Screaming in the Cloud - How to Get 75 Gigs of Free Storage in AWS with xssfox
Episode Date: April 8, 2020About xssfoxJust a dumb fox Links Referenced: DigitalOcean https://www.digitalocean.com/ CHAOSSEARCH http://CHAOSSEARCH.io Big Buck AWS https://github.com/xssfox/bigbuckaws Corey’s talk..., “Terrible Ideas in Git” https://www.lastweekinaws.com/blog/terrible-ideas-in-git-by-corey-quinn/ Twitter: https://twitter.com/xssfoxScreaming in the Cloud http://ScreamingintheCloud.com
Transcript
Discussion (0)
Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world of cloud,
thoughtful commentary on the state of the technical world,
and ridiculous titles for which Corey refuses to apologize.
This is Screaming in the Cloud. And this is important to me. No billing surprises. With simple, predictable pricing that's flat across 12 global data center regions
and a UX developers around the world love,
you can control your cloud infrastructure costs
and have more time for your team to focus on growing your business.
See what businesses are building on DigitalOcean and get started for free
at do.co slash screaming. That's
do.co slash screaming. And my thanks to DigitalOcean for their continuing support of this
ridiculous podcast. This week's episode is sponsored by Chaos Search. If you've ever tried
managing Elasticsearch yourself, you know that it's of the devil. You have to manage a
series of instances, you have to potentially
deal with a managed service. What if all of that
went away? Chaos Search does
that. It winds up taking
the data that lives in your S3
buckets and indexing that
and providing an Elastic Search
compatible API. You don't have to
manage infrastructure, you don't have to
play stupid slap and tickle games
with various licensing arrangements. And fundamentally, you wind up dealing with a
better user experience for roughly 80% less than you'll spend on managing actual Elasticsearch.
Chaos Search is one of those rare companies where I don't just advertise for them. I actively
recommend them to my clients because fundamentally
they're hitting it out of the park. To learn more, look at chaossearch.io. Chaos Search is,
of course, all in capital letters because despite chaos searching, they cannot find the caps lock
key to turn it off. My thanks to Chaos Search for sponsoring this ridiculous podcast.
Welcome to Screaming in the Cloud. I'm Corey Quinn.
I'm joined this week by Michael, who is similar to a previous guest, an Australian code terrorist.
Michael, welcome to the show. Hey, thanks for having me.
So you wrote something a while back that really sort of took some of the serverless world by storm. It's a GitHub repo that you called BigBuckAWS.
What does that do exactly?
So it's sort of like a tech demo, I guess,
of how you can abuse some of the inner workings of AWS,
specifically Lambda.
And it sort of allows you to stream content so like a mp4 hls stream
without really paying for for much data um and it does that by abusing the ability that you can pull
your code back out of lambda yeah so the the thing is when you upload code into Lambda, they also let
you download it again to view it in like the online editor, or if you have some sort of internal tool,
you can use it. But the way it works internally in AWS is when you upload it, it ends up in an
S3 bucket. You can grab that data back out from the
S3 bucket through a signed URL, but that bucket is Amazon's bucket, not yours. So you don't get
charged for it. So that's kind of the key part of it. And that always struck me as something very
strange. I'm sure there's a technical reason behind it, but you get 75 gigs of Lambda storage
per account. It's an Amazon bucket. There's no requester pays or anything like that.
And it just never occurred to me that,
A, you could pull data back out of it,
mostly because I guess I lack the appropriate level
of code terrorism style imagination,
but also because it's something
that has always happened under the hood.
Oh, that's where the code gets uploaded there
and then it magically runs as your Lambda function.
I guess, starting at the beginning,
what got you thinking down this path? I guess the key thing is I was waiting for an awfully
large Lambda function to upload. And I was just looking at like, how big of a Lambda function can
I actually upload and what sort of limitations there are. So that got me looking at the limits
in that. And that's sort of when I started thinking down the path of why is there a 75 gig sort of limit to Lambda functions? There's probably a good reason for
that. And from there I sort of thought, oh, it's because, you know, it's not hosted in your bucket.
Unlike like CodeDeploy or OpsWorks where you've run the infrastructure, it's
actually in Amazon's bucket. So that's why they've put that sort of cap on it. And then I thought about ways that that could possibly be, I guess, abused to store your own
stuff for free. Oh, it can be abused awesomely. This is also, I guess, this is not the first time
that I've looked into that 75 gigabyte free storage option for, I guess, effectively stealing
resources that they didn't think anyone would actually use
in order to build something horrifying.
Ben Kehoe and I have talked a bit
about building out PackRatDB,
which is how much of a database
can you actually shove into an AWS account
without paying for anything?
And we're continuing to explore
what that might look like.
And this was absolutely one of the single largest
points of data store you can get.
Everything else is about free tags on resources that don't cost anything,
builds an awesome key value store, etc., etc.
But this one sort of blows away everything else we've found so far,
as far as, hey, what can I get massive storage-wise
without having to pay for any of it?
Yeah, yeah, certainly.
And the other thing is thinking about transfer costs.
So a lot of places you can store data,
but you somehow get charged for a request to pull it back out.
And so that's what was a little bit weird with the Lambda functions.
And I guess the only tricky part is making it usable for an end client
without having to have a Lambda function that pulls it
out, does some data mangling it to get it into the right format and then send it back to the client.
So that's sort of the only tricky part with using Lambda functions like that.
Okay. So let's continue through the demo. You figured out that you could have 75 gigs of data
just hanging out there and you could pull it back at no charge. Where'd you go from there?
Right. So I remembered a blog post, a medium post by Laurent, and that blog post was about
using Google Docs to store video. And they used the HLS format to basically skip through to the section that actually contained the video content.
So in their example, they were uploading the videos as sort of inside a PNG file, and then they could skip through to that.
Their purpose is they wanted to hide it away from Google.
So Google thought that it was just a picture, didn't try to do any sort of copyright data matching on it. And from there, I thought, hey, I could probably use that inside
Lambda to make use of the storage. So I guess the key problem with Lambda is you need to upload a
zip file. So you need to somehow have the video content in the zip file, but still accessible by the client.
So for that, what we do is we take the zip file
and we actually compress it with zero compression.
So depending on your zip utility,
there's lots of ways of enabling that.
But basically you say, zip it up, but don't compress it.
So the entire file is there
uncompressed. You just sort of need to jump to the right sort of byte offset. And that's where the
HLS stream comes in. You can say, hey, just jump to this part of the file and you can skip all the
zip header stuff. Right. So you have the header itself that's there, but effectively when people
are saying they're looking for a great compression algorithm, you were looking for the exact opposite of that.
Correct, yeah.
We want something that's not compressed that way because the video client doesn't know how to deal with the compression.
So yeah, if we have it not compressed, then that works better in our favor.
Gotcha.
So you wind up then effectively having one giant object sitting there,
but you can also do the byte offset to tell the video player where to get it from.
Is that using the byte range stuff that is in S3's get API,
or is it using something else?
Yeah, so if you do a normal get to S3,
it uses the bytes range header as part of the request,
so it'll skip through to that.
Yeah, I think that's one of those things
that people aren't generally aware exists,
where you don't need to pull the entire object down. You can just say, give me this very specific
portion of it. Yeah, it's a very handy feature for more production-like workloads.
So you wound up then putting this behind an API gateway and then hooking that up to a Lambda
function. Yep. Have you looked at all into their new HTTP API option,
which I think is now in beta?
They talked about it a lot at reInvent,
but I haven't had the chance to play with it myself yet.
Yeah, so I actually tried to,
because I thought this would be a brilliant sort of demo
of testing that out.
And I tried to set that up and I followed all the steps
and I just could not get the API to return anything but like a
403 or a 500 or something. So I clearly... That sounds like most of my early explorations with
API Gateway start to finish until I started just using something like serverless framework to
wrap it for me. I feel like for a long time that most of what the stuff I was getting from API Gateway was just a comedy of errors.
It was not the most intuitive thing to learn,
and I'm disheartened to hear that that potentially
is what we're seeing from the new version as well.
Yeah, I'm not sure.
I feel like I've probably done something wrong.
There's probably some key part of documentation,
or possibly I just didn't wait long enough
for DNS to propagate or whatnot for the new API.
But I quickly sort of jumped to...
Who has the patience for that?
Yeah, yeah.
This episode is sponsored by ExtraHop.
ExtraHop provides threat detection and response
for the enterprise, not the starship.
On-prem security doesn't translate well to cloud
or multi-cloud environments,
and that's not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your
cloud workloads and IoT devices, detects these threats up to 35% faster, and helps you act
immediately. Ask for a free trial of detection and response for AWS today at extrahop.com slash trial. Okay, so once this
video is up and running and you give someone a URL that's fronted by an API gateway, that's
awesome. But what sort of, I guess, viewers or clients can wind up understanding that?
Yeah, so one of the problems you have with this is because it's hosted in s3 um in their bucket they haven't
enabled cores for us and you know that's sort of not surprising uh but you can play in basically
every sort of modern um modern hls sort of player so vlc and player um i even got it running in
windows media player um the problem you have is
you can't embed it into a web page because of because of course so you can link someone to
it and they can open it in their player but they you can't open it in a browser forgive me if i'm
remembering this backwards would it be possible to adjust that by modifying the payload response
that winds up being returned maybe through the lambda function itself so yes but the problem is the only thing we're doing in the lambda response is providing a 302
and it's the s3 bucket of where all the lambda functions are stored that's where the cause needs
to be and we don't gotcha and if we want to continue to make aws
eat the bill for it there's not really a great series of answers around it yeah yeah so in theory
and this is where my knowledge falls apart um in theory i believe there's like a same origin or
there's there's a html tag that you could potentially put in the video tags to get it to
work but it seems that none of the out of the-the-box sort of HLS streaming JavaScript libraries that I looked at support
doing that. And I don't know if it is possible, but it seems to imply that it is.
What other limitations do you see in something like this?
So you are limited in your chunk size. so you can only have like a 50 meg
file per function um and that sort of poses a problem if you're doing like 4k video because
you won't be uh you won't have a very large or like a long duration of video content in that
and that can cause some issues with some of the clients trying to buffer
because they only sort of try to download the next couple of files,
not based on time.
So you end up with a sort of stuttering stuff.
So for 1080 video, it seemed to work okay.
I haven't tested it with 4K.
I did this all by hand by manually writing the HLS stream file. It's trivial to automate,
but if I was to do a 4k one, I'd have hundreds and hundreds of lambda functions I'd need to upload.
I'm guessing this is probably not intended for anything remotely resembling production use?
Most certainly not. I imagine Amazon is going to do something to block this somehow.
There's a few ways that I've thought of how they could do that, but I don't know sort
of what approach they'll take because they do have to sort of remain having backwards
compatibility with...
Right.
They view APIs as promises, as they love telling us.
So the question then becomes, what could they change to make this a non-workable solution? And of that list,
which of those are viable without having to break existing functionality? Yeah. So the two that come
to mind is they could put sort of requester pays on the S3 bucket. And that would basically just
mean that you'd start paying the cost for for downloading
that um i'm not sure if that's something they they would do the other one that comes to mind is if
they uh did the zipping themselves as well so that it is actually compressed i'm not sure internally
how that would go like if that would break things.
It would fix using this for HLS, but it could still be used for other data purposes.
Yeah, it's one of those interesting areas where it would solve this particular use case,
but there's nothing preventing someone from building out something like this that just grants,
oh, run this one-liner and you'll get 75 gigs of storage per account.
Yep.
It's interesting to see how this might wind up, I guess, influencing AWS.
I mean, there is always the option where they just decide that this is an acceptable loss.
They've made something like, what, $32 billion in revenue last year.
Yeah, if people want to go through these kind of hoops okay they won we'll let them unless they start seeing widespread abuse of this which frankly i kind of have a hard
time envisioning i don't know that this is necessarily going to be on the top of their list
of things to chase down yeah i'm not certain about that because i guess one of the use cases for this is video piracy. So it could potentially be used
to pirate movies and stuff as they sort of come out. And it means that the person doing the
pirating or hosting the pirating, apart from having their Amazon account shut down, they don't
really risk spending a huge amount of money. But the other use case I kind of thought about,
I mean, I just did this for fun.
I had no use for it.
But the other use case I thought of after I built this
was those times where media outlets have this huge story
and they want to release it to the world,
but they know that they're going to take a pretty big hit
in terms of hosting cost for it.
They could just quickly do this and that would save them like probably millions if everyone's
looking at the same video content.
Right.
Effectively a freestyle CDN to some extent.
The counterpoint is, is I can't really see any reputable media organization going down
this path it just
seems like it would be a little bit uh too far towards the what do you folks think you're doing
model yes yeah yeah certainly yeah i can't imagine anyone doing it but i don't know that
any way to cut costs i guess as soon as one person does it at that point i feel like they're no longer
able to ignore this as just some weird proof of concept someone on the internet threw up.
Yeah, certainly. I imagine this will last a
while. I imagine they'll sort of monitor it, maybe run
some analytics on it, and then at some point, once it gets to a tipping
point where it's worth changing, then they'll look into it.
Right, and there's always the customer unfriendly approaches where they can just solve this entirely in their terms of service
and after they find egregious users
and start effectively turning off their AWS accounts,
the message would probably get out.
It feels like something that a company that isn't Amazon
would be likelier to do.
Yeah, yeah.
I imagine based on some of the previous sort of examples of this sort of,
I guess,
uh,
code terrorism,
uh,
it,
it seems like they are more likely to just sort of eat up the,
the bill,
um,
until it becomes a huge problem.
Yeah.
I feel like I need to highlight yet again,
that this is not something that people should use for production use.
For a while,
I was giving a talk,
uh, called terrible ideas and get, something that people should use for production use. For a while, I was giving a talk called
Terrible Ideas in Git. And I had a Docker container that was published and ready to be used for this
just because resetting a whole bunch of Git repositories after you've mangled the hell out
of them is obnoxious. Just run a Docker container every time you give the talk and things are great.
The container was called Terrible Ideas. And I'm sure someone was using it for something in production because people do terribly stupid things without any rationale. Similar to the time where I started
making jokes about using route 53 as a database. And I started getting people responding with,
well, that's not the worst idea in the world. What if we did it like this? And it's no, no,
no, no, no. That at some point the joke takes on a life of its own but you kind of want to at least
keep the sharp edges away from people who may not understand yes what it exactly is they're doing
yeah and that's why i tried to put a fairly decent write-up on how it works and also
the limitations sort of a disclaimer to say hey this probably won't work in the future. But I am very scared about how many stars
this has gotten on GitHub.
And there's apparently three forks of it already.
So hopefully no one's actually using this
in the production sense.
One would very much like to hope.
The counter argument, though,
is that people will always surprise you
with what ridiculous things they're doing. Back when I was doing open source development work on SaltStack, I counter argument, though, is that people will always surprise you with what ridiculous things they're doing.
Back when I was doing open source development work on SaltStack, I figured that, oh, the problem clearly is that everyone who's used Poppet or Shaft or anything like that was just, oh, they weren't very good at what they do and the tool was not adequate.
We've built this thing. It's going to be amazing.
And that lasted right until I saw my first customer use case where, oh, it turns out
that anything is a hammer if you hold it wrong. It's difficult to get people to see the vision.
And I feel like the things that you build never survive encounters with other people's horrifying
use cases. Yeah. A lot of the things I have built in the past have been, I guess, horrible, horrible things, mostly just for fun, just seeing how
far you can take a tool to work. I guess an example of that is I have built in the past a
Lambda function that works as a custom resource in CloudFormation that starts a Mechanical Turk instance question
and provides access keys and secret keys.
So you can free text, create your CloudFormation
and just say, hey, can you build me an S3 bucket?
And it will fire off a Mechanical Turk request
and ask someone on Mechanical Turk
to build the S3 bucket for you.
On some level, you have to wonder at what point they just automate a lot of these common solutions
into something that is AWS solutions option or a quick start versus how much of it is something
like AWS IQ, where you can effectively pay people a few bucks to do common or uncommon things as
the case may be. I would not put this past being wrapped around an official AWS
offering at some point.
Yeah, I can imagine.
I'm assuming that this is not the sort of thing that springs fully formed from
I'm going to go online today on my first day on the job and go ahead and build something like this.
Where does it come from? Where did you where did you wind up, I guess,
starting down the path of thinking about creative use of services like this?
I always think about limits and how can I at least make use of the limits,
get to the sort of boundary, like that limit is set.
How do I get right up to that edge and make the most out of this?
So I guess every time I look
at something, if I see a limitation, I guess the prime example here is S3. When they first released
it, I'm pretty sure they only charge for data for transfer. When they first released it, they didn't
have any sort of billing for get requests or head requests or options and all of
that stuff. All those requests weren't billed. So I heard this story a long time ago about someone
that essentially used that as a database because those requests were free. They weren't really
grabbing any data out of it. And that's sort of when Amazon had to add that limitation. I'm like, at some point,
I really want to be doing that. I want to be the reason why Amazon puts in that limit. So
every time I look at a new service when it's released, I look at the limits
and try and work out how can I use this to its fullest potential that Amazon never actually
planned on it being used that way.
Right. I want to be the exception case. How do I make that possible?
Yeah, exactly.
And you always think the, well, no one would actually go to the trouble of doing that stuff.
Well, have you met me? I mean, your Lambda function is a whopping 36 lines all in, in Python. This is
not, it's not a massive amount of code. It's not anything that is overly complex.
I think it just requires looking at these things from a certain point of view that very
often the people building it never considered.
Exactly.
And I feel like there's probably a few other cases in Amazon where this sort of approach,
like not this exact code, but this approach can be applied to.
I haven't, I haven't been able to see those. It's just I happen to use Lambda enough
that I have worked out the inner workings a bit more. But I'm sure there's other places where you
can upload data and get it back down for free that could be abused, maybe not for video streaming,
but at least as a free database. You almost start to wonder, okay, what is the upper bound of data
you can attach to an AWS support ticket?
Yes.
Because it does have an API.
Yeah.
One other thing I thought was kind of neat too,
right around the same time that this came out,
was someone did a whole write-up
about how anything outside of the handler function in Lambda
ran with the full three gigs of resources and two vCPUs
and wound up not billing or something on the order of that
until it entered the handler that it was either unbilled
or billed at a small fraction of what it was that was being charged.
Remember what I'm talking about?
Yeah, yeah, I've seen that and read that.
And so, yeah, I think if I remember correctly,
it's just like as soon as on that cold start,
it has full power, full memory to just set everything up.
You know, it's designed for big Java applications and whatnot.
So it can quickly get started.
So that cold start time is really short and then run.
But you can sort of abuse that by running your code outside the handler and do that.
And a lot of the stuff I do professionally, you know, we do a lot of stuff outside the
handler for the, you know, setting everything up.
But I never really thought about using that that extra time but i have wondered could you expand that to
be more useful as like a clustered like distributed computing system um it'd be really cool to see
that expanded on because amazon gave that the the tick of approval to say that's fine so yeah they
said have fun on multiple uh folks who are in a position to be authoritative on Twitter said, yeah, go for it. See what you can build.
So, all right, challenge accepted. This is the danger of goading
people on who have very little sense of, oh, I shouldn't do
that. They wouldn't like it. Oh, no. At some point, you've got to get portions of
that AWS bill back. Yes, yes, certainly. Amazon
is definitely not losing out on this.
If you're using any of their services, they're winning.
Absolutely.
I've yet to see a single exploit like this
that didn't result in, yes,
and it winds up causing a slight discounting on my bill
that is already a phone number.
This is very much a rounding error,
even a per user account for most of these things.
Yes.
So we'll see.
So if people want to discover more of your various acts of code terrorism, where can
they find you?
They can find me on Twitter.
So at XSSFox, X-Ray, C-R-R-C-R-L, Foxtrot, Oscar, X-Ray.
Excellent.
Thank you so much for taking the time to speak with me today.
I appreciate it.
No worries. Thank you so much for taking the time to speak with me today. I appreciate it. No worries.
Thank you.
Michael, code terrorist at undisclosed location for obvious reasons.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave an excellent review on Apple Podcasts.
If you've hated this podcast, please leave an excellent review on Apple Podcasts. this has been a humble pod production
stay humble