Screaming in the Cloud - Creating “Quinntainers” with Casey Lee
Episode Date: April 20, 2022About CaseyCasey spends his days leveraging AWS to help organizations improve the speed at which they deliver software. With a background in software development, he has spent the past 20 yea...rs architecting, building, and supporting software systems for organizations ranging from startups to Fortune 500 enterprises.Links Referenced:“17 Ways to Run Containers in AWS”: https://www.lastweekinaws.com/blog/the-17-ways-to-run-containers-on-aws/“17 More Ways to Run Containers on AWS”: https://www.lastweekinaws.com/blog/17-more-ways-to-run-containers-on-aws/kubernetestheeasyway.com: https://kubernetestheeasyway.comsnark.cloud/quinntainers: https://snark.cloud/quinntainersECS Chargeback: https://github.com/gaggle-net/ecs-chargeback twitter.com/nektos: https://twitter.com/nektos
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is sponsored by our friends at Revelo.
Revelo is the Spanish word of the day, and it's spelled R-E-V-E-L-O.
It means I reveal.
Now, have you tried to hire an engineer lately?
I assure you it is significantly harder than it sounds.
One of the things that Ravello has recognized is something I've been talking about for a while.
Specifically, that while talent is evenly distributed, opportunity is absolutely not.
They're exposing a new talent pool to basically those of us without a presence in Latin America via their
platform.
It's the largest tech talent marketplace in Latin America with over a million engineers
in their network, which includes but isn't limited to talent in Mexico, Costa Rica, Brazil,
and Argentina.
Now, not only do they wind up spreading all of their talent on English ability as well
as, you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform
are hands down the most talented engineers that I've ever spoken to. Let's also not forget that
Latin America has high time zone overlap with what we have here in the United States. So you can hire
full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service. So you can find and hire engineers in
Central and South America without having to worry about, frankly, the colossal pain of cross-border
payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io slash screaming to get 20%
off your first three months. That's R-E-V-E-L-O dot I-O slash screaming.
Couchbase Capella. Database as a service is flexible, full featured, and fully managed
with built-in access via key value, SQL, and full-text
search. Flexible JSON documents align to your applications and workloads. Build faster with
blazing fast in-memory performance and automated replication and scaling while reducing cost.
Capella has the best price performance of any fully managed document database. Welcome to Screaming in the Cloud.
I'm Corey Quinn. My guest today is someone that I had the pleasure of meeting at reInvent last year, but we'll get to that story in a minute.
Casey Lee is the CTO at a company called Gaggle, which is, as they frame it, saving lives. Now, that seems to be a relatively common position that an awful lot of different
tech companies take. We're saving lives here. It's you show banner ads, and some of them are
attack platforms for JavaScript malware. Let's be serious here. Casey, thank you for joining me.
And what makes the statement that Gaggle saves lives not patently ridiculous?
Sure. Thanks, Corey. Thanks for having me on the show. So Gaggle, we're an ed tech company. We sell
software to school districts and school districts use our software to help protect their students
while the students use the school issued Google or Microsoft accounts. So we're looking for signs
of bullying, harassment, self-harm, and potentially suicide from K-12 students while they're using these platforms.
They will take the thoughts, concerns, emotions they're struggling with and write them in their school-issued accounts.
We detect that, and then we notify the school districts, and they get the students the help they need before they can do any permanent damage to themselves.
We protect about 6 million students
throughout the U.S. We ingest a lot of content. Last school year, over 6 billion files, about the
equal number of emails ingested. We're looking for concerning content. And then we have humans review
the stuff that our machine learning algorithms detect and flag. About 40 million items had to go in front of humans last
year, resulted in about 20,000 what we call PSSs. These are possible student situations where
students are talking about harming themselves or harming others. And that resulted in what we like
to track as lives saved. 1,400 incidents last school year where a student was dealing with suicide ideation. They were
planning to take their own lives. We detect that and get them help within minutes before they can
act on that. That's what Kaggle's been doing. We're using tech, solving tech problems, and also
saving lives as we do it. It's easy to lob the criticism at some of the things you're alluding to. The idea
of, oh, you're using machine learning on student data for young kids, yada, yada, yada. Look at
the outcome, look at the privacy controls you have in place, and look at the outcomes you're
driving to. Now, I don't necessarily trust the number of school administrations not to become
heavy-handed and overbearing with it, but let's be clear, that's not the intent. That is not what the success stories you have allude to.
I've got to say, I'm a fan.
So thanks for doing what you're doing.
I don't say that very often to people who work in tech companies.
Cool. Thanks, Corey.
But let's rewind a bit,
because you and I had passed ships in the night on Twitter for a while,
but last year at reInvent,
something odd happened. First, my business partner procrastinated at getting his ticket.
That's not the odd part. He does that a lot. But then suddenly ticket sales slam shot and none were to be had anywhere. You reached out with him. Hey, I have a spare ticket because
someone can't go. I'll let me get it to you. And I said, terrific. Let me pay you for the ticket and take you to dinner.
You said, yes, on the dinner,
but I'd rather you just look at my AWS bill
and don't worry about the cost of the ticket.
All right, said I.
I know a deal when I see one.
We grabbed dinner at the Venetian.
I said, bust out your laptop.
And you said, oh, I was kidding.
And I said, great, I wasn't busted out.
And you went from laughing to taking notes
and about the usual time that happens
when I start looking at these things.
But how's your recollection of that?
I always tend to romanticize some of these things.
And then everyone in the restaurant just turned,
stopped and clapped the entire time.
Maybe that part didn't happen.
Everything was right up until the clapping part.
That was a really cool experience.
I appreciate you walking through that with me.
Yeah, we've got lots of opportunity to save on our AWS bill here at Gaggle. And in that little
bit of time that we had together, I think I walked away with no more than a dozen ideas for where to
shape some costs. The most obvious one, the first thing that you keyed in on is we had our eyes
coming due that weren't really well optimized, and you steered me towards
savings plans. We put that in place, and we're able to apply those savings plans not just to
our EC2 instances, but also to our serverless spend as well. So that was a very worthwhile
and cost-effective dinner for us. The thing that was most surprising, though, Corey,
was your approach. Your approach to how to review our bill was not what I thought at all.
Well, what did you expect my approach was going to be?
Because this always is of interest to me.
Like, did you expect me to, like, whip a portable machine learning rig out of my backpack full
of GPUs or something?
I didn't know if you had, like, some secret tool you were going to hit.
Or if nothing else, I thought you were going to go for the cost explorer.
I spent a lot of time in cost explorer.
That's my go-to tool.
And you wanted nothing to do with cost explorer.
I think I was actually pulling up cost explorer for you. and you said, I'm not interested. Take me to the
bills. So we went right to the billing dashboard. You start opening up the invoices. And I thought
to myself, I don't remember the last time I looked at an AWS invoice. I just, it's noise. It's not
something that I pay attention to. And I learned something that you get a real quick view of both
the cost and the usage.
And that's what you're keyed in on, right?
And you were looking at things relative to each other.
Okay, I have no idea about Gaggle or what they do, but normally for a company that's
spending X amount of dollars in EC2, why is your data transfer cost the way it is?
Is that high or low?
So you were looking for kind of relative numbers.
But it was really cool watching you slice and dice that bill through the dashboard there.
There are a few things I tie together there. Part of it is that this is sort of a surprising
thing that people don't think about. But start with the big numbers first rather than going
alphabetically, because I don't really care about your $6 Alexa for business spend. I care a bit
more about the $6 million or whatever it happens to be at EC2.
I'm pulling numbers completely out of the ether.
Let's be clear.
I don't recall what the exact magnitude of your bill is and it's not relevant to the conversation.
And then you see that and it's like,
huh, okay, you're spending $6 million on EC2.
Why are you spending 400 bucks on S3?
Seems to me that those two should be a little closer aligned.
What's the deal here?
Oh God, you're using eight petabytes of EBS
volumes. Oh, dear. And it just tends to lead to interesting stuff. Break it down by region,
service, and use case, or usage type, rather, is what shows up on those exploded bills,
and that's where I tend to start. It also is one of the easiest things to wind up having someone
throw into a PDF and email my way if I'm not doing it in a restaurant with, you know, people clapping, standing around.
Right on.
I also want to highlight that you've been using AWS for a long time.
You're a container hero.
You are not bad at understanding the nuances and depths of AWS.
So I take praise from you around this stuff as valuing it very highly.
This stuff is not intuitive. It is deeply nuanced, and you
have a business outcome you are working towards that invariably is not oriented day in, day out
around, how do I get these services for less money than I'm currently paying? But that is how I see
the world, and I tend to live in a very different space just based on the nature of what I do. It's
sort of a case study in the advantage of specialization. But I know remarkably little about containers, which is
how we wound up reconnecting about a week or so before this recording.
Yeah, I saw your tweet. You were trying to run some workload, container workload,
and I could hear the frustration on the other end of Twitter when you were shaking your fist.
I should not tweet angrily, and I did in this case.
And every time I do, I regret it.
But it played well with the people.
So that does help.
I believe my exact comment was, me, I've got this container.
Run it, please.
Google Cloud Run.
You got it, boss.
AWS has 17 ways to run containers, and they all suck.
And that's painting with an overly broad brush, let's be clear.
But that was at the tail end of two or three days of work
trying to solve a very specific, very common business problem
that I was just beating my head off of a wall again and again and again.
And it took less than half an hour from start to finish with Google Cloud Run
and I didn't have to think about it anymore.
And it was, it's one of those moments where you look at this and realize that the future is here,
we just don't see it in certain ways. And you took exception to this. So please, let's dive in,
because 280 characters of text after half a bottle of wine is not the best context to have a nuanced discussion that leaves friendships intact the following
morning. Nice. Well, I just want to make sure I understand the use case first, because I was
trying to read between the lines on what you needed, but let me take a guess. My guess is
you got your source code in GitHub, you have a Docker file, and you want to be able to take that
repo from GitHub and just have it continuously deployed somewhere and run.
And you don't want to have headaches with it.
You just want to push more changes up to GitHub,
Docker build runs, and update some service somewhere.
Am I right so far?
Ish, but think a little further up the stack.
It was in service of this show.
So this show, as people who are listening to this
are probably aware by this point, periodically has sponsors, which we love. We thank them for participating in the ongoing support of this show, which empowers conversations like this. And it's, first, you misspelled your company name from the common English word. There are three sub-levels within the domain.
And then you have a complex UTM tagging, tracking.
Yeah, you realize people are driving to work when they're listening to this.
So I built a while back a link shortener, snark.cloud.
Because is it the shortest thing in the world?
Not really.
But it's easily understandable when I say that.
And people hear it
for what it is. And that's been running for a long time as an S3 bucket with full of redirects
behind CloudFront. So I wind up adding a zero byte object with a redirect parameter on it,
and it just works. Now, the challenge that I have here as a business is that I am increasingly
prolific these days. So anything that I am not
directly required to be doing, I probably shouldn't necessarily be the one to do it. And
care and feeding of those redirect links is a prime example of this. So I went hunting and the
things that I was looking for were obviously do the redirect. Now, if you pull up GitHub, there are
hundreds of solutions here. There are AWS blog posts. One that I really liked and almost got working was Eric Johnson's three-part blog post on how to do it serverlessly with API Gateway and DynamoDB, no lambdas required. I really liked aspects of what that was, but it was complex. I kept smacking into weird challenges as I went, and front-end is just baffling to me,
because I needed a front-end app for people to be able to use here. I need to be able to secure that,
because it turns out that if you just have a, anyone who stumbles across the URL can redirect
things to other places, well, you've just empowered a whole bunch of spam email, and you're going to
find that service abused, and everyone starts blocking it, and then you have trouble. Nothing lasts the first encounter with jerks.
And I was getting more and more frustrated,
and then I found something by a Twitter engineer on GitHub
with a few creative search terms who used to work at Google Cloud.
And what it uses as a client is it doesn't build any kind of custom web app.
Instead, as a database, it uses not S3 objects,
not Route 53, the ideal database,
but a Google Sheet,
which sounds ridiculous,
but every business user here knows how to use that.
And it looks for the two columns.
The first one is the slug after the snark.cloud,
and the second is the long URL.
And it has a TTL of five seconds on cache, so make
a change to that spreadsheet. Five seconds later, it's live. Everyone gets it. I don't have to build
anything new. I just put it somewhere where only the relevant people can access it. I gave a
tutorial and a giant warning on it, and everyone gets that. And it just works well. It was click
here to deploy, follow the steps. And the documentation was a little,
okay, I had to undo it once and redo it again. Getting the domain registered was getting
ported over, took a bit of time. And there were some SSL errors as the certificates were set up.
But once all of that was done, it just worked. And I tested the heck out of it. And cold starts
are relatively low and the entire thing fits within the free tier. And it is reminiscent of the magic that
I first saw when I started working with some of the cloud provider services years ago. It's been
a long time since I had that level of delight with something, especially after three days of
frustration. It's one of the, this is a great service. Why are people not shouting about this
from the rooftops? That was my perspective. And I put it out on Twitter and, oh Lord, did I get comments. What was your take on it? Well, so my take was when you're evaluating
a platform to use for running your applications, how fast it can do, get you to hello world is not
necessarily the best way to go. I just assumed you're wrong. I assumed of the 17 ways AWS has
to run containers, Corey just doesn't understand. And assumed of the 17 ways AWS has to run containers,
Corey just doesn't understand. And so I went after it and I said, okay, let me see if I can
find a way that solves his use case as I understand it through a quick tweet. And so I tried App
Runner. I saw that App Runner does not meet your needs because you have to somehow get your Docker
image pushed up to a repo. AppRunner can take an image that's already
been pushed up and deployed for you, or it can build from source. But neither of those were
the way I understood your use case. Having used AppRunner before via the co-pilot CLI,
it is the closest, as best I can tell, to achieving what I want. But also, let's be clear,
I don't believe there's a free tier. There needs to be a load balancer in front of it. So you're
starting with $15 a month for this thing, which is not the end of
the world. Had I known at the beginning that all of this was going to be there, I would have just
signed up for a Bitly account and called it good. But here we are. I tried Copilot. Copilot is a
great developer experience, but it also is just pulling together tons of people. I mean, just
trying to do a Copilot service deploy, VPCs are being created and tons of people. I mean, just trying to do a co-pilot service deploy.
VPCs are being created and tons of IAM roles are being created,
code pipelines.
There's just so much going on.
I was like 20 minutes into it
and I said, yeah, this is not fitting the bill
for what Corey was looking for.
Plus, it doesn't solve the way
I understood your use case,
which is you don't want to worry about builds.
You just want to push code
and have new Docker images get built for you.
Well, honestly, let's be clear here. Once it's up and running, I don't want to ever have to
touch the silly thing again. And that so far has been the case. After I made up, I forked the repo
and made a couple of changes to it that I wanted to see. One of them was to render the entire thing
case insensitive because I get that one wrong a lot. And the other is I wanted to change the permanent 301 redirect to a temporary 302
redirect because occasionally sponsors will want to change where it goes in the fullness of time,
and that is just fine. But I want to be able to support that and not have to deal with old
cache data. So getting that up and running was a bit of a challenge. But the way that it worked
was following the instructions in the GitHub repo. The developer environment had spun up, and the Google's Cloud Shell was just spectacular.
It prompted me for a few things, and it told me step-by-step what to do.
This is the sort of thing I could have given a basically non-technical user, and they would have had success with it.
So I tried it as well.
I said, well, okay, if I'm going to respond to Corey here and challenge him on this, I need to try Cloud Run.
I had no experience with Cloud Run.
I had a small example repo that loosely mapped what I understood you were trying to do.
Within five minutes, I had Cloud Run working.
And I was surprised.
Anytime I pushed a new change within 45 seconds, the change was built and deployed.
So here's my conclusion, Corey.
Google Cloud Run is great for your use case, and AWS doesn't have the perfect answer. But here's my conclusion, Corey. Google Cloud Run is great for your use
case, and AWS doesn't have the perfect answer. But here's my challenge to you. I think that you
just proved why there's 17 different ways to run containers on AWS. It's because there's that many
different types of users that have different needs, and you just happen to be number 18 that
hasn't gotten the right attention yet from AWS.
Well, let's be clear.
Like my gag about 17 ways to run containers on AWS was largely a joke.
And it went around the internet three times.
So I wrote a list of them on the blog post of 17 ways to run containers in AWS.
And people liked it.
And then a few months later, I wrote 17 more ways to run containers on AWS,
listing 17 additional services that all run containers.
And my favorite email that I think I've ever received in feedback was from a salty AWS employee saying that one of them didn't really count because of some esoteric reason.
And it turns out that when I'm trying to make a point of you have a sarcastic number of ways to run containers, pointing out that, well, one of them isn't quite valid, doesn't really shatter the argument. Let's be very clear here. So I appreciate the feedback.
I always do. And it's partially snark, but there is an element of truth to it in that
customers don't want to run containers by and large. That is what they do in service of a
business goal. And they want their application
to run, which is in turn serves the business goal that continues to abstract out and to
remain a going concern via the current position the company stakes out. In your case, it is saving
lives. In my case, it is fixing horrifying AWS bills and making fun of Amazon at the same time.
And in most other places, there are somewhat more prosaic answers to that. But containers are simply an implementation detail, to some extent, to my way of thinking,
of getting to that point.
An important one, let's be clear, I was very anti-container for a long time.
I wrote a talk, Heresy in the Church of Docker, that then was accepted at ContainerCon.
It's like, oh boy, I'm not going to leave here alive.
And the honest answer is,
many years later, that Kubernetes solves almost all the criticisms that I had with the downside of, well, first you have to learn Kubernetes. And that continues to be mind-bogglingly complex
from where I sit. There's a reason that I've registered kubernetestheeasyway.com and repointed
it to ECS, Amazon's container service, that is not requiring you to cosplay
as a cloud provider yourself.
But even ECS has a number of challenges to it.
I want to be very clear here.
There are no silver bullets in this.
And you're completely correct
in that I have a large, complex environment
and the application is nuanced
and I'm willing to invest a few weeks
in setting up the baseline underlying infrastructure
on AWS with some of these services. Ideally, not all of them at once, because that's something a lunatic would
do, but getting them up and running. The other side of it, though, is that if I am trying to
evaluate a cloud provider's handling of containers and how this stuff works, the reason that everyone
starts with a Hello World-style example is that it delivers, ideally, the mean time to dopamine.
There's a reason that Hello World doesn't have 18 different dependencies across a bunch of different databases and message
queues and all the other complicated parts of running a modern application, because you just
want to see how it works out of the gate. And if getting that baseline empty container that just
returns the string Hello World is that complicated and requires that much work.
My takeaway is not that this user experience is going to get better once I make the application
itself more complicated. So I find that off-putting. My approach has always been find something that I
can get the easy minimum viable thing up and running on. And then as I expand, know that
you'll be there to catch me as my needs
intensify and become ever more complex. But if I can't get the baseline thing up and running,
I'm unlikely to be super enthused about continuing to beat my head against the wall. Like, well,
I'll just make it more complex. That'll solve the problem because it often does not. That's my
position. Yeah. I agree that that dopamine hit is valuable in getting attached to want to invest
into whatever tech stack you're using. The challenge is your second part of that. Your
second part is, will it grow with me and scale with me and support the complex edge cases that
I have? And the problem I've seen is a lot of organizations will start with something that's
very easy to get started with and then quickly outgrow it and then come up with all sorts of weird
Rube Goldberg-type solutions
because they jumped all in before seeing.
I've got kind of an example of that.
I'm happy to announce that there's now 18 ways
to run containers on AWS
because your use case,
in the spirit of AWS customer obsession,
I hear your use case. I've created
an open source project that I want to share called Quintainers. Oh, no. And it solves. Yes.
Quintainers is live and is ready for the world. So now we've got 18 ways to run containers. And
if you have Corey's use case of, hey, here's my container, run it for me. Now we've got a one
command that you can run to get things going for you. Now we've got a one command that you can run
to get things going for you. I can share a link for you and you can check it out.
This is a little- Oh, we're putting that in the show notes for sure. In fact, if you go to
snark.cloud slash containers, you'll find it. You'll find it. There you go. The idea here was
this. There is a real use case that you had. And I looked and AWS does not have a out of the box, simple solution for you.
I agree with that.
And Google Cloud Run does.
Well, the answer would have been from AWS.
Well, then here we need to make that solution.
And so that's what this was, was a way to demonstrate that it is a solvable problem.
AWS has all the right primitives.
Just that use case hadn't been covered.
So how does containers work?
Real straightforward. It's a command line. It's an NPM tool. You just run an NPX container. It sets
up a GitHub action role in your AWS account. It then creates a GitHub action workflow in your
repo and then uses the container GitHub action, reusable action that creates the image for you
every time you push to the branch,
pushes it up to ECR, and then automatically pushes up that new version of the image to
AppRunner for you. So now it's using AppRunner under the covers, but it's providing that nice
developer experience that you were getting out of Cloud Run. Look, is Quintainer really the right
way to go with running containers? No, I'm not making that point at all. But the point is,
it might very well be. Well, if you want to show a good containers? No, I'm not making that point at all. But the point is, it might very well be.
Well, if you want to show a good Hello World experience,
containers the best.
Because within 30 seconds, your app is now set up
to continuously deliver containers into AWS
for your very specific use case.
The problem is, it's not going to grow for you.
I mean, it was something I did
over the weekend just for fun. It's not something that would ever be worthy of hitching up a real
production workload to. So the point there is you can build frameworks and tools that are very good
at getting that initial dopamine hit, but then are not going to be there for you necessarily
as you mature and get more complex.
And yet, I've tilted a couple of times at the windmill
of integrating GitHub Actions
in anything remotely resembling a programmatic way
with AWS's services, as far as instance roles go.
Are you using permanent credentials for this,
as stored secrets, or are you doing the OICD handoff?
OIDC, so what happens is the tool creates the IAM role for you with the trust policy on GitHub's OIDC handoff? OIDC. So what happens is the tool creates the IAM role for you with the trust policy
on GitHub's OIDC provider, sets all that up for you in your account, locks it down so that just
your repo and your main branch is able to push or is able to assume the role. The role is set up
just to allow deployments to AppRunner and ECR repository. And then that's it.
At that point, it's out of your way and you're just get push.
And a couple minutes later, your updates are now running in AppRunner for you.
This episode is sponsored in part by our friends at Vulture.
Optimized cloud compute plans have landed at Vulture to deliver lightning-fast processing power, courtesy of third-gen AMD
Epyc processors without the I.O. or hardware limitations of a traditional multi-tenant
cloud server. Starting at just $28 a month, users can deploy general-purpose CPU, memory,
or storage-optimized cloud instances in more than 20 locations across five continents.
Without looking, I know that once again Antarctica has gotten the short end of the stick. Launch your
vulture optimized compute instance in 60 seconds or less on your choice of included operating systems
or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to
noisy neighbors and egregious egress forever. Vulture delivers the power of the cloud with
none of the bloat. Screaming in the Cloud listeners can try Vulture for free today
with $150 in credit when they visit getvulture.com slash morning. That's G-E-T-V-U-L-T-R dot com
slash morning. My thanks to them for sponsoring this ridiculous podcast.
Don't undersell what you've just built. This is something that, is this what I would use for a
large-scale production deployment? Obviously not, but it has streamlined and made incredibly
accessible things that previously have been very complex for folks to get up and running. One of the most disturbing themes behind some
of the feedback I got was at one point I said that, well, have you tried running a Docker
container on Lambda? Because now it supports containers as a packaging format. And I said,
no, because I spent a few weeks getting Lambda up and running when it first came out. And I
basically been copying and pasting
what I got working ever since,
the way most of us do.
And the response is, oh, that explains a lot,
with the implication being that I'm just a fool.
Maybe, but let's be clear,
I am never the only person in the room
who doesn't know how to do something.
I'm just loud about what I don't know.
And the failure mode of a bad user experience
is that a customer feels dumb.
And that's not okay because this stuff is complicated.
And when a user has a bad time, it's a bug.
I learned that in 2012 from Jordan Sissel,
the creator of Logstash.
He has been an inspiration to me for the last 10 years.
And that's something I try to live by,
that if a user has a bad time, something needs to get fixed.
Maybe it's the tool itself. Maybe it's the documentation.
Maybe it's the way the GitHub repos readme is structured in a way that just makes it accessible.
Because I am not a trailblazer in most things, and nor do I intend to be.
I'm not the world's best engineer by a landslide.
Just look at my code, and you'd argue the fact that I'm an engineer at all. But if it's bad and it works, how bad is it? It's sort of the
other side of it. So my problem is that there needs to be a couple of things. Ignore for a
second the aspect of making it the right answer to get something out of the door. The fact that
I want to take this container and just run it, and you and I both reach for AppRunner as the default AWS service that does this, because I've been swimming in the AWS waters a while, and you're a freaking, I believe, 15 ways to run containers on mobile and 19 ways to run containers on non-mobile, which is just fascinating in its own right.
And it's overwhelming, it's confusing, and it's not something that makes it abundantly clear what the golden path is.
First, get it up and working.
Get it running.
Then you can add
nuance and flavor and the rest. And I think that's something that's gotten overlooked in our mad rush
to pretend that we're all Google engineers circa 2012. I think people get stressed out when they
try to run containers in AWS because they think, what is that golden path? You said golden path.
And my advice to people is there is no golden path. And the great thing about AWS
is they do continue to invest
in the solutions they come up with.
I'm still bitter about Google Reader.
As am I.
Yeah, I built so much time
getting my perfect set of RSS feeds
and then I had to find somewhere else.
With AWS, the different offerings
that are available for running containers,
those are there intentionally.
It's not by accident. They're there to solve specific problems. So the trick is finding what
works best for you and don't feel like one is better than the other or is going to get more
attention than others. And they each have different use cases. And I approach it this way.
I've seen a couple of different people do some great flow charts. I think Forrest did one,
Vlad did one on ways to make the decision on how
to run your containers. And I break it down to three questions. I ask people, first of all,
where are you going to run these workloads? If someone says it has to be in the data center,
okay, cool. Then ECS Anywhere or EKS Anywhere, and we'll figure out if Kubernetes is needed.
If they need specific requirements, so that they say, no, we can run in the cloud,
but we need privilege mode for containers,
or we need EBS volumes,
or we want really small container sizes,
like less than a quarter of ECPU,
or less than half a gig of RAM,
or if you have custom log requirements,
Fargate's not going to work for you,
so you're going to run on EC2.
Otherwise, run it on Fargate.
But that's the first question.
Figure out where are you going to run your containers.
That leads to the second question.
What's your control plane?
But those are different, sort of related, but different questions.
And I only see six options there.
That's AppRunner for your control plane.
Lightsail for your control plane.
Rosa, if you're invested in OpenShift already.
EKS, either if you have momentum in Kubernetes or you have a bunch of
engineers that have a bunch of experience with Kubernetes. If you don't have either, don't choose
it. Or ECS, the last option, Elastic Beanstalk, but let's leave that as a, if you're not currently
investing in Elastic Beanstalk, don't start today. But I look at those as, okay, so first question,
where am I going to run my containers? Second question, what do I want to use for my control plane? And there's different pros and cons of each of those.
And then the third question, how do I want to manage them? What tools do I want to use for
managing deployment? All those other tools like Copilot or App2Container or Proton, those aren't
my control plane. Those aren't where I run my containers. That's how I manage, deploy, and orchestrate all
the different containers. So I look at it as those three questions. But I don't know, what do you
think of that, Corey? I think you're onto something. I think that that is a terrific way of exploring
that question. I would argue that setting up a framework like that one or very similar
is what the AWS containers page should be. It's coming from
the perspective of what is the neophyte customer experience. On some level, you almost need a
slider of choose your level of experience ranging from what's a container to I named my kid
Kubernetes because I make terrible life decisions and anywhere in between. Sure. Yeah. Well, and I
think that really dictates the control plane level. So for
example, LightSail, where does LightSail fit? To me, the value of LightSail is the simplicity.
I'm looking at a monthly pricing, seven bucks a month for a container. I don't know how this
other stuff works, but I can think in terms of monthly pricing and it's tailored towards a console
user. Someone just wants to click in, point to an image. That's a very specific user. There's
thousands of customers that are very happy with that experience and they use it. AppRunner presents that scale to zero.
That's one of the big selling points I see with AppRunner. Likewise with Google Cloud Run. I've
got that scale to zero. I can't do that with ECS or EKS or any of the other platforms. So if you've
got something that has a ton of idle time, I'd really be looking at those. I would argue that, I think I did the math, Google Cloud Run is about 30% more expensive
than AppRunner. Yeah, if you disregard the free tier, I think that to have it running persistently
at all times throughout the month, the dropout cold starts would cost something like 40 some
odd bucks a month or something like that. Don't quote me on it. Again, and to be clear, I wound
up doing this very congratulatory and complimentary tweet about them
on i think it was thursday and then they immediately apparently took one look at this
and said holy shit cory's saying nice things about us what do we do what do we do panic
and the next morning they raised prices on a bunch of cloud offerings whoo that'll fix it like
did did you miss the direction you're going on here?
No, that's the exact opposite of what you should be doing.
But here we are.
Interestingly enough, to tie our two conversation threads together,
when I look at an AWS bill, unless you're using Fargate,
I can't tell whether you're using Kubernetes or not.
Because EKS is a small charge in almost every case for the control plane or Fargate under it.
Everything else just manifests as EC2 spend.
From the perspective of the cloud provider, if you're running a Kubernetes cluster, it is a single-tenant application that can have some very funky behaviors like cross-AZ chatter back and forth.
Because there's no internal mechanism to say, talk to the free thing rather than the two cents a gigabyte thing. It winds up spinning up and down in a bunch of different ways. And the behavior patterns, because of how placement works,
are not necessarily deterministic, depending upon workload. And that becomes something that
people find odd when, okay, you look at our bill for a week, what could you say? Well, first
question, are you running Kubernetes at all? And they're like, who invited these clouds?
Understand, we're not prying into your workloads for a variety of excellent
legal and contractual reasons here.
We are looking at how they behave
and for specific workloads, once we have a
conversation with the engineering team, yeah, we're going to dive
in. But it is not
at all intuitive from the outside to make
any determination whether you're running containers
or whether you're running VMs
that you just haven't done anything with in
20 years, or what exactly is going on.
And that's just an artifact of the billing system.
We ran into this challenge at Kaggle.
We don't use EKS, we use ECS, but we have some shared clusters.
Lots of EC2 spend, hard to figure out which team is creating the services that's running that up.
We actually ended up creating a tool.
We open sourced it, ECS Chargeback. And what it does is it looks at the CPU memory reservations for each task definition and then
prorates the overall charge of the ECS cluster and then creates metrics in Datadog to give us
a breakdown of cost per ECS service. And it also measures what we like to refer to as waste,
right? Because if you're reserving four gigs of memory, but your utilization never goes over two
gigs, we're paying for that reservation, but you're under utilizing. So we're able to also show
which services have the highest degree of waste, not just utilization. So it helps us go after it.
But this is a hard problem. I'd be curious, how do you approach these shared ECS resources and slicing and dicing those bills?
Everyone has a different approach to this. There is no unifiable correct answer. A previous show
guest, Peter Hamilton over at Remind had done something very similar, open sourced a bunch of
these things. Understanding what your spend is, is important on this. And it comes down to getting at the actual business
concern, because in some cases, effectively, dead reckoning's enough. You take a look at the cluster
that is really hard to attribute because it's a shared service. Great. It is 5% of your bill.
First pass, why don't we just agree that it is a third for service A, two-thirds for service B,
and we'll call it mostly good at that point. That can be enough in a lot of cases. With scale, you're just sort of hand-waving over
many millions of dollars a year there. How about we get into some more depth,
and then you start instrumenting and reporting to something, be it CloudWatch, be it Datadog,
be it something else, and understanding what the use case is. In some cases,
customers have broken apart shared clusters for that specific reason. I don't think that's necessarily the best approach from an engineering perspective. But again, this is not purely an engineering decision. It comes down to serving the business need. And if you're taking a partial credit on that cluster for a tax credit for R&D, for example, you want that position to be extraordinarily defensible and spending a few
extra dollars to ensure that it is, is the right business decision. I mean, again, we're pure
advisory. We advise customers on what we would do in their position, but people often mistake that
to be, we're going to go for the lowest possible price, bad idea, or that we're going to wind up
doing this from a purely engineering centric point of view. It's be aware that in almost every case,
with some very notable weird exceptions,
the AWS bill costs significantly less
than the payroll expense that you have
of people working on the AWS environment in various ways.
People are more expensive.
So the idea of if, well,
you can save a whole bunch of engineering effort
by spending a bit more on your cloud bill.
Yeah, let's go ahead and do that. Yeah, good point. The real mark of someone who's senior
enough is their answer to almost any question is it depends. And I feel I've fallen into that trap
as well. But I'd love to sit here and say, oh, it's really simple. You do X, Y, and Z.
Honestly, my answer, the simple answer is I think that we orchestrate a cyberbullying campaign
against AWS through the AWS wishlist hashtag. We get people to harass their account managers with
repeated requests for, hey, could you go ahead and dip that thing in? Give that a plus one for me,
whatever internal system you're using. Just because this is a problem we're seeing more and more,
given that it's an unbounded growth problem, we're going to see it more and more for the
foreseeable future. So I wish I had a better answer for you, but yeah, that stuff's super hard is honest, but it's also not the most useful answer for most folks.
Well, I'd love feedback from anyone from you or your team on that tool that we created.
I can share a link after the fact.
ECS Chargeback is what we call it.
Excellent.
I will follow up with you separately on that.
That is always worth diving into. I'm curious to see new and exciting approaches to this. Just be aware that we have
an obnoxious talent sometimes for seeing these things. Well, what about asking about some weird
corner or edge case that either invalidates the entire thing, or you're like, who on earth would
ever have a problem like that? And the answer is always the next customer. For a bounded problem
space of the AWS bill, every time I think I've seen it all, I just talk to one more customer.
Cool. In fact, the way that we approached your teardown in the restaurant is how we launched
our first pass approach because there's value in something like that that is different than
the value of a six to eight week long deep dive engagement to every nook and cranny.
Yeah, for sure. It was valuable to us. Yeah. Having someone come in and just spend a day with your team, diving into it
up one side and down the other, it seems like a weird thing. How much good could you possibly do
in a day? And the answer in some cases is we had Honeycomb saying that in a couple of days of
something like this, we wound up blowing 10% off their entire operating budget for the company. It led to an increased valuation. Liz Fong-Jones has said on multiple
occasions that the company would not be what it was without our efforts on their bill, which is
just incredibly gratifying to hear. It's easy to get lost in the idea of, well, it's the AWS bill.
It's just making big companies spend a little bit less to another big company. And that's not exactly saving the lives of K-12 students here. It's opening up opportunities.
Yeah. It's about optimizing for the win for everyone. Because now AWS gets a lot more money
from a honeycomb than they would if honeycomb had not continued on their trajectory. You can
charge customers a lot right now, or you can charge them a little bit over time
and grow with them in a partnership context.
I've always opted for the second model
rather than the first.
Right on.
But here we are.
I want to thank you for taking so much time
out of, well, several days now
to argue with me on Twitter,
which is always appreciated,
particularly when it's, you know, constructive.
Thanks for that,
for helping me get my business partner to reinvent.
Although then he got me that horrible puzzle
of a thousand pieces
for the Cloud Native Computing Foundation landscape.
And now I don't ever want to see him again.
So, you know, that happens.
And of course, spending the time to write containers,
which is going to be at snark.cloud slash containers
as soon as we're done with this recording.
Then I'm going to kick the tires
and send some pull requests.
Right on.
Yeah.
Thanks for having me.
I appreciate you starting the conversation.
I would just conclude with, I think that, yes, there are a lot of ways to run containers
in AWS.
Don't let it stress you out.
They're there for intention.
They're there by design.
Understand them.
I would also encourage people to go a little deeper, especially if you've got a significantly large workload.
You've got to get your hands dirty.
As a matter of fact, there's a hands-on lab that a company called Liatrio does.
They call it their Ignite Lab.
It's a one-day, free, hands-on, you run legacy monolithic job applications on Kubernetes.
It gives you firsthand experience on how to get all the way up into observability
and doing things like canary deployments.
That's a great, great lab.
But you got to do something like that to really get your hands dirty and understand how these
things work.
So don't sweat it.
There's not one right way.
There's a way that'll probably work best for each user.
And just take the time and understand the ways to make sure you're applying the one
that's going to give you the most runway for your workload. I will definitely dig into that myself. I think you're right. I
think you have nailed a point that is, again, a nuanced one and challenging to put in a rage tweet.
But these services don't exist in a vacuum. They're not there because, despite the joke,
someone wants to get promoted. It's because there are customer needs that are going on that,
and this is another way of meeting those needs. I think there could be better guidance, but I also
understand that there are a lot of nuanced perspectives here and that hell is someone
else's workflow. And there's always value in broadening your perspective a bit on those
things. If people want to learn more about you and how you see the world, where's the best place
to find you? Probably on Twitter, twitter.com slash Nektos, N-E-K-T-O-S.
That might be the first time Twitter has been described
as the best place for anything,
but thank you once again for your time.
It is always appreciated.
Thanks, Corey.
Casey Lee, CTO at Gaggle and AWS Container Hero,
and apparently writing code in anger
to invalidate my points,
which is always appreciated. Please do more of that, folks. I'm cloud economist Corey Quinn,
and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star
review on your podcast platform of choice or the YouTube comments, which is always a great place to
go reading. Whereas if you've hated this podcast, please leave a five-star review in the usual
places and an angry comment
telling me that I'm completely wrong and then launching your own open source tool to point out
exactly what I've gotten wrong this time. If your AWS bill keeps rising and your blood pressure is
doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it
smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations
to your business, and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.