Screaming in the Cloud - Creating “Quinntainers” with Casey Lee

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and it's spelled R-E-V-E-L-O.

Starting point is 00:00:38 It means I reveal. Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Ravello has recognized is something I've been talking about for a while. Specifically, that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to basically those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers

Starting point is 00:01:11 in their network, which includes but isn't limited to talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability as well as, you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States. So you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service. So you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border

Starting point is 00:01:57 payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io slash screaming to get 20% off your first three months. That's R-E-V-E-L-O dot I-O slash screaming. Couchbase Capella. Database as a service is flexible, full featured, and fully managed with built-in access via key value, SQL, and full-text search. Flexible JSON documents align to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is someone that I had the pleasure of meeting at reInvent last year, but we'll get to that story in a minute.

Starting point is 00:03:14 Casey Lee is the CTO at a company called Gaggle, which is, as they frame it, saving lives. Now, that seems to be a relatively common position that an awful lot of different tech companies take. We're saving lives here. It's you show banner ads, and some of them are attack platforms for JavaScript malware. Let's be serious here. Casey, thank you for joining me. And what makes the statement that Gaggle saves lives not patently ridiculous? Sure. Thanks, Corey. Thanks for having me on the show. So Gaggle, we're an ed tech company. We sell software to school districts and school districts use our software to help protect their students while the students use the school issued Google or Microsoft accounts. So we're looking for signs of bullying, harassment, self-harm, and potentially suicide from K-12 students while they're using these platforms.

Starting point is 00:04:09 They will take the thoughts, concerns, emotions they're struggling with and write them in their school-issued accounts. We detect that, and then we notify the school districts, and they get the students the help they need before they can do any permanent damage to themselves. We protect about 6 million students throughout the U.S. We ingest a lot of content. Last school year, over 6 billion files, about the equal number of emails ingested. We're looking for concerning content. And then we have humans review the stuff that our machine learning algorithms detect and flag. About 40 million items had to go in front of humans last year, resulted in about 20,000 what we call PSSs. These are possible student situations where students are talking about harming themselves or harming others. And that resulted in what we like

Starting point is 00:04:57 to track as lives saved. 1,400 incidents last school year where a student was dealing with suicide ideation. They were planning to take their own lives. We detect that and get them help within minutes before they can act on that. That's what Kaggle's been doing. We're using tech, solving tech problems, and also saving lives as we do it. It's easy to lob the criticism at some of the things you're alluding to. The idea of, oh, you're using machine learning on student data for young kids, yada, yada, yada. Look at the outcome, look at the privacy controls you have in place, and look at the outcomes you're driving to. Now, I don't necessarily trust the number of school administrations not to become heavy-handed and overbearing with it, but let's be clear, that's not the intent. That is not what the success stories you have allude to.

Starting point is 00:05:48 I've got to say, I'm a fan. So thanks for doing what you're doing. I don't say that very often to people who work in tech companies. Cool. Thanks, Corey. But let's rewind a bit, because you and I had passed ships in the night on Twitter for a while, but last year at reInvent, something odd happened. First, my business partner procrastinated at getting his ticket.

Starting point is 00:06:14 That's not the odd part. He does that a lot. But then suddenly ticket sales slam shot and none were to be had anywhere. You reached out with him. Hey, I have a spare ticket because someone can't go. I'll let me get it to you. And I said, terrific. Let me pay you for the ticket and take you to dinner. You said, yes, on the dinner, but I'd rather you just look at my AWS bill and don't worry about the cost of the ticket. All right, said I. I know a deal when I see one. We grabbed dinner at the Venetian.

Starting point is 00:06:37 I said, bust out your laptop. And you said, oh, I was kidding. And I said, great, I wasn't busted out. And you went from laughing to taking notes and about the usual time that happens when I start looking at these things. But how's your recollection of that? I always tend to romanticize some of these things.

Starting point is 00:06:52 And then everyone in the restaurant just turned, stopped and clapped the entire time. Maybe that part didn't happen. Everything was right up until the clapping part. That was a really cool experience. I appreciate you walking through that with me. Yeah, we've got lots of opportunity to save on our AWS bill here at Gaggle. And in that little bit of time that we had together, I think I walked away with no more than a dozen ideas for where to

Starting point is 00:07:16 shape some costs. The most obvious one, the first thing that you keyed in on is we had our eyes coming due that weren't really well optimized, and you steered me towards savings plans. We put that in place, and we're able to apply those savings plans not just to our EC2 instances, but also to our serverless spend as well. So that was a very worthwhile and cost-effective dinner for us. The thing that was most surprising, though, Corey, was your approach. Your approach to how to review our bill was not what I thought at all. Well, what did you expect my approach was going to be? Because this always is of interest to me.

Starting point is 00:07:49 Like, did you expect me to, like, whip a portable machine learning rig out of my backpack full of GPUs or something? I didn't know if you had, like, some secret tool you were going to hit. Or if nothing else, I thought you were going to go for the cost explorer. I spent a lot of time in cost explorer. That's my go-to tool. And you wanted nothing to do with cost explorer. I think I was actually pulling up cost explorer for you. and you said, I'm not interested. Take me to the

Starting point is 00:08:07 bills. So we went right to the billing dashboard. You start opening up the invoices. And I thought to myself, I don't remember the last time I looked at an AWS invoice. I just, it's noise. It's not something that I pay attention to. And I learned something that you get a real quick view of both the cost and the usage. And that's what you're keyed in on, right? And you were looking at things relative to each other. Okay, I have no idea about Gaggle or what they do, but normally for a company that's spending X amount of dollars in EC2, why is your data transfer cost the way it is?

Starting point is 00:08:38 Is that high or low? So you were looking for kind of relative numbers. But it was really cool watching you slice and dice that bill through the dashboard there. There are a few things I tie together there. Part of it is that this is sort of a surprising thing that people don't think about. But start with the big numbers first rather than going alphabetically, because I don't really care about your $6 Alexa for business spend. I care a bit more about the $6 million or whatever it happens to be at EC2. I'm pulling numbers completely out of the ether.

Starting point is 00:09:07 Let's be clear. I don't recall what the exact magnitude of your bill is and it's not relevant to the conversation. And then you see that and it's like, huh, okay, you're spending $6 million on EC2. Why are you spending 400 bucks on S3? Seems to me that those two should be a little closer aligned. What's the deal here? Oh God, you're using eight petabytes of EBS

Starting point is 00:09:25 volumes. Oh, dear. And it just tends to lead to interesting stuff. Break it down by region, service, and use case, or usage type, rather, is what shows up on those exploded bills, and that's where I tend to start. It also is one of the easiest things to wind up having someone throw into a PDF and email my way if I'm not doing it in a restaurant with, you know, people clapping, standing around. Right on. I also want to highlight that you've been using AWS for a long time. You're a container hero. You are not bad at understanding the nuances and depths of AWS.

Starting point is 00:09:58 So I take praise from you around this stuff as valuing it very highly. This stuff is not intuitive. It is deeply nuanced, and you have a business outcome you are working towards that invariably is not oriented day in, day out around, how do I get these services for less money than I'm currently paying? But that is how I see the world, and I tend to live in a very different space just based on the nature of what I do. It's sort of a case study in the advantage of specialization. But I know remarkably little about containers, which is how we wound up reconnecting about a week or so before this recording. Yeah, I saw your tweet. You were trying to run some workload, container workload,

Starting point is 00:10:39 and I could hear the frustration on the other end of Twitter when you were shaking your fist. I should not tweet angrily, and I did in this case. And every time I do, I regret it. But it played well with the people. So that does help. I believe my exact comment was, me, I've got this container. Run it, please. Google Cloud Run.

Starting point is 00:10:58 You got it, boss. AWS has 17 ways to run containers, and they all suck. And that's painting with an overly broad brush, let's be clear. But that was at the tail end of two or three days of work trying to solve a very specific, very common business problem that I was just beating my head off of a wall again and again and again. And it took less than half an hour from start to finish with Google Cloud Run and I didn't have to think about it anymore.

Starting point is 00:11:30 And it was, it's one of those moments where you look at this and realize that the future is here, we just don't see it in certain ways. And you took exception to this. So please, let's dive in, because 280 characters of text after half a bottle of wine is not the best context to have a nuanced discussion that leaves friendships intact the following morning. Nice. Well, I just want to make sure I understand the use case first, because I was trying to read between the lines on what you needed, but let me take a guess. My guess is you got your source code in GitHub, you have a Docker file, and you want to be able to take that repo from GitHub and just have it continuously deployed somewhere and run. And you don't want to have headaches with it.

Starting point is 00:12:08 You just want to push more changes up to GitHub, Docker build runs, and update some service somewhere. Am I right so far? Ish, but think a little further up the stack. It was in service of this show. So this show, as people who are listening to this are probably aware by this point, periodically has sponsors, which we love. We thank them for participating in the ongoing support of this show, which empowers conversations like this. And it's, first, you misspelled your company name from the common English word. There are three sub-levels within the domain. And then you have a complex UTM tagging, tracking.

Starting point is 00:12:50 Yeah, you realize people are driving to work when they're listening to this. So I built a while back a link shortener, snark.cloud. Because is it the shortest thing in the world? Not really. But it's easily understandable when I say that. And people hear it for what it is. And that's been running for a long time as an S3 bucket with full of redirects behind CloudFront. So I wind up adding a zero byte object with a redirect parameter on it,

Starting point is 00:13:16 and it just works. Now, the challenge that I have here as a business is that I am increasingly prolific these days. So anything that I am not directly required to be doing, I probably shouldn't necessarily be the one to do it. And care and feeding of those redirect links is a prime example of this. So I went hunting and the things that I was looking for were obviously do the redirect. Now, if you pull up GitHub, there are hundreds of solutions here. There are AWS blog posts. One that I really liked and almost got working was Eric Johnson's three-part blog post on how to do it serverlessly with API Gateway and DynamoDB, no lambdas required. I really liked aspects of what that was, but it was complex. I kept smacking into weird challenges as I went, and front-end is just baffling to me, because I needed a front-end app for people to be able to use here. I need to be able to secure that, because it turns out that if you just have a, anyone who stumbles across the URL can redirect

Starting point is 00:14:16 things to other places, well, you've just empowered a whole bunch of spam email, and you're going to find that service abused, and everyone starts blocking it, and then you have trouble. Nothing lasts the first encounter with jerks. And I was getting more and more frustrated, and then I found something by a Twitter engineer on GitHub with a few creative search terms who used to work at Google Cloud. And what it uses as a client is it doesn't build any kind of custom web app. Instead, as a database, it uses not S3 objects, not Route 53, the ideal database,

Starting point is 00:14:49 but a Google Sheet, which sounds ridiculous, but every business user here knows how to use that. And it looks for the two columns. The first one is the slug after the snark.cloud, and the second is the long URL. And it has a TTL of five seconds on cache, so make a change to that spreadsheet. Five seconds later, it's live. Everyone gets it. I don't have to build

Starting point is 00:15:11 anything new. I just put it somewhere where only the relevant people can access it. I gave a tutorial and a giant warning on it, and everyone gets that. And it just works well. It was click here to deploy, follow the steps. And the documentation was a little, okay, I had to undo it once and redo it again. Getting the domain registered was getting ported over, took a bit of time. And there were some SSL errors as the certificates were set up. But once all of that was done, it just worked. And I tested the heck out of it. And cold starts are relatively low and the entire thing fits within the free tier. And it is reminiscent of the magic that I first saw when I started working with some of the cloud provider services years ago. It's been

Starting point is 00:15:51 a long time since I had that level of delight with something, especially after three days of frustration. It's one of the, this is a great service. Why are people not shouting about this from the rooftops? That was my perspective. And I put it out on Twitter and, oh Lord, did I get comments. What was your take on it? Well, so my take was when you're evaluating a platform to use for running your applications, how fast it can do, get you to hello world is not necessarily the best way to go. I just assumed you're wrong. I assumed of the 17 ways AWS has to run containers, Corey just doesn't understand. And assumed of the 17 ways AWS has to run containers, Corey just doesn't understand. And so I went after it and I said, okay, let me see if I can find a way that solves his use case as I understand it through a quick tweet. And so I tried App

Starting point is 00:16:36 Runner. I saw that App Runner does not meet your needs because you have to somehow get your Docker image pushed up to a repo. AppRunner can take an image that's already been pushed up and deployed for you, or it can build from source. But neither of those were the way I understood your use case. Having used AppRunner before via the co-pilot CLI, it is the closest, as best I can tell, to achieving what I want. But also, let's be clear, I don't believe there's a free tier. There needs to be a load balancer in front of it. So you're starting with $15 a month for this thing, which is not the end of the world. Had I known at the beginning that all of this was going to be there, I would have just

Starting point is 00:17:10 signed up for a Bitly account and called it good. But here we are. I tried Copilot. Copilot is a great developer experience, but it also is just pulling together tons of people. I mean, just trying to do a Copilot service deploy, VPCs are being created and tons of people. I mean, just trying to do a co-pilot service deploy. VPCs are being created and tons of IAM roles are being created, code pipelines. There's just so much going on. I was like 20 minutes into it and I said, yeah, this is not fitting the bill

Starting point is 00:17:35 for what Corey was looking for. Plus, it doesn't solve the way I understood your use case, which is you don't want to worry about builds. You just want to push code and have new Docker images get built for you. Well, honestly, let's be clear here. Once it's up and running, I don't want to ever have to touch the silly thing again. And that so far has been the case. After I made up, I forked the repo

Starting point is 00:17:51 and made a couple of changes to it that I wanted to see. One of them was to render the entire thing case insensitive because I get that one wrong a lot. And the other is I wanted to change the permanent 301 redirect to a temporary 302 redirect because occasionally sponsors will want to change where it goes in the fullness of time, and that is just fine. But I want to be able to support that and not have to deal with old cache data. So getting that up and running was a bit of a challenge. But the way that it worked was following the instructions in the GitHub repo. The developer environment had spun up, and the Google's Cloud Shell was just spectacular. It prompted me for a few things, and it told me step-by-step what to do. This is the sort of thing I could have given a basically non-technical user, and they would have had success with it.

Starting point is 00:18:39 So I tried it as well. I said, well, okay, if I'm going to respond to Corey here and challenge him on this, I need to try Cloud Run. I had no experience with Cloud Run. I had a small example repo that loosely mapped what I understood you were trying to do. Within five minutes, I had Cloud Run working. And I was surprised. Anytime I pushed a new change within 45 seconds, the change was built and deployed. So here's my conclusion, Corey.

Starting point is 00:19:03 Google Cloud Run is great for your use case, and AWS doesn't have the perfect answer. But here's my conclusion, Corey. Google Cloud Run is great for your use case, and AWS doesn't have the perfect answer. But here's my challenge to you. I think that you just proved why there's 17 different ways to run containers on AWS. It's because there's that many different types of users that have different needs, and you just happen to be number 18 that hasn't gotten the right attention yet from AWS. Well, let's be clear. Like my gag about 17 ways to run containers on AWS was largely a joke. And it went around the internet three times.

Starting point is 00:19:33 So I wrote a list of them on the blog post of 17 ways to run containers in AWS. And people liked it. And then a few months later, I wrote 17 more ways to run containers on AWS, listing 17 additional services that all run containers. And my favorite email that I think I've ever received in feedback was from a salty AWS employee saying that one of them didn't really count because of some esoteric reason. And it turns out that when I'm trying to make a point of you have a sarcastic number of ways to run containers, pointing out that, well, one of them isn't quite valid, doesn't really shatter the argument. Let's be very clear here. So I appreciate the feedback. I always do. And it's partially snark, but there is an element of truth to it in that customers don't want to run containers by and large. That is what they do in service of a

Starting point is 00:20:23 business goal. And they want their application to run, which is in turn serves the business goal that continues to abstract out and to remain a going concern via the current position the company stakes out. In your case, it is saving lives. In my case, it is fixing horrifying AWS bills and making fun of Amazon at the same time. And in most other places, there are somewhat more prosaic answers to that. But containers are simply an implementation detail, to some extent, to my way of thinking, of getting to that point. An important one, let's be clear, I was very anti-container for a long time. I wrote a talk, Heresy in the Church of Docker, that then was accepted at ContainerCon.

Starting point is 00:21:01 It's like, oh boy, I'm not going to leave here alive. And the honest answer is, many years later, that Kubernetes solves almost all the criticisms that I had with the downside of, well, first you have to learn Kubernetes. And that continues to be mind-bogglingly complex from where I sit. There's a reason that I've registered kubernetestheeasyway.com and repointed it to ECS, Amazon's container service, that is not requiring you to cosplay as a cloud provider yourself. But even ECS has a number of challenges to it. I want to be very clear here.

Starting point is 00:21:30 There are no silver bullets in this. And you're completely correct in that I have a large, complex environment and the application is nuanced and I'm willing to invest a few weeks in setting up the baseline underlying infrastructure on AWS with some of these services. Ideally, not all of them at once, because that's something a lunatic would do, but getting them up and running. The other side of it, though, is that if I am trying to

Starting point is 00:21:53 evaluate a cloud provider's handling of containers and how this stuff works, the reason that everyone starts with a Hello World-style example is that it delivers, ideally, the mean time to dopamine. There's a reason that Hello World doesn't have 18 different dependencies across a bunch of different databases and message queues and all the other complicated parts of running a modern application, because you just want to see how it works out of the gate. And if getting that baseline empty container that just returns the string Hello World is that complicated and requires that much work. My takeaway is not that this user experience is going to get better once I make the application itself more complicated. So I find that off-putting. My approach has always been find something that I

Starting point is 00:22:37 can get the easy minimum viable thing up and running on. And then as I expand, know that you'll be there to catch me as my needs intensify and become ever more complex. But if I can't get the baseline thing up and running, I'm unlikely to be super enthused about continuing to beat my head against the wall. Like, well, I'll just make it more complex. That'll solve the problem because it often does not. That's my position. Yeah. I agree that that dopamine hit is valuable in getting attached to want to invest into whatever tech stack you're using. The challenge is your second part of that. Your second part is, will it grow with me and scale with me and support the complex edge cases that

Starting point is 00:23:15 I have? And the problem I've seen is a lot of organizations will start with something that's very easy to get started with and then quickly outgrow it and then come up with all sorts of weird Rube Goldberg-type solutions because they jumped all in before seeing. I've got kind of an example of that. I'm happy to announce that there's now 18 ways to run containers on AWS because your use case,

Starting point is 00:23:41 in the spirit of AWS customer obsession, I hear your use case. I've created an open source project that I want to share called Quintainers. Oh, no. And it solves. Yes. Quintainers is live and is ready for the world. So now we've got 18 ways to run containers. And if you have Corey's use case of, hey, here's my container, run it for me. Now we've got a one command that you can run to get things going for you. Now we've got a one command that you can run to get things going for you. I can share a link for you and you can check it out. This is a little- Oh, we're putting that in the show notes for sure. In fact, if you go to

Starting point is 00:24:13 snark.cloud slash containers, you'll find it. You'll find it. There you go. The idea here was this. There is a real use case that you had. And I looked and AWS does not have a out of the box, simple solution for you. I agree with that. And Google Cloud Run does. Well, the answer would have been from AWS. Well, then here we need to make that solution. And so that's what this was, was a way to demonstrate that it is a solvable problem. AWS has all the right primitives.

Starting point is 00:24:41 Just that use case hadn't been covered. So how does containers work? Real straightforward. It's a command line. It's an NPM tool. You just run an NPX container. It sets up a GitHub action role in your AWS account. It then creates a GitHub action workflow in your repo and then uses the container GitHub action, reusable action that creates the image for you every time you push to the branch, pushes it up to ECR, and then automatically pushes up that new version of the image to AppRunner for you. So now it's using AppRunner under the covers, but it's providing that nice

Starting point is 00:25:14 developer experience that you were getting out of Cloud Run. Look, is Quintainer really the right way to go with running containers? No, I'm not making that point at all. But the point is, it might very well be. Well, if you want to show a good containers? No, I'm not making that point at all. But the point is, it might very well be. Well, if you want to show a good Hello World experience, containers the best. Because within 30 seconds, your app is now set up to continuously deliver containers into AWS for your very specific use case.

Starting point is 00:25:41 The problem is, it's not going to grow for you. I mean, it was something I did over the weekend just for fun. It's not something that would ever be worthy of hitching up a real production workload to. So the point there is you can build frameworks and tools that are very good at getting that initial dopamine hit, but then are not going to be there for you necessarily as you mature and get more complex. And yet, I've tilted a couple of times at the windmill of integrating GitHub Actions

Starting point is 00:26:10 in anything remotely resembling a programmatic way with AWS's services, as far as instance roles go. Are you using permanent credentials for this, as stored secrets, or are you doing the OICD handoff? OIDC, so what happens is the tool creates the IAM role for you with the trust policy on GitHub's OIDC handoff? OIDC. So what happens is the tool creates the IAM role for you with the trust policy on GitHub's OIDC provider, sets all that up for you in your account, locks it down so that just your repo and your main branch is able to push or is able to assume the role. The role is set up just to allow deployments to AppRunner and ECR repository. And then that's it.

Starting point is 00:26:45 At that point, it's out of your way and you're just get push. And a couple minutes later, your updates are now running in AppRunner for you. This episode is sponsored in part by our friends at Vulture. Optimized cloud compute plans have landed at Vulture to deliver lightning-fast processing power, courtesy of third-gen AMD Epyc processors without the I.O. or hardware limitations of a traditional multi-tenant cloud server. Starting at just $28 a month, users can deploy general-purpose CPU, memory, or storage-optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again Antarctica has gotten the short end of the stick. Launch your

Starting point is 00:27:32 vulture optimized compute instance in 60 seconds or less on your choice of included operating systems or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vulture delivers the power of the cloud with none of the bloat. Screaming in the Cloud listeners can try Vulture for free today with $150 in credit when they visit getvulture.com slash morning. That's G-E-T-V-U-L-T-R dot com slash morning. My thanks to them for sponsoring this ridiculous podcast. Don't undersell what you've just built. This is something that, is this what I would use for a large-scale production deployment? Obviously not, but it has streamlined and made incredibly

Starting point is 00:28:22 accessible things that previously have been very complex for folks to get up and running. One of the most disturbing themes behind some of the feedback I got was at one point I said that, well, have you tried running a Docker container on Lambda? Because now it supports containers as a packaging format. And I said, no, because I spent a few weeks getting Lambda up and running when it first came out. And I basically been copying and pasting what I got working ever since, the way most of us do. And the response is, oh, that explains a lot,

Starting point is 00:28:49 with the implication being that I'm just a fool. Maybe, but let's be clear, I am never the only person in the room who doesn't know how to do something. I'm just loud about what I don't know. And the failure mode of a bad user experience is that a customer feels dumb. And that's not okay because this stuff is complicated.

Starting point is 00:29:10 And when a user has a bad time, it's a bug. I learned that in 2012 from Jordan Sissel, the creator of Logstash. He has been an inspiration to me for the last 10 years. And that's something I try to live by, that if a user has a bad time, something needs to get fixed. Maybe it's the tool itself. Maybe it's the documentation. Maybe it's the way the GitHub repos readme is structured in a way that just makes it accessible.

Starting point is 00:29:35 Because I am not a trailblazer in most things, and nor do I intend to be. I'm not the world's best engineer by a landslide. Just look at my code, and you'd argue the fact that I'm an engineer at all. But if it's bad and it works, how bad is it? It's sort of the other side of it. So my problem is that there needs to be a couple of things. Ignore for a second the aspect of making it the right answer to get something out of the door. The fact that I want to take this container and just run it, and you and I both reach for AppRunner as the default AWS service that does this, because I've been swimming in the AWS waters a while, and you're a freaking, I believe, 15 ways to run containers on mobile and 19 ways to run containers on non-mobile, which is just fascinating in its own right. And it's overwhelming, it's confusing, and it's not something that makes it abundantly clear what the golden path is. First, get it up and working.

Starting point is 00:30:43 Get it running. Then you can add nuance and flavor and the rest. And I think that's something that's gotten overlooked in our mad rush to pretend that we're all Google engineers circa 2012. I think people get stressed out when they try to run containers in AWS because they think, what is that golden path? You said golden path. And my advice to people is there is no golden path. And the great thing about AWS is they do continue to invest in the solutions they come up with.

Starting point is 00:31:10 I'm still bitter about Google Reader. As am I. Yeah, I built so much time getting my perfect set of RSS feeds and then I had to find somewhere else. With AWS, the different offerings that are available for running containers, those are there intentionally.

Starting point is 00:31:24 It's not by accident. They're there to solve specific problems. So the trick is finding what works best for you and don't feel like one is better than the other or is going to get more attention than others. And they each have different use cases. And I approach it this way. I've seen a couple of different people do some great flow charts. I think Forrest did one, Vlad did one on ways to make the decision on how to run your containers. And I break it down to three questions. I ask people, first of all, where are you going to run these workloads? If someone says it has to be in the data center, okay, cool. Then ECS Anywhere or EKS Anywhere, and we'll figure out if Kubernetes is needed.

Starting point is 00:32:01 If they need specific requirements, so that they say, no, we can run in the cloud, but we need privilege mode for containers, or we need EBS volumes, or we want really small container sizes, like less than a quarter of ECPU, or less than half a gig of RAM, or if you have custom log requirements, Fargate's not going to work for you,

Starting point is 00:32:17 so you're going to run on EC2. Otherwise, run it on Fargate. But that's the first question. Figure out where are you going to run your containers. That leads to the second question. What's your control plane? But those are different, sort of related, but different questions. And I only see six options there.

Starting point is 00:32:34 That's AppRunner for your control plane. Lightsail for your control plane. Rosa, if you're invested in OpenShift already. EKS, either if you have momentum in Kubernetes or you have a bunch of engineers that have a bunch of experience with Kubernetes. If you don't have either, don't choose it. Or ECS, the last option, Elastic Beanstalk, but let's leave that as a, if you're not currently investing in Elastic Beanstalk, don't start today. But I look at those as, okay, so first question, where am I going to run my containers? Second question, what do I want to use for my control plane? And there's different pros and cons of each of those.

Starting point is 00:33:08 And then the third question, how do I want to manage them? What tools do I want to use for managing deployment? All those other tools like Copilot or App2Container or Proton, those aren't my control plane. Those aren't where I run my containers. That's how I manage, deploy, and orchestrate all the different containers. So I look at it as those three questions. But I don't know, what do you think of that, Corey? I think you're onto something. I think that that is a terrific way of exploring that question. I would argue that setting up a framework like that one or very similar is what the AWS containers page should be. It's coming from the perspective of what is the neophyte customer experience. On some level, you almost need a

Starting point is 00:33:50 slider of choose your level of experience ranging from what's a container to I named my kid Kubernetes because I make terrible life decisions and anywhere in between. Sure. Yeah. Well, and I think that really dictates the control plane level. So for example, LightSail, where does LightSail fit? To me, the value of LightSail is the simplicity. I'm looking at a monthly pricing, seven bucks a month for a container. I don't know how this other stuff works, but I can think in terms of monthly pricing and it's tailored towards a console user. Someone just wants to click in, point to an image. That's a very specific user. There's thousands of customers that are very happy with that experience and they use it. AppRunner presents that scale to zero.

Starting point is 00:34:30 That's one of the big selling points I see with AppRunner. Likewise with Google Cloud Run. I've got that scale to zero. I can't do that with ECS or EKS or any of the other platforms. So if you've got something that has a ton of idle time, I'd really be looking at those. I would argue that, I think I did the math, Google Cloud Run is about 30% more expensive than AppRunner. Yeah, if you disregard the free tier, I think that to have it running persistently at all times throughout the month, the dropout cold starts would cost something like 40 some odd bucks a month or something like that. Don't quote me on it. Again, and to be clear, I wound up doing this very congratulatory and complimentary tweet about them on i think it was thursday and then they immediately apparently took one look at this

Starting point is 00:35:10 and said holy shit cory's saying nice things about us what do we do what do we do panic and the next morning they raised prices on a bunch of cloud offerings whoo that'll fix it like did did you miss the direction you're going on here? No, that's the exact opposite of what you should be doing. But here we are. Interestingly enough, to tie our two conversation threads together, when I look at an AWS bill, unless you're using Fargate, I can't tell whether you're using Kubernetes or not.

Starting point is 00:35:39 Because EKS is a small charge in almost every case for the control plane or Fargate under it. Everything else just manifests as EC2 spend. From the perspective of the cloud provider, if you're running a Kubernetes cluster, it is a single-tenant application that can have some very funky behaviors like cross-AZ chatter back and forth. Because there's no internal mechanism to say, talk to the free thing rather than the two cents a gigabyte thing. It winds up spinning up and down in a bunch of different ways. And the behavior patterns, because of how placement works, are not necessarily deterministic, depending upon workload. And that becomes something that people find odd when, okay, you look at our bill for a week, what could you say? Well, first question, are you running Kubernetes at all? And they're like, who invited these clouds? Understand, we're not prying into your workloads for a variety of excellent

Starting point is 00:36:26 legal and contractual reasons here. We are looking at how they behave and for specific workloads, once we have a conversation with the engineering team, yeah, we're going to dive in. But it is not at all intuitive from the outside to make any determination whether you're running containers or whether you're running VMs

Starting point is 00:36:42 that you just haven't done anything with in 20 years, or what exactly is going on. And that's just an artifact of the billing system. We ran into this challenge at Kaggle. We don't use EKS, we use ECS, but we have some shared clusters. Lots of EC2 spend, hard to figure out which team is creating the services that's running that up. We actually ended up creating a tool. We open sourced it, ECS Chargeback. And what it does is it looks at the CPU memory reservations for each task definition and then

Starting point is 00:37:12 prorates the overall charge of the ECS cluster and then creates metrics in Datadog to give us a breakdown of cost per ECS service. And it also measures what we like to refer to as waste, right? Because if you're reserving four gigs of memory, but your utilization never goes over two gigs, we're paying for that reservation, but you're under utilizing. So we're able to also show which services have the highest degree of waste, not just utilization. So it helps us go after it. But this is a hard problem. I'd be curious, how do you approach these shared ECS resources and slicing and dicing those bills? Everyone has a different approach to this. There is no unifiable correct answer. A previous show guest, Peter Hamilton over at Remind had done something very similar, open sourced a bunch of

Starting point is 00:37:58 these things. Understanding what your spend is, is important on this. And it comes down to getting at the actual business concern, because in some cases, effectively, dead reckoning's enough. You take a look at the cluster that is really hard to attribute because it's a shared service. Great. It is 5% of your bill. First pass, why don't we just agree that it is a third for service A, two-thirds for service B, and we'll call it mostly good at that point. That can be enough in a lot of cases. With scale, you're just sort of hand-waving over many millions of dollars a year there. How about we get into some more depth, and then you start instrumenting and reporting to something, be it CloudWatch, be it Datadog, be it something else, and understanding what the use case is. In some cases,

Starting point is 00:38:41 customers have broken apart shared clusters for that specific reason. I don't think that's necessarily the best approach from an engineering perspective. But again, this is not purely an engineering decision. It comes down to serving the business need. And if you're taking a partial credit on that cluster for a tax credit for R&D, for example, you want that position to be extraordinarily defensible and spending a few extra dollars to ensure that it is, is the right business decision. I mean, again, we're pure advisory. We advise customers on what we would do in their position, but people often mistake that to be, we're going to go for the lowest possible price, bad idea, or that we're going to wind up doing this from a purely engineering centric point of view. It's be aware that in almost every case, with some very notable weird exceptions, the AWS bill costs significantly less than the payroll expense that you have

Starting point is 00:39:33 of people working on the AWS environment in various ways. People are more expensive. So the idea of if, well, you can save a whole bunch of engineering effort by spending a bit more on your cloud bill. Yeah, let's go ahead and do that. Yeah, good point. The real mark of someone who's senior enough is their answer to almost any question is it depends. And I feel I've fallen into that trap as well. But I'd love to sit here and say, oh, it's really simple. You do X, Y, and Z.

Starting point is 00:39:58 Honestly, my answer, the simple answer is I think that we orchestrate a cyberbullying campaign against AWS through the AWS wishlist hashtag. We get people to harass their account managers with repeated requests for, hey, could you go ahead and dip that thing in? Give that a plus one for me, whatever internal system you're using. Just because this is a problem we're seeing more and more, given that it's an unbounded growth problem, we're going to see it more and more for the foreseeable future. So I wish I had a better answer for you, but yeah, that stuff's super hard is honest, but it's also not the most useful answer for most folks. Well, I'd love feedback from anyone from you or your team on that tool that we created. I can share a link after the fact.

Starting point is 00:40:38 ECS Chargeback is what we call it. Excellent. I will follow up with you separately on that. That is always worth diving into. I'm curious to see new and exciting approaches to this. Just be aware that we have an obnoxious talent sometimes for seeing these things. Well, what about asking about some weird corner or edge case that either invalidates the entire thing, or you're like, who on earth would ever have a problem like that? And the answer is always the next customer. For a bounded problem space of the AWS bill, every time I think I've seen it all, I just talk to one more customer.

Starting point is 00:41:11 Cool. In fact, the way that we approached your teardown in the restaurant is how we launched our first pass approach because there's value in something like that that is different than the value of a six to eight week long deep dive engagement to every nook and cranny. Yeah, for sure. It was valuable to us. Yeah. Having someone come in and just spend a day with your team, diving into it up one side and down the other, it seems like a weird thing. How much good could you possibly do in a day? And the answer in some cases is we had Honeycomb saying that in a couple of days of something like this, we wound up blowing 10% off their entire operating budget for the company. It led to an increased valuation. Liz Fong-Jones has said on multiple occasions that the company would not be what it was without our efforts on their bill, which is

Starting point is 00:41:55 just incredibly gratifying to hear. It's easy to get lost in the idea of, well, it's the AWS bill. It's just making big companies spend a little bit less to another big company. And that's not exactly saving the lives of K-12 students here. It's opening up opportunities. Yeah. It's about optimizing for the win for everyone. Because now AWS gets a lot more money from a honeycomb than they would if honeycomb had not continued on their trajectory. You can charge customers a lot right now, or you can charge them a little bit over time and grow with them in a partnership context. I've always opted for the second model rather than the first.

Starting point is 00:42:31 Right on. But here we are. I want to thank you for taking so much time out of, well, several days now to argue with me on Twitter, which is always appreciated, particularly when it's, you know, constructive. Thanks for that,

Starting point is 00:42:42 for helping me get my business partner to reinvent. Although then he got me that horrible puzzle of a thousand pieces for the Cloud Native Computing Foundation landscape. And now I don't ever want to see him again. So, you know, that happens. And of course, spending the time to write containers, which is going to be at snark.cloud slash containers

Starting point is 00:43:01 as soon as we're done with this recording. Then I'm going to kick the tires and send some pull requests. Right on. Yeah. Thanks for having me. I appreciate you starting the conversation. I would just conclude with, I think that, yes, there are a lot of ways to run containers

Starting point is 00:43:13 in AWS. Don't let it stress you out. They're there for intention. They're there by design. Understand them. I would also encourage people to go a little deeper, especially if you've got a significantly large workload. You've got to get your hands dirty. As a matter of fact, there's a hands-on lab that a company called Liatrio does.

Starting point is 00:43:33 They call it their Ignite Lab. It's a one-day, free, hands-on, you run legacy monolithic job applications on Kubernetes. It gives you firsthand experience on how to get all the way up into observability and doing things like canary deployments. That's a great, great lab. But you got to do something like that to really get your hands dirty and understand how these things work. So don't sweat it.

Starting point is 00:43:54 There's not one right way. There's a way that'll probably work best for each user. And just take the time and understand the ways to make sure you're applying the one that's going to give you the most runway for your workload. I will definitely dig into that myself. I think you're right. I think you have nailed a point that is, again, a nuanced one and challenging to put in a rage tweet. But these services don't exist in a vacuum. They're not there because, despite the joke, someone wants to get promoted. It's because there are customer needs that are going on that, and this is another way of meeting those needs. I think there could be better guidance, but I also

Starting point is 00:44:28 understand that there are a lot of nuanced perspectives here and that hell is someone else's workflow. And there's always value in broadening your perspective a bit on those things. If people want to learn more about you and how you see the world, where's the best place to find you? Probably on Twitter, twitter.com slash Nektos, N-E-K-T-O-S. That might be the first time Twitter has been described as the best place for anything, but thank you once again for your time. It is always appreciated.

Starting point is 00:44:55 Thanks, Corey. Casey Lee, CTO at Gaggle and AWS Container Hero, and apparently writing code in anger to invalidate my points, which is always appreciated. Please do more of that, folks. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice or the YouTube comments, which is always a great place to go reading. Whereas if you've hated this podcast, please leave a five-star review in the usual

Starting point is 00:45:23 places and an angry comment telling me that I'm completely wrong and then launching your own open source tool to point out exactly what I've gotten wrong this time. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

Screaming in the Cloud - Creating “Quinntainers” with Casey Lee

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.