Screaming in the Cloud - Operating in the Kubernetes Cloud on Amazon EKS with Eswar Bala

Episode Date: May 5, 2023

Eswar Bala, Director of Amazon EKS at AWS, joins Corey on Screaming in the Cloud to discuss how and why AWS built a Kubernetes solution, and what customers are looking for out of Amazon EKS. ...Eswar reveals the concerns he sees from customers about the cost of Kubernetes, as well as the reasons customers adopt EKS over ECS. Eswar gives his reasoning on why he feels Kubernetes is here to stay and not just hype, as well as how AWS is working to reduce the complexity of Kubernetes. Corey and Eswar also explore the competitive landscape of Amazon EKS, and the new product offering from Amazon called Karpenter.About EswarEswar Bala is a Director of Engineering at Amazon and is responsible for Engineering, Operations, and Product strategy for Amazon Elastic Kubernetes Service (EKS). Eswar leads the Amazon EKS and EKS Anywhere teams that build, operate, and contribute to the services customers and partners use to deploy and operate Kubernetes and Kubernetes applications securely and at scale. With a 20+ year career in software , spanning multimedia, networking and container domains, he has built greenfield teams and launched new products multiple times.Links Referenced:Amazon EKS: https://aws.amazon.com/eks/kubernetesthemuchharderway.com: https://kubernetesthemuchharderway.comkubernetestheeasyway.com: https://kubernetestheeasyway.comEKS documentation: https://docs.aws.amazon.com/eks/EKS newsletter: https://eks.news/EKS GitHub: https://github.com/aws/eks-distro

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. It's easy to f*** up on AWS, especially when you're managing your cloud environment on your own.
Starting point is 00:00:36 Mission Cloud unf***s your apps and servers. Whatever you need in AWS, they can do it. Head to missioncloud.com for the AWS expertise you need. Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest episode is brought to us by our friends at Amazon. Now, Amazon is many things. They sell underpants, they sell books, they sell books about underpants and underpants featuring pictures of books, but they also have a minor cloud computing problem. In fact, some people would call them a cloud computing company with a gift shop that's attached.
Starting point is 00:01:11 Now, the problem with wanting to work at a cloud company is that their interviews are super challenging to pass. If you want to work there but can't pass the technical interview, for a long time, the way to solve that has been, ah, we're going to run Kubernetes, so we get to LARP as if we worked at a cloud company, but don't. Ishwar Bala is a director of engineering for Amazon EKS and is going to basically suffer my slings and arrows about one of the most complicated and I would say overwrought best practices that we're seeing industry-wide. Ishwar, thank you for agreeing
Starting point is 00:01:45 to subject yourself to this nonsense. Hey, Corey, thanks for having me here. So I'm a little bit unfair to Kubernetes because I wanted to make fun of it and ignore it, but then I started seeing it in every company that I deal with in one form or another. So yes, I can still sit here and shake my fist at the tide, but it's turned into old man yells at cloud, which I'm thrilled to embrace, but everyone's using it. So EKS is approaching the five-year mark since it was initially launched. What is EKS other than Amazon's flavor of Kubernetes? You know, the best way I can define EKS is EKS is just Kubernetes, not Amazon's version of Kubernetes. It's just Kubernetes that we get from the community and offer it to customers to make it easier for them to consume.
Starting point is 00:02:34 So EKS, I've been with EKS from the very beginning when we thought about offering a managed Kubernetes service in 2017. And at that point, the goal was to bring Kubernetes to enterprise customers. So we have many customers telling us that they'd want us to make their life easier by offering a managed version of Kubernetes that they've actually beginning to adopt at that time period, right?
Starting point is 00:03:02 So my goal was to figure out what does that service look like and which customer base should we target the service towards. Kelsey Hightower has a fantastic learning tool out there in a GitHub repo called Kubernetes the hard way, where he talks you through
Starting point is 00:03:17 building the entire thing start to finish. I wound up forking it and doing that on top of AWS. And you can find that at kubernetesthemuchharderway.com. And that was fun. And I went through the process and my response at the end was, why on earth would anyone ever do this more than once? And we got that sorted out, but now it's, customers aren't really running these things from scratch. It's like the Linux from scratch project,
Starting point is 00:03:41 great learning tool, probably don't run this in production in the same way that you might otherwise, because there are better ways to solve for the problems that you will have to solve yourself when you're building these things from scratch. So as I look across the ecosystem, it feels like EKS stands in the place of the heavy undifferentiated lifting of running the Kubernetes control plane so customers functionally don't have to. Is that an effective summation of this? That is precisely right. And I'm glad you mentioned Kubernetes the hard way. I'm a big fan of that when it came out. And if anyone who did that tutorial and also your tutorial, Kubernetes the harder way, would walk away thinking, why would I pick this
Starting point is 00:04:21 technology when it's super complicated to set up? But then you see that customers love Kubernetes and you see that reflecting the adoption, even in 2016, 2017 timeframes. And the reason is it made life easier for application developers in terms of offering web services that they wanted to offer to their customer base. And because of all the features that Kubernetes brought along, application lifecycle management, service discoveries, and then it evolved to support various application architectures, right? In terms of stateless services, stateful applications, and even daemon sets, right?
Starting point is 00:04:58 Like for running your logging and metrics agents. And these are powerful features at the end of the day, and that's what drove Kubernetes. And because it's super hard to get going to begin with, and then to operate, the day-to-operator experience is super complicated. The day-one experience is super hard. And the day-two experience of, okay, now I'm running it and something isn't working the way it used to, where do I start, has been just tremendously overwrought and frankly, more than intimidating. Exactly, right? And that exactly was our opportunity when we started in 2017.
Starting point is 00:05:34 And when we started, there was a question on, okay, should we really build a service when we have an existing service like ECS in place? And by the way, I did work in ECS before I started working in EKS from the beginning. So the answer then was, it was about giving what customers wanted. And there's space for many container orchestration systems, right? ECS was the AWS service at that point in time. And our thinking was, how do we give customers what they wanted? If they wanted a Kubernetes solution, let's go build that. But we built it in a way that we removed the undifferentiated heavy lifting of managing Kubernetes. One of the weird things that I find is that everyone's using Kubernetes, but I don't see it in the way that I contextualize the AWS universe, which, of course, is on the bill. That's right.
Starting point is 00:06:17 If you don't charge for something in AWS land, and preferably a fair bit, I don't tend to know it exists. Like, what's an IAM and what might that possibly do? Always a reassuring thing to hear from someone who's often called an expert in the space. But, you know, if it doesn't cost money, why do I pay attention to it? The control plane is what EKS charges for unless you're running a bunch of Fargate-managed pods and containers to wind up handling those things. So it mostly just shows up as an addenda to the actual big meaty portions of the bill. It just looks like a bunch of EC2 instances with some really weird behavior patterns, particularly with regard to auto-scaling and crosstalk between all of those
Starting point is 00:07:00 various nodes. So it's a little bit of a murder mystery figuring out. So what's going on in this environment? Do you folks use containers at all? And the entire Kubernetes shop is looking at me like, are you simple? No, it's just, I tend to disregard the lies that customers say, mostly to themselves, because everyone has this idea of what's going on in their environment, but the bill speaks. It's always been a little bit of an investigation to get to the bottom of anything that involves Kubernetes at significant points of scale. Yeah, you're right. If you look at EKS, we started with managing the control plane to Bing and Bing. And managing the control plane is a drop in the bucket when you actually look at the cost in terms of operating a Kubernetes cluster or running a Kubernetes cluster.
Starting point is 00:07:44 When we look at how our customers use and where they spend most of their cost, it's about where their applications run. It's actually the Kubernetes data point and the amount of compute and memory that the applications end up using end up driving 90% of the cost. And beyond that is the storage,
Starting point is 00:08:01 beyond that is the networking cost, right? And then after that is the actual control plane cost. So the problem right now is figuring out how do we optimize our costs for the application to run on. On some level, it requires a little bit of understanding of what's going on under the hood. There have been a number of cost optimization efforts that have been made in the Kubernetes space, but they tend to focus around stuff that I find relatively, well, I'll call it banal because it basically is.
Starting point is 00:08:30 You're looking at this, the idea of, okay, what size instances should you be running and how well can you fill them and make sure that all the resources per node wind up being taken advantage of. But that's also something that, I guess, from my perspective, isn't really the interesting architectural point of view. Whether or not you're running a bunch of small instances or a few big ones or some combination of the two, that doesn't really move the needle on any architectural shift. Whereas ingesting a petabyte a month of data and passing 50 petabytes back and forth between availability zones, that's where it starts to get really interesting as far as tracking that stuff down.
Starting point is 00:09:12 But what I don't see is a whole lot of energy or effort being put into that. And I mean, industry-wide, to be clear, I'm not attempting to call out Amazon specifically on this. That's not the direction I'm taking this in. For once, I know. I'm still me. But it seems to be just an industry-wide issue where zone affinity for Kubernetes has been a very low priority item, even on project roadmaps on the Kubernetes project. Yeah, the Kubernetes does provide ability for customers to restrict their workloads within a particular AZ, right?
Starting point is 00:09:39 Like there is constraints that you can place on your pod specs that end up driving applications towards a particular AZ if they want, right? You're right. It's still left to the customers to configure. Just because there's a configuration available doesn't mean the customers use it. If it's not defaulted, most of the time it's not picked up. to offer ability to not only provide the visibility by means of reporting that's available using tools like KubeCast and Amazon Billing Explorer, but also provide insights and recommendations
Starting point is 00:10:12 on what customers can do. I agree that there's a gap today, for example, at EKS in terms of that, like we're slowly closing that gap and it's something that we are actively exploring. How do we provide insights across all the resources customers end up using from within a cluster? That includes not just computer memory, but also storage and networking, right? And that's where we are actually moving towards. That's part of the weird problem I found is that on some level, you get to play almost
Starting point is 00:10:40 data center archaeologists when you start exploring what's going on in these environments. I found one of the only reliable ways to get answers to some of this stuff has been oral almost data center archaeologists when you start exploring what's going on in these environments. I found one of the only reliable ways to get answers to some of this stuff has been oral tradition of, okay, this Kubernetes cluster just starts hurling massive data quantities at 3 a.m. every day. What's causing that?
Starting point is 00:10:55 And it leads to, I don't know, if you talk to the data science team, like, oh, you have a data science team, a common AWS build mistake. And exploring down that particular path sometimes pays dividends. But there's no holistic way to solve that globally today. I'm optimistic about tomorrow, though. Correct. And that's where we're spending our efforts right now. For example, we recently launched our partnership with KubeCos, and KubeCos is now available as an add-on from
Starting point is 00:11:20 the marketplace that you can easily install and provision on Kubernetes. EKS clusters, for example. And that is a start. And kubectl is amazing in terms of features, in terms of insights it offers, right? It looks at looking into computer memory and the optimizations and insights it provides you. And we're also working with the AWS cost and usage reporting team to provide a native AWS solution for the cost reporting
Starting point is 00:11:44 and the insights aspect as well in EKS. And it's something that we are going to be working really closely to solve the networking gaps in the near future. What are you seeing as far as customer concerns go with regard to cost and Kubernetes? I see some things, but let's be very clear here. I have a certain subset of the market
Starting point is 00:12:03 that I spend an inordinate amount of time speaking to, and I always worry that what I'm seeing is not holistically what's going on in the broader market. What are you seeing customers concerned about? Well, let's start from the fundamentals here, right? Customers really want to get to market faster, whatever services and applications that they want to offer. And they want to have it cheaper to operate. And if they're adopting EKS, they want it cheaper to operate in Kubernetes in the cloud. They also want high performance. They also want scalability and they want security and isolation. There's so many parameters that they have to deal with before
Starting point is 00:12:37 they put their service in the market and continue to operate. And there's a fundamental tension here, right? Like they want cost efficiency, but they also want to be available in the market quicker and they want the performance and availability. Developers have uptime SLOs and SLAs to consider and they want the maximum possible resources that they want. And on the other side, you've got financial leaders and the business leaders
Starting point is 00:13:01 who want to look at the spending and worry about like, okay, are we allocating our capital wisely? And are we allocating where it makes sense? And are we doing it in a manner that there's very little wastage and aligned with our customer use, for example? And this is where the actual problems arise from at the end of it. I want to be very clear that for a long time, one of the most expensive parts about running Kubernetes has not been the infrastructure itself. It's been the people to run this responsibly, where it's the day two, day three experience,
Starting point is 00:13:33 where for an awful lot of companies, like, oh, we're moving to Kubernetes because, I don't know, we read it in InFlight magazine or something, and all the cool kids are doing it, which honestly, during the pandemic is why suddenly everyone started making better IT choices because their execs were not being exposed to airport ads. I digress. The point, though, is that as the customers are figuring this stuff out and playing around with it, it's not sustainable that every company that wants to run Kubernetes can afford a crack SRE team that is individually incredibly expensive and collectively staggeringly so,
Starting point is 00:14:06 that seems to be the real cost is the complexity tied to it. And EKS has been great in that it abstracts an awful lot of the control plane complexity away. But I still can't shake the feeling that running Kubernetes is mind-bogglingly complicated. Please argue with me and tell me I'm wrong. No, you're right. It's still complicated. And it's a journey towards reducing the complexity. When we launched the EKS, we launched only with managing the control plane to Bing. And that's the way we started. But customers had the complexity of managing the worker nodes.
Starting point is 00:14:38 And then we evolved to manage the Kubernetes worker nodes. In terms of two products, we've got managed node groups and Fargate. And then customers moved on to installing more agents in their clusters before they actually install their business applications. Things like Cluster Autoscaler, things like Metric Server. Critical components that they've come to rely on, but doesn't drive their business logic directly. They are supporting aspects of driving core business logic. And that's how we evolved
Starting point is 00:15:05 into managing the add-ons to make life easier for our customers. And it's a journey where we continue to reduce the complexity of making it easier for customers to adopt Kubernetes. And once you cross that chasm, and we are still trying to cross it, once you cross it, you have the problem of, okay, so adopting Kubernetes is easy. Now we have to cross it. Once you cross it, you have the problem of, okay, so adopting Kubernetes is easy. Now we have to operate it, which means that we need to provide better reporting tools, not just for cost,
Starting point is 00:15:34 but also for operations. How easy it is for customers to get to the application-level metrics and how easy it is for customers to troubleshoot issues, how easy for customers to actually upgrade
Starting point is 00:15:48 to newer versions of Kubernetes. All of these challenges come out beyond day one, right? And those are initiatives that we have in flight to make it easier for customers as well. So one of the things I see when I start going deep
Starting point is 00:16:02 into the Kubernetes ecosystem is, well, Kubernetes will go ahead and run the containers for me, but now I need to know what's going on in various areas around it. One of the big booms in the observability space in many cases has come from the fact that you now need to diagnose something in a container you can't log into and incidentally stopped existing 20 minutes before you got the alert about the issue. So you'd better hope your telemetry is up to snuff. Now, yes, that does act as a bit of a complexity burden, but on the other side of it, we don't have to worry about things like failed hard drives taking systems down anymore, that it has successfully been abstracted away by Kubernetes or, you know, your cloud provider,
Starting point is 00:16:45 but that's neither here nor there these days. What are you seeing as far as effectively the sidecar pattern, for example, of, oh, you have too many containers and need to manage them? Have you considered running more containers? Sounds like something a container salesman might say. So running containers demands that you have really solid observability tooling, things that you're able to troubleshoot successfully, debug without the need to log into the containers itself.
Starting point is 00:17:13 And in fact, that's an anti-pattern, right? You really don't want a container to have the ability to SSH into a particular container, for example. And to be successful at it demands that you publish your metrics and you publish your logs. All of these are things that a developer needs to worry about today in order to adopt containers, for example. And it's on the service providers to actually make it easier for the developers not to worry about these. And all of these are available automatically when you adopt a Kubernetes service, for example. In EKS, we are working with our managed Prometheus service teams inside Amazon, right, and also CloudWatch teams to easily enable metrics and logging for customers without having
Starting point is 00:17:59 to do a lot of heavy lifting. Let's talk a little bit about the competitive landscape here. One of my biggest competitors in optimizing AWS builds is Microsoft Excel. Specifically, people are going to go ahead and run it themselves because hiring someone who's really good at this, that sounds expensive. We can screw it up for half the cost, which is great. It seems to me that one of your biggest competitors is people running their own control plane on some level. I don't tend to accept the narrative that, oh, EKS is expensive. That winds up being, what, $35 or $70 or whatever it is per control plane per cluster on a monthly basis.
Starting point is 00:18:37 Okay, yes, that's expensive if you're trying to stay completely within a free tier, perhaps. But if you're running anything that's even slightly revenue generating or for a for-profit company, you will spend far more than that just on people's time. I have no problems for once with the EKS pricing model start to finish. Good work on that. You have successfully nailed it. But are you seeing significant pushback from the industry of, nope, we're going to run our own Kubernetes management system instead because we enjoy pain, corporately speaking. Actually, we're in a good spot there, right?
Starting point is 00:19:11 Like at this point, customers who choose to run Kubernetes on AWS by themselves and not adopt EKS just fall into one main category or two main categories. Number one, they have existing technical stack built on running Kubernetes on themselves, and they'd rather maintain that and not move into EKS. Or they demand certain custom configurations of the Kubernetes control point that EKS doesn't support. And those are the only two reasons why we see customers not moving into EKS and prefer to run their own Kubernetes on AWS clusters.
Starting point is 00:19:47 It really does seem on some level like there's going to be a, I don't want to say reckoning because that makes it sound vaguely ominous and that's not the direction that I intend for things to go in. But there has to be some form of collapsing of the complexity that is inherent to all of this,
Starting point is 00:20:05 because the entire industry has always done that. An analogy that I fall back on, because I've seen this enough times to have the scars to show for it, is that in the 90s, running a web server took about a week of spare time and an in-depth knowledge of GCC compiler flags. And then it evolved to, ah, I could just unzip a tarball of precompiled stuff, and then RPM or dev became a thing, and then yum or something else, or I guess apt over in the Debian land to wind up wrapping around that. And then you had things like PuppetWars and SureInstalled, and now it's Docker run. And today it's a checkbox in the S3 console that proceeds to yell at you because you're
Starting point is 00:20:41 making a website public, but that's neither here nor there. Things don't get harder with time, but I've been surprised by how I haven't yet seen that sort of geometric complexity collapsing around Kubernetes to make it easier to work with. Is that coming,
Starting point is 00:20:59 or are we going to have to wait for the next cycle of things? Let me think. I actually don't have a good answer to that, Corey. That's good, at least, because if you did, I'd worry that I was just missing something obvious, that that's kind of the entire reason I asked. Like, oh, good, I get to talk to smart people and see what they're picking up on that I'm absolutely missing. I was hoping you had an answer, but I guess it's cold comfort that you don't have one off the top of your head.
Starting point is 00:21:22 But man, is it confusing. Yeah. So there are some discussions in the community out there, right? Like, is Kubernetes the right layer to interact? And there are some tooling that's built on top of Kubernetes. For example, Knative that tries to provide a serverless layer on top of Kubernetes, for example. There are also attempts at abstracting Kubernetes completely and providing tooling that just completely removes any sort of Kubernetes API out of the picture.
Starting point is 00:21:49 And maybe a specific CI CD based solution that takes it from the source and deploys a service without even showing you that there's Kubernetes underneath. All of these are evolutions that are being tested out there in the community. Time will tell whether these end up sticking. But what's clear here is the gravity around Kubernetes. All sorts of tooling that gets built on top of Kubernetes,
Starting point is 00:22:15 all the operators, all sorts of open-source initiatives that are built to run on Kubernetes, for example, Spark, for example, Cassandra. So many of these big, large-scale open-source solutions are now built to run really well on Kubernetes. And that is the gravity that's pushing Kubernetes at this point. I'm curious to get your take on one other, I would consider interestingly competitive spaces. Now, because I have a domain problem, if you go to kubernetestheeasyway.com, you'll wind up on the ECS marketing page. That's right. The worst competition in the world,
Starting point is 00:22:52 the people who work down the hall from you. If someone's considering using ECS, Elastic Container Service, versus EKS, Elastic Kubernetes Service, what is the deciding factor when a customer is making that determination? To be clear, I'm not convinced there's a right or wrong answer, but I am curious to get your take given that you have a vested interest, but also presumably don't want to talk complete smack about your colleagues, but feel free to surprise me. Hey, I love ECS, by the way. Like I said, I started my life in AWS in ECS. So, look, ECS is a hugely successful container orchestration service. I know
Starting point is 00:23:30 we talk a lot about Kubernetes. I know there's a lot of discussions around Kubernetes, but I want to make it a point that ECS is a hugely successful service. Now, what determines how customers go to? If customers are... If the customer's tech stack is entirely on AWS,
Starting point is 00:23:46 right, they use a lot of AWS services, and they want an easy way to get started in the container world that has really tight integration with other AWS services without them having to configure a lot, ECS is the way, right? And customers have actually seen terrific success adopting ECS for that particular use case. Whereas EKS customers, they start with, okay, I want an open source solution. I really love Kubernetes. Or I have a tooling that I really like in the open source land that really works well with Kubernetes. I'm going to go that way. And those set of customers end up picking EKS. I feel like on some level, Kubernetes has become almost the default API across a wide variety of environments. AWS, obviously, but on-prem, other providers. It
Starting point is 00:24:33 seems like even the traditional VPS companies out there that rent a server in the cloud somewhere are all also offering, oh, and we have a Kubernetes service as well. I wound up backing a Kickstarter project that runs a Kubernetes cluster with a shared backplane across a variety of raspberries pie, for example. And it seems to be almost everywhere you look. Do you think that there's some validity to that approach of effectively whatever it is that we're going to wind up running in the future, it's going to be done on top of Kubernetes? Or do you think that that's mostly hype-driven these days? It's definitely not hype.
Starting point is 00:25:09 We see the proof in the kind of adoption we see. It's becoming the de facto container orchestration API. And with all the open-source tooling that's continuing to build on top of Kubernetes, CNC tooling ecosystem that's actually spawned to actually support Kubernetes adoption. All of this is solid proof that Kubernetes is here to stay
Starting point is 00:25:31 and is a really strong, powerful API for customers to adopt. So four years ago, I had a prediction on Twitter and I said, in five years, nobody will care about Kubernetes. And it was in February, I believe. And every year I wind up updating and incrementing the link to it, like four years to go, three
Starting point is 00:25:49 years to go. And I believe it expires next year. And I have to say, I didn't really expect when I made that prediction for it to outlive Twitter. But yet here we are, which is neither here nor there. But I'm curious to get your take on this. But before I wind up just letting you savage the naive interpretation of that, my impression has been that it will not be that Kubernetes has gone away. That is ridiculous. It is clearly in enough places that even if they decided to rip it out now, it would take them 10 years. But rather that it's going to slip below the surface level of awareness. Once upon a time, there was a whole bunch of energy and drama and debate around the Linux virtual memory management subsystem. And today, there's like a dozen people
Starting point is 00:26:30 on the planet who really have to care about that. But for the rest of us, it doesn't matter anymore. We are so far past having to care about that, having any meaningful impact on our day-to-day work, that it's just, it's the part of the iceberg that's below the waterline. I think that's where Kubernetes is heading. Do you agree or disagree? And what do you think about the timeline? I agree with you. That's the perfect analogy.
Starting point is 00:26:54 It's going to go the way of Linux, right? It's here to stay. It's just going to get abstracted out if any of the abstraction efforts are going to stick around. And that's where we are testing the waters there. There are many, many open source initiatives there trying to abstract Kubernetes. All of these are yet to gain ground,
Starting point is 00:27:12 but there are some reasonable efforts being made. And if they are successful, they just end up being a layer on top of Kubernetes. Many of the customers, many of the developers don't have to worry about Kubernetes at that point. But a certain subset of us in the tech world will need to deal with Kubernetes. And most likely, teams like mine that end up managing and operating their Kubernetes clusters. So one last question I have for you is that if there's one thing that AWS loves,
Starting point is 00:27:39 it's misspelling things. And you have an open source offering called Carpenter, spelled with a K, that is a extending of that tradition. What does Carpenter do and why would someone use it? Thank you for that. Carpenter is one of my favorite launches in the last one year. Presumably because you're terrible at the spelling bee back when you were a kid, but please tell me more. So Carpenter is an open-source, flexible, and high-performance cluster autoscaling solution. So basically, when your cluster needs more capacity to support your workloads,
Starting point is 00:28:16 Carpenter automatically scales the capacity as needed. For people that know the Kubernetes space well, there's an existing component called Cluster Autoscaler that fills this space today. And it's our take on, okay, so what if we could reimagine the capacity management solution available in Kubernetes? And can we do something better,
Starting point is 00:28:39 especially for cases where we expect terrific performance at scale to enable cost efficiency and optimization use cases for our customers. And most importantly, provide a way for customers not to pre-plan a lot of capacity to begin with. This is something we see a lot in the sense of very bursty workloads where, okay, you're going to steady state load, cool, buy a bunch of savings plans, get things set up the way you want them and call it a day. But when it's bursty, there are challenges with it. Folks love using spot, but in the event of a sudden capacity shortfall, the question
Starting point is 00:29:16 is, can we spin up capacity to backfill it within those two minutes that we have a warning on that on? And if the answer is no, then it becomes a bit of a non-starter. Customers have had to build an awful lot of those things around EC2 instances that handle a lot of that logic for them in ways that are tuned specifically for their use cases. I'm encouraged to see there's a Kubernetes story around this that starts to remove some of that challenge from the customer side. Yeah. So the best thing is, is where complexity comes in, right? Like many customers for steady state, they know what their capacity requirements are.
Starting point is 00:29:50 They set up that capacity. They can also reason out what is the effective capacity needed for good utilization for economical reasons, and they can actually pre-plan that and set it up. But once burstiness comes in, which inevitably does it at popular applications, customers worry about, okay, am I going to get the capacity that I need in times that I need to be able to service my customers? And am I confident at it? If I'm not confident, I'm going to actually allocate capacity beforehand, assuming that I'm going to actually get the burst that I need,
Starting point is 00:30:30 which means you're paying for resources that you're not using at the moment, and the burstiness might happen. And then you are on the hook to actually reduce your capacity footprint once the peak subsides at the end of it. And this is a challenging situation. And this is one of the use cases that we targeted Carpenter towards. I find that the idea that you're open sourcing this is fascinating because of two reasons. One, it does show a willingness to engage with a community that, again, it's difficult. When you're a big company, people love to wind up taking issue with almost anything that you do. But for another, it also puts it out in the open on some level where, especially when you're talking about cost optimization and decisions that affect cost, it's all out in public. So people can look at this and think, wait a minute, it's not,
Starting point is 00:31:14 what is this line of code that means if it's toward the end of the month, crank it up because we might need to hit our numbers. Like there's nothing like that in there. At least I'm assuming I'm trusting that other people have read this code because, honestly, that seems like a job for people who are better at that than I am. But that does tend to breed a certain element of trust. Right. It's one of the first things that we thought about when we said, okay, so we have some ideas here to actually improve the capacity management solution for Kubernetes. Okay, should we do it out in the open? And the answer was a resounding yes, right? I think there's a good story here that actually enables not just AWS to offer these ideas out there,
Starting point is 00:31:54 right? And we want to bring it to all sorts of Kubernetes customers. And one of the first things we did is to architecturally figure out all the core business logic of Carpenter, which is, okay, how to schedule better, how quickly to scale, what is the best instance types to pick for this workload. All of that business logic was abstracted out from the actual cloud provider implementation. And the cloud provider implementation is super simple. It's just creating instances, deleting instances, and describing instances. And it's something that we bake from the get-go, so it's easier for other cloud providers to come in and to add their support to it.
Starting point is 00:32:32 And we as a community actually can take these ideas forward in a much faster way than just ADB is doing it. I really want to thank you for taking the time to speak with me today about all these things. If people want to learn more, where's the best place for them to find you? The best place to learn about EKS, right, as EKS evolves, is using our documentation. We have an EKS newsletter that you can go subscribe. And you can also find us on GitHub, where we share our product roadmap. Those are great places to learn about how EKS is evolving and also share your feedback. Which is always great to hear as opposed to, you know,
Starting point is 00:33:07 in the AWS console where we live waiting for you to stumble upon us, which, yeah, no, it's good to have a lot of different places for people to engage with you. And we'll put links to that, of course, in the show notes. Thank you so much for being so generous
Starting point is 00:33:21 with your time. I appreciate it. Corey, really appreciate you having me. Ishwar Bala, Director of Engineering for Amazon EKS. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review
Starting point is 00:33:36 on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, telling me why when it comes to tracking Kubernetes costs, Microsoft Excel is in fact a superior experience. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS.
Starting point is 00:34:10 We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.