PurePerformance - A Minimalistic Approach to Kubernetes with Kelsey Hightower

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. We are the performance. It's definitely been a while for our audience. We had our last official podcast before Perform was with Abigail Wilson, no relation. And then we had all those Perform podcasts, and I want to thank everybody for, you know, I know those are a lot more Dynatrace-centric, more about our user conference, but thank you for making it through those if you're not interested in the Dynatrace stuff. We're back to regular programming now, and I know there were some issues with some of the uploads, so those should be fixed by now. But yeah, it was really busy at Perform and got to see you there, which was always nice. Yeah, even though it was only for a short while. Yeah, it was

Starting point is 00:01:16 only for a short because I've been running around and you guys did all the work. And I just felt like I didn't really contribute a whole lot. You contributed a ton, Andy, all those interviews you did with the PMs. Don't sell yourself short there. That was amazing. Amazing work you get done for us. Yeah, it was only a short while, but it felt like forever. They really run us ragged.

Starting point is 00:01:35 But it was great to see all of our customers and vendors and everyone else over at Perform. And hopefully we'll be seeing everyone again next year. But we're back to our normal show, right? Exactly. And we have a very exciting guest today. I say that often, I know. Everyone's a very exciting guest.

Starting point is 00:01:52 Anybody who's willing to come on the show is exciting for me. But to me, I've been following this person, checking out things they've been doing. I've enjoyed the no-code repository. So do you want to go ahead and introduce our guest here, Andy, or anything else you want to contribute before we head in? No, I think you're right. I mean, it's amazing to have a guest like him on the show today.

Starting point is 00:02:15 And the only thing I want to say, I want to read out his tagline on his Twitter profile, and it just says, minimalist. That's all it says. And I actually want to hand over the token to Kelsey Hightower, who is there with us today. I'm not sure where he is today. He's officially lives in Portland, Oregon,

Starting point is 00:02:31 but I know he's traveling a lot. So Kelsey, where are you today? And can you please let us know what a minimalist is and why this is driving you? Yeah, so I actually live in Washington state now. Okay. And I work from home, so I'm at home now, so I'm not actually traveling today. But I do travel quite a bit for our customers and, you know, some community work or engineering work at one of the Google offices.

Starting point is 00:02:54 And a minimalist, I started doing, well, I adopted the term because, you know, it was the best way to describe some of the life changes I probably made maybe 15, 20 years ago, which is, you know, living debt-free, all of my clothes fit in probably one bag. I own very little items, but I really appreciate the items that I do own. And I kind of keep this minimalist philosophy in terms of the things that I need versus the things that I want. And I keep that balance. And I guess the term minimalist fits the bill. That's awesome. You know, it's funny when you talk about the clothes, I know you travel a lot. So that requires having enough clean clothes to travel with. But I was just thinking like, you know, since I started working from home, the amount of clothes I buy

Starting point is 00:03:41 is very minimal, not quite minimalist, but you, but I wear the same jeans all week long, if I wear pants at all. So, yeah, I get it. I think that's really awesome. It's something I would love to do, but never really had the drive to do, or the push to do. So, it's great that you're pulling that off. So Kelsey, you are working for Google for those people that kind of managed to escape

Starting point is 00:04:11 who you actually are. I think if you are Googling your name or if you Google Kubernetes, then your name comes up all the time because you're doing a fantastic job in going out there and showing people what the technology that, in this case, Google and Kubernetes brings to the table, how this can change our life

Starting point is 00:04:31 and how we can use it, you know, in the best ways to, I think, in the end, deliver better software value faster to our end users. So I need to get this out there because even though I think a lot of people know Kubernetes but I still can you give me can you give me your view on where are we right now with Kubernetes I know it's a hot topic and it seems it's you know it's been widely adopted and more and more people that we talk to have already

Starting point is 00:05:00 started using Kubernetes also in production or are about to. But can you let us quickly know what's the state of Kubernetes and what does Kubernetes give you? But also maybe what does Kubernetes not give you? What else do you need on top of Kubernetes in order to become successful with deploying high-scaling, performing applications, which is still to the heart of our listeners? Wow. So I think Kubernetes is, you know, it's definitely an application platform. Most people see it for what it is on the surface, right? You have a bunch of containers, you have a bunch of

Starting point is 00:05:34 either bare metal or virtual machines, and you want something to orchestrate the lifecycle of your applications that happen to be packaged in containers, right? This is how most people look at Kubernetes. This is how most people leverage Kubernetes today. So, and this is just a continuation on the good work that Docker started, right? This idea that you would package your applications in a standard format. You could then share and distribute those using standard runtimes. And I think that's just kind of where most people are today. If you're an advanced user, though, you're probably progressed well past that. You're probably using Kubernetes to build your own automation tools. So this concept of an operator or people using CRDs, these are custom resource definitions,

Starting point is 00:06:21 and they've taken this whole declarative approach to application management and spread it across other automation tasks like creating CICD pipelines or machine learning pipelines. So I think this is kind of where the industry is depending on when you got started with Kubernetes. Now, I think you bring up a great point and I just want to bring up one project

Starting point is 00:06:43 that we are actively working on. As you said, Kubernetes is a great application platform. You deploy a container. Kubernetes takes care of, you know, managing the health, spinning up the containers in case they're failing. I mean, that's all taken care of. But what we have also realized, and we kind of maybe are in the next mature state as an engineering organization, is we need something to better manage the lifecycle of an artifact as it gets kind of deployed into a stage. Then it needs to be properly tested. It needs to be valued and then promoted into the next stage,

Starting point is 00:07:15 kind of an event-driven model for managing the lifecycle of an artifact all the way through into production and also managing you know problems that may come up in production so the the thing we built on top uh of kubernetes and and finally we're also part of the cncf landscape now is an open source project called captain that we also are happy to give back to the world where we are you know automating some of these tasks on top that that's quote unquote bare metal Kubernetes doesn't give us.

Starting point is 00:07:49 Now, what we have seen, and maybe you see this as well, we have a lot of people that are just getting started with Kubernetes. And then we have these that have already been working with it for years and years and are really kind of far beyond what everybody that just starts can imagine. Is there any advice that you can give people that are just

Starting point is 00:08:10 getting started and there's still a lot that are getting started and getting their hands on? What are the things people want to make sure that they are addressing and what are the things that you say, you know, please don't make this mistake because it's been made all over again, so avoid these things? Oh, man. I think the biggest one is using Kubernetes for everything. Okay. I think that's a thing where most people look at Kubernetes and say,

Starting point is 00:08:38 wow, I can probably use Kubernetes to recreate every automation tool I've ever built in my life because it's the new hotness. There's declarative configs. You can extend it. And I think people are just getting a little carried away with that. It sits on top of infrastructure, right?

Starting point is 00:08:55 So Kubernetes assumes you have infrastructure that you can use, right? So I think that's the best place to start in terms of putting Kubernetes as a layer to orchestrate the things below it. Some people want to recreate all the things below it, like build a cloud platform on top of Kubernetes itself. That's going to be very challenging if you don't have a lot of experience on how to do that.

Starting point is 00:09:20 These are kind of separate concerns, like IaaS, infrastructure as a service, the thing you get from a cloud provider, what the right extensions and add-ons you may get from tools like VMware. But Kubernetes doesn't really operate at that layer. It actually leverages that lower layer to do its bidding. So without that in place, Kubernetes is really, I don't know if it's going to be as effective. So I think the idea to keep in mind is Kubernetes has a purpose. It's definitely great at what it does. You can extend it to build other platforms, but just understand there's a big difference between building platforms

Starting point is 00:09:55 and building your own platform as a service, if that's your thing, and using Kubernetes. So I think you just have to be fairly clear about your intentions. Andy, this calls to mind a conversation we had a while back. And all the terms and the guests are eluding me right now. But it had to do Wardley Maps, right? And there was this idea, and Kelsey, I was curious on your perspective on this. There was the idea, it sounds like what you're saying is people are, you know, since Kubernetes is extensible, since you can kind of make it do a lot of things, the question is what you're proposing is what should you be making it do and what shouldn't you be making it do?

Starting point is 00:10:35 And I'm not asking for any kind of a vendor endorsement in this question, but more of the question of it seems to, if there are products and services out there that do things very well, and it's not in the core scope of Kubernetes, then leverage those external products. Find out what you should build in-house or make on your own, and then also identify what already exists that can handle the things that you need and use them because it's a lot easier to have a dedicated company or a dedicated set of resources to manage your IaaS components or maybe manage your monitoring or maybe manage all these other components.

Starting point is 00:11:19 So where do you see that case that you're talking about falling into the go to a third party versus build your own? The way I like to think about this is, let's say Kubernetes didn't exist. What decision would you make? You know, what logging stack would you pick? Would you pick Splunk? Why did you pick Splunk? All of those decisions are probably still good decisions. Just because Kubernetes has a default logging tool, Fluid D in this case, doesn't mean you have to throw away Splunk, right? It may give you an option.

Starting point is 00:11:54 Maybe there's a lower cost there, but it doesn't mean that your previous good decision is no longer valid. And I see that happen a lot. Same thing for your load balancer. One thing you would like to see is, let's say you're using Nginx. It would whole class of reasons that has nothing to do with Kubernetes or the ecosystem that may be introducing a different take on load balancing just because Kubernetes is the new hotness. So I think a lot of times you got to make sure that you're making

Starting point is 00:12:35 decisions that are just kind of good on their own and not making decisions just because you have Kubernetes. I like that. Hey, so I know you've been talking a lot about all sorts of aspects on Kubernetes. One aspect that we are talking a lot about is performance and talking to some of our performance engineers that we have in-house and to some of the performance engineers that Brian and I often meet because that's also our background. We have a couple of friends around the world that have been doing performance engineering for 20, 30 years. you know, the next new platform, especially as it comes to the hype around it. And then really, does it really perform as well? Does it really scale as well? As you said, underneath, it's still just hardware where you have just more layers of abstraction on top of it.

Starting point is 00:13:34 And therefore, everything has to be slower by default because you have so many levels of abstractions. Can you talk a little bit about what you have seen from a performance and scalability perspective? What are things performance engineers especially have to consider when they are dealing with Kubernetes? Are there any new patterns we have to be aware of? Are there any new metrics that are interesting when it comes to monitoring the performance of the Kubernetes platform itself? Is there anything that you say, well, the way you did it on the classical infrastructure doesn't make sense anymore in Kubernetes and here is why. So

Starting point is 00:14:16 especially speaking to those folks that have a long history with performance engineering and now moving to Kubernetes, can you talk a little bit about to kind of spread their doubts, but also give them some practical advice? Yeah, so if you're going to be measuring Kubernetes, right, this is a new system. And most people that are measuring, whether it's scheduler latency, the metrics, how many pods did it schedule,

Starting point is 00:14:41 how much memory is the Kubernetes API server using. A lot of times the API server does serve as a cache, so it's expected to use a lot of memory to make a performance trade-off in terms of API calls. So when you say kubectl apply and you deploy your application, you want that interaction to be fast as possible. So a way of doing that is Kubernetes does attempt to use a lot of memory on the API

Starting point is 00:15:07 server to protect the database etcd. And that's a big performance trade-off, right? So some people will look at that and say, wow, this API server takes up way too much memory. But again, that's by design. So just like any new system, you really need to understand why it's designed the way it is, what trade-offs is it making for the sake of performance, before you can really start to measure and understanding

Starting point is 00:15:30 what your measurements are telling you. So that's step one. So when measuring Kubernetes, it is a different system. There is a control plane, the scheduler, the API server, and all the things that make that control plane work. And there's a data plane. There's a couple of data planes if you want to think about it, like the kubelet, the thing that make that control plane work. And there's a data plane. There's a couple of data planes if you want to think about it, like the kubelet, the thing that takes from the control plane what to run. That thing lives on the node, and it has its own set of metrics that you may care about. So that's Kubernetes itself.

Starting point is 00:15:57 The biggest gotcha I see from people who have experience really thinking about performance of the application. The defaults are the things where I think a lot of people who have been doing this for a very long time tend to forget the defaults that are on a virtual machine, right? So if you take a VM or bare metal server, you're probably running usually as root, even though people won't admit it.

Starting point is 00:16:20 You're usually running this root. You have access to all the memory, all the CPUs. You have no isolation. You have no namespaces limiting what you can do with the memory or CPU. So the whole machine is yours. And if you're used to a performance benchmark under those conditions, you're going to be really surprised when I take your application. And it's not the fact that I put it in a tarball, aka container image, is where the impact comes from. It comes from the fact that you're going to be running under a container runtime that by default

Starting point is 00:16:51 will try to isolate that process and also limit how that process gets to interact with the number of CPUs available. So one concrete example would be like a Java application, right? You're running on the JVM and if I give you a 16 node application, right? You're running on the JVM. And if I give you a 16 node box, right? 16 CPUs, typically, depending on how you've configured your JVM,

Starting point is 00:17:11 it's going to see those 16 CPUs and just grab all of them, right? Like I'm Java, take all the memory, takes all the CPU, and that may become your new performance benchmark. Now, let's say you have Kubernetes and you take the same application and you do something like, you know, you may limit the CPUs to two because you've copied and pasted from Stack Overflow. You're not quite sure what you're doing in that pod specification when you deploy the app. The problem, though, is even though you told Kubernetes to only run you on a machine that has two CPU free,

Starting point is 00:17:46 and you may even set some limits, like no more than four CPUs of utilization. That sounds about right. And then you run on the machine and the JVM is still going to see 16 CPUs. And it actually will not necessarily take a queue from your configuration. So the scheduler and the kubelet and the container runtime will limit you, but the JVM will still run as if it was still on the VM and that will manifest itself in thread contention, all kinds of things that will look like the app is now running quote unquote slow. So then is this problem not solved by any of the JVM vendors so that JVMs understand in which runtime environment they really run and therefore automatically adjust the defaults? And is something like this available or not available?

Starting point is 00:18:35 I'm going to say maybe, but like all JVM problems, there's probably like 200 additional switches that you can do, right? So there's already like 10,000 switches and most people I know don't really know which ones do what. So most people tend to run with like a default configuration. I think your audience is going to be a little bit different. They're probably used to really tuning the GC or tuning various aspects of the JVM.

Starting point is 00:19:00 But again, even with that knowledge, you still may not know the relationship to your current configuration and the limitations being opposed by the container runtime. So I think those are the things that you're just going to observe that, oh, the kernel is throttling the threads that you're operating on. You would have to understand what Docker is doing to you. So I haven't seen any JVMs that just auto-detect the fact that they're running inside of a container runtime. There's environment variables that you can pass to the JVM, right?

Starting point is 00:19:31 So if you can take your configuration and you can say, hey, this thing only has this many CPUs. I want to limit my thread pool based on that number. So you mentioned a lot of this comes from, you know, you're running Kube as root and all this. Is there something during the initial setup that you would do differently to be able to contain this? Or does it all come into not just copying and pasting from Stack Overflow, but understanding all of the settings that you're going to put into your containers and everything to limit this? How does one control this besides that environmental variable for the threads or limit it to two CPUs specifically? So there's two things at play here.

Starting point is 00:20:10 One, most people are coming from a world where they just had full ownership of the machine without isolation, running possibly as root. It's probably less about running as root as it is about having access to the entire machine, right? So that's the baseline

Starting point is 00:20:24 most people are tracking, right? I had the whole machine to myself, very few agents or daemons running except for the ones that I put there. And that becomes your new baseline, especially if you've been running like that for years. Then you move into Kubernetes world. So whether you're running as root or not,

Starting point is 00:20:39 there are ways to mitigate the ID that this thing runs in. But the biggest challenge I I think, is communicating between the runtime configuration, using network namespaces and cgroups for the first time that may introduce some overhead that your runtime isn't used to in case of the JVM. And the way you communicate that to the JVM is using what we call the downward API. So in Kubernetes at runtime, you can actually gather some information about some of these limitations that you're putting in place

Starting point is 00:21:12 and use that to actually configure the JVM as your process is being launched, right? So either do it as part of your wrapper script where you're about to launch the app and you want to set your JVM switches up ahead of time. But that's the problem there. The biggest challenge is communicating the limitations, because what the JVM will see will be largely different from what the constraints that are put around it. So, I mean, if I think back the way we started our talk and we said you are

Starting point is 00:21:41 minimalist, isn't that the best advice? If you actually tell folks that used to run their JVMs on a I own the world kind of environment to, hey, you may want to start thinking about what's the minimum environment that your application, what's the minimum environment that your application needs in order to still run as expected from a performance perspective? I think that's a general great advice then. Stay minimum.

Starting point is 00:22:12 And that means you can then really truly also leverage the scalability options that Kubernetes gives you by then spinning up maybe more pods in order to scale. Obviously, assuming your architecture allows for that. But would that be a good kind of advice? Stay minimalistic? Yeah, I think even before Kubernetes, I think this idea of need versus want. So if you look at most people, if we continue with the JVM,

Starting point is 00:22:37 if you look at your class path, do you really need all of those jars in there? And I think a lot of times, sometimes those jars increase the startup time, they use more memory. So if you can clean the class path out jars in there. And I think a lot of times, sometimes those jars increase the startup time, they use more memory. So if you can clean the class pass out and just leave what you need, you're probably going to have a much better time in terms of, you know, compute utilization and requirements and also understanding what flags do you need. I've seen a lot of people who have like 30 flags set. I'm like, why are these flags being set? He's like, I don't know, man. It worked at some point, so we don't change it. And when you

Starting point is 00:23:10 go into a new platform, not understanding what flags are critical and what they do, you never can really get to the point of having only the flags you need to tune the behavior that you want. And going back to the idea of the jars running, sorry for picking your brain on this so much, but curious from from a point of view, a developer might know which ones they're using, which ones they're not. But let's say you're talking to someone from a performance engineering team, RSRE team.

Starting point is 00:23:32 Is there a way with any tools or anything else that they can detect which ones are utilized and which ones aren't so that they can make a recommendation to say, hey, we noticed these ones aren't being leveraged. Maybe you should consider removing them. You know what? I haven't worked with the JVM extensively for a long time,

Starting point is 00:23:51 but I'll tell you what I used to do, just to be very transparent. From my experience, no one knows what they actually need. Because if I ask a lot of developers, what are your dependencies? And most of them will say, I don't know. I need Red Hat version X. I need JVM version X. And then ideally everything in the class path is what I need because it is working. As you remove something, it may stop

Starting point is 00:24:15 working. So don't remove anything. That's usually the status quo that I've seen. Now, if you're lucky, one thing I used to do is just look at the startup process. Like, it's tedious as hell. But you have to be careful because you may have library paths that are only exercised in very unique scenarios. So you don't want to go remove something because you haven't seen it be used in a long time. For example, if you have a Postgres jar, you may say, oh, this app doesn't call Postgres. Well, it may under a reporting condition that runs only once per month. So you kind of need to have a negotiation. So what I like to do is sit side by side with the team, figure out what parts of this is being used. And then what is it being used for? Why is this Postgres jar here at all if we're using Oracle? And someone will say,

Starting point is 00:25:02 oh, we'll use it for reporting. So, okay, it gets to stay. And then I tried to remove one by one until we cut too deep and we have to add back. But to me, it was just more of a trial and error sitting next to the experts. And it just takes time to get to a point where you have that discipline knowing what needs to be there. Hey, Kelsey, I got another question for you. So one of the things that we get asked a lot about, so yes, we have, we understand we are going, we're going towards containers, we're using Kubernetes. Shall we treat our pre-prod environment and our production environment in a way that we have different Kubernetes clusters, or shall we use the isolation that namespaces give us to isolate the stages, pre-prod and prod,

Starting point is 00:25:51 or shall we even go the next level and say, well, there should only be prod. We don't need any pre-prod anymore because if we're doing it right, we just deploy into prod. And if we have a new version, we just do canary deployments or we use feature flags.

Starting point is 00:26:07 So I know this is a long stretch question, but the question is, what are, if you talk with an organization that is trying to figure out how does this work for them? Do they need multiple Kubernetes clusters for the different environments

Starting point is 00:26:22 or is namespace isolation enough or should we go all in and say we only do production and then use these new deployment options that we have? What is your experience? Where does it work? Where does it not work? Because I'm sure it depends also on the application and the maturity level of the organization.

Starting point is 00:26:41 Yes. All of those things. This has been a question for years like 20 should you have one vmware cluster for everything should you have one amazon account or one gcp account for everything all this starts to break down really quickly when you start to think about security software upgrades mistakes blast radius there's so many things around maturity that have nothing to do with the happy state of you getting everything right all the time.

Starting point is 00:27:13 So I think as humans and human nature, mistakes are going to happen. That's okay. So here's the thing. Let's just talk about those scenarios really quickly. Let's say you have one Kubernetes cluster for production. Just one big one. You have 5,000 nodes in there and you're going to use namespaces to create environments.

Starting point is 00:27:30 Great. This sounds like a great idea. So in production, your third-party extensions are really locked into Kubernetes 1.14. Just 1.14, some developer tried their best and they coded around 114. 116 comes out. They didn't test those existing extensions with 116 because there was nothing to test against. Turns out 116 is incompatible with those third-party extensions that you come to rely on. You upgrade the cluster. For QA purposes, QA wants to test, but there's only one control plane. Even though you have multiple namespaces, there's only one control plane. You upgrade to 116. QA is fine. Production is down. The first thing you're going to do is say, hmm, maybe we should have more than one Kubernetes cluster.

Starting point is 00:28:28 Now, if you feel that more than one could be in production, so you have cluster A and cluster B, and now you have 2,500 nodes in one, 2,500 nodes in the other, and now you have a much easier way of rolling out a Kubernetes control plane upgrade. You can upgrade one cluster, and if a percentage of your applications fall over, then you know not to touch the other cluster where production traffic is still serving. So I think that's going to be your biggest decision factor, is how much of a blast radius

Starting point is 00:28:57 do you want? The other thing that I see is maybe you're doing this performance testing, and you really need to blast this cluster. Okay, if you only have one cluster where all of your stuff is being shared, either you're going to over-provision that cluster to accommodate for big performance tests or what we see a lot in the cloud, especially when you have more tools to manage a cluster, there's almost no reason to put all your eggs in one basket. So let's say your steady state in production is 2,000 nodes. If I want to do a performance test, I can just create another 2,000 node cluster, run my performance test, and delete the entire cluster instead of trying to maintain a maybe 3,000 node cluster that auto scales up and down playing musical chairs. So I think there's

Starting point is 00:29:46 just so many more options that you really don't need to think about one big cluster. Now this is slightly different if you're like on-premise where you don't necessarily have the ability to do elastic compute where you can scale up and down easily. So you may get forced into over-provisioning which we see a lot when it comes to on premise. But even in that scenario, you can add and remove nodes to an existing cluster. So your automation is going to be different. But I strongly recommend people just for the sake of Kubernetes upgrades, and thinking about your third party extensions and dependencies that may not be so may not have the ability to have QA, dev, performance,

Starting point is 00:30:26 and production control plane be at the same version during all the test phases. This is fascinating. Thank you so much. So just to be clear, when we talk about these third-party extensions, we're primarily talking about operators that you have running in your Kubernetes cluster that you may have written yourself or that you brought in from somebody else. What other extensions are there for those people that are not that familiar with Kubernetes

Starting point is 00:30:54 that we need to take care of? So I'll tell you one that I saw break in production for real production people, and I had to help troubleshoot. So think about the network policy extension. Very great company, Calico, and they have a really nice extension that really implements Kubernetes network security policies well.

Starting point is 00:31:15 And the way that works is there's an agent that runs on every node, and you can put in Kubernetes network policy. So this works like a traditional firewall. It leverages IP tables. And what it does is it watches the Kubernetes API. And as it gets configurations about how to configure that firewall, it then goes and updates IP tables. Okay. So a new version of Kubernetes comes out.

Starting point is 00:31:38 And to be fair, there was no way for them to test future compatibility. So let's say there's a small bug in either a Kubernetes control plane. So the's say there's a small bug in either A, Kubernetes control plane. So the way it does its configuration or the way it sends its changes may have a bug or maybe it just changes. Or there's a bug in the controller itself that's watching Kubernetes.

Starting point is 00:31:56 That actually changed and all of the firewall rules closed and they failed open or they failed close. That means all traffic in the entire cluster just stopped, period. At that point, you are now hard down completely. But you could have tested this if you had a separate cluster. You could have upgraded the Kubernetes version, saw everything, stopped and said, all right, we're not moving this to the other cluster until we resolve it. This stuff is all very advanced Kubernetes, right?

Starting point is 00:32:32 A lot of times people start by just, we're going to spin it up, we're going to start running some things. And probably until they hit their first hiccups, they're not thinking about these things. For people who want to try to get ahead of this, besides DMing you and trying to take your time, which I know is not reasonable. Yeah, don't do that. Yeah, don't do that, exactly.

Starting point is 00:32:50 But what resources are out there to support people understanding? Besides going to like KubeCon, besides listening to like watching some of, you know, your presentations and other presentations, is there any sort of organized or semi-organized best practices or maybe things to consider,

Starting point is 00:33:05 stuff going on out there at this point? Or is it all kind of trial and error and looking up scenarios and doing a bunch of searches and trying to figure out how to do this? Yeah, I think that's true of all tech, right? There's no substitute for experience. With that said, though, the nice thing is there is a official certification

Starting point is 00:33:24 for a lot of people that are, you know, trying to become Kubernetes administrators so they can at least learn some of this baseline knowledge about what these components are and how they fit together. KubeCon, I mean, there are just so many talks over the years that cover everything from security to performance tuning and, you know, horror stories like the ones I just mentioned in production from real companies. So I encourage people to learn from others. So watch a couple of those talks, watch a couple of those videos, and then give yourself some headroom. Having more than one cluster will give you the time to make those mistakes, but then limit the impact of those mistakes. And I think that's all we're trying to do is have this delicate balance in tech where we're all kind of learning as we go,

Starting point is 00:34:09 especially when there's new tools like this. We can leverage the people who've already gone through this, listen to them talk, read those blog posts as inputs for our own education, and then leaving ourselves headroom in case we get it wrong. I imagine a lot of people might think, or maybe you tell me if you come across this, I would imagine some people or at least organizations might think, well, if I'm using GKE or EKS or anything else like that, I don't have to worry about this stuff so much. But I would imagine that the proper advice is,

Starting point is 00:34:38 no, you still have to worry about it because they're just providing the framework there or basically the hardware for you, you still need to know what you're doing on a very deep level. Don't rely on the cloud vendors to take care of all this for you because they won't. They're just going to give you what you ask for. Yeah, I remember my first encounter with true automation.

Starting point is 00:34:58 I've always wrote scripts and things like that, but I remember getting configuration management for the first time with Puppet. And I was like, look, I can update all 1,000 servers in two or three minutes. Let me show you. Bam. All the servers are updated. It's like, oh, so we don't even have to worry about deployments anymore. I'm like, oh, no, that's a thing of the past. Now, the problem with that is when you have a wrong configuration or the wrong version of the app, you will totally roll out the wrong version of the app in minutes and not days, which is really, really bad. So when

Starting point is 00:35:31 it comes to Kubernetes, even if you have a fully managed Kubernetes cluster, it still doesn't save you from the issue we talked about earlier, which is an upgrade of a version of Kubernetes that your extensions can't handle. It's just going to happen faster. So when you click on that auto upgrade button, it's just that at some random point in time, you may be a victim of an auto upgrade. So you still need to be cautious. And I think the same discipline applies even when the cluster is fully managed. Hey, Kelsey, I think earlier you mentioned that, you know, one of the big beneficiaries, the group of beneficiaries of Kubernetes

Starting point is 00:36:13 are the application developers. And at least that's the way I remember you saying it. And that's obviously true because it's easy to, you know, deploy your containers. And then thanks to, you know, service meshes like Istio, you can easily deploy it side by side with the previous version. And then it just, you know, do things like a canary deployment. Do you, have you seen what,

Starting point is 00:36:44 with the people that have worked in the past any kind of, let's say, automation evolving around this? Because knowing that I can deploy my container side by side by using Istio, by defining my helm charts correctly is great if I know how to do this. but with more and more application developers kind of hopping on to Kubernetes, is there something where we can help them instead of having to learn Helm, instead of having them to learn how to configure Istio correctly? Are there things available, especially for application developers,

Starting point is 00:37:18 that allow them to deploy easier with kind of like a self-service experience, I would almost say, to really truly leverage the power that Kubernetes gives them. But obviously something is needed on top. Yeah, so it's funny. This question comes up every five to seven years. New thing shows up, we leak it to the developers, right?

Starting point is 00:37:39 Hey, virtualization. Hey, everyone gets to log into the VMware console. Hey, Vagrant, everyone gets VMs on their laptops. Hey, Cloud, everyone gets an Amazon or Google account. Hey, Docker, everyone gets a Docker daemon. Oh, what's that? Is that Kubernetes? Everyone gets kubectl. It's like, what are you doing?

Starting point is 00:37:57 If you keep doing that, you're going to keep asking the same question every single time. There's really, really one serious workflow that you really, really want. I, as a developer, check in my code. There's a way to build my application to produce an artifact. Jar, war, binary, doesn't matter. You can package that in an RPM container. Doesn't matter to me either. I need that thing to be running on some compute. That's kind of what we were talking about here. Now, I may need to give you some additional metadata so you know where you want me to run it or how much memory or CPU it needs. That's usually all you need, regardless of the system, VMs, Kubernetes content, it just doesn't matter. The problem is we haven't had a lot of time to create those other systems. There are systems like this, like Cloud Foundry, OpenShift, Heroku, App Engine.

Starting point is 00:38:47 These are very opinionated ways to take a subset of the things we just described, some artifact and some metadata, and to deploy your application, right? That's kind of been the whole goal of these passes. So when you look at Kubernetes, you look at Istio, these are just platform components for you to build your own pass. So if you're an operator and you have a bunch of virtual machines or bare metal machines, you could spend almost a lifetime at this point trying to create all the things that Kubernetes has and all the things that Istio has to build your own deployment system for your team or the people you're trying to support with your infrastructure. Or you can download or use components like Istio and Kubernetes and then layer on your opinionated workflow at the top. So this all boils down to what workflow do you want your developers to have? I'm going to use one more analogy here. It's like the internet.

Starting point is 00:39:41 The current workflow for most consumers of the internet, you go to somewhere like Best Buy and you buy a modem and it says, this will get you on the internet. It's all right, great. You buy the modem and you open the box and there's a cable in it and you twist it in the wall and lights start flashing. You're probably online at that point. That's it. And then you just get online and if things don't work, you look at the lights and then you unplug it and plug it back in. End of story.

Starting point is 00:40:10 That is the interface. That's the workflow. What we do in the Kubernetes world, we say, no, we can't be having that. We need everyone to know how to create a Cat5 cable. So here's the chart of how to twist the pairs of wires and here's how you clip the other end. So let's the chart of how to twist the pairs of wires. And here's how you clip the other end. So let's give every developer or every internet consumer a Cat5 kit so they know how to

Starting point is 00:40:31 create cables. And next, we're going to teach them how BGP works, right? Because that's the backbone of the internet and we're doing DevOps. We don't want people just using the internet. We want them to know how it actually works. It's like, come on, you got to be realistic. So focus on the workflow. Kubernetes and Istio are implementation details and they give you a lot of platform features, but don't make the mistake of leaking all of those details as if that's the end game. I love your analogy.

Starting point is 00:40:58 And I may, if it's okay, I steal slash borrow it at some of the presentations that I'm doing because it's, and I steal slash borrow it at some of the presentations that I'm doing. And I completely agree with you, right? This is the challenge that we also faced. And with we, I mean our own organization where when we started with Kubernetes, everyone that tried to use Kubernetes had to figure out, well, they were kind of building all these cables and trying to figure out how TCPIP works and kind of to bring your analogy, kind of figuring out how to deploy, figuring out how to configure Istio. And when we realized this is greatly slowing us down from the way we could actually deliver value to our customers, because we have a lot of developers that are not just purely spending time

Starting point is 00:41:47 in trying to figure out how does this big thing underneath actually work. And this is also when we started building and maybe call it an opinionated way of enabling our developers to just, hey, give me your artifact. And then we are kind of managing the lifecycle of that artifact by automatically deploying it, automatically configuring Istio,

Starting point is 00:42:12 automatically monitoring your SLIs and SLOs that the developer also specified as metadata, and then giving the developer feedback and giving them the option to promote it into the next stage. I want to be clear here. I'm not saying that everyone in the world who downloads giving them the option to promote it into the next stage. I want to be clear here. I'm not saying that everyone in the world who downloads Kubernetes for the first time will automatically have the time to build this perfect workflow.

Starting point is 00:42:33 What I'm saying is there's a couple of things. In the very beginning, your workflow may very well be people using kubectl apply and pushing YAMLs at the API server. That data definition is also the fundamental element that no matter what workflow you build, it will always create those data elements, those configurations you give to Kubernetes. So the nice thing about this is it becomes your escape hatch.

Starting point is 00:42:58 So let's say you build a workflow that says, if you check in some code, we build the container, we deploy to Kubernetes. That's not very advanced. But the nice thing about that is you may not have any troubleshooting tools in that workflow either. But your pipeline will just create the same artifacts that developers were creating with kubectl. Bonus points if you check them in before you deploy them. Now troubleshooting can just be done on the backend using kubectl until you decide if you want to have higher

Starting point is 00:43:26 level debugging tools to go along with that workflow. So keep that in mind that I'm not expecting perfection day one, but just make sure you understand that there is work to do based on what you get out of the box. And I think that also ties into the concepts of what we see even in our own organization where operations becomes more of moving from the old-fashioned operations type of team to the pipeline and requirements team to say to the developers, check in your code with these additional artifacts, but it's a subset of the artifacts that they need to know about.

Starting point is 00:44:03 They don't need to know every single component. But if the operations team can outline and define what is required along with a code check-in, then the rest of the pipeline will pick everything else up based on the conformity to what's out there and push it. And the developers don't have to think about all the rest of that. Because again, you don't want developers to have to understand all the underlying components. You want them to write code and you want them to write code that works. So the more you can abstract away from them

Starting point is 00:44:32 and put that into the hands of, maybe it's your operations team, maybe it's another team, but someone else defines all that and someone gives them a gated amount of additional components they check in, it becomes a lot more manageable. Well, probably we could talk about performance.

Starting point is 00:44:48 And I also really like the way you kind of explained it. Think about what's the workflow you want your developers to have, right? And how can you make this more efficient, more streamlined, more easy to use? How can we offer things like this as a self-service? But on the other end, and Kelsey, probably you'll agree with me, whether this is Kubernetes or whether this used to be in the past

Starting point is 00:45:14 a system where you were deploying to a big Java application server, in the end, we should have always thought about that way. How can we make the lives of developers so easy that they can really focus on delivering code that then gets automatically deployed in the right way without having to think about all the plumbing underneath. But I guess that's just new. Exactly. Happy paths with escape hatches.

Starting point is 00:45:37 Yeah, exactly. I like that. Yeah, let's jump over to the topic that Brian just brought up, right? So security is a big topic of yours. And I think your roles, if you look at some of the meetups you've done in the recent weeks, I think you're scheduled for some of those. If I look at your Twitter feed and also the podcast you've done, then security is a big topic in the Kubernetes space.

Starting point is 00:46:04 Can you tell us a little bit more, particularly what are a big topic in the Kubernetes space. Can you tell us a little bit more, particularly what are the big topics that people need to be aware of and things that people have to think of? Yeah, so right now, I mean, it's just been amazing. Like the last five or six years, the number of platforms we have, whether it's Lambda, Cloud Run, Heroku, Cloud Foundry, raw Kubernetes, or just Docker. All of those are now getting us to the point where we can put our software wherever we want it, when we want it. This is a great place to be in.

Starting point is 00:46:33 Now the security question comes in, how do you lock down those platforms? There's tons of discussions around how do you secure those platforms to only run the things that you want to run. Now we're turning our attention to what are we running? So if you build, let's say, a Go application and you may have some OS dependencies,

Starting point is 00:46:52 something like FFmpeg or something like that, that will come from your operating system. Typically, if you're using something like Red Hat, they'll sign some packages, track the CVEs. You can run yum update and maybe rebuild your container or rebuild or update the server that the app is running on and do dynamic linking. Okay, we kind of have a handle somewhat on that. But where there's a blind spot in the industry is our third party dependencies that happen to be software related. So let's say someone creates a go package or go module

Starting point is 00:47:22 that you just import to do some basic functionality. Who is tracking the CVE of that little code snippet that you're getting from somewhere like GitHub? It's not an OS package. It may not have the same scrutiny or visibility that things like FFmpeg have. So now you have this blind spot. So when we package up our applications, let's say in containers, we have to worry about the OS level dependencies. And we also have to worry about the third party dependencies that get compiled in our binary. So this end-to-end software pipeline or governance or chain of trust, how do we establish that chain of trust? And when it's broken, meaning there's a vulnerability in one of those third party dependencies, how do we identify who has it and how do we redeploy when that takes place?

Starting point is 00:48:13 So how do we do this? I mean, how do we... No one knows because... It seems that obviously depending on where you are in the delivery pipeline, when you're, I mean, there's obviously ways where you can probably include code scanners and to figure out, do you just copy paste it something that potentially is vulnerable, right? And then as you're then linking to other libraries, is this, I mean, as you said, maybe nobody knows yet where it all fits in, but there's probably multiple phases in the end-to-end delivery pipeline where you need to think of security and how to make sure you're not including or introducing existing vulnerabilities, or you need to take care of logging on what you include so that later on, in case there is a new vulnerability, you know that you have this vulnerability currently deployed. Yeah, I think people are trying to take an approach that you see in the auto industry, right?

Starting point is 00:49:14 If you build a car, the car has a VIN number, has a make and a model, has a color. And typically, you can trace back a lot of the components like the alternator, the engine, even the airbag to that VIN number. So you have a bill of goods. And you can think of the VIN number as the components like the alternator, the engine, even the airbag to that VIN number. So you have a bill of goods and you can think of the VIN number as the signature for the car. It's probably not a perfect analogy, but now you have a way to trace back. So when there's a recall on a car, you can use the DMV registration to identify everyone who has that car and almost pinpoint the problem down to one manufacturer and say, hey, return your car and we will fix this component. We don't have that for software.

Starting point is 00:49:49 So in the software world, you don't really know who's importing your library. You don't know what version or what checksum of that library that they have. So one way we're trying to solve this is by starting at the very start. You're checking your code. Ideally, people are signing their commits so we can trace back where these things were introduced by the various authors. People like me, we check in our dependencies. So even though we have a third-party dependency, I like to check in that dependency so I can have it with me to prevent someone from replacing a 1.0 and me just re-downloading it in

Starting point is 00:50:22 my next build. So I like to kind of check in the things that I'm using and then rely on the checksum, not just the version, to really tell me what I have. You can analyze that code. There's static analysis that you can use. There's linters and there's also reputation. Do I trust this person? Therefore, I may have a head start on trusting this code.

Starting point is 00:50:41 And then we have to trust the build system, right? So we're not going to talk about recreating the universe from scratch. We're just going to zoom forward to the we have to trust the build system, right? So we're not going to talk about recreating the universe from scratch. We're just going to zoom forward to the point where I trust my build system and the hardware that it runs on. I trust my compiler. And once you have all those things, we want to sign the results

Starting point is 00:50:56 and then carry around that signature as kind of proof that we know where this thing came from. I'm not saying it's going to prevent problems, but it's going to let us identify when problems do crop up and we can trace it all the way back to how that software got introduced. I like the term signature, by the way, because it reminds me of a term that we've been using, which is a performance signature. So we look at different performance characteristics of a system, and then this kind of makes up the signature. And then if that signature changes, let's say from build to build, from configuration change to configuration change, or from workload to workload, then it's a way for us to detect that something is wrong.

Starting point is 00:51:37 And then we can typically also pinpoint it to the area of where that signature changes. So I like the term. So feel free to keep using it. And what you just described is a very much a missing component to this software delivery pipeline. Lots of people are thinking about this from the QA standpoint.

Starting point is 00:51:54 So a QA team runs some quality assurance test and they may sign it with their key. We used to do this with RPMs back in the day. And that means QA has approved to this to go to the next environment. We see this with security scanning tools that will look for vulnerabilities, and they may also attach an additional signature saying, hey, I assert that this is free of any vulnerabilities that I know about. So it would attach a signature. But you're right. For people in the performance

Starting point is 00:52:19 community, there could develop like a set of performance baselines and attach yet another signature to say, this piece of software also conforms to our benchmark targets for what it means to be performance software. Yeah, that's awesome. Hey, Kelsey, I know we've been taking up a lot of your time already. And it's really amazing that knowing, at least seeing based on social media, what you're doing, it's great that you could really take out that amount of time off your busy schedule. I know there's a lot of more stuff we can talk about when it comes to Kubernetes, which is obviously dear to your heart. We can, I'm sure, talk more

Starting point is 00:53:02 about what it means to be living like a minimalist and and how we can apply this to other parts of our life as well is there typically and i know this brian is something you always say typically we kind of wrap it up uh in a little summary that i give but is there any other topic that is also dear is there any other thing we're missing uh maybe something we can also cover in a later episode. I mean, we talked about performance. We talked about people that get started, the whole myth of,

Starting point is 00:53:31 should we build a big Kubernetes container or just smaller ones for the upgrades? We talked about security. Is there any other area and topic that we must not forget about? Yeah, so I think service mesh is a really big component where we're starting to move a lot of smarts into the network layer. So people are taking sidecars like Envoy and control planes like Istio. And one, I think, big topic of performance is if we start to move things like TLS mutual off

Starting point is 00:54:01 to be mandatory. So imagine a set of microservices, thousands of them. You're going to introduce a real overhead in terms of performance to encrypt all traffic locally. And that trade-off between security and performance is a big one because costs also go up. I think there needs to be more exploration there around that trade-off between security and performance because if the system is so slow that it's unusable, then people will, by default, more than likely turn off that security. So I think we get more professionals who are looking at this from a performance standpoint, not only identifying the performance bottleneck, but also helping people solve it. Maybe there's kernel tunables we could use.

Starting point is 00:54:42 There are different techniques we can do in these proxies to speed things up quite a bit without turning them off. I think this is really a big area because I think right now, most people are just taking the other trade-off, which is turn off the security bits because they can't afford the performance overhead. But isn't that also an argument for, I know there's a lot of people that say,

Starting point is 00:55:04 well, first of all, microservices don't solve every problem, but then you exactly get in the arguments from people, well, you wouldn't have this problem if you're smarter with architecturing your software and putting things together that belong together without kind of breaking every little component, every little function out in its own service. And then you have to deal with the overhead of, first of all, transportation, communication, and security. Well, so the thing is microservices very rarely have I seen solve a performance problem. They solve organizational problem where you may have

Starting point is 00:55:35 a large organization that needs to work together without tripping over each other's toes. I can see from a big organization structure, but to believe that we can go from in-memory communication to over the network and somehow solve a performance problem, that one's hard to swallow because I've never seen it, right? I think there are techniques you can do with caching to close the gap, but it's very hard to get back to data locality. I think the reason why that one's a little bit more up in the air is because you don't have to adopt microservices. You can choose an architectural pattern that works for you. You can mitigate different designs, but security is one where you can almost not really make the same trade-off. You need security at all levels. How much you need is really being impacted by,

Starting point is 00:56:23 because I think a lot of times, a lot of security tools don't think about performance. They think about security as the end-all be-all. But the thing is, if it's too slow, it gets turned off. And I think that one is where I think we need more attention. I think microservices have been explored extensively. It's kind of philosophical at this point. But the security one really needs people to really look at it because if you make the wrong trade-off here, it doesn't matter if you have a monolith or microservices, it's just insecure. Do we have, from a performance perspective, do these service meshes that, you know, handle

Starting point is 00:56:56 the secure communication at least, are there benchmarks, Are there metrics that actually tell us how much overhead is involved in securing the communication and encrypting, decrypting? Are you aware of that? Yeah, there's people who are measuring that kind of thing. I don't know if we have a great baseline of acceptable performance yet. I mean, I think there's some things

Starting point is 00:57:22 that have been published about the cost, but I don't know if we understand what's acceptable meaning if you hit this threshold we should live with that so i think there's room for that to be a little bit more awareness i think it's like there's in the performance space we often talk about a performance budget so that means how much before i mean how much you know what's your what's your performance end goal and it means you know how much time can you spend on each individual layer of your stack in order to fulfill that budget, in order to end up with, let's say, a certain response time, a certain throughput. I would just assume that security, the overhead of security, the quote-unquote performance overhead of security, would just have to be factored in into the performance budget. That means where along the end-to-end transaction can you trade off,

Starting point is 00:58:09 how much time can you leverage or use for security-related purposes in order to not impact your performance goals in the end? One would hope. One would hope, exactly. All right. Well, we thank you very much for taking some time with us today and

Starting point is 00:58:29 look forward to everything else that's going to be coming out with you in the future. People can follow you at Twitter at, what is it again? My Twitter handle is just Kelsey Hightower. I pretty much keep my DMs open. So if you have a question, I was going to say a good question, but I think all much keep my DMs open. So if you have a question,

Starting point is 00:58:45 I was going to say a good question, but I think all questions can be a good question. So you can feel free to DM me there. And yeah, I'll be on Twitter trying to interact and share as I learn. And I can confirm that because Kelsey, thank you so much for responding to my DM. So you really are a man of your word.

Starting point is 00:59:04 So thank you so much for that and thank you for being inspiring to so many people especially with your demos that you do on stage and with all the content that you put out there. It's really phenomenal. Keep up inspiring more people and let's make sure that

Starting point is 00:59:20 let's make the world a better place. Let's end it with this. Awesome. Thanks for having me and thanks for the kind words.

Your Ad Here

PurePerformance - A Minimalistic Approach to Kubernetes with Kelsey Hightower

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.