Software Misadventures - Kelsey Hightower - On ways kubernetes can break, being an effective leader and much more - #1

Starting point is 00:00:00 Take your time. This whole becoming an expert in 10 minutes or in 30 days is crazy to me. I've been using Linux for, what, 15 years? There is no way I'm still an expert today because it keeps changing and evolving. And I think Kubernetes is going to have to be as patient if you want to become an expert around something like Kubernetes. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Ronak, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, but it's the best opportunity to learn about a system more deeply.

Starting point is 00:00:55 When most of us started in this field, we didn't really know what to expect and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hey, folks, this is Ronak here. software, as well as advice and enabling others to succeed. In this insightful conversation, we cover wide-ranging topics from his role at Google to the art of storytelling. We get into some very interesting details of how Kubernetes can break in production and discuss practices that have worked for Kelsey in being an effective leader. We had a great time talking to Kelsey, and we hope that you enjoyed this conversation as much as we did. And thank you for listening.

Starting point is 00:01:54 All right. Hi, Kelsey. It's super excited to have you with us today. Welcome to the show. Awesome. Thanks for having me. So, funny fact. So we were researching for this episode, and we just Googled in Kelsey Hightower podcast on Google. And we hit up about like three pages worth of results, mostly from 2020. I think we were hoping for just like a handful of things to kind of pick from, but there's definitely a lot there.

Starting point is 00:02:19 So we're going to try to keep this interesting for you. So to kind of start, something that's been interesting, I think quite a few of us is like, so the role of developer advocate probably means something different depending on which company you're at. So for you, what does a day in your life as a principal developer advocate at Google look like? Yeah, so this is my first time

Starting point is 00:02:42 ever being in an advocacy role, maybe officially. Other roles I've had, you know, I worked at Puppet Labs on the engineering team, became a manager on the engineering team. But if you look at the activities I did when I was at Puppet Labs in 2012, we spoke at conferences, we interacted with the open source community because Puppet, the configuration management tool, was used by this large open source community because Puppet, the configuration management tool, was used by this large open source community. So we would be on chat and we would be viewing their code and trying to figure out what the right features were by engaging with the community directly.

Starting point is 00:03:15 When you get to a larger company, you tend to start to have a very narrow or the ability to focus on that kind of role. So at Google, developer advocacy is in engineering. So we're on the engineering ladder. If anyone has ever interviewed for developer advocacy, they'll tell you about maybe the same software engineering, live coding interviews that they have to do as if they were interviewing to be a SWE software engineer.

Starting point is 00:03:40 And so for my day in the life, so I'm a principal developer advocate. So that means I work at the director level. So I have some executive responsibilities. Some of those are going to be for diversity and inclusion, technical sponsor for some technical projects. So if you're going to launch a new project, ideally you want a technical sponsor to help you bring it to market.

Starting point is 00:03:58 And then also working with our larger enterprises where at the exec level, think CIO or VP of engineering, having those conversations. But the thing I try to preserve, and you see this out in the open, is staying in tune. So I write a lot of code every morning. I try to contribute to open source projects, and I tend to write things down. So I think that's the part of the advocacy work people see, which is, oh, he threw another tutorial on GitHub, or he's speaking at this conference, or he's on a podcast like this. But day to day is very different. A lot of design docs that are being written,

Starting point is 00:04:29 code that's being reviewed. And I just get to be fortunate enough to do that for our customers as well. That's awesome. Yeah, so I imagine working with a lot of different people, especially at Google, a large part is different teams with different alignments, different objectives. And you've talked about an empathy and trying to solve people's problems and working beyond this kind of like job title. Can you tell us a little bit about how you set clear boundaries for your time? I've noticed this in the past with other folks like at LinkedIn, including myself, that we care so much about solving a problem with or for others that it ends up actually consuming like all of our capacity. And we just kind of

Starting point is 00:05:11 get that tunnel vision. I'm just curious how you handle that. Yeah. So in most people that have been in engineering for a while, as you kind of grew up in your engineering career, you may start off with a lot of time on the keyboard, banging out code, getting those commits super high. And then you go into design phase where the goal then as maybe as a TL is to, you know, you still write a bit of code, but you're trying to bring up the other people on the team and be a team leader. And then the higher you go up from there, you start to do a lot more design work. So in my world, I have to be very careful about my time, right? Because if you look

Starting point is 00:05:45 at my calendar, there'll be a podcast, there'll be a customer meeting, there'll be a design review, there will be a P&L, you know, review the health of the business, you know, how's Cloud Run doing in relation to, you know, our Kubernetes Energon product. So what I tend to do is I block off these four hour blocks, right? And the four hour block will be, you know, write some code using gRPC and Envoy and evaluate how well does Envoy work with gRPC. Does it introduce too much latency? And I need that groundwork because it would influence things like Traffic Director in Google Cloud, which tries to do a service discovery backend for gRPC clients, or our service mesh or Istio. There's a bunch of foundational knowledge that customers expect me to have, that the community expects me to have. So I have to approach all of these problems as a customer first. So what does that mean? That means I have my own,

Starting point is 00:06:37 you know, Google Cloud projects or set of projects that are actually expensed through the same way a customer would do things. So I try to avoid any of the special things that we have internally just to make sure that I'm always experiencing things as a customer does. And I bring that insight back in. So I bring it to my meetings, my product meetings, and to my feedback loops and everything that I do. Awesome. Nice. Yeah, and you recently tweeted something to the effect of, and I think some engineers can probably relate with this, of not wanting to be put into a box in a job. Like some people are just naturally just very good at something. And then it just ends up kind of being like the sole thing that they're going to be doing. So when you were joining Google, is it something, was it something that you were afraid of? Yeah.

Starting point is 00:07:26 Oh, extremely. I think I was going to go work at NASA, too. There's a team in Jet Propulsion Labs out in Pasadena. And JPL, they were kind of focused on like a Mars mission, right? That's a real big, exciting goal. And the nice thing about that was they were also moving towards this whole cloud native stuff, looking at containers, looking at how they do observability, just looking at new approaches to doing everything. And that role was going to be pretty broad to really match my skill set and capabilities. So whenever you go to a big company, they start to have these well-defined roles like level 8 XYZ.

Starting point is 00:08:02 And it's like, I don't want to just be this thing that HR can fit into a description or something that lands on a job ladder. Because for a lot of people, that doesn't represent all of their capabilities. So for me, one particular thing I wanted to do is make sure like, look, throughout my entire career, I've been a VP of engineering. I've been a director of engineering. I've been a software engineer, but I've always found a reason or a way to work with sales, right? Like we're building things that can be sold to customers. So why not have a super educated sales organization that knows how the product works in a very authentic way. So sometimes that means jumping on calls. So regardless of what my role is, I don't want the role to be so rigid that I can't go help the business execute by using my other skill sets,

Starting point is 00:08:44 right? Like I have the ability to do public speaking. I want to be able to use that particular skill. I still write code and contribute to open source projects. I don't want to be locked out of being able to do that. And I also have a close connection with the community, whether that's via social media or chat room or speaking at a conference. So I just wanted to make sure that the role didn't confine me to just one set of activities and also want to be able to be fluid. There are going to be some quarters where it's all about building. There are no talks. We're just coding. We're designing. We're shipping product. I mean, maybe another quarter is going out beyond like the job title again, and just being flexible with based on like what's kind of needed at the time and what you find as like highest priority.

Starting point is 00:09:32 You had hinted at public speaking. So you've done comfortable even doing live demos that people are just terrified of doing. And so how did you develop like this stage presence and kind of like this general art of storytelling? They're just tools, right? So I remember my very first speaking was in Atlanta, let's say around 2008-ish. Speaking at local meetups, I was really in the Python community at the time. And watching people give talks, I remember going to Georgia Tech, one of the big universities out in the US in Atlanta, and watching people give these talks about Python, some of them were super technical, some of them were funny. Some of them were high level. Some of them had lots of slides. And I remember looking at it and said, wow, you know, if they can do that, maybe I'll give it a try. And when I first did, I was like, maybe I need to mimic them. You got to have a lot of slides. You got to talk about something super technical. And you have to do it a certain way. And I would do that for years. I remember my first kind of big conference was PuppetConf 2012. It was the first Puppet Conf out in Portland, Oregon. And I remember going there and giving this talk thinking I had to dress a certain way, look a certain way, and feel a certain way. And I remember halfway through that presentation as I was building it up, it just didn't make me feel anything. It just felt like I was presenting information. So I remember throwing in this slide of Samuel Jackson in the background and I made some funny joke to go with it. And everyone started laughing. And at that moment is when I got comfortable in my own skin. It's like,

Starting point is 00:11:14 you know what? If people are going to let me be Kelsey, oh my God, I can just, I can take all the crutches off. Sometimes I don't need slides. Sometimes I just want to tell a joke or relate to some experience that I had in production. And what I noticed was when I look out into the crowd, when I got to the point where I had the courage to look at people and allow them to look at me back, then it felt very natural. You know, you've been in the office before with a team around and you're showing them a problem and everyone's laughing and it's hysterical. How do you recreate that on stage? And that's what when you see comfort that's what you see there's a point in the talk and i'll go into some detail for those that you

Starting point is 00:11:50 know i haven't covered things before which is there's been talks where i totally bailed i said you know i can recall one gopher con where i'm coming down the elevator and i was emceeing and i had a keynote later in the day but i also had this opening thing I was going to do. I basically remixed this poem from Maya Angelou, Still I Rise, and I titled it Gophers Rising. And I remember, I was like, I'm not going to go read a poem to open a text conference. This ain't going to work.

Starting point is 00:12:20 No way am I doing this. And I called my wife and said, babe, remember that poem that I spent all this time writing and remixing and making sure it was right? I'm not going to do that today because what if it doesn't work and people look at me weird as hell? I don't want to go out like that. So I'm not going to do it.

Starting point is 00:12:36 And I remember calling Brian Kettleston, who was one of the people who started GopherCon, works at Microsoft now. It's like, hey, I'm not going to do that poem thing. So let's just scratch that from the agenda. And I got there and they were just like, you know, we think it's a good poem. Like if you should do it, if you want to do it.

Starting point is 00:12:53 And I think I called my wife and she was like, okay, I support you, do what you got to do. And then as the elevator gets to the ground floor, she calls back and says, that's unacceptable. You're always giving this advice for people to get out of their comfort zone to try new things and i was like damn my words are coming back to bite me and i remember going out and doing that poem and it was the first time that i took away the technical crutches there was no demo there's no slides We're not talking about the language. We're not talking about containers.

Starting point is 00:13:26 We're talking about something more, in my opinion, more human. And the reaction was phenomenal. And then that was the time it gave me permission to also be able to do that. So you see elements of that in various talks that you've ever seen me do over time. That's awesome. Yeah. Thanks for that. So I would like to shift focus to Kubernetes a little bit. It has become the de facto choice

Starting point is 00:13:52 for a container orchestration system. And for someone who has used Kubernetes knows that it is an amazing piece of software. However, the stack is so wide and deep that it also exposes one to failure modes they didn't even know existed. At least it's something that I've learned in the last year. And I've seen that Kubernetes can break in amusing and surprising ways. I know you've worked with a lot of customers who use Kubernetes for various different use cases. Can you tell us a little bit about the ways you have seen Kubernetes break in production? That could either be because something went wrong or it was just configured in the wrong way. Yeah, so I think it's healthy to remind people, you know,

Starting point is 00:14:38 Kubernetes is kind of in its first trimester, right? This idea that's born, what, six, seven years ago. And the pieces are there. You know, you layer on top of an existing ecosystem, which is Docker. You build in, you bring in things like etcd, that's your key value store. And then you try to create a world around it. And a lot of ideas evolve, you know, Red Hat chipped in some great ideas from OpenShift and lots of vendors, big and small, brought a lot of ideas to the table. And I think we have the right framework and we have the right ideas around scheduling and describing objects and controllers and emission hooks. But a lot of those ideas are still fairly fresh, right? They just all some of them just got GA.

Starting point is 00:15:21 And the things that got GA, do they actually work well together? So a lot of the bugs really come into that situation, right? Like you have this distributed system. And if you think about like etcd, it doesn't really have a robust indexing system. So the API server has to make that up. And then it has to do its own in-memory cache. And caching is a super hard problem. And then you have this schedule that's relying on that cache to make scheduling decisions.

Starting point is 00:15:44 And then you have all these other components that are trying to keep that cache in sync about the health of the containers running all the nodes. So when you take all of this coordination, especially when you start to have a lot of workloads coming in and out fairly quickly, CICD jobs, batch jobs, web requests, things are trying to auto scale. All of them are trying to coordinate on this global set of configuration. So that leaves a lot of room for inconsistencies. And in any distributed system, when you have inconsistencies, I've seen things like the workload is playing musical chairs. If you just sit back and watch and you say kubectl get pods, it's like an orchestra, like just Mozart is like moving your workload.

Starting point is 00:16:26 And you're like, why are things moving around? It's like, don't do anything. This is beautiful. And what you find out is that some of your workloads have CPU limits and requests or memory limits or requests. And some don't. And the ones that don't might get evicted from the node. And they go back through the scheduler. And since they don't have any CPU or memory requirements, they get scheduled to a node

Starting point is 00:16:50 and then that thing heats up, gets read requests and constrains the system and gets evicted again. And if you don't really understand how Kubernetes makes its scheduling decisions or how all of those components we just talked about work together, that you might create this masterpiece by mistake. You create this world that you don't quite understand. So I think people have to understand is how do you as a user of Kubernetes communicate your intentions to this big automation machine? And all of those things like health checks

Starting point is 00:17:21 and readiness probes, CPU and memory requirements, the size of your container. All these things are communicating your intent of what you want the system to eventually resolve to. And when people get that wrong or omit it altogether, that's when chaos ensues. It's interesting that you mentioned that. We actually hit this ourselves in our initial days of using Kubernetes. And it's actually like an orchestra. So do you remember the last time you spent a significant amount of time

Starting point is 00:17:51 trying to debug an issue with Kubernetes? Whether it was a problem you're trying to solve for yourself or just working with a customer? Yeah, I had a customer text me. Yeah, I remember this one and and the customer is running maybe ruby on rails and you know they were just upgrading their app and they thought that they did something wrong with app right because you know ruby on rails you do like a database migration you run some rig tasks and you know you don't know if you broke the database it just like wow

Starting point is 00:18:24 wow west and then you know most engineers you don't want to say nothing until you have to, right? You've all been at your keyboards like, yo, I can fix this. I'm not saying, hey, what's going on? Oh, nothing. Everything is great. Nothing to see here. And eventually there's enough alarms that go off where people start saying, no, there's something wrong. And you were the last person to do something.

Starting point is 00:18:43 We're going to sit around your desk until we figure this out together and i remember that togetherness happened and someone texted me like hey kelsey i have no idea we have no idea what's going on with our cluster things are down 100 and i was like well what'd you do of course it's their fault and they're like no we just did a new deployment, something, something, something. And it turned out there was a very weird bug and an upgrade of Kubernetes, specifically the network security policy component. And this component is a controller. So for those that don't know the way network security policy works, it looks at data inside of Kubernetes and decides which of your applications can talk to each other. And then based on that rule set, it programs IP tables, you know, to allow or deny things across all of the hosts. And there was a bug and Calico was the implementation that they were using.

Starting point is 00:19:35 So when Calico's control loop came up, by default, it was set to deny all and then go get its rule sets from Kubernetes to allow traffic. Unfortunately, there was a new version of kubernetes that this version of calico was not compatible with oh and so what would happen is when the rolling upgrade stopped calico was like i can't get any rules therefore i'm going to lock everything out and so this is a low level networking issue because you're not using a service mesh. There's nothing happening in your load balancer. There is no reason that you should see this. And you've never seen this before. You've been using Kubernetes for a year and a half. This has never been something you saw. So you don't even know to go check there.

Starting point is 00:20:17 And I remember I couldn't figure it out. I was doing TCP dumps. I was like, wow, what the hell is blocking? Who would do this? And of course, two hours later, you look at IP tables, it's like, what is this? Maybe there's a network policy that says you don't want any traffic. Why would you do that to yourself? But then you see all these other policies allowing traffic, and they swear that these policies have always been there. And it took a long time. And luckily for us, the Google SREs who run GKE found the bug and said, hey, there is a bug with the way that controller was pulling its specs from the API server that

Starting point is 00:20:53 was failing. Here is the patch. And there was like this quick on the fly patch that didn't require restarting the whole world or downgrading Kubernetes. And the customer was able to get back and running. But those are the kind of obscure issues that you don't even know are possible because you don't know what it's controlling what. So if you have no experience with KubeProxy or some of these other

Starting point is 00:21:14 agents that supplement KubeProxy at that level, you might be lost. And I was. That's a very interesting work that you mentioned. But I have a follow-up question on that. Before I get there, you mentioned that you had a customer text you. Is that something which is normal? That shouldn't happen. So this happens when, you know, I meet a lot of people and sometimes you convince them, like, you should really be using Kubernetes. And they're using Kubernetes. And and look there's great customer support there's SREs but they say hey man you talked me into this I've tried everything oh before I call support since this

Starting point is 00:21:57 was your idea you're gonna take this text message and and and really and you know what it also keeps me honest yeah right Because I remember those moments when a customer may reach out. I mean, of course, these days, they're just such good support. They can call support. But if I know people, they're gonna be like, hey, Kelsey, what's going on here?

Starting point is 00:22:14 Like, give me something here. And so I'll take that call. I see. Interesting. I've actually seen some tweets, people tweeting at you saying that you just jumped on a Hangout call with them to debug something. That is incredible. I'm sure it provides a lot of value to them when they're

Starting point is 00:22:30 trying to navigate the situation. And honestly, I'll tell you, I feel the need to have to do that every once in a while. There are a lot of people who only know Kelsey from 2018, right? And they say, I only know this guy from talking about this stuff. Has he ever done this stuff? Is he just someone who read all the documentation and just recites it on stage? And that's a thing for any engineer, right? We already all have imposter syndrome, but when other people think you're an imposter, that's even worse. And so sometimes I'll say, you know what, let me show them. Let me remind people that, you know, I've done all of these things. I've lived in production. I've been in those change windows that go to 6 a.m. I'm the one who's broken production sometimes.

Starting point is 00:23:15 I'm the one who's fixed production sometimes. And so sometimes it's kind of, you know, every once in a while you want to come back and just say, hey, and then they say, okay, this guy knows what he's doing. It just makes me feel good too. And so that's why I continue to do that. Makes sense. And of course, there are two types of engineers, ones who have broken production and ones who are going to. So we'll all be there at some point. So talking about networking and Kubernetes, it's something that took me a while to get used to. And as there are so many agents on machines these days, which are trying to manage the network through IP table and route tables, that it's not intuitive to just jump on a machine, look at the IP tables and say, who created this specific rule? And debug problems where the network is behaving weird all of a

Starting point is 00:24:06 sudden and the pods are not getting scheduled or unable to talk to each other despite using the right configuration. So as you work with a lot of Kubernetes users, do you see them taking a while to absorb all of this magic and just understand how the network is being managed? Yeah, so that's the thing about Kubernetes. One thing is early on, we try to make it easy. And I think in some ways too easy. We try to hide the fact that there is a network underneath, right? So this whole, and I think kind of Docker started this trend of giving every container its own IP, in some ways treating it as a first class citizen on the network. But not until recently have pods been a first class citizen on the network, right? Like if you go to some cloud providers, you might be lucky enough that they have things like IP aliases,

Starting point is 00:25:00 where the IP assigned to the pod comes from the VPC address pool. And the nice thing about that then is that it is a first class routable IP that you can put in load balancers and DNS entries, etc. But the truth is most people are not dealing with things that are actually first class, right? You're dealing with turning the Linux host into a router, you have some bridge, you have some local route table, and then you have to go and program the rest of the infrastructure to do some of this L3 routing. And for people that don't know how any of that is set up, when any of those settings goes away, for example, someone blows away your route table, everything looks good on the host,

Starting point is 00:25:40 everything looks good in Kubernetes, you have no idea what should even be in the middle. And so I think this illusion that we try to give that a pod was just like a VM, and it's self contained and using things like network namespaces. I think in some cases, like people ask for easy day one, but you better know how it works on day three, because you need to know what needs to be there. So you can double check. And I think we even made it worse when we started. I don't know if you all remember the age of container networking. This is when people were doing these demos where it's like, I have a container running on Amazon. I have a container running on GCP and on my laptop. And I'm going to use an overlay network to put them in all the same. what are you doing you can't have a broadcast

Starting point is 00:26:26 domain that big there's just no way that's going to be reliable in terms of latency and are you storing the state of the network at etcd so if etcd falls over the whole networking thing is down right whoa let's stop here and i think we went way too far in trying to get to this ease of use that we didn't educate people on what to do. And this is why I wrote Kubernetes the hard way. And I try to call out these small details about what layers you need to think about and what they do. Yeah, I'm glad you mentioned Kubernetes the hard way. It is something that I found as an extremely valuable resource to go back and not just look at a tool which says, hey, bring me a Kubernetes cluster that just works, but actually really understand how each layer works. Not just the networking part, but also

Starting point is 00:27:18 the identity part. Like which CN or common name should I have an assert for this component so that these things didn't magically talk to each other. And the fact that you touched on is when they're operating something or using something like Kubernetes, it's so easy to get started with something like KubeAdmin, KOps, and a bunch of other tools. Do you have any thoughts around or do you have any advice for people who are using it or getting started on the right ways to go deep into the stack so that they actually understand how this thing works? Yeah, one thing I remember learning about a T-shaped engineer. And, you know, a lot of times we want people to have a very broad skill set. We want them to know a little bit about security, a little bit about networking, maybe a little

Starting point is 00:28:07 bit about the problem domain and probably the programming language and the patterns leveraged by that programming language to solve problems in that domain. And that's a T-shape, right? That's horizontal. But in my career, I can remember every couple of years I would go super deep in one of these areas, right? So for config management, I worked at Puppet Lab. So I came from being a user to someone helping build the product. And that was my opportunity to go deep in terms of promise theory, control loops, inversion of

Starting point is 00:28:35 control, infrastructure as code. And so that was my ability to go deep. And the same thing was true for networking. I spent a lot of time helping get Puppet on Juniper devices and learning about how you configure a leaf and spine network and how IPv6 translates to IPv4. Again, going super deep. So I think a lot of people need to figure out when it comes to Kubernetes, I know we like to start with how do we make it easy? You hear a lot of people, oh, it needs to be easy, easy, easy, easy, easy. What layer are you talking about? Because there's certain layers that can be easy. For example, if someone is running Kubernetes for you

Starting point is 00:29:12 and you just want to have an easy interface to the cluster, well, that might represent some UI where you go in and you click around with some pre-built workflows. That can be easy, but it doesn't mean that there's still not YAML being generated, the kind of assembly language of Kubernetes, and that there's not a kernel and, you know, a container runtime and orchestrator, CA certs, and all of these things. I don't know

Starting point is 00:29:35 if you can actually make the whole thing transparently easy, but we can make additional components easy. So what I would advise new people is, sure, get started with some mini cube running on your laptop, or maybe you click a button in a cloud provider and you get a cluster provision for you. I think you should start there. Now, if you're in ops, or you're in DevOps, and people say that you are part of the team responsible for knowing how it works and how it breaks, then you don't get to get easy, right? You get to learn how all of these components work over time. And maybe you start with networking, CNI, IP tables,

Starting point is 00:30:13 think about how service mesh fits in, or maybe you start on the other end of how the scheduler works and why you should have memory and CPU allocations versus deploying things without those. And take your time. That would be the last thing I would say here. Take your time. This whole becoming an expert in 10 minutes or in 30 days is crazy to me. I've been using Linux for, what, 15 years? There is no way I'm

Starting point is 00:30:39 still an expert today because it keeps changing and evolving. And I think Kubernetes is going to have to be as patient if you want to become an expert around something like Kubernetes. I love that. And one thing that I would add is that the word expert in the context of Kubernetes is very interesting because it is like the T-shaped project. It's extremely far and wide. Every vertical has so much depth that even if one person becomes an expert in one area, there are so many other things which are constantly moving and shifting. So talking about experts, you speak with a lot of advanced users of Kubernetes. And what are some of the pitfalls that you've seen these advanced users fall into while using

Starting point is 00:31:27 Kubernetes? Yeah, so if you think about a Linux distro, and if you met an engineer and you said, and they said, hey, we compile our kernel from scratch, we layer on the file system, we build ZFS from scratch, Bash and modules and Apache, and we spend, I don't know, 30 hours compiling our Linux distro. You will look at them and say, I don't know if that's smart. I don't even know if that's secure, right? There's so many things that can go wrong in a Linux distribution that for you to have a team full-time focused on it, means you have to track every CVE, you have to learn how Bash is compatible with this version of X. There's just too many things to worry about. So the biggest thing I see for advanced users is, oh yeah, we're going to roll our own Kubernetes distro from GitHub. I'm like, okay, let's say you did that. At best, you'll get 80% of 80 of what you need right you'll get the kubelet

Starting point is 00:32:26 the scheduler the controller manager you may come up with your own same defaults with the 10 000 flags that you can set on all of those components yeah but then now it's like okay what version of that cd are you going to use how much memory are you going to allocate to it are you going to use your own ca are you going to use vault how are you going to use your own ca are you going to use vault how much latency will an external ca cause if you're trying to mint certs from the back end how are you going to sign those jot tokens then what about the low balancer what's going to be your ingress controller are you going to integrate with the cloud provider please don't tell me you're going to write your own ingress controller from scratch when there's like 50,000

Starting point is 00:33:01 of them out there yeah and then by time you're done rolling your own distro, the whole team is like, there's a new platform that exists and Kubernetes is what we were doing in 2020, right? So now it's 2040, like now I'm ready. It's like, come on, man. We're in distro territory. So I think advanced teams that are doing,

Starting point is 00:33:21 I think, a good job, they're looking for the gaps. There are things that Kubernetes doesn't do well. For example, Kubernetes is not going to defrag your cluster, right? So if you have nodes that come and go, Kubernetes is not going to figure out the best way to compress or defrag the workload so that they use as few nodes as possible. Should that be in Kubernetes? That's debatable. I don't think that's a core component, even though it would be nice to have. And then you'll see an advanced team go and say, yeah, we built a tool that understands our workloads and their priorities and knows how to take a node away and make sure things get rescheduled into less of the nodes. And this is how we compact our workloads and get our efficiency higher.

Starting point is 00:34:03 That's a very complex thing where Kubernetes can only meet you halfway. And the last thing I'll say here is people are trying to run stateful workloads on Kubernetes. And this one makes me smile, borderline laugh, because it's hard. It's not the fact that Kubernetes can or cannot run stateful workloads. It's the fact that you probably can't. You take MySQL, and I told you that one day someone's going to take your static MySQL server and put it into an environment where you're playing musical chairs with this magnificent orchestra from time to time.

Starting point is 00:34:35 No way is that going to work out well. So then you have to learn how to use the right parts of Kubernetes to ensure that doesn't happen. and then prepare yourself for when it does. Like, do you use a network block device so it can be reattached? If you're in the cloud, that block device might be only attachable in certain zones and not regional. There's so many little nuances to think about when you try to take a thing that wasn't built for distributed systems and throw it into a distributed system that you don't know yet,

Starting point is 00:35:04 which is going to be a little bit hard. So talking about the hard things in general, one thing you might have seen, at least I have noticed it as an observer, I'm not sure if it's true, but sometimes I've seen that sentiment of, we'll rub Kubernetes on this problem and it'll solve all the side effects, everything that we have to deal with today. Are there scenarios where you've seen this to be true? And are there scenarios where you've actually said, hey, don't use Kubernetes to solve this problem? Yeah, I think I was on a way to visit a customer and one of the engineers was riding a bike

Starting point is 00:35:44 and he stopped to wave. He's like, hey, Kelsey. And he ran into a wall. Like I was on the way to visit a customer and one of the engineers was riding a bike and he stopped to wave. He's like, hey, Kelsey. And he ran into a wall. Like I was like, whoa, this is devastating. And I think he hurt his arm. Maybe he was bleeding. And he was like, how do I fix this? And I took out my magical jar of Kubernetes and rubbed it on his elbow and everything was fine.

Starting point is 00:36:02 No, that doesn't work. Use damn Band-Aid if you have a cut. Kubernetes is not the right solution for that. Like that was a very obvious one. But when you get to things like, I have this workload. I said, what does the workload do? Well, this workload needs 128 CPUs,

Starting point is 00:36:20 two terabytes of memory, and all the GPUs I can get. And then we're going to run this thing on this, you know, 50 petabyte data set. And we need this thing to run fast as possible. For some people that may not be a distributed systems problem, especially if you have to do this sequentially for some reason. So in that case, the best thing you can do is get a big ass VM or big ass bare metal thing and put the workload on it and just run. You don't need to have Fluid D collecting logs from this thing.

Starting point is 00:36:50 You don't need the kubelet trying to figure out what IP it should have. You don't need any of that, right? You just want this machine to not do any additional context switching. You don't want anything there taking up resources. That's a simple case of I I wanna go as fast as possible with nothing there. Great use case. There's cases where a company's like,

Starting point is 00:37:10 yeah, we wanna have this super scalable app. I was like, what does it do? Oh, it serves up the current time. Like that's it, it just, it just serves up the current time. Yeah, yeah, yeah. I said, no, you can like use the surplus thing for that how many requests you're getting per second well we have 100 customers they check once per day

Starting point is 00:37:30 about damn if they do the math it's like one request per 10 minutes i was like hey listen to me i know you really want to use kubernetes right you know you want to make sure your linkedin profile is looking nice you want to put cloud native on there for 200 years. I know you want to do that. But in this case, this is not a reason to get there. You need to just put this on something like Cloud Run. It's going to be fine. It's going to cost you like $1.25. Great. You're going to be a hero. But if you go and spend $10,000 on a Kubernetes cluster, hire eight more team, create an SRE team for one request every 10 minutes, we're going a little too far.

Starting point is 00:38:13 So I think this is a case where I have to talk people off the ledge. I say, hey, look, I'm the Kubernetes guy. I like Kubernetes a lot. But for this case, I would not use it for that. And there'd be other cases where, you know, you'll have some really critical database that has a lot of kernel settings, you know, you have to think about IOPS and caches and file buffers. And there's so many things that they're doing to the system,

Starting point is 00:38:36 that when you start to do that inside of Kubernetes, and you start to do that inside of one big cluster, and you have a couple of nodes who need this kind of special tweaking and tuning. Now you're starting to ask yourself, should that be either a dedicated cluster or should that be no kubernetes at all so i have to caution people that kubernetes has a few things that it will do to help you but it may have a few things that may get in the way of certain types of workloads and we just got to be smart about it makes sense So the focus should be on the problem you're trying to solve and whether something like Kubernetes is the right fit or not. So taking a step back, there are a few things that you've mentioned,

Starting point is 00:39:15 which I definitely want to get into. And one thing that I read about you was that you wake up at 4.30 a.m. in the morning. Now again, internet is a funny place. You cannot believe everything that you wake up at 4.30 a.m. in the morning. Now, again, internet is a funny place. You cannot believe everything that you read. So I wanted to ask you, is this true? Yeah, I would say 4.35. There's a couple of reasons.

Starting point is 00:39:35 One, my wife wakes up to work at it at that time. That's when the jump roping will start. So I'm going to wake up whether I like it or not. But also, it's that time of, that's the peace time. No one's sending emails. Nothing's really happening. You shouldn't be on social media. At least I shouldn't be.

Starting point is 00:39:56 And there's really no breaking news because they're all waiting until everyone's woke so they can see it. And that gives me the time to really think about a problem or sometimes just stop thinking about a problem. And that gives me a lot of jumpstart on the day to think about things. And then if I do wanna write some code, I know it's gonna be another three hours before someone sends me an email

Starting point is 00:40:19 or gets some notification about something important. And so having that quality time in that block is really, really nice. And I wouldn't trade it for anything. And I think these days, what I do is I'll try to find more blocks throughout the day, like lunch. You know, once COVID started, I realized that it was time to put the lunch block on the calendar because, you know, now you have people from different time zones, all working at the same time that lunch for you is not lunch for them. And it may not be clear as they're scheduling out their day.

Starting point is 00:40:48 So I think just having that block. So that's what that 430 thing is all about, is about me being able to do all the things I want to do, but may not be able to get to during the day where everyone's up and active. So you are a principal developer advocate at Google, and this position is the engineering equivalent of a director. Now you mentioned that part of the responsibility is to sponsor certain projects and identify areas now, whether in the existing products or the new ones that the team should be building. This would require speaking with a lot of stakeholders, both within Google and outside, and persuade a lot of teams. Being on the IC track, I'm actually very curious, what are some of the practices that have worked really

Starting point is 00:41:39 well for you in having this influence without authority? Yeah, so one person, I'm a dotted line report to a person named Eric Brewer. You probably know him from the cap theory, cap theorem. And he's a VP level. He's also a fellow and he's a professor at Berkeley. And he doesn't have any direct reports except, you know, me as this dotted line. But I learned a lot from him. So when I joined Google, had been a vp of engineering before director of engineering i've done all these roles i've been a product manager when i was at

Starting point is 00:42:09 core west working with the founders on the portfolio there but when i when i got to google it's one of the rare cases and you all kind of work at a large company and you know they have room for this kind of ic at the v level, all the way up to the VP level. And you're right. You have to now try to persuade people. You have to try to influence people, in some cases, inspire people because no one reports to you. You can't make anyone do anything. So in those cases, like how do you go about it? So there's a couple of things I do. Like if I'm with a team, for example, let's say we have a team working on making it easier to write

Starting point is 00:42:48 Kubernetes controllers, and you have all these ideas. And I might come to the table and say, you know what I would like to do as a developer? How come we don't leverage serverless backends? How come I can't just come to GCP, see the objects that are available on Kube, and then route one of those objects to this function. Right? Imagine that being the experience. And the thing we do in the middle could be all of that watch logic, the

Starting point is 00:43:13 reconciliation, all that hard stuff that it comes with making a controller, we can make that world disappear. And sometimes what I'll do then to inspire would be show come up with that prototype, come to the meeting and say, I see the design doc, and I think we're on to something. Let me just give you a five minute demo real quick. And then I'll have a prototype that says, you know, maybe I make a YAML that says, this object in cube goes to this function in GCP, then I apply it, make a change in cube, and the function fires off. And they say, well, what's going on? It's not important.

Starting point is 00:43:45 It's a prototype. But it's real. It's a working prototype. But look at the experience. Imagine someone being able to make a controller by just writing a function. That's the power of bringing those worlds together. And those are the kind of ideas that inspire. And then they influence.

Starting point is 00:44:01 And then it's something that we can ship to customers. So I think at my level now, not only do I have to study the designs, I also have to need to know what's possible between all the technologies. And once I know what's possible, then I can start to build the facades in front of those things to create these new buildings, these new, in some cases, works of art, and try to get these experiences that aren't possible unless you know what pieces are available to you. So that's how I go about it. So that's how I do it on the product side. On the business side, right, because a lot of this, you know, at this level, you got to understand the business impact

Starting point is 00:44:34 as well. So how do you grow the business? And sometimes you're going to be concerned with adoption. And so you ask yourself, if we're going to build this feature, or if we've built this feature, who is it for? How would they use it? And then once that becomes clear, can you even do that? And that's where those empathy sessions come from. It's another way of inspiring people. So you get all the engineers in a room. You get SREs from this part of the org.

Starting point is 00:44:58 You get documentation people. And there's been cases where we even brung in some customers. And you split them into these teams. And you say, all right, here's what we want to do i'm going to take the product and we're going to try to do the obvious thing that you should be able to do with the product right yes great go ahead and try and no one can do it in an hour even teams of four googlers and customers and it doesn't work and you say well why doesn't it work ah we're missing some of these horizontal integrations iam needs to be easier. IAM needs to be easier to configure. This needs to be easier to configure. But guess what? Now they have empathy on their own.

Starting point is 00:45:31 It doesn't, I didn't say this is exactly how you fix it. You said how we should fix it through experience. And again, I'm leaving empathy behind. So that's the way I've been successful in my career. And you can leverage all kinds of skills and techniques, but that's the way I kind of bring my unique skills to the table to persuade people into doing the things that they do. That's super valuable. So usually when people want to propose something, they start with the design doc. And you've mentioned that you also go in with the demo and trying to persuade people of a new idea in your experience do you think that makes it easier to convey the idea than just sending out a design doc they're all effective in many ways because at google design docs are really great because they capture the timeline of thought, right?

Starting point is 00:46:25 Like, especially if you see one where there's comments. Hey, I don't know about this piece. What about this? What about that? Then you see that doc evolve over time. And I think it really captures and serializes all the decisions that went into maybe the roadmap item or the things that end up in the backlog to be worked on.

Starting point is 00:46:44 What I do with demos, though, I think of it as a slightly different tool. Like the demo or the prototype can augment the design because at some point you have to do the design doc for various reasons. But sometimes, like, for example, if I go to a conference. I can't pull up the design doc. You could be like, we don't want you to read this four or five page design doc to us in the audience right now i could turn it into slides right give you the bullet points from the design doc and again i'm going to worry about you falling to sleep but if i show you if i show you the design doc then it gets a little bit more interesting right because then i can use my speech to

Starting point is 00:47:22 articulate what's in the design doc but But when people get to see it, and I remember doing this for the Kubernetes community, that's like the path to serverless. Like in Kubernetes, you got to do all of this stuff to get something done. And I remember building this tool that just extracted the container and the thing inside of the container and run it in Lambda. That's the design. I remember someone, I think it was Sebastian. I forgot one of the serverless products that he built but i remember him tweeting he left the keynote in the first like eight minutes i was like damn it was that bad that you left you left early nope he said he got inspired and he went to go add something to

Starting point is 00:47:58 kubelis i think it was the you know serverless platform he was building on top of kubernetes to add lambda support oh wow because he saw it on the stage because you can actually do the Lambda runtime in a way that works for Kubernetes and works for Lambda in the same custom runtime if you did it right. So those were the things where I think I was being more impactful than just a design doc. Makes sense. In your position, I imagine you mentor a few folks. Do you have any advice for people who are just starting this relationship, either as a mentee or a mentor, to teach someone. Like I literally will look at a person and care about them because I know at some point in my own career trajectory, I was where they are. And you have to be, for me, I'm not saying this is general advice for everyone, but for me,

Starting point is 00:48:57 treating that person as a human first, they're not a software engineer trying to go from L1 to L2 or L3. Like if you start doing that, then you may not be talking to the whole person. So for me, I like to try to mentor the entire person because they might be at a point in their life where they do not need to go and double down learning some programming language. Maybe they need to take a break. Maybe they need to relax a little bit and enjoy the last promotion versus thinking about the current promotion or the next one that's coming up. And I think sometimes if you don't try to look at this from the whole person point of view, you might miss something. The other part, I think if you're going to be a mentor,

Starting point is 00:49:31 understand that you might also learn something. And honestly, when you do learn something from someone that you're attempting to mentor, let them know like, hey, pause. I just learned something from you. I really appreciate it. Here's how I'm going to apply it. And it shows that you're really listening, not just giving advice. You're also kind of listening to them. Now let's talk about the flip side real quick. On the flip side, the thing I've seen, that's tough. For someone who mentors a lot of people, tries to allocate a lot of their time, it's really hard when someone is not even trying. Right? Hey, mentor me on getting started with Go. What have you done so far?

Starting point is 00:50:08 Nothing. I want you to do everything to get me started. And it's like, whoa, that's not fair, actually. I need you to try a few things first. And then maybe we decide through the conversation how I can help you more, answer any questions, because maybe I'm further along than you are. So I think you've got to make sure you're asking yourself, what am I bringing to the table? Have I tried anything? Is there any feedback I can give about the getting started experience? But just have something you've tried. And then what I like to do, if I'm getting advice from

Starting point is 00:50:40 someone, I don't want to have another meeting until I figure out what to do with the first set of advice. Can I put it to work? Can I put it to action? And if I can't, at least summarize why I can't. So when I show up for the next one, I can do a quick recap about the last set of information and then maybe be ready to receive some new information. So I should think about that holistically. That makes sense. And I have one brief detour on something you'd mentioned before about learning new things. And you also referred to Service Mesh. And I was looking at your Twitter profile and you're working on something called Service Mesh the hard way. Can you tell us more about that? Yeah, so Service Mesh the hard way. So I think I've been, for the last two years,

Starting point is 00:51:27 I would say I've been kind of learning in public. So when Istio first came out, I did maybe keynotes on Istio and Tuckle Deep Dive and Envoy. I did one of the first prototypes for the, you know, in Kubernetes, when you deploy your application, you may only have one container in there.

Starting point is 00:51:43 But if you want to join the service mesh, then you want to append the Istio or Envoy so that it can attach to the Istio control plane so it can be the sidecar in your service mesh. And I remember doing the first prototype for how to use the injection features inside of Kube to be able to do things like that. And then I will go deeper, right, into like, how does Istio control plane work? How does it objects play with Kubernetes? How does it impact the networking? And then over time,

Starting point is 00:52:10 I started getting deeper and deeper into things like open policy agent, right? How do you delegate off N and off Z? I even wrote a library recently for implementing the Jose standards, you know, Jot tokens, JWS, that kind of thing. But then earlier this year, I really started to get deeper into just Envoy, right? So when you look at the whole service mesh thing, when you are using Istio and

Starting point is 00:52:33 all of these things, one thing a light bulb lit up for me maybe a couple of years ago was when I look at Istio, I look at it as a Envoy config compiler, right? You give it all of these high level things like rate limiting and all this other stuff and at the end though it really spits out you know whether in sections of using the xds protocol or a big chunk of thing but at the end of the day envoy has to get pretty much everything it needs in order to process using various filters that it has so So when I started to study the Envoy config language and how Envoy actually uses all of those things, whether you're talking about a V host, rate limiting rule, or delegating off Z to some sidecar thing

Starting point is 00:53:14 like open policy agent, I was like, wow, this is, now I know enough to start to work on mesh the hardware. So I didn't wanna do do Istio the hard way because I think Istio makes a lot of assumptions and it has its own way of thinking about service mesh. I want to go a little bit lower. So I want to start with Envoy, no containers, and then bring in things like Open Policy Agent, Prometheus,

Starting point is 00:53:41 all of the things that you would typically see in a service mesh. I'm even creating some of the certificates by hand using the Spiffy protocol. So for people that don't know what Spiffy is, how do we give identity to our workloads? And it's not enough for IP address, right? Because those can change. So in the Spiffy world, we want to give an identity. And typically, we want to have that identity. So if you're at Foo and you belong to this particular org, well, Spiffy tries to make that a little simple, right? You might say

Starting point is 00:54:08 the org is example.com, that's the domain, and then forward slash, maybe the service name. So forward slash Foo. Or if it's part of a bigger collection of services, maybe it's forward slash payments forward slash Foo. But either way, that simple URL construct can serve as a spiffy ID. Now it's like, how do you prove that you're supposed to have that ID? So that's one area where it was super confusing for me. It's like, how the hell do you do this? And there's different ways of thinking about this problem. One could be, you can piggyback on your existing identity. So if you're in the Kubernetes world, that service account that you're given means that Kubernetes started the process and gave it an identity. You can exchange that identity for one

Starting point is 00:54:51 of these, let's call it an X.509 certificate. And that X.509 certificate can also have part of its SANS, subject alternative names, is the SPIFI ID. And it's in a standard place where if I am a web app, or I'm a sidecar proxy, I can use this client side certificate to present it to some other endpoint. And it knows where to look to find the identity. And last thing I'll say here is that that identity is key to the whole thing. So people say service mesh, but the thing is service mesh doesn't work without a service identity. So the reason why I want to do mesh the hard way is because we need to break things

Starting point is 00:55:30 all the way down to that level of granularity so we know what's going on. I'm certainly looking forward to that. Awesome. Yeah, thanks for covering all those details. So I imagine that you've been exposed to a lot of new tools while working in this cloud native ecosystem.

Starting point is 00:55:49 So maybe just for the audience, but like maybe what was like the last tool that you discovered that you just really, really liked? I talk about Open Policy Agent a lot because it solves this kind of gnarly problem around off Z and give you kind of a framework for doing so. But I guess the last tool I saw, I'm an advisor to a startup called Pixie Labs. And they've done something nice where they've taken this, what I would say, this new platform, EBPVF, you know, the integration you can do with the kernel now, where you now have the ability to

Starting point is 00:56:22 inject code into running processes, Be careful about that. Or not necessarily in the thing that got compiled, but a way to grab metrics if you wanted to, maybe even put in a statement that allows you to extract the variable. You can either get other activities in the kernel, like, you know, what the network is doing in a more pragmatic way or programmatic way. And so what I've been seeing now is there's all of these new tools out now where they can do, you know, some people call it agentless, but there is something listening to those particular sets of messages and doing things. And it's given us this world of now I can do things like tracing what's going on in the system just by plugging in at that layer. So when you start to see bigger systems built on this concept, they have so much more power than before. Now, we got to be careful that we don't get into some

Starting point is 00:57:08 security situations. But I've seen things where people say, look, you deployed this Go application, and you just don't know what the value of this variable is. And you don't have anything in there where you can turn on that debugger in production. So what do you do? Well, now with this particular method, you can go in and start to probe this particular process and grab things out of it, I guess, if the symbols are there, and then actually debug this thing that wasn't previously debuggable. So you don't need to do another deployment, but you have a much clearer way, a much stronger contract with the kernel about how to go and probe inside of these processes that are running. That's going to be a game changer for the next generation of tools that

Starting point is 00:57:50 can give us this a way to attach things, right? Because Java world, you have this, JVM does a great job, it has for many years. But we don't have this level of tooling for all of the languages that run on a particular machine. So I think this is going to be pretty amazing. There's going to be security things we see. We see this at the networking level in terms of Kubernetes. Just so many things that are going to happen in the next generation of tools because the kernel now presents what I'm considering a platform at this point. That's really awesome.

Starting point is 00:58:17 It's really funny to hear. So a colleague of mine just mentioned eBPF as a way of like how we looked into seeing like what would it look like if we could tap into any sort of process and just know when an exception gets tracked. Instead of like emitting a new metric, like, you know, everybody has an instrument in their stuff, standardized across the thing. Like we just have this agent on the side that just taps into it, catches it, and then can say, yeah, we got an error, throw that out there. And now applications don't have to even think about, you know, oh, I have to use this library, I have to emit it this way, or whatever, you completely make that completely separate. So that's really neat to hear about. Yeah, and I guess, now that we're wrapping up, so where can other people,

Starting point is 00:59:06 uh, find you on the internet these days? Man, I think for the last maybe five or, you know, maybe 10 years now is Twitter. I try to just keep everything there. I try to be very responsive. My DMS are open. Uh, and every once in a while I'll advertise a new block of office hours and office hours are meant to just give people one-on-one time. You know, I can think back to, you know, some of the books I bought in the past and I was like, man, it would be nice if I can speak to this person about this particular section just to get more insight. So I try to do that for the community. And sometimes I'll say, hey, I got about eight slots open, book some time and we can just do a one-on-one. I can learn from you. You can learn from me.

Starting point is 00:59:46 But Twitter is where you find me, at Kelsey Hightower. Awesome. Thanks. And is there anything else that you would like to share with our listeners? No, I think we shared a lot. I would probably say, you know, for those thinking about their careers, I would just say, hey, be patient. You got your whole life literally ahead of you. So just kind of enjoy the things and skills you've already learned. And there's a lot of people who look at all this cloud native stuff, and maybe they're listening

Starting point is 01:00:14 to this and like, I don't know what they're talking about. And I would encourage you to find comfort that a lot of the stuff we're talking about, the fundamentals are rooted in the things you already know. If you know how networking works in the VM or bare metal world, you already know 80% of how networking works in the Kubernetes world. Everyone talks about these containers and container images. If you know how RPMs work, you know 90% of how the container world works. So I would ask people like, hey, don't always think about the things you don't know. Think about the fundamentals that you do and how you map them to the things you don't. Well, it was a pleasure to talk to you, Kelsey. We hope we can bring you back in the future as well.

Starting point is 01:00:54 And thank you so much for taking the time. It was truly a pleasure. Awesome. Thanks for having me. Hey, thanks so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at softwaremisadventures at gmail.com.

Starting point is 01:01:16 We would love to hear from you. Until next time, take care.

Your Ad Here

Software Misadventures - Kelsey Hightower - On ways kubernetes can break, being an effective leader and much more - #1

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.