Screaming in the Cloud - Mastering Kubernetes for Multi-Cloud Efficiency With Nick Eberts

Starting point is 00:00:00 maybe that's where Kubernetes has a strength. Because you get a lot of it for free, it's complicated, but if you figure it out and then create the right abstractions, you can end up being a lot more efficient than trying to manage, you know, a hundred different implementations. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by someone rather exciting.

Starting point is 00:00:26 You don't often get to talk to, you know, paid assassins. But Nick Eberts is a product manager over at Google, so I can only assume that you kill things for a living. Not if we can help it, right? So if you're listening to this and you're using anything that I make, which is long lines of GKE, fleets, multi-cluster stuff, please use it. Otherwise, you're going to make me into a killer. This episode's been sponsored by our friends at Panoptica, part of Cisco. This is one of those real rarities where it's a security product that you can get started with for free, but also scale to enterprise grade.

Starting point is 00:01:07 Take a look. In fact, if you sign up for an enterprise account, they'll even throw you one of the limited, heavily discounted AWS skill builder licenses they got. Because believe it or not, unlike so many companies out there, they do understand AWS. To learn more, please visit panoptica.app slash last week in AWS. That's panoptica.app slash last week in AWS. Exactly. If our customers don't use this, we're going to have to turn it off. That's an amazing shakedown approach. Although, let's be honest, every company implicitly does have that. Like, if we don't make enough money, we're going to go out of business is sort of the

Starting point is 00:01:47 general trend, at least for the small scale companies. And then at some point, it's, we're going to indulge our own corporate ADHD and just lose interest in this thing that we've built and shipped. We'd rather focus on the new things, not the old things. That's boring. But Kubernetes is not boring. I will say that. One of the things that led to this is a few weeks before this recording, I wound up giving a talk at the Southern California

Starting point is 00:02:11 Area Linux Expo called Terrible Ideas in Kubernetes. Because five years ago, I ran my mouth on Twitter, imagine that, and predicted that no one would care about Kubernetes five years from now. It would drift below the surface level of awareness that most people had to think about. I think I'm directionally correct, but I got the timing wrong. I'll blame COVID for it. Why not? And as penance, I installed a Kubernetes of my very own in my spare room on a series of 10 raspberries pie and ran a bunch of local workloads on it for basically fun. And I learned so many things. I want to say about myself, but no, not really. Mostly about how the world thinks about these things

Starting point is 00:02:51 and how what Kubernetes is once you get past conference stage talking points and actually run it yourself. I get the sense you probably know more about this stuff than I do. I would seriously hope anyway. GKE is one of those things where people have said for a long time, the people I trust, most people call them customers, that they have been running Kubernetes in different places and GKE was the most natural expression of it. It didn't feel like you were

Starting point is 00:03:16 effectively fighting upstream trying to work with it. And I want to preface this by saying so far, all of my Kubernetes explorations personally have been in my on-prem environment because given the way that all of the clouds charge for data transfer, I can't necessarily afford to run this thing in a cloud environment, which is sad, but true. On that note specifically, I think maybe you've noted this at other times, Google Cloud stopped charging for egress. You stopped charging for data egress when customers agree to stop using Google Cloud stopped charging for egress. You stopped charging for data egress when customers agree to stop using Google Cloud. All three of the big clouds have done this.

Starting point is 00:03:50 And I think it's genius from the perspective of it's a terrific sales tool. If you don't like it, we won't charge you to get your data back. But what hurts people is not, I want to move the data out permanently. It's the ongoing cost of doing business. Perfect example. I have a 10-node Kubernetes cluster that really isn't doing all that much. It's spitting out over 100

Starting point is 00:04:12 gigabytes of telemetry every month, which gets fairly sizable. It would be the single largest expense of running this in a cloud expense other than the actual raw compute. And it's doing nothing, but it's talking an awful lot. And we've all had co-workers just like that. It's usually not a great experience. So it's the ongoing ebb and flow. And why is it sending all that data? What is in that data? It gets very tricky to understand and articulate that. So the data transfer is interesting. I mean, I'd want to ask you, what metrics or what signals are you sending out to cross a point in which you would get billed? Because that's interesting to me.

Starting point is 00:04:51 I mean, do you not like the in-cloud logging and operations monitoring stuff? Because when we ship metrics there, we're not billing you for it. Now we are billing you for the storage. Sure. And to be fair, storage of metrics has never been something

Starting point is 00:05:05 I found prohibitive on any provider. This is, again, this is running in my spare room. It is just spitting things out. Like, why do you use the in-cloud provided stuff? It's like, well, it's not really a cloud in the traditional sense. And we will come back to that topic in a minute. But I want to get things out somewhere.

Starting point is 00:05:20 In fact, I'm doing this multiple times, which makes this fun. I use Axiom for logs. That's how I tend to think about this. And I've also instrumented it with Honeycomb. Axiom is what told me we're about 250 gigabytes and climbing the last time I looked at it. And it's at least duplicating that, presumably, for what gets sent off to Honeycomb as well. I also run Prometheus and Grafana locally, because I want to have all the cool kids do. And frankly, having a workload that runs Kubernetes

Starting point is 00:05:45 means that I can start actively kicking the tires on other products that really are, it's contrived. You try and shove it into just this thing locally on your laptop or something that, like I've had some of my actual standing applications are for pure serverless build on top of Lambda functions. That gets really weird for some visions

Starting point is 00:06:04 of what observability should be. So I have an actual Kubernetes that now I can throw things at and see what explodes. Now that makes sense. I mean, like, listen, I love Honeycomb and there's a lot of third-party tools out there and providers. And one of the things that we do at Google, probably done across the board, is work with them to provide an endpoint or a data store or an existence of their service that's local within the cloud, right? So if you're using Honeycomb and that Honeycomb instance

Starting point is 00:06:30 that is your SaaS provider actually is an endpoint that's reachable inside of Google Cloud without going out to the network, then you can reduce the cost. So we try to work with them to do things. One example technology we have is Private Service Connect,

Starting point is 00:06:43 which allows you third-party companies to sort of host their endpoint in your VPC with an IP that's inside of your VPC. So then your egress charges are from a node running in a cluster to a private IP not going out through the internet. So we're trying to help because our customers do prefer not to pay large amounts of money to use essentially as a service that's most of these services are running on Google Cloud. I do confess to having a much deeper understanding of the AWS billing architectures and challenges among it. But one of the big challenges I've found, this will lead naturally into this from the overall point that few of us here at the Duckbill Group have made on Twitter, which is how you and I started talking. Specifically, we have made the assertion that Kubernetes is not cloud native, which sounds an awful lot like clickbait, but it is a sincerely held belief.

Starting point is 00:07:38 It's not one of those, somebody needs to pay attention to me. No, no, no. I have better stunts for that. This is based upon a growing conviction that I've seen from the way that large companies are using Kubernetes on top of a cloud provider and how Kubernetes itself works. It sounds weird to say that I have built this on a series of raspberries pie in my spare room. That's not really what it's intended for or designed to do, But I would disagree because what a lot of folks are using

Starting point is 00:08:07 is treating Kubernetes as a multi-cloud API, which I think is not the worst way to think of it. If you have a bunch of servers sitting in a rack somewhere, how are you going to run workloads on it? How are you going to divorce workloads from the underlying hardware platform? How do you start migrating it to handle hardware failures, for example?

Starting point is 00:08:25 Kubernetes seems to be a decent answer on this. It's almost a cloud in and of itself. It's similar to a data center operating system. It's realizing the vision that OpenStack sort of defined but could never realize. No, that's 100% it. And you're not going to get an argument from me there. Kubernetes, running your applications in Kubernetes

Starting point is 00:08:42 do not make them cloud native. One of the problems with this argument in general is that who can agree on what cloud native actually means? It means I have something to sell you in my experience. Right. My interpretation, it sort of adheres to the value prop of what the cloud was when it came out. Flexible, just pay for what you want when you need it,

Starting point is 00:09:00 scale out on demand, these kinds of things. So applications definitely are not immediately cloud native when you put them in Kubernetes. You have to do some work to make them auto scale. You have to do some work to make them stateless, maybe 12 factor, if you will, if you want to go back like a decade. Yeah, you can't take a Windows app, run it on Kubernetes clusters that have Windows Node support that's a monolith and then call it cloud native. Also, not all applications need to be cloud native. That is not the metric that we should be measuring ourselves by. So it's fine. Kubernetes is the lowest common denominator, or it's becoming the lowest common denominator of compute. That's the point.

Starting point is 00:09:35 If you have to build a platform, or you're a business that has several businesses within it, you have to support a portfolio of applications, it's more likely that you'll be able to run a high percentage of them on Kubernetes than you would on some fancy paths. Like that's been the death of all paths. It's like, ooh, this is really cool. I have to rewrite all my applications in order to fit into this paradigm. I built this thing and it's awesome for my use case. And it's awesome right until it gets a second user, at which point the whole premise falls completely to custard. It's a custard.

Starting point is 00:10:08 It's awful. It's a common failure pattern, where anyone can solve something to solve for their own use cases. But how do you make it extensible? How do you make it more universally applicable? And the way that Kubernetes has done this has been to effectively, you're building your own cloud when you're using Kubernetes to no small degree. One of the cracks I made in my talk, for example, was that Google has a somewhat condescending and

Starting point is 00:10:30 difficult engineering interview process. So if you can't pass through it, the consolation prize is you get to cosplay as working at Google by running Kubernetes yourself. And the problem when you start putting these things on top of other cloud provider abstractions is you have a cloud within a cloud and to the to the cloud provider what you've built looks an awful lot like a single tenant app with very weird behavioral characteristics that for all intents and purposes remain non-deterministic so as a result you're staring at this thing that the cloud provider says well you have an app and it's doing some things and the level of native understanding of what your workload looks like from the position of that cloud provider become obfuscated through that level of indirection. It effectively winds up creating a host of problems while solving for others.

Starting point is 00:11:17 As with everything, it's built on trade-offs. Yeah. I mean, not everybody needs a Kubernetes, right? If there's a certain complexity that you have to have of the applications that you need to support, then it's beneficial, right? It's not just immediately beneficial. A lot of the customers that I work with actually too much. I don't want to say dismay, but a little bit. Like they're doing the hybrid cloud thing. I'm running this application across multiple clouds. And Kubernetes helps them there because while it's not identical on every single cloud,

Starting point is 00:11:47 it does take like 80, maybe 85, 90% of the configuration. And the application itself can be treated the same across these three different clouds. There's 10% that's different per cloud provider, but it does help in that degree. We have customers that can hold us accountable. They can say, you know what? This other cloud provider is doing something better or giving it to us cheaper.

Starting point is 00:12:11 And we have a dependency on open source Kubernetes and we built all our own tooling. We can move quickly. And it works for them. That's one of those things that has some significant value for folks. I'm not saying that Kubernetes

Starting point is 00:12:21 is not adding value. And again, nothing is ever an all or nothing approach. But an easy example where I tend to find a number of my customers struggling, most people will build a cluster to span multiple availability zones over an AWS LAN because that is what you are always told. Oh, well, yeah, we can strain blast radiuses, so of course we're going to be able to sustain the loss of an availability zone. So you want to be able to have it flow between those. Great. The problem is, is it costs two cents per gigabyte to have data transfer between availability zones, which means that in many cases, Kubernetes itself is not in any way zone aware. It has no sense of

Starting point is 00:13:00 pricing for that. So it'll just as cheerfully toss something over a two gigabyte link as opposed to the thing, two gigabit link, as opposed to the thing right next to it for free. And it winds up in many cases bloating those costs. It's one of those areas where if the system understood

Starting point is 00:13:16 its environment, the environment understood its system a little bit better, this would not happen. But it does. So I have worked on Amazon. I didn't work for them. I've worked on used EC2 for two or three years. That was my first foray into cloud. I then worked for Microsoft. So I worked

Starting point is 00:13:31 on Azure for five years and I now I've been on Google for a while. So I will say this, I, I, my information with Amazon's a little bit dated, but I can tell you from a Google perspective, like that specific problem you call out, there's at least, there's upstream Kubernetes configurations that can allow you to have affinity with transactions. It's complicated though. It's not easy. We also, so one of the things

Starting point is 00:13:53 that I'm responsible for is building this idea of fleets. This idea of fleets is that you have n number of clusters that you sort of manage together. And not all of those clusters need to be homogenous, but pockets of them are homogenous.

Starting point is 00:14:06 Right? And so one of the patterns that I'm seeing our bigger customers do is create a cluster per zone. And they stitch them together with the fleet, use namespace sameness, treat everything the same across them, slap a load balancer on front, but then silo the transactions in each zone so they can just have an easy and efficient and sure way to ensure that, you know, interzonal costs are not popping up. In many cases, the right approach. I learned this from some former Google engineers back in the noughts, which back when being a Google engineer was a sort of thing where the hush came over the room and everyone leaned in to see what this genius wizard would be able to talk about.

Starting point is 00:14:42 It was a different era on some level. And one of the things I learned was that in almost every scenario, when you start trying to build something for high availability, this was before cross-AZ data transfer was even on anyone's radar, but for availability alone, you have even a phantom router

Starting point is 00:14:57 that was there to take over in case the primary fails. The number one cause of outages, and it wasn't particularly close, was by a failure in the heartbeat protocol or the control handover. So rather than trying to build data center pods that were highly resilient, the approach instead was load balance between a bunch of them and constrain transactions within them, but make sure you can fail over reasonably quickly and effectively and automatically. Because then you can just write off a data center in the middle of the night when it fails, fix it in the morning, and the site continues to remain up. That is a fantastic approach. Again, having built this in my spare room, at the moment, I just have the one. I feel like after this conversation, it may split into two, just on sheer sense of this is what

Starting point is 00:15:42 smart people tend to do at scale. Yeah, it's funny. So when I first joined Google, I was super interested in going through their like SRE program. And so one thing that's great about this company that I work for now is they give you the time and the opportunities. So I wanted to go through what SREs go through when they come on board and train. So I went through the interview process and I believe that process is called hazing, but continue. continue yeah but the funniest thing is so you go through this and you're actually playing with tools and affecting real org cells and using all of the google terms to do things um obviously not in production and then you have like these tests and most of the time the answer the test was hey drain the cell Just turn it off and then turn another one on.

Starting point is 00:16:27 It's the right approach in many cases. That's what I love about the container world is that it becomes ephemeral. That's why observability is important because you better be able to get the telemetry for something that stopped existing 20 minutes ago to diagnose what happened. But once you can do that, it really does free up a lot of things, mostly. But even that I ran into significant challenges with. I come from the world of being a grumpy old sysadmin. And I've worked with data center remote hands employees that were, yeah, let's just say that was a heck of a fun few years.

Starting point is 00:16:57 So the first thing I did once I got this up and running, got a workload on it, is I yanked the power cord out of the back of one of the node members that was running a workload like I was rip-starting a lawnmower enthusiastically at two in the morning, like someone might have done to a core switch once. But yeah, it was, okay, so I'm waiting for the pod to, the cluster to detect the pod is no longer there and reschedule it somewhere else. And it didn't for two and a half days. It's like, because I was under the, and again, there are ways to configure this and you have to make sure the workload is aware of this. But again, to my naive understanding, part of the reason that people go for Kubernetes the way that they do

Starting point is 00:17:31 is that it abstracts the application away from the hardware and you don't have to worry about individual node failures. Well, apparently I have more work to do. These are things that are tunable and configurable. One of the things that we strive for on GKE is to make a lot of these best practices. This would be a best practice,

Starting point is 00:17:50 recovering the node, reducing the amount of time it takes for the disconnection to actually release whatever lease that's holding that pod on that particular node is. We do all this stuff in GKE, and we don't even tell you we're doing it because we just know that this is the way that you do things.

Starting point is 00:18:07 And I hope that other providers are doing something similar just to make it easier. They are. Again, I've only done this in a bare metal sense. I intend to do it on most of the major cloud providers at some point over the next year or two. Few things are better for your career and your company than achieving more expertise in the cloud. Security improves. Compensation goes up. Employee retention skyrockets. are better for your career and your company than achieving more expertise in the cloud. Security improves, compensation goes up, employee retention skyrockets. Panoptica, a cloud security platform from Cisco, has created an academy of free courses just for you. Head on over to academy.panoptica.app to get started. The most common problem I had was all related

Starting point is 00:18:44 to the underlying storage subsystem. Longhorn is what I use. I was going to say, can I give you a fun test? When you're doing this on all the other cloud providers, don't use Rancher or Longhorn. Use their persistent disk option. Oh, absolutely. The reason I'm using Longhorn, to be very clear on this, is that I don't trust individual nodes very well and yeah ebs or any of the providers have a block have a block storage option that is far superior to what i'll be able to achieve with local hardware because i don't happen to have a few spare billion dollars in engineering lying around in order to abstract a really performant really durable block store

Starting point is 00:19:22 and i don't that's not on my list. Well, so I think all the cloud providers have really performant, durable block store that's presented as disk store, right? They all do. But the real test is when you rip out that node or effectively unplug that network interface, how long does it take for their storage system to release the claim on that disk

Starting point is 00:19:44 and allow it to be attached somewhere else? That's the test. Exactly. And that is a great question. And there are ways, of course, to tune all of these things across every provider. I did no tuning, which means the time was effectively infinite, as best I could tell. It wasn't just for this. I had a number of challenges with the storage provider over the course of a couple months. And it's challenging. I mean, there are other options out there that might have worked better. I switched all the nodes that have a backing store over to using relatively fast SSDs because having it on SD cards seemed like it might have been a bottleneck around there. And there were still

Starting point is 00:20:18 challenges on things and in ways I did not inherently expect. That makes sense. So can I ask you a question? Please. If Kubernetes is too complicated, let's just say, okay, it is complicated. It's not good for everything. But most PaaSes are a little bit too constrictive, right? Like their opinions are too strong.

Starting point is 00:20:39 Most of the time, I have to use a very explicit programming model to take advantage of them. That leaves us with VMs in the cloud, really, right? Yes and no. For one workload right now that I'm running in AWS, I've had great luck with ECS, which is, of course, despite their word about ECS anywhere, it is a single cloud option. Let's be clear on this. It is, you are effectively agreeing to a lock-in on some form, but it does have some elegance

Starting point is 00:21:03 because of how it was designed in ways that resonate with the underlying infrastructure in which it operates. Yeah, no, that makes sense. I guess what I was trying to get at, though, is if ECS wasn't an option and you had to build these things, I feel like my experience working with customers, because before I was a PM, I was very much field consultant, customer engineer, solution architect, all those words. Customers just ended up rebuilding Kubernetes. They built something that auto-scaled.

Starting point is 00:21:33 They built something that had service discovery. They built something that repaired itself. They ended up creating a good bit of the API, is what I found. Now, ECS is interesting. It's a little bit hairy when you actually, if you were going to try to implement something that's got smaller services that talk to each other a lot. If you just have one service and you're auto-scaling it behind a load balancer, great. Yeah, they talked about S3 on stage at one point with something like 300 and some odd

Starting point is 00:21:59 microservices that all comprised to make the thing work, which is phenomenal. I'm sure it's the right decision for their workloads and whatnot. I felt like I had to jump on that as soon as it was said, just a warning. This is what works at a global hyperscale centuries-long thing that has to live forever. Your blog does not need to do this. This is not a to-do list. But yeah, back when I was doing this stuff in anger, which is, of course, my name for production for production as opposed to staging environment which is always called theory because it works in theory but not in production exactly back when i was running things in anger it was always it was before containers had gotten big so it was always uh take amis uh to a certain point and then do configuration management and uh code deploys in order to get them to current and yeah then we

Starting point is 00:22:43 bolt on all the things that Kubernetes does offer that any system has to offer. Kubernetes didn't come up with these concepts. The idea of durability, of auto-scaling, of load balancing, of service discovery, those things inherently become a problem that needs to be solved for. Kubernetes has solved some of them in very interesting, very elegant ways. Others of them it has solved for by, oh, you want an answer for that?

Starting point is 00:23:07 Here's 50, pick your favorite. And I think we're still seeing best practices continue to emerge. No, we are. And I did the same thing. Like my first role that I was using cloud, we were rebuilding an actuarial tool on EC2. And the value prop obviously for our customers was like, hey, you don't need to rent

Starting point is 00:23:26 1,000 cores for the whole year from us. You could just use them for the two weeks that you need them. Awesome. That was my first foray into infrastructure code. I was using Python and the Botto SDK and just automating the crap out of everything. And it worked great. But I imagine that if I had stayed on at that role, repeating that process for n number of applications would start to become a burden. So you'd have to build some sort of template, some engine, you'd end up with an API. Once it gets beyond a handful of applications, I think maybe that's where Kubernetes has a strength because you get a lot of it for free. It's complicated. But if you figure it out and then create the right abstractions for the different user types you have, you can end up being a lot more efficient than trying to manage 100 different implementations.

Starting point is 00:24:12 We see the same thing right now. Whenever someone starts their own new open source project or even starts something new within a company, great. The problem I've always found is building the CI-CD process. How do I hook hook it up to GitHub actions or whatever it is to fire off a thing? And until you build sort of what looks like an internal company platform, you're starting effectively at square one each and every time. I think that building an internal company platform at anything beyond giant scale is probably ridiculous, but it is something that people are increasingly excited about. So it could very well be that I'm the one who's wrong on this. I just know that every time I build something new, there's a significant boundary between me being able to YOLO slam this thing into place and having merges into the main branch wind up getting automatically released through a process that has some responsibility to it. Yeah. I mean, there's no perfect answer for everybody,

Starting point is 00:25:09 but I do think, I mean, you'll get to a certain point where the complexity warrants a system like Kubernetes. But also the CICD angle of Kubernetes is not unique to Kubernetes either. I mean, you're just talking about pipelines. We've been using pipelines forever. Oh, absolutely.

Starting point is 00:25:25 And even that doesn't give it to you out of the box. You still have to play around with getting Argo or whatever it is you choose to use set up. Yeah, it's funny, actually. Weird tangent. I have this weird offense when people use the term GitOps like it's new. So first of all, we've all been,

Starting point is 00:25:38 well, as an aged man who's been in this industry for a while, we've been doing GitOps for quite some time. Now, if you're specifically talking about a pull model, fine, that may be unique. But GitOps is simply just, hey, I'm making a change in source control. And then that change is getting reflected in an environment.

Starting point is 00:25:56 That's how I consider it. What do you think? Well, yeah, we store all of our configuration in Git now. It's called GitOps. What were you doing before? Oh yeah, go retrieve the previous copy of what configuration in Git now. It's called GitOps. What were you doing before? Oh yeah, go retrieve the previous copy of what the configuration looked like. It's called copyofcopyofcopyof

Starting point is 00:26:10 thing.back.cjq.usethisone.doc.zip. Yeah, it's great. That's even going further back. Yeah, let me please make a change to something that's in a file store somewhere and copy that down to X amount of VMs

Starting point is 00:26:25 or even hardware machines just running across my data center. And hopefully that configuration change doesn't take something down. Yeah. The idea of blast radius starts to become very interesting in canary deployments.

Starting point is 00:26:38 And all the things that you basically rediscover from first principles every time you start building something like this. It feels like Kubernetes gives a bunch of tools that are effective for building a lot of those things. But you still need to make a lot of those choices and implementation decisions yourself. And it feels like whatever you choose is not necessarily going to be what anyone else has chosen.

Starting point is 00:26:59 It seems like it's easy to wind up in unicorn territory fairly quickly. But I just, I don't know. I think as we're thinking about what the alternative for a Kubernetes is or what the alternative for a PaaS is, no one, I don't really see anyone building a platform to run old shitty apps. Who's going to run that platform? Because that's the, what, 80% of the market of workloads that are out there that need to be improved. So we're either waiting for these companies to rewrite them all, or we're going to make their life better somehow. That's what makes containers so great in so many ways. It's not the best approach,

Starting point is 00:27:32 obviously, but it works. You can just take up something that is 20 years old, written in some ancient version of something, shove it into a container as a monolith. Sure, it's an ugly big container, but then you can at least start moving that from place to place and unwinding your dependency rat's nest. That's how I think about it, only because, like I said, I've spent 10, 12 years working with a lot of customers trying to unwind these old applications. And a lot of the times they lose interest in doing it pretty quickly because they're making money and there's not a whole lot of incentive for them to break them up and do anything with them. In fact, I often theorize that whatever their business is, the real catalyst for change is when another startup or another smaller company comes up and does it more cloud natively and beats their pants off in the market, which then forces them to have to adjust. But that kind of stuff doesn't happen in like payment transaction companies. Like no one, like you have, there's a heavy price to pay to even be in that business. And so like,

Starting point is 00:28:31 what's the incentive for them to change? I think that there's also a desire on the part of technologists many times, and I'm as guilty as any one of this to walk in and say, this thing's an ancient piece of crap. What's it doing? And the answer is like, about 4 billion in revenue. So maybe mind your manners. And yeah, okay. Is this engineeringly optimal? No, but it's kind of long bearing.

Starting point is 00:28:55 So we need to work with it. If people are not still using mainframes because they believe that in 2024, they're going to greenfield something and that's the best they'd be able to come up with. It's because that's what they went with 30, 40 years ago. And there has been so much business process built around its architecture, around its constraints, around its outputs, that unwinding that hairball is impossible. It is a bit impossible. And also, is it that? Those systems are pretty reliable. The only, the downside is just the cost of whatever IBM is going to charge

Starting point is 00:29:26 you to have support. So we're going to re-architect and then migrate it to the cloud. Yeah, because that'll be less expensive. Good call. It's a, it's always a trade-off. And economics are one of those weird things where people like to think in terms of cash dollars they pay vendors as the end-all be-all,

Starting point is 00:29:42 but they forget the most expensive thing that every company has to deal with is its personnel, the payroll costs, dwarf cloud infrastructure costs, unless you're doing something truly absurd at very small scale of a company. Like there's,

Starting point is 00:29:55 like I've never heard of a big company that spends more on cloud than it does on people. Oh, that's an interesting data point. I figured we'd at least have a handful of them, but interesting. I mean, you see it at least have a handful of them, but interesting. I mean, you see it in some very small companies where like, all right, we're a two-person startup and we're not taking market rate salaries and we're doing a bunch of stuff with AI. And OK, yeah, I can see driving that cost into the stratosphere, but you don't see it at significant scale. In fact, for most companies that are legacy, which is the condescending engineering term for it makes money, which means in terms it was founded before five years ago,

Starting point is 00:30:29 a lot of companies that are their number two expense is real estate, more so than it is infrastructure costs. Sometimes, yeah, you can talk about data centers being as part of that, but office buildings are very expensive. Then there's a question of, okay, cloud is usually number three, but there are exceptions for that. Because they're public, we can talk about this one. Netflix has said for a long time that their biggest driver, even beyond people, has been content. Licensing all of the content and producing all of that content is not small money. So there are going to be individual weird companies doing strange things. But it's fun. I mean, you also get to this idea as well that, oh, no one can ever run on-prem anymore.

Starting point is 00:31:10 Well, not for nothing. Technically, Google is on-prem. Yeah, so is Amazon is on-prem. They're not just these magic companies are the only ones that can remember how to be able to replace hardware and walk around between racks of servers. It's a, there's, it just, is it economical? Is it, when does it make sense to start looking at these things? And even strategically, tying yourself indelibly to a particular vendor, because people remember the mainframe mistake with IBM.

Starting point is 00:31:36 Even if I don't move this off of Google or off of Amazon today, I don't want it to be impossible to do so in the future. Kubernetes does present itself as a reasonable hedge. Yeah, it neutralizes that lock-in with vendor if you are to run your own data centers or whatever. But then you're locked into, a lot of the times you end up getting locked into specific hardware,

Starting point is 00:31:56 which is not that different than cloud because I do work with a handful of customers who are sensitive to even very specific versions of chips. They need version N because it gives them 10% more performance. And at the scale they're running, that's something that's very important to them. Yeah. One last question before we wind up calling this an episode

Starting point is 00:32:16 that I'm curious to get your take on, given that you work over in product. Where do you view the dividing line between Kubernetes and GKE? So this is actually a struggle that I have because I am historically much more open source oriented and about the community itself. I think it's our job to bring the community up, to bring Kubernetes up.

Starting point is 00:32:36 But of course it's a business, right? So the dividing line for us, that I think about is the cloud provider code, the ways that we can make it work better on Google Cloud without really making the API weird, right? I don't want to, we don't want to run some version of the API that you can't run anywhere. Yeah, otherwise you just roll a Borg and call it a day.

Starting point is 00:32:56 Yeah, but when you use a load balancer, we want it to be like fast, smooth, seamless, and easy. When you use persistent storage, like we have persistent storage that automatically replicates the disk across all three zones so that when one thing fails, you go to the other one and it's nice and fast. So these are the little things that we try to do to make it better. Another example that we're working on is fleets. That's specifically the product that I work on, GKE Fleets. And we're working upstream with Pluster Inventory to ensure

Starting point is 00:33:23 that there is a good way for customers to take a dependency on our fleets without And we're working upstream with cluster inventory to ensure that there is a good way for customers to take a dependency on our fleets without getting locked in, right? So we adhere to this open source standard. Third-party tool providers can build fun implementations that then work with fleets. And if other cloud providers decide to take that same dependency on cluster inventory, then we've just created another good abstraction for the ecosystem to grow without forcing customers to lock into specific cloud providers to get valuable services.

Starting point is 00:33:50 It's always a tricky balancing act because at some level, being able to integrate with the underlying ecosystem it lives within makes for a better product and a better customer experience. But then you get accused

Starting point is 00:33:59 trying to drive lock-in. The whole, I think the main, if you talk to my skit, Drew Bradstock runs Cloud Container Runtimes for all of Google Cloud. I think he would say, and I agree with him here, that we're trying to get you to come

Starting point is 00:34:14 over to use GKE because it's a great product, not because we want to lock you in. So we're doing all the things to make it easier for you because you listed out a whole lot of complexity. We're really trying to things to make it easier for you. Because you listed out a whole lot of complexity. We're really trying to remove that complexity. So at least when you're building your platform

Starting point is 00:34:29 on top of Kubernetes, there's maybe, I don't know, 30% less things you have to do when you do it on Google Cloud than other platforms or on-prem. Yeah, it would be nice. But I really want to thank you for taking the time to speak with me.

Starting point is 00:34:42 If people want to learn more, where's the best place for them to find you? If people want to learn more, where's the best place for them to find you? If people want to learn more, I'm active on Twitter. So hopefully you can just add my handle in the show notes. And also just if you're already talking to Google, then feel free to drop my name and bring me into any call you want as a customer.

Starting point is 00:35:02 I'm happy to jump on and help work through things. I have this crazy habit where I can't get rid of old habits. So I don't just come on the calls as a PM and help you. I actually put on my architect consultant hat and I can't turn that part off. I don't understand how people can engage in the world of cloud without that skillset and background personally.

Starting point is 00:35:22 It's so core and fundamental to how I view everything. I mean, I'm sure there are other paths. I just have a hard time seeing it. Yeah, yeah, yeah. It's a lot less about, let me pitch this thing to you and much more about, okay, well, how does this fit into the larger ecosystem of things you're doing, the problems you're solving?

Starting point is 00:35:38 Because, I mean, we didn't get into it on this call and I know it's about to end, so we shouldn't, but Kubernetes is just a runtime. There's like a thousand other things that you have to figure out with an application sometimes, right? Like storage, bucket storage, databases, IAM, IAM.

Starting point is 00:35:55 Yeah, that is a whole separate kettle of nonsense. You won't like what I did locally, but that's beside the point. But are you allowing all anonymous? Exactly. The trick is, if you harden the perimeter well enough, then nothing is going to ever get in, so you don't have to worry about it. Let's also be clear, this is running a bunch of very

Starting point is 00:36:12 small-scale stuff. It does use a real certificate authority, but still. I have the most secure Kubernetes cluster of all time running in my house back there. Yeah, it's turned off. Even then, I still feel better if we're sunk into concrete and then dropped into a river somewhere. But, you know,

Starting point is 00:36:26 we'll get there. Thank you so much for taking the time to speak with me. I appreciate it. No, I really appreciate your time. This has been fun. You're a legend,

Starting point is 00:36:35 so keep going. I'm something, all right. I think I have to be dead to be a legend. Nick Ebertz, product manager at Google. I'm cloud economist Corey Quinn, and this is Screaming

Starting point is 00:36:45 in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice or on the YouTubes. Whereas if you hated this podcast, please continue to leave a five-star review on your podcast platform of choice, along with an angry, insulting comment saying that that is absolutely not how Kubernetes and hardware should work. But remember to disclose which large hardware vendor you work for in that response.

Screaming in the Cloud - Mastering Kubernetes for Multi-Cloud Efficiency With Nick Eberts

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.