Screaming in the Cloud - The Pros of On-Prem Kubernetes with Justin Garrison

Starting point is 00:00:00 How does this actually work when you're doing this drastically? And in most cases, it says like, no, actually, you just want to pin. You want to pin your workload to a type of node and always make sure you have the same amount of cores. Welcome to Screaming in the Cloud. I'm Corey Quinn, and I'm joined today by Justin Garrison, who these days is the director of DevRel over at Sidero. Justin, thank you for joining me. Thanks for having me, Corey. This episode's been sponsored by our friends at Panoptica, part of Cisco.

Starting point is 00:00:32 This is one of those real rarities where it's a security product that you can get started with for free, but also scale to enterprise grade. Take a look. In fact, if you sign up for an enterprise account, they'll even throw you one of the limited, heavily discounted AWS skill builder licenses they got because believe it or not, unlike so many companies out there,

Starting point is 00:00:54 they do understand AWS. To learn more, please visit panoptica.app slash last week in AWS. That's panoptica.app slash last week in AWS. I have to say this. One of the things I adore about having you on this show is that there's a standard intake questionnaire of the form I send people when I invite them onto the show. And one of the fields that's there is, does this need any PR review? I think legal counsel, corporate comms, PR folk, et cetera. And your answer to it was, ha ha ha, no, which I just, in all caps, which is just the best answer I think I've ever

Starting point is 00:01:31 gotten to that. And basically perfectly embodies my philosophy on these things. So thank you for making me smile. Yeah. You know, I do what I can. I've, I've spent my career mostly at very old or very large companies. And every company I've worked for before Sight Era was either 100 years old or over 100,000 people. And this is the first time that I'm not in one of those situations. And it feels really good, actually. You finally get to say what you want to say, how you want to say it, to whom you wish to say it. And that is no small thing. I mean, having an opinion that doesn't align directly with the company is a good thing sometimes. You were recently at AWS. And before that, you were at Disney, both of which are, to

Starting point is 00:02:10 say that they're large companies dramatically understates the situation by a fair bit. Before I did this, my last job was at BlackRock, and I basically sat there and stewed for almost a year, unable to say anything that looked like an opinion in public, particularly in the way that I like to share opinions. I can only imagine, given that you've spent many years in those types of environments, what shenanigans you're going to get up to now, given that, to my understanding, Sidero is a little bit smaller than, you know, companies with four commas in their market cap. Yeah, it's been really different. I've been at Siderero now for four months. And already it's just, I was thinking back of what it was like at Amazon at four months. I just got my laptop when I started.

Starting point is 00:02:51 It took three months to get a laptop when I started at Amazon. I was like, oh, okay, this is big company business. And I remember at Disney, it was about three and a half months before I had my first ticket assigned to me, my first project. It was just three months of reading docs and going to meetings and figuring out what people do. I understand the onboarding and reading docs. I spent a consulting project that lasted six weeks at a big company once, the first four of which were spent waiting to turn on my AD account, which was, okay, big companies are going to big company,

Starting point is 00:03:19 but without a laptop, how could you even read the docs? I couldn't. I actually had people sending PDFs to me in various ways that weren't supposed to be allowed because I had nothing to do. It was actually during OP1 season when I started too. And if you know OP1 season on Amazon, there's a lot of docs to read. And so, but it was entertaining because I had a personal Chromebook and nothing that they had supported Chromebooks. And so it was just like, okay, well, I have a phone and a Chromebook and I can't do what

Starting point is 00:03:44 I'm supposed to be doing. So I spent some time playing with new services in my personal AWS account to learn what they did. I do a lot of the, my computing work, especially on the road from an iPad and most things, yeah, don't tend to support that as a primary means of interaction. All my dev work is done on an EC2 box for a number of reasons, but it's just struck me. It just strikes me as weird in that at big companies in particular i would never as a as an employee expect to be able to use my personal devices for things and because it's always going to come with these restrictions of great install this mdm on your stuff so we can do the corporate management thing and it's like how about you

Starting point is 00:04:19 provide the equipment you want to have me uh engage with your corporate systems or you leave me alone like oh when you start rolling out mdm for mobile devices where it's great you want to have me engage with your corporate systems or you leave me alone. Like, oh, when you start rolling out MDM for mobile devices where it's great, we want to be able to wipe your cell phone. My position on that is great. You're going to get me a corporate cell phone. Alternately, I'll just be in touch when I happen to be on the laptop. It's not a great position to be in, but I've heard too many horror stories of not just, it's not just malfeasance you have to worry about. It's accidents. Someone in Corp IT accidentally wiped away your personal phone while you're on a trip

Starting point is 00:04:49 through no ill intent. Great, now what? My entire life lives on that thing. I was part of the team at Disney Animation that rolled out MDM for people. And so I definitely know the struggle of like, I don't want to do this. It's not like, but it's my job, right?

Starting point is 00:05:05 Like I have to do this thing. Even on corporate machines here, we have a very light touch Jamf profile that is only on stuff that we own and control. And all it does is it enforces screensavers, password strengths, and encrypting the disk. So I don't have to report a data breach when you get it swiped out of your car trunk.

Starting point is 00:05:22 Awesome, great. This is stuff that anyone who wants to, who works here can pull this up at any point. Hey, pull this up and take a look at, I want to make sure you're not doing something underhanded. Absolutely. I'm not. It's very straightforward. And it's the stuff that I, like my entire philosophy is I will never ask employees to do something that I won't do myself. And not just because I happen to be the one holding the switch because those things can change pretty quickly. Yeah. And I actually took your lead on joining the new company where my, my primary or only mobile device is an iPad now. And, and partially because they got good enough,

Starting point is 00:05:55 right? Like the software is still in the middle there, but I was like, I need something that I do a lot of drawing. I like animation. I like doing that side of it. And I like the iPad and the pencil format. And I do a lot of video editing and DaVinci works on the iPad. And so like that combo has worked really well for me of just like, hey, I want a single purpose device. I do writing and those sorts of things. And then I have a shell that remotes back to my desktop in my home lab and I can do the things that I need to do for work.

Starting point is 00:06:19 It's been working really well for me. We are very different ends of a particular spectrum where my version of a user interface that's good talks about like the position of command line arguments in something. I use VI for most of my writing, although I do admit for code increasingly, I'm drifting in a direction toward VS Code because it's gotten pretty decent.

Starting point is 00:06:39 But yeah, everything I do is just green screen terminal style stuff. Oh no, Blink on the iPad is great. Blink as the terminal is an excellent terminal app on the iPad. That's where I do is just green screen terminal style stuff. Oh, no, Blink on the iPad is great. Blink as the terminal is an excellent terminal app on the iPad. That's where I do all my writing. It's in Vim. It's remote back to my box. And yeah, with Tailscale and a 5G connection, it's awesome.

Starting point is 00:06:56 It sounds like we're using the exact same stack on that. I've used their VS Code implementation just by typing code in the path to a directory. I gave up on code. I was using code at Amazon for... I was like, I want to switch. I want to switch off Vim. I'm finally going to go into the GUI world of IDEs. And I used it for four years at Amazon. And I'm like, I don't like this anymore.

Starting point is 00:07:14 And I just went back to, actually went to NeoVim with LazyVim as a framework on top. And it brings a lot of those things that I liked in code automatically, like all the pop-ups and things that get in the way, all those come with it. It's great. One thing that I've done that I think is probably just the side of a war crime is I've gotten GitHub Copilot to work in Vim when prompted. But unlike in a lot of other environments, it doesn't automatically slap the text and it has to explicitly be asked, which is kind of important. Yeah. It's auto auto-filling for me. So it's disabled by default. It's just like, it just gets in the way too much. Yeah. When I want a robot's opinion, I will ask explicitly for it. It's like that old shit posting meme picture of a robot will

Starting point is 00:07:55 absolutely not speak to me in my, my holy language. I am a divine, I'm a divine being. You are a pile of bolts. How dare you like just this, this-the-top screaming at a computer's thing, which, yeah, that's very much my bent. So what are you doing at Cidero these days? What do you folks do exactly? At Cidero, we mainly focus on people having problems with on-prem Linux and Kubernetes. And so it came from actually starting Tilos Linux, which was a single-purpose Linux distro. And in my days at Disney, it was at Disney Animation, I was doing on-prem Kubernetes

Starting point is 00:08:28 and we were using CoreOS and it was like a fantastic, like, oh, simplified the rel stack to be basically just systemd and container runtime enough that I could bash script my way into running a Kubernetes cluster. And in around that time, Andrew, the CTO at Cidero, he started this Talos Linux thing. And it's like, no, like PID1 is just an API. There's no SSH. There's no, you don't need users. Everything is an API, API driven Linux. And all it does is run Kubernetes. It does nothing else useful besides run Kubernetes. And I was like, that's an appliance. That's amazing. And so I remember when he announced it on Hacker News, I emailed him immediately.

Starting point is 00:09:05 I was working at Disney. I'm like, I want this to exist in the world. This should be a thing that is available. And throughout my career, I've stayed in touch. I've kind of used it here and there. And I went over to Amazon and we launched Bottle Rocket as a direct competitor to Talos Linux.

Starting point is 00:09:19 And then Bottle Rocket kind of went off and got re-orged a couple of times and does some other things. No, you're kidding. A re-org at Amazon? No. It doesn't do the same things that it used to. And it's not really focused on Kubernetes anymore.

Starting point is 00:09:31 It's like this weird container sort of thing. And of course, CoreOS got bought twice and it does some other weird things. And that kind of like shifted off. And then this Talos Linux thing, it just kind of stayed in this like, we do Kubernetes and that's it. And it's everything from bare metal. You can see on my bench behind me, I know it's an audio podcast, but I have like a Raspberry Pi running it and a laptop running it. Like I'm setting up a lab for a talk later. And I'm just like, oh yeah, this is like, it runs on very small stuff. And

Starting point is 00:09:54 then very big stuff. We have an oxide rack that we're, we're testing it with. And so it's like big metal. And then also like cloud providers, it's just, it runs everywhere. It's Linux and it just runs Kubernetes. And that's like the really key point of like, we have this streamlined Linux OS because people are like, you have to learn Linux to know Kubernetes. It's like, actually you can lower the bar. You can make Linux disappear and just make it an API. You have to dial it in super well to get there, but I believe it is possible. It's a, that is kind of the dream where you don't have to think about these things anymore. I mean, I wrote a blog post a couple of years back on the idea that nobody cares about the operating system anymore,

Starting point is 00:10:28 which of course was dunking on Red Hat at the time for deciding they're going to basically backtrack on the CentOS long-term support commitment. Surprise, you need to pay us now. Everyone loves hearing that. And it was a bad move from my perspective because most companies don't care that much about the operating system the way that they once had to. Some people need to care very much, but it's not a necessary

Starting point is 00:10:50 prerequisite to build things now. Especially when you have something like a Kubernetes clustering system on top, you want the OS to disappear. It should not be a hindrance. It should get out of your way. It should let you debug some things. It should have some access to do like, hey, what's going on? But beyond that, it should just disappear. And that's exactly what Talos was meant to do. And even when I started, I started digging into it more because I haven't touched it for a little while at Amazon. And I was like, hey, how

Starting point is 00:11:14 many binaries do we have on the system? And I was like, oh, I can just list out this directory. I was like, oh, there's 12. There's 12 total binaries. I was like, what? How is this possible? What are you doing here? And like, oh, yeah. And I started counting other Linux distros. And it's like, you know, RH rel has like 7,000, Ubuntu 6,000, even the smallest bottle rocket is like, you know, 1400, 1300 or something like that. I mean, technically you could get a lot of it down to near one with busy box. Just it

Starting point is 00:11:36 did it change its behavior based upon what the, what SimLink invokes it. And there are some, there are SimLinks in there. Like if you count all the linked binaries, there's like 30, but it's like most of them are LVM, right? It's like, we have like half of the binaries are for like disk management. The thing that drives me nuts, and I've wanted this for years, even the stripped down minimalist distros aren't stripped out of minimalist enough because their argument as well. If you get rid of some of these things, it'll be really uncomfortable to work in the environment.

Starting point is 00:12:01 It's production. It's not supposed to be comfortable. If you're copying your.files to customize your shell to production, you are doing it wrong in almost every case. Stop that. I got into Kubernetes finally for basically losing a bet with the internet earlier

Starting point is 00:12:16 this year and gave a talk, Terrible Ideas in Kubernetes. To do that, I now have a 10-node Kubernetes of my own running in the next room on a bunch of raspberries pie. And the most annoying part of running the thing, to be very direct with you, is in fact the underlying operating system. Having to keep those things patched and updated and caring about those things.

Starting point is 00:12:36 I just want it to basically run Kubernetes as an appliance, and please shut up and leave me alone for other things that aren't that. And I can't get there. That was exactly like at Amazon. I was helping build the EKS Anywhere, the on-prem EKS version of you should run Kubernetes in this environment with this set of Linux distro and stuff. And it was like, it was automated, but it was difficult. And it was like, oh, we could do this stuff.

Starting point is 00:12:59 And it had this cluster API and all these things that got way too complicated. And I was like, hey, I need to look at what the competition's doing. Like what are other people doing? And again, I was like, Oh, I know Talos, like I'm just spin it up. And I was like, okay, well I get the API, like I can get the commands. And I actually booted the main product that we sell at Side Arrow is Omni. And it's a, it's a SaaS version of like, you can run, you can literally put it on a USB drive, boot a machine and it connects back. You have a Kubernetes cluster, like magic. Like the first time I did it, like, wait a minute, I missed something. Like there's got to be something I'm doing wrong because this part of it was too easy. Like I went from booting a USB drive to a Kubernetes API in like two clicks.

Starting point is 00:13:33 And I was like, I don't know what I just did wrong, but let me go try it again. And I started internally at Amazon, like making like competitive, like, hey, we should look at what's going on over here. Like this is not cluster API. This is not complicated. Like I had to get the dev team to like troubleshoot anything. It's like an ERP system. You buy the thing and the real money is for the consultants who wind up spending the next

Starting point is 00:13:52 four years of their lives dialing in all the 100,000 configuration options just right for you. I don't think that we're at a point where that needs to be the case. You can, I imagine they could be tunable in a bunch of different ways if you absolutely had to be the case. You can, I imagine they could be tunable in a bunch of different ways if you absolutely had to be, but the mean path, the common path for most folks should not involve that level of obscene customization. And absolutely. There's a place for, I want to learn it. I want to do it the hard way. I want to go through and like, I need, I want to learn the Linux steps because I need to for a career, whatever. I've done it right.

Starting point is 00:14:21 Like I was doing that. I was part of the initial, like a SIG on-prem inside of Kubernetes. Like we were like building Kubernetes on-prem. I was one of the chairs for it when it originally started. And it's like, it was really, really hard. And all of my system D units were like curling down hypercube and all this stuff.

Starting point is 00:14:35 And it was like, it was automated, but it was hard. And now it's like, this should just disappear. That whole layer from the TPM, like you want trusted boot all the way up to Kubernetes API, that should just be handled. Like we should be able to not automate that and just abstract it away like Kubernetes does. disappear. That whole layer from the TPM, like you want trusted boot all the way up to Kubernetes API,

Starting point is 00:14:48 that should just be handled. Like we should be able to not automate that and just abstract it away like Kubernetes does for a lot of other stuff when you run a pod, right? Like I want a deployment. I just want the service load balancer available. I don't care how you get there. Few things are better for your career and your company than achieving more expertise in the cloud. Security improves, compensation goes up, employee retention skyrockets. Panoptica, a cloud security platform from Cisco, has created an academy of free courses just for you. Head on over to academy.panoptica.app to get started. How do you handle things like Kubernetes version upgrades

Starting point is 00:15:22 when there are no moving parts? All the Kubernetes components are part of containers that get run, right? So like we have like system containers that run. And so it's like you can shift out like I have a, you know, Talos OS version. I say, oh, upgrade Kubernetes. And when you have a declarative spec, right, like Kubernetes can say, like, I know how to roll through upgrades of a service or a pod or whatever. We can roll through upgrades of the components of a control plane.

Starting point is 00:15:45 We're like, okay, well, etcd goes first. We know how to make sure etcd is healthy. We know how to roll that out if you have highly available. We do one at a time and we just do the steps you're supposed to, just like Kubernetes does with the application layer.

Starting point is 00:15:56 So you don't, in this case, need to do this from the perspective of someone who is having to babysit the process through. You just tell the API to do it and it does a solved problem effectively. tell the API to do it and it does a solved problem effectively. It knows how to do it. Just expect that it might be in an intermediate state for the next period of time until it completes. So maybe this is not the

Starting point is 00:16:13 same time to roll out a major deploy. Yeah, there are things in any infrastructure where you're like, oh, if I'm going to upgrade this thing, let's hold off and make sure that we're not going to cross send errors because something else happened, right? Change one thing at a time. So one of the things that I've learned through dealing with the Kubernetes has been that it's been the first time in a long time that I've done stuff on prep. I'm used to doing everything in a cloud provider environment, but I didn't do that with one of the managed Kubernetes services because I've seen the bills when those things start misbehaving. And frankly, they are not pleasant. I don't want to spend thousands of dollars for basically a fun talk that I'm giving

Starting point is 00:16:45 in terms of oops-a-doozy charges. And I'd forgotten on some level just how annoying some of the aspects of running things in your own environment can be. Deviances between some bad cables or a bad chip that winds up getting sent out. The joy of running power and the rest. The challenge inherent in the storage subsystem. For example, Longhorn is apparently terrible. It's just the second worst option. Everything else is tied for first. And EBS in AWS land largely just works.

Starting point is 00:17:16 So you don't have to think about these things. And you learn that, oh, just like DNS, when the storage subsystem starts acting up, so does everything else. You will not be going to space today because you have surprise unplanned work. Now, I would never run things like this in a production style environment. There'd be a lot more care and whatnot, but it just runs a bunch of home lab stuff that I want to exist. But if it goes down for a day or so, it doesn't destroy my ability to operate. Yeah, for sure. And I mean, there's always that, like people ask all the time,

Starting point is 00:17:42 like, hey, how do I get my storage inside this Kubernetes cluster? And I'm like, do you have an external way you run storage today? Right. You don't have to reinvent everything. Everything should not be a Kubernetes problem, right? That like net app that you paid for, keep using the net app. It's really good at doing storage, right? Like that's fine. Like don't, don't think you have to shove everything in here. And I understand people, they want the interface, they want something that looks familiar. I'm like, no, it's okay to have a team that manages NFS over there. And then you mount that in and those things are great. And having those things separate, but again, the Kubernetes mindset of I have to run everything myself because I can,

Starting point is 00:18:17 is just kind of a way of, we've gotten in trouble with everything in the past, like config management and everything had to be an Ansible playbook for a long time. I'm like, actually, no, I don't want to SSH and try to do this in YAML. I do manage the individual pies with Ansible because it was either that or a bunch of shell scripts running in loops, which no thank you. But yeah, it's a neat approach to doing these things. Now, most of my use cases are very much contrived here in that there are containers that I want to exist and run. I could run those anywhere, probably Docker Swarm or just running them in Docker Desktop would be more than sufficient for my use case. But I needed some workloads that I actually were running things I care about.

Starting point is 00:18:53 And these are mostly Singleton style container approaches. A couple of them have services that have a few containers talking to each other, but it's really not a Kubernetes shaped problem. I've introduced a raft of unnecessary complexity for what I'm doing. But now that I have it, I find that just spinning up new containers of services here to do things to be relatively straightforward, I have become something of a reluctant convert to Kubernetes. Now, I like it more on-prem than I do in a cloud provider environment, which is what I want to get into with you, because it feels like it's a good way to basically build a cloud-like platform

Starting point is 00:19:25 of your own. But when you do that on top of AWS, it's like, well, crap, why not? If you want to go work at a cloud provider so bad, fill out an application. You can go and do it there for real. If you're going to work in the cloud, use the primitives that they provide for you for your use and you'll generally wind up happier has been my position on this historically please fight me i would fight back on the fact that uh for disney plus disney plus when i was there was all built on ecs and ecs it was it was one of the larger uh installations of ecs and it is ecs is a native or a native aws services and all of the things that you would want to do are very native to what AWS can do. And in any, even starting small,

Starting point is 00:20:10 the first AWS native primitive you're going to find yourself hitting is account limits. Like that as like a cloud, as like a thing that happens in the cloud is like, oh, this artificial thing to protect the service. I understand why it exists and why I would want to build it. But then it's like, I have to do that all the time.

Starting point is 00:20:28 And there were so many account limits. And that's why people always want to have development done in separate accounts, because if you don't, you can have all the permissions isolation in the world. But if you exhaust the rate limit for EC2 describe instances, well, suddenly things like auto scaling groups and load balancers will not be able to make those calls. EC2 describe instances. Well, suddenly things like autoscaling groups and load balancers will not be able to make those calls and your production environment will once again, not be going to space today. Yeah. In a lot of cases, even at medium size scale,

Starting point is 00:20:53 I just basically say like, you could abandon a lot of your labels in AWS and just go different accounts, right? Cause like, that's the real limitation factor of AWS is like, oh, this is that account. And like, they're not talking to each other and you get that separate clean bill. And you're like, for every, every application you're deploying, every team that wants some resources, stick them in a new account. And that becomes easier to a lot of degrees than trying to divvy up things and say like, oh, well, this is these labels and they shouldn't talk to this other one. And then you get this, I am nightmare.

Starting point is 00:21:23 Of course, when they have to talk to each other, you get into a lot of other problems. But in a lot of cases, it was just like, you know what, as separate accounts, as the boundary to just like a separate Kubernetes cluster, as the boundary for a lot of teams, it just makes a lot more sense because even though you have RBAC in namespaces and other things, it just gets cleaner and easier when you just say, this is a different API and you just go over there and you do your own thing. And when you break it and come back to me and say, I need another one, I'm just going to make you another clean one. I don't have to talk to anyone else. and that was approaching a very specific thing that has been misinterpreted a number of times, but cool, I'll live with it. I should have been more clear. I'll live with that on me.

Starting point is 00:22:09 But my point has been that for most folks, just setting out to build something from day one with the idea that it can run seamlessly and just the same in any or all cloud provider environments is probably not what you want to be doing without there being an extenuating circumstance because rather than embracing a provider, it matters not which one, and taking advantage of all the things that they

Starting point is 00:22:28 have operated. Instead, you're doing things like rolling your own effective load balancers because no one has a consistent load balancing experience or a provision database service. It's that's a lot of undifferentiated heavy lifting that you could be spending on the thing that your business does. There are exceptions to all of this,

Starting point is 00:22:44 but that's the general guidance that I've taken on this. I don't feel the same way about hybrid, because hybrid I don't think is something that people set out to achieve. I think it is something that happens to them. My theory has been that most hybrid environments are someone trying to do a full-on cloud migration, running into a snag somewhere along the way, such as we have a mainframe and there is no AWS 400.

Starting point is 00:23:06 Oops-a-doozy. Declaring a victory midway through saying we're hybrid now and then going to focus on something else. It's cynical, but also directionally accurate from what I've seen. Agree? Disagree? Two parts there of like multiple clouds. I absolutely agree.

Starting point is 00:23:20 Like you should go into one cloud specifically. And I call that undifferentiated heavy clouding. When you try to make your own cloud on top of clouds in any significant size enterprise is going to have a cloud team that basically abstracts some of that away of like, oh, we want this to work everywhere. And this Terraform provider should go to every cloud. I'm like, actually, maybe you don't. Maybe you should just keep that as clean, as simple of an interface as possible. Because again, account limits. And also, if you're not actively testing the stuff by running it active-active across multiple providers, it's like a DR plan that isn't being tested. The next commit after you test and get and validate your DR plan has the potential to render the DR plan irrelevant. So it has to be consistently tested.

Starting point is 00:24:02 If you've never done a restore, you don't have backups. Exactly. No one cares about backups. Everyone cares about the restore. And similarly, if you are, if you are, there's some people that desperately need certain things to run multi-cloud. If you're a telemetry vendor like Datadog or whatnot, you need that stuff to live where people are running their workloads. So yeah, that thing needs to live everywhere. You know what doesn't necessarily though is your account management stuff or your marketing site or a bunch of other services. It would even, like I was a Datadog customer

Starting point is 00:24:30 and even like, does that need to run in AWS? The only reason that needs to run in AWS is because there's egress fees, right? Like that's the, like the business model of the cloud directly impacts the architecture of someone else. And that becomes a problem. And so that for sure is like a thing that exists, but yeah, you have to be where your customers are because of someone else. And that becomes a problem. And so that for sure is like a thing that exists. But yeah, you have to be where your customers are because of that stuff.

Starting point is 00:24:49 I actually just saw a headline today that Azure is dropping cross AZ charges. I did not expect Microsoft to be the leader in that particular direction. But I usually think Microsoft, when I want someone to basically treat security as a joke, not be avant-garde in terms of data transfer pricing. Yeah, competition is good. But yeah, to your point, like you either have to go with Azure and hope you don't get hacked

Starting point is 00:25:10 or go with Google and hope that they don't shut it down. Like those are your choices right now for a lot of things. But back to the hybrid conversation. Absolutely, like it becomes a thing of like, is this architected to be hybrid or is this a failed half attempt of like, actually that stuff just is going to stay on-prem. And the successful situations that I've seen are the we just need

Starting point is 00:25:28 to burst and we have compute. Right. And that's like we want the elasticity of compute and compute resources in the cloud. But we want to keep that majority anything. It would be a reserved instance or a savings plan that should be on prem. It's going to be always cheaper and you don't have it ever worry about cross AC because you own the switches, that sort of stuff. And then like that extra, anything you would run in Spot, yeah, go run it in Spot in a cloud provider. And how you get there becomes a lot harder

Starting point is 00:25:55 for a lot of people. And they think that Kubernetes is going to solve that problem. It's like, oh, well, I have a Kubernetes. So now I can have another Kubernetes and we can just deploy twice. I'm like, no, not really, because there's a lot there right yeah and no no two kubernetes installations are alike there's always different prerequisites of what services and how people have approached these things it

Starting point is 00:26:14 feels like we've come full circle because this was the guidance people gave back in the in the knots when cloud first came out before ebs was a thing it was oh great just scale use this to scale up so you own the base rent the peak and that's the way that people always approached it somehow first came out before EBS was a thing. It was, oh, great, just scale, use this to scale up. So you own the base, rent the peak. And that's the way that people always approached it. Somehow that changed to, oh, put everything in the cloud. And there are values and validity to that approach, especially at small scale. When you're a startup building something, you should not be negotiating data center leases. At some point of scale, though, the economics start to turn. And ideally, that's when the scale discounting starts to come into play. But I've never yet used to be able to say this. I

Starting point is 00:26:48 can't anymore. But I used to say that I don't know too many companies that spend more on cloud infrastructure than they do on people to operate that cloud infrastructure. And then we had a bunch of AI companies go and basically spend $100 million that they can't pay on their cloud built stability. I am looking at news articles about you on this. And okay, yeah, step one, you scam the VCs out of money. And then step two is you go buy all the GPUs from cloud providers. You got your order of operations confused, and now you're having a fire sale. That's unfortunate.

Starting point is 00:27:18 Sure. And I mean, to the first point of bursting, peeking into the cloud, the most successful times I've seen that done is when I was at Disney Animation, like our render jobs were like, Hey, the movie has a date. We know how long it takes us to render things. We don't have enough computers, right? And like, we could just do that math. We're like, this is where it lines up. Okay, well, let's just go find cheap compute somewhere. And then just we'll spin up, you know, more, we'll get more CPUs. We'll render the movie. And then when it's done, the movie's done and we shut it all down. And literally we built and we were creating box software,

Starting point is 00:27:46 right? And that sort of business model, we can shut it all down as soon as we were done with the movie. And it was great because we could just cleanly turn it off. We're like, hey,

Starting point is 00:27:53 this is just coming out of the budget and we got to make it back in the movie, right? Like this is how money works. The other one is like those big launches where it's like, oh, it's a huge peak at the front,

Starting point is 00:28:01 like gaming, right? Like they're like, oh, we want on-prem because we have to be close to users or whatever. But we just, we don't want people to not be able to log in on day one. And after two weeks, everything kind of settles down and we can figure out from there. And those are perfect examples of like, how do you extend that? How do you make a stretched environment that isn't, you know, completely separate in one area or another. And in a lot

Starting point is 00:28:21 of those cases, you want that like consistent scheduling of like, Hey, at Disney animation, we had a custom scheduler that we would, you know, spin up jobs anywhere we wanted to. And it's with side arrow. Like we have this thing called cube span, which is a wire guard connection that makes a mesh between your nose. And it's like, you joins the same cluster. And at that point you have one cluster that you say, Hey, I can still run cluster auto scaler or carpenter or whatever, and say, Hey, when I need computes automatically grab me a few. They're not going to live forever. It's only going to be for a temporary time.

Starting point is 00:28:50 But I'd rather give a user a good experience than bifurcate my deployment engines and run two things. And that becomes the really hard thing to maintain. Gaming companies are fascinating in that space because in many cases, they have to both get as close to users as physically possible, and they have a very strong ebb and flow throughout the day, and almost a follow the sun type of style. And I haven't looked into details of it. It depends on a case-by-case basis, but very often it's the evenings that you're following around there because most people don't play video games from work except during a pandemic, but that's neither here nor there. So they basically distill something down and it just needs to be close to users. It's highly dynamic throughout the day

Starting point is 00:29:28 and it basically can run anywhere that can handle a Docker container, give or take. So you have an x86 or ARM instruction set that you can run there, great. You can run the thing that we need to do. What I love looking at is people with sustained workloads in clouds. What is the average lifetime of an

Starting point is 00:29:45 instance? How scaling are you? Are you just treating the cloud like a big dumb data center? That's not economically great. It does solve a number of problems, don't get me wrong, and I don't have enough context without more information to say whether this is good or bad, but when I see something that, oh yeah, we're running heavily on spot, we are scaling up and down constantly throughout the course of the day, Here are the workloads and here are the patterns where it shifts. Yeah, I'm not going to suggest those people move to on-prem in almost any case, just because it makes little sense for them. But that's not everyone, especially at enterprise. There's an awful lot of effectively permanent pet style EC2 instances living around where if you already have a data center and the

Starting point is 00:30:21 staff able to run it, maybe there's an economic story to not be there. I spent a lot of time at AWS working with the Carpenter team on the node scheduler in Kubernetes. Carpenter is this workload native thing that can do that dynamic scaling really well. And I've changed my mind a lot about how autoscaling works in a lot of cases because it just takes a lot of engineering. And in a lot of cases, I still prefer like, hey, you know what? If you're auto scaling, let's say you're going up and down by 30%, which is a pretty big

Starting point is 00:30:50 shifts throughout a day of like we get 30% more instances and we shut them all down. And I'm like, actually, again, that steady state of just over provisioning on-prem is going to be cheaper, not only cheaper, just from the budget perspective of like, hey, on-prem machines cost less, but also the engineering time of making sure that you're getting instances that you need, that you're not getting iced from your cloud provider and saying like, oh, those instances aren't fair. Or even the performance impacts of a Linux kernel when you're scheduling a process and that same application lands on a machine that has, say, four cores and lands on a machine

Starting point is 00:31:23 that has 24 cores. Even if you have reservations set in your application, the performance characteristics of how the kernel schedules you on the processor is going to be drastically different. And we've had a lot of performance analyzation of like, how does this actually work when you're doing this drastically? And in most cases, it says like, no, actually, you just want to pin. You want to pin your workload to a type of node and always make sure you have the same amount of cores. I'm like, well, then you should just buy a handful of rack of servers and just stick it on there and then

Starting point is 00:31:50 just not use it as much. Right. Cause I know that it looks like waste, but really it's not, it's like, Hey, but it's still cheaper than doing the engineering work of all of that effort to say, I want to scale up and down, which is cool. You can't, it looks really neat. Those graphs are amazing, but also you wasted a lot of time when you could have just like bought two boxes and said like, Hey, guess what? We're done. I do find it fun when I'm talking to customers about this, when they talk about auto scaling, cause it's hard. It's great at giving you the capacity you need 20 minutes after you needed it. And they talk like, I see some significant peaks in some cases and was talking to one and I'll change some details a little bit here. Cause you know,

Starting point is 00:32:24 you don't out customers here, but it was effectively the equivalent of, yeah, we do have to scale massively for incoming rush, but those spikes are also at Formula One races. And spoiler, they schedule when those things are going to be. We don't wake up at two in the morning, surprise, impromptu race where we have to scramble to auto scale so yeah things like that make a fair bit of sense i will say that there is a bit of a countervailing point now which is the proliferation of ai by which i mean the latest scheme people have come up with to sell nvidia gpus before this it was crypto before that it was gaming and it's so hard to get exactly you're holding an nvidia exactly you know what you're up to it's the increasingly to get. Exactly. You're holding an NVIDIA GPU.

Starting point is 00:33:06 You know what you're up to. Increasingly, the hard part has been finding them, particularly at scale. So it's almost an inversion of data gravity, where people are moving workloads to wherever they can get the requisite GPUs. They'll eat the data transfer costs. They just need to be able to get them somewhere. And looking at the cost that cloud providers charge for these things versus what NVIDIA does, assuming you can get them, yeah, you'll hit break even in a disturbingly short number of months if you can get them. And right now, that seems to be the hard part. I mean, I know companies and people that have enough money that they have them and they can't

Starting point is 00:33:39 use them because they don't have a data center that is capable of racking it. Like it's not just buying the thing. It is prep work of like, hey, guess what? At some point that colo doesn't have the cooling system and power requirements needed for those, you know, A400, H500, 400s, whatever it is.

Starting point is 00:33:57 That's like, oh yeah, we have these like giant NVIDIA GPUs. We want you to put them in a colo. I'm like, no, actually you do have to plan something. You take them home, you plug one of them in, they're half rack size things

Starting point is 00:34:06 and suddenly all the lights on your block go out. Yeah, I mean, I have, I got solar power installed in my house, which has been great, but like my batteries would drain in two seconds if I was like running actual stuff here.

Starting point is 00:34:16 But like, yeah, there has to be some planning. And I think in a lot of cases, this like knee jerk reaction of catching up has caught up to too many people of saying like, I have to have to too many people of saying like, I have to have that now. And it's like, actually let's just take two minutes and do a little bit

Starting point is 00:34:31 of math and see what do you actually need? What are your actual, and I know again, a lot of it's stock price, a lot of our quarterly earnings, a lot of those things have to be in this framework of business decisions. But like at some level, someone should be smart enough to say, okay, where's the math line of these lines cross? What do I need to do? Hey, guess what? I can build that in my next budget. That can be a six-month project, not a six-day project. The more I do this, the more I realize that the common guidance falls apart when it comes to specifics. Because I talk to very large companies that are doing a lot of stuff on prime as well as in the cloud.

Starting point is 00:35:06 And I talked to equally large companies that are only doing things in the cloud. And something I've learned is that neither one of those two company profiles generally hires idiots to make their decisions for them. There are contextual reasons in the bounds of what it is that they're doing that makes sense. I would not expect anyone, one of those companies, to listen to us talking on this podcast in any direction and suddenly do a hard pivot in another thing

Starting point is 00:35:30 because I heard this on the podcast. It doesn't work that way. There is always going to be nuance in how budgets are done, expertise that you can have available, where you can get power, large, complicated contract commitments that you have in different directions. The world is complicated,

Starting point is 00:35:45 and generalized guidance does not substitute for using judgment in the world that you're in. I feel the need to say that just because periodically I have people come back and say, well, we're trying to pivot our entire strategy based upon what you said. It's like, ah, could I have a little more context? Don't put that juju on me. You understand that when I call it post-grasqueal or when I say that Route 53 is a database, that is shitposting. You should not actually do it, right? I just want to make sure that people are taking the right lesson away from these things, which is evaluate things in a full context of what you're doing, not because some overconfident pair of white dudes on a podcast had some thoughts. And that's one thing that I've learned a lot over my years of since writing a book

Starting point is 00:36:26 about cloud infrastructure to today. And most of my time has been working in these, you know, in a cloud, but also on-prem and seeing what that balance is. And it is, it very much depends. But I think that a lot of times people get blinded by just forgetting to make the decision of like, oh, should I not go to the cloud

Starting point is 00:36:43 or should I not be on-prem? And they just blindly go to one or another and they make decisions about that environment. They kind of just forget about the other side of it completely of like, oh, what is the math of buying a rack of servers and putting it in a colo and saying like, oh, I get infinite bandwidth of those things.

Starting point is 00:36:59 It's like, okay, is that possible? I don't know. Should I? I don't know. I have to hire people. What does that mean? Okay. What does it mean with deprecates?

Starting point is 00:37:05 Wow. Okay. Like that may be the wrong decision, right? Like if you, if you need GPUs today and you don't want contracts on them, like I love the cloud. The cloud is amazing for experimenting. I've learned a lot and being able to do things and, and built plenty of infrastructure and plenty of applications that run in the cloud.

Starting point is 00:37:20 And then at the same time, like I have, I love being able to plug a physical cable in and being able to feel the heat from a computer. I still think today that like all of this greenwashing of like data centers being green and whatnot, and like saying like, oh, we can go to the cloud because it's, it's renewable energy. And it's like, well, it's kind of buybacks. And I was like, the more time people spend in a hot aisle of a data center, the more they will realize what they are doing to the environments and. And touching metal and feeling heat from actual machines will make it actually real to you of saying like, oh, those thousand computers I just created have an impact. And what does that actually mean for what I'm doing? Is what I'm doing worth it for the world and for humanity?

Starting point is 00:38:00 All those things, GPUs and AI included. That's the challenge I have with a lot of the narrative around this. It feels like it's trying to shift the blame in some ways onto individual users. Whereas if I'm building some Twitter for pets style thing, and I start Googling about the most cost-effective way to run my serverless architecture from a green perspective, I have a bigger carbon footprint for those Google searches than I do for the actual application that I'm running because it turns out dogs don't tweet or technically they do, but we couldn't find a way to make them racist enough to bootstrap a Twitter clone.

Starting point is 00:38:32 Especially with AI built into Google now. Oh my God. Yes. Hey, instead of answers, how do we just make things up? Because it sounds good. Like have you put any glue on your pizza lately? Oh, I couldn't believe that when I saw that earlier today, that was, yeah, that strikes me as the kind of thing computers come up with. And okay, great. But why are we putting that front and foremost instead of actual expertise from people who have been there before? I think a lot of companies have forgotten that they've built trust over years and years

Starting point is 00:39:02 and years. And that trust can break really quickly with something that was not well thought through. And trust is a very hard thing to keep. It feels like Amazon earned trust so they can then squander it by telling everyone that they are leaders in AI and then demonstrating again and again and again just how far from reality that is. It's frustrating to me. You don't have to be a leader in AI as long as you're ahead of someone else, right? As long as again and again and again, just how far from reality that is. It's frustrating to me. You don't have to be a leader in AI as long as you're ahead of someone else, right? As long as you're ahead of your customer, like that is the message for a lot of this is like, oh, I'm a leader because I took one more step before you did. I'm old and conservative when it comes to

Starting point is 00:39:38 particular expressions of technologies, file systems, databases, the stuff where mistakes will show. I'm one of the last kids in my block to adopt those things. And it seems to work out reasonably well for me. I don't think that you need to basically pivot your entire company for what right now is extraordinarily hype driven. Yes, there's value in AI, but no, customers aren't expecting you to find, unlock it, and then deliver it to them with a gift wrap bow on it. It's too early for that. But the stock market is, and that's where the... That's the problem. Remember it was customer obsession and not market financial analyst obsession?

Starting point is 00:40:11 When the customers are giving you money. Exactly. Well, that's the trick of it too. When I talk to my customers who are spending, in some cases, hundreds of millions a year on AWS and half our consulting work is contract negotiation with them, with AWS. We're not seeing people do what the narrative would have you believe, which is, well, we're spending a hundred million a year right now, but on our next commit, let's make it 150 because of all of that gen AI stuff. It's small scale experiments in almost every case. And that those, those same small experiments are trumpeted in keynotes as this company is radically transforming the way that they do business with the power of AI.

Starting point is 00:40:49 Meanwhile, a week beforehand, we're talking to them. It's, yeah, we've done this thing. It's an experiment. Yeah, there's a press release or two in here, but we're winding the effort down because it hasn't been worth the effort and energy and cost it takes to do this. So we're keeping an eye on it, but it's not substantial to what we're doing. So it's the intentional misrepresentation of what people are doing with these things that starts to irk me more and more because they can't change their minds

Starting point is 00:41:14 and they won't change the subject. Sorry, I'll rant about this all day if you'll let me. But I want to thank you for taking the time to speak with me. If people want to learn more about what you're up to now, where's the best place for them to find you? JustinGarrison.com is my website. I highly encourage people to buy a domain and run their own websites. I've been pushing that for years now as more and more things shift to platforms.

Starting point is 00:41:34 I love owning a website that I've been running a website for almost 20 years now and blogging almost monthly for 20 years. And after 20 years of blogging, it turns out that like I have a lot of bad takes, but also it's just like the place that you can find me. And it's the place that I primarily want to put my thoughts and want people to reach out to me. That's a really good philosophy. I've done something very similar historically.

Starting point is 00:41:57 People are like, oh, you have a big Twitter audience. Yeah, but I have a bigger newsletter audience because I own that domain. I have people's email addresses. I can land in their inbox whenever I feel I have something to say because double opt-in and consent are things. And I don't have to please the whims of a given algorithm

Starting point is 00:42:13 in order to reach out to people who've expressly stated they want to hear what I have to say from time to time. There's so much value in that. Except for I use review for my newsletter, which got shut down when someone decided to shut it down at X. Yes, that's why I was talking to some of those folks early on. I built my own custom system. Never do that. But I can port the thing between any email service provider or, God forbid,

Starting point is 00:42:39 I can go back to the olden days of running my own series of post-fix mail exchangers if it really comes down to it. I'm not saying that migration will be painless and there might not be a week or two of delayed issues, but it's definitely something that is possible to do because, surprise, I back up the database from time to time. Owning the stack gives you options. We will, of course, put a link to that in the show notes. Thank you so much for taking the time to speak with me. I appreciate it. Thanks, Corey. Justin Garrison, director of DevRel at Sidero. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform

Starting point is 00:43:17 of choice, along with an angry, insulting comment disparaging any of the opinions we've just had. But be sure to mention which on-prem provider or which cloud provider you work for so we understand the needless, grievous personal attacks.

Screaming in the Cloud - The Pros of On-Prem Kubernetes with Justin Garrison

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.