Screaming in the Cloud - The Pros of On-Prem Kubernetes with Justin Garrison
Episode Date: June 6, 2024Justin Garrison, Director of Developer Relations at Sidero, joins Corey to discuss Justin's experience transitioning from large companies like AWS and Disney to a more agile company like Side...ro, the benefits of using simplified Linux distributions like Talos OS for running Kubernetes, and the pros of on-premises setups for certain workloads. The conversation touches upon challenges with cloud provider limitations, the impacts of computing power on both an economic and environmental scale and Corey and Justin’s frustration with businesses touting their use of AI when they’ve already abandoned those projects. Show Highlights: (00:00) - Introduction(01:09) - Justin’s Background and Career Journey(02:39) - Transition to Sidero(03:51) - Using Personal Devices for Work(08:09) - Talos Linux and Kubernetes(15:19) - Kubernetes Upgrades and On-Prem Challenges(19:21) - Building Your Own Cloud Platform(21:52) - Multi-Cloud vs. Hybrid Cloud(25:15) - Scaling and Resource Management(28:02) - Gaming and Cloud Bursting(32:46) - AI and GPU Challenges(34:54) - Balancing On-Prem and Cloud Solutions(40:49) - Final Thoughts and ContactAbout Justin:Justin is a historian living in the future. Lucky enough to play with cool technologies and hopeful enough to bring others along for the ride.Links Referenced:Justin’s Website: http://justingarrison.comJustin on Bluesky: https://bsky.app/profile/justingarrison.comJustin Garrison on LinkedIn: https://www.linkedin.com/in/justingarrison/*SponsorPanoptica: https://www.panoptica.app/
Transcript
Discussion (0)
How does this actually work when you're doing this drastically?
And in most cases, it says like, no, actually, you just want to pin.
You want to pin your workload to a type of node and always make sure you have the same amount of cores.
Welcome to Screaming in the Cloud.
I'm Corey Quinn, and I'm joined today by Justin Garrison, who these days is the director of DevRel over at Sidero.
Justin, thank you for joining me.
Thanks for having me, Corey.
This episode's been sponsored by our friends at Panoptica, part of Cisco.
This is one of those real rarities where it's a security product
that you can get started with for free, but also scale to enterprise grade.
Take a look.
In fact, if you sign up for an enterprise account,
they'll even throw you one of the limited,
heavily discounted AWS skill builder licenses they got
because believe it or not,
unlike so many companies out there,
they do understand AWS.
To learn more, please visit panoptica.app
slash last week in AWS.
That's panoptica.app slash last week in AWS. I have to say this. One of
the things I adore about having you on this show is that there's a standard intake questionnaire
of the form I send people when I invite them onto the show. And one of the fields that's there is,
does this need any PR review? I think legal counsel, corporate comms, PR folk, et cetera. And your answer to it
was, ha ha ha, no, which I just, in all caps, which is just the best answer I think I've ever
gotten to that. And basically perfectly embodies my philosophy on these things. So thank you for
making me smile. Yeah. You know, I do what I can. I've, I've spent my career mostly at very old or
very large companies. And every company I've worked for before Sight
Era was either 100 years old or over 100,000 people. And this is the first time that I'm not
in one of those situations. And it feels really good, actually. You finally get to say what you
want to say, how you want to say it, to whom you wish to say it. And that is no small thing.
I mean, having an opinion that doesn't align directly with the company is a good thing
sometimes. You were recently at AWS. And before that, you were at Disney, both of which are, to
say that they're large companies dramatically understates the situation by a fair bit.
Before I did this, my last job was at BlackRock, and I basically sat there and stewed for almost
a year, unable to say anything that looked like an opinion in public, particularly in
the way that I like to share opinions. I can only imagine, given that you've spent many years in those types of
environments, what shenanigans you're going to get up to now, given that, to my understanding,
Sidero is a little bit smaller than, you know, companies with four commas in their market cap.
Yeah, it's been really different. I've been at Siderero now for four months. And already it's just, I was thinking back of what it was like at Amazon at four months.
I just got my laptop when I started.
It took three months to get a laptop when I started at Amazon.
I was like, oh, okay, this is big company business.
And I remember at Disney, it was about three and a half months before I had my first ticket
assigned to me, my first project.
It was just three months of reading docs and going to
meetings and figuring out what people do. I understand the onboarding and reading docs.
I spent a consulting project that lasted six weeks at a big company once, the first four of which
were spent waiting to turn on my AD account, which was, okay, big companies are going to big company,
but without a laptop, how could you even read the docs?
I couldn't. I actually had people sending PDFs to me in various ways that weren't supposed to be allowed
because I had nothing to do.
It was actually during OP1 season when I started too.
And if you know OP1 season on Amazon, there's a lot of docs to read.
And so, but it was entertaining because I had a personal Chromebook and nothing that
they had supported Chromebooks.
And so it was just like, okay, well, I have a phone and a Chromebook and I can't do what
I'm supposed to be doing. So I spent some time playing with new services in my personal
AWS account to learn what they did. I do a lot of the, my computing work,
especially on the road from an iPad and most things, yeah, don't tend to support that as
a primary means of interaction. All my dev work is done on an EC2 box for a number of reasons,
but it's just struck me. It just strikes me as weird in that at big
companies in particular i would never as a as an employee expect to be able to use my personal
devices for things and because it's always going to come with these restrictions of great install
this mdm on your stuff so we can do the corporate management thing and it's like how about you
provide the equipment you want to have me uh engage with your corporate systems or you leave
me alone like oh when you start rolling out mdm for mobile devices where it's great you want to have me engage with your corporate systems or you leave me alone.
Like, oh, when you start rolling out MDM for mobile devices where it's great, we want to be
able to wipe your cell phone. My position on that is great. You're going to get me a corporate cell
phone. Alternately, I'll just be in touch when I happen to be on the laptop. It's not a great
position to be in, but I've heard too many horror stories of not just, it's not just malfeasance you
have to worry about. It's accidents.
Someone in Corp IT accidentally wiped away your personal phone while you're on a trip
through no ill intent.
Great, now what?
My entire life lives on that thing.
I was part of the team at Disney Animation
that rolled out MDM for people.
And so I definitely know the struggle of like,
I don't want to do this.
It's not like, but it's my job, right?
Like I have to do this thing.
Even on corporate machines here,
we have a very light touch Jamf profile
that is only on stuff that we own and control.
And all it does is it enforces screensavers,
password strengths, and encrypting the disk.
So I don't have to report a data breach
when you get it swiped out of your car trunk.
Awesome, great.
This is stuff that anyone who wants to, who works here can pull this up at any point. Hey, pull this up and take a look at,
I want to make sure you're not doing something underhanded. Absolutely. I'm not. It's very
straightforward. And it's the stuff that I, like my entire philosophy is I will never ask employees
to do something that I won't do myself. And not just because I happen to be the one holding the
switch because those things can
change pretty quickly. Yeah. And I actually took your lead on joining the new company where my,
my primary or only mobile device is an iPad now. And, and partially because they got good enough,
right? Like the software is still in the middle there, but I was like, I need something that I
do a lot of drawing. I like animation. I like doing that side of it. And I like the iPad and
the pencil format. And I do a lot of video editing and DaVinci works on the iPad.
And so like that combo has worked really well for me of just like, hey, I want a single
purpose device.
I do writing and those sorts of things.
And then I have a shell that remotes back to my desktop in my home lab and I can do
the things that I need to do for work.
It's been working really well for me.
We are very different ends of a particular spectrum where my version of a user interface that's good
talks about like the position
of command line arguments in something.
I use VI for most of my writing,
although I do admit for code increasingly,
I'm drifting in a direction toward VS Code
because it's gotten pretty decent.
But yeah, everything I do
is just green screen terminal style stuff.
Oh no, Blink on the iPad is great. Blink as the terminal is an excellent terminal app on the iPad. That's where I do is just green screen terminal style stuff. Oh, no, Blink on the iPad is great.
Blink as the terminal is an excellent terminal app on the iPad.
That's where I do all my writing.
It's in Vim.
It's remote back to my box.
And yeah, with Tailscale and a 5G connection, it's awesome.
It sounds like we're using the exact same stack on that.
I've used their VS Code implementation just by typing code in the path to a directory.
I gave up on code.
I was using code at Amazon for... I was like, I want to switch.
I want to switch off Vim.
I'm finally going to go into the GUI world of IDEs.
And I used it for four years at Amazon.
And I'm like, I don't like this anymore.
And I just went back to, actually went to NeoVim with LazyVim as a framework on top.
And it brings a lot of those things that I liked in code automatically, like all the
pop-ups and things that get in the way, all those come with it. It's great. One thing that I've done that I think is
probably just the side of a war crime is I've gotten GitHub Copilot to work in Vim when prompted.
But unlike in a lot of other environments, it doesn't automatically slap the text and it has
to explicitly be asked, which is kind of important. Yeah. It's auto auto-filling for me. So it's
disabled by default. It's just like, it just gets in the way too much. Yeah. When I want a robot's
opinion, I will ask explicitly for it. It's like that old shit posting meme picture of a robot will
absolutely not speak to me in my, my holy language. I am a divine, I'm a divine being. You are a pile
of bolts. How dare you like just this, this-the-top screaming at a computer's thing, which, yeah, that's very
much my bent.
So what are you doing at Cidero these days?
What do you folks do exactly?
At Cidero, we mainly focus on people having problems with on-prem Linux and Kubernetes.
And so it came from actually starting Tilos Linux, which was a single-purpose Linux distro.
And in my days at Disney, it was at Disney Animation, I was doing on-prem Kubernetes
and we were using CoreOS and it was like a fantastic, like, oh, simplified the rel stack
to be basically just systemd and container runtime enough that I could bash script my
way into running a Kubernetes cluster.
And in around that time, Andrew, the CTO at Cidero, he started this
Talos Linux thing. And it's like, no, like PID1 is just an API. There's no SSH. There's no,
you don't need users. Everything is an API, API driven Linux. And all it does is run Kubernetes.
It does nothing else useful besides run Kubernetes. And I was like, that's an appliance.
That's amazing. And so I remember when he announced it on Hacker News, I emailed him immediately.
I was working at Disney.
I'm like, I want this to exist in the world.
This should be a thing that is available.
And throughout my career, I've stayed in touch.
I've kind of used it here and there.
And I went over to Amazon
and we launched Bottle Rocket
as a direct competitor to Talos Linux.
And then Bottle Rocket kind of went off
and got re-orged a couple of times
and does some other things.
No, you're kidding.
A re-org at Amazon?
No.
It doesn't do the same things that it used to.
And it's not really focused on Kubernetes anymore.
It's like this weird container sort of thing.
And of course, CoreOS got bought twice and it does some other weird things.
And that kind of like shifted off.
And then this Talos Linux thing, it just kind of stayed in this like,
we do Kubernetes and that's it.
And it's everything from bare metal.
You can see on my bench behind me, I know it's an audio podcast, but I have like a Raspberry Pi running it and a laptop running it. Like I'm setting up a lab
for a talk later. And I'm just like, oh yeah, this is like, it runs on very small stuff. And
then very big stuff. We have an oxide rack that we're, we're testing it with. And so it's like
big metal. And then also like cloud providers, it's just, it runs everywhere. It's Linux and
it just runs Kubernetes. And that's like the really key point of like, we have this streamlined Linux OS because people are like,
you have to learn Linux to know Kubernetes. It's like, actually you can lower the bar.
You can make Linux disappear and just make it an API. You have to dial it in super well to get
there, but I believe it is possible. It's a, that is kind of the dream where you don't have to think
about these things anymore. I mean, I wrote a blog post a couple of years back on the idea that nobody cares
about the operating system anymore,
which of course was dunking on Red Hat at the time
for deciding they're going to basically backtrack
on the CentOS long-term support commitment.
Surprise, you need to pay us now.
Everyone loves hearing that.
And it was a bad move from my perspective
because most companies don't care that much
about the operating system the way that they once had to. Some people need to care very much, but it's not a necessary
prerequisite to build things now. Especially when you have something like a Kubernetes clustering
system on top, you want the OS to disappear. It should not be a hindrance. It should get out of
your way. It should let you debug some things. It should have some access to do like, hey,
what's going on? But beyond that, it should just disappear. And that's exactly what Talos
was meant to do. And
even when I started, I started
digging into it more because I haven't touched it for a little
while at Amazon. And I was like, hey, how
many binaries do we have on the system? And I was like, oh,
I can just list out this directory. I was like, oh, there's 12.
There's 12 total binaries. I was like, what?
How is this possible? What are you doing here?
And like, oh, yeah. And I started counting other Linux
distros. And it's like, you know, RH rel has like 7,000, Ubuntu 6,000,
even the smallest bottle rocket is like, you know, 1400, 1300 or something like that.
I mean, technically you could get a lot of it down to near one with busy box. Just it
did it change its behavior based upon what the, what SimLink invokes it.
And there are some, there are SimLinks in there. Like if you count all the linked binaries,
there's like 30, but it's like most of them
are LVM, right?
It's like, we have like half of the binaries are for like disk management.
The thing that drives me nuts, and I've wanted this for years, even the stripped down minimalist
distros aren't stripped out of minimalist enough because their argument as well.
If you get rid of some of these things, it'll be really uncomfortable to work in the environment.
It's production.
It's not supposed to be comfortable.
If you're copying your.files
to customize your shell to production,
you are doing it wrong in almost
every case. Stop that.
I got into Kubernetes finally
for basically losing a bet with the internet earlier
this year and gave a talk, Terrible
Ideas in Kubernetes. To do that, I now have
a 10-node Kubernetes of my
own running in the next room on
a bunch of raspberries pie.
And the most annoying part of running the thing, to be very direct with you,
is in fact the underlying operating system.
Having to keep those things patched and updated and caring about those things.
I just want it to basically run Kubernetes as an appliance,
and please shut up and leave me alone for other things that aren't that.
And I can't get there.
That was exactly like at Amazon.
I was helping build the EKS Anywhere, the on-prem EKS version of you should run Kubernetes
in this environment with this set of Linux distro and stuff.
And it was like, it was automated, but it was difficult.
And it was like, oh, we could do this stuff.
And it had this cluster API and all these things that got way too complicated.
And I was like, hey, I need to look at what the competition's doing. Like what are other people doing? And again, I was like, Oh,
I know Talos, like I'm just spin it up. And I was like, okay, well I get the API, like I can get
the commands. And I actually booted the main product that we sell at Side Arrow is Omni.
And it's a, it's a SaaS version of like, you can run, you can literally put it on a USB drive,
boot a machine and it connects back. You have a Kubernetes cluster, like magic. Like the first
time I did it, like, wait a minute, I missed something. Like there's got to be something I'm doing wrong because this part of it was too easy.
Like I went from booting a USB drive to a Kubernetes API in like two clicks.
And I was like, I don't know what I just did wrong, but let me go try it again.
And I started internally at Amazon, like making like competitive, like, hey, we should look
at what's going on over here.
Like this is not cluster API.
This is not complicated.
Like I had to get the dev team to like troubleshoot anything.
It's like an ERP system.
You buy the thing and the real money is for the consultants who wind up spending the next
four years of their lives dialing in all the 100,000 configuration options just right for
you.
I don't think that we're at a point where that needs to be the case.
You can, I imagine they could be tunable in a bunch of different ways if you absolutely
had to be the case. You can, I imagine they could be tunable in a bunch of different ways if you absolutely had to be, but the mean path, the common path for most folks should not involve
that level of obscene customization. And absolutely. There's a place for,
I want to learn it. I want to do it the hard way. I want to go through and like, I need,
I want to learn the Linux steps because I need to for a career, whatever. I've done it right.
Like I was doing that. I was part of the initial, like a SIG on-prem inside of Kubernetes.
Like we were like building Kubernetes on-prem.
I was one of the chairs for it
when it originally started.
And it's like, it was really, really hard.
And all of my system D units
were like curling down hypercube
and all this stuff.
And it was like, it was automated,
but it was hard.
And now it's like,
this should just disappear.
That whole layer from the TPM,
like you want trusted boot
all the way up to Kubernetes API,
that should just be handled. Like we should be able to not automate that and just abstract it away like Kubernetes does. disappear. That whole layer from the TPM, like you want trusted boot all the way up to Kubernetes API,
that should just be handled. Like we should be able to not automate that and just abstract it away like Kubernetes does for a lot of other stuff when you run a pod, right? Like I want a deployment.
I just want the service load balancer available. I don't care how you get there.
Few things are better for your career and your company than achieving more expertise in the cloud.
Security improves, compensation goes up, employee retention skyrockets.
Panoptica, a cloud security platform from Cisco,
has created an academy of free courses just for you.
Head on over to academy.panoptica.app to get started.
How do you handle things like Kubernetes version upgrades
when there are no moving parts?
All the Kubernetes components are part of containers that get run, right?
So like we have like system containers that run.
And so it's like you can shift out like I have a, you know, Talos OS version.
I say, oh, upgrade Kubernetes.
And when you have a declarative spec, right, like Kubernetes can say, like, I know how
to roll through upgrades of a service or a pod or whatever.
We can roll through upgrades of the components of a control plane.
We're like, okay, well, etcd goes first.
We know how to make sure etcd is healthy.
We know how to roll that out
if you have highly available.
We do one at a time
and we just do the steps you're supposed to,
just like Kubernetes does
with the application layer.
So you don't, in this case,
need to do this from the perspective
of someone who is having
to babysit the process through.
You just tell the API to do it
and it does a solved problem effectively. tell the API to do it and it
does a solved problem effectively. It knows how to do it. Just expect that it might be in an
intermediate state for the next period of time until it completes. So maybe this is not the
same time to roll out a major deploy. Yeah, there are things in any infrastructure where you're
like, oh, if I'm going to upgrade this thing, let's hold off and make sure that we're not going
to cross send errors because something else happened, right? Change one thing at a time. So one of the things that I've learned through dealing with the
Kubernetes has been that it's been the first time in a long time that I've done stuff on prep. I'm
used to doing everything in a cloud provider environment, but I didn't do that with one of
the managed Kubernetes services because I've seen the bills when those things start misbehaving.
And frankly, they are not pleasant. I don't want to spend thousands of dollars for basically a fun
talk that I'm giving
in terms of oops-a-doozy charges. And I'd forgotten on some level just how annoying
some of the aspects of running things in your own environment can be. Deviances between some
bad cables or a bad chip that winds up getting sent out. The joy of running power and the rest.
The challenge inherent in the storage subsystem.
For example, Longhorn is apparently terrible.
It's just the second worst option.
Everything else is tied for first.
And EBS in AWS land largely just works.
So you don't have to think about these things.
And you learn that, oh, just like DNS,
when the storage subsystem starts acting up,
so does everything else.
You will not be going to space today because you have surprise unplanned work. Now, I would never run things like this in
a production style environment. There'd be a lot more care and whatnot, but it just runs a bunch
of home lab stuff that I want to exist. But if it goes down for a day or so, it doesn't destroy my
ability to operate. Yeah, for sure. And I mean, there's always that, like people ask all the time,
like, hey, how do I get my storage inside this Kubernetes cluster? And I'm like,
do you have an external way you run storage today? Right. You don't have to reinvent everything.
Everything should not be a Kubernetes problem, right? That like net app that you paid for,
keep using the net app. It's really good at doing storage, right? Like that's fine. Like don't,
don't think you have to shove everything in here. And I understand people, they want the interface,
they want something that looks familiar. I'm like, no, it's okay to have a team that manages NFS over there.
And then you mount that in and those things are great. And having those things separate,
but again, the Kubernetes mindset of I have to run everything myself because I can,
is just kind of a way of, we've gotten in trouble with everything in the past,
like config management and everything had to be an Ansible playbook for a long time. I'm like,
actually, no, I don't want to SSH and try
to do this in YAML. I do manage the individual pies with Ansible because it was either that or
a bunch of shell scripts running in loops, which no thank you. But yeah, it's a neat approach to
doing these things. Now, most of my use cases are very much contrived here in that there are
containers that I want to exist and run. I could run those anywhere, probably Docker Swarm or just running them in Docker Desktop would be more than sufficient for
my use case. But I needed some workloads that I actually were running things I care about.
And these are mostly Singleton style container approaches. A couple of them have services that
have a few containers talking to each other, but it's really not a Kubernetes shaped problem.
I've introduced a raft of unnecessary complexity for what I'm doing.
But now that I have it, I find that just spinning up new containers of services here to do things
to be relatively straightforward, I have become something of a reluctant convert to Kubernetes.
Now, I like it more on-prem than I do in a cloud provider environment, which is what
I want to get into with you, because it feels like it's a good way to basically build a
cloud-like platform
of your own. But when you do that on top of AWS, it's like, well, crap, why not? If you want to
go work at a cloud provider so bad, fill out an application. You can go and do it there for real.
If you're going to work in the cloud, use the primitives that they provide for you for your use
and you'll generally wind up happier has been my position on this historically
please fight me i would fight back on the fact that uh for disney plus disney plus when i was
there was all built on ecs and ecs it was it was one of the larger uh installations of ecs and it
is ecs is a native or a native aws services and all of the things that you would want to do are very native to what AWS can do.
And in any, even starting small,
the first AWS native primitive
you're going to find yourself hitting is account limits.
Like that as like a cloud,
as like a thing that happens in the cloud is like,
oh, this artificial thing to protect the service.
I understand why it exists
and why I would want to build it.
But then it's like, I have to do that all the time.
And there were so many account limits.
And that's why people always want to have development done in separate accounts,
because if you don't, you can have all the permissions isolation in the world.
But if you exhaust the rate limit for EC2 describe instances,
well, suddenly things like auto scaling groups and load balancers
will not be able to make those calls. EC2 describe instances. Well, suddenly things like autoscaling groups and load balancers will
not be able to make those calls and your production environment will once again,
not be going to space today. Yeah. In a lot of cases, even at medium size scale,
I just basically say like, you could abandon a lot of your labels in AWS and just go different
accounts, right? Cause like, that's the real limitation factor of AWS is like, oh, this is
that account. And like, they're not talking to each other and you get that separate clean bill.
And you're like, for every, every application you're deploying, every team that wants some
resources, stick them in a new account.
And that becomes easier to a lot of degrees than trying to divvy up things and say like,
oh, well, this is these labels and they shouldn't talk to this other one.
And then you get this, I am nightmare.
Of course, when they have to talk to each other, you get into a lot of other problems. But in a
lot of cases, it was just like, you know what, as separate accounts, as the boundary to just
like a separate Kubernetes cluster, as the boundary for a lot of teams, it just makes a
lot more sense because even though you have RBAC in namespaces and other things, it just gets
cleaner and easier when you just say, this is a different API and you just go over there and you do your own thing. And when you break it and come back to me and say, I need another one, I'm just going to make you another clean one. I don't have to talk to anyone else. and that was approaching a very specific thing that has been misinterpreted a number of times,
but cool, I'll live with it.
I should have been more clear.
I'll live with that on me.
But my point has been that for most folks,
just setting out to build something from day one
with the idea that it can run seamlessly
and just the same in any or all cloud provider environments
is probably not what you want to be doing
without there being an extenuating circumstance
because rather than embracing a provider,
it matters not which one, and taking advantage of all the things that they
have operated.
Instead,
you're doing things like rolling your own effective load balancers because no
one has a consistent load balancing experience or a provision database
service.
It's that's a lot of undifferentiated heavy lifting that you could be
spending on the thing that your business does.
There are exceptions to all of this,
but that's the general guidance that I've taken on this.
I don't feel the same way about hybrid,
because hybrid I don't think is something that people set out to achieve.
I think it is something that happens to them.
My theory has been that most hybrid environments
are someone trying to do a full-on cloud migration,
running into a snag somewhere along the way,
such as we have a mainframe and there is no AWS 400.
Oops-a-doozy.
Declaring a victory midway through saying we're hybrid now and then going to focus on
something else.
It's cynical, but also directionally accurate from what I've seen.
Agree?
Disagree?
Two parts there of like multiple clouds.
I absolutely agree.
Like you should go into one cloud specifically.
And I call that undifferentiated heavy clouding. When you try to make your own cloud on top of clouds in any significant size enterprise is going to have a cloud team that basically abstracts some of that away of like, oh, we want this to work everywhere. And this Terraform provider should go to every cloud. I'm like, actually, maybe you don't. Maybe you should just keep that as clean, as simple of an interface as possible.
Because again, account limits.
And also, if you're not actively testing the stuff by running it active-active across multiple
providers, it's like a DR plan that isn't being tested.
The next commit after you test and get and validate your DR plan has the potential to
render the DR plan irrelevant.
So it has to be consistently tested.
If you've never done a restore, you don't have backups.
Exactly. No one cares about backups. Everyone cares about the restore. And similarly,
if you are, if you are, there's some people that desperately need certain things to run multi-cloud. If you're a telemetry vendor like Datadog or whatnot, you need that stuff to live
where people are running their workloads. So yeah, that thing needs to live everywhere.
You know what doesn't necessarily though is your account management stuff
or your marketing site
or a bunch of other services.
It would even, like I was a Datadog customer
and even like, does that need to run in AWS?
The only reason that needs to run in AWS
is because there's egress fees, right?
Like that's the, like the business model of the cloud
directly impacts the architecture of someone else.
And that becomes a problem.
And so that for sure is like a thing that exists, but yeah, you have to be where your customers are because of someone else. And that becomes a problem. And so that for sure is like a thing that exists.
But yeah, you have to be where your customers are because of that stuff.
I actually just saw a headline today that Azure is dropping cross AZ charges.
I did not expect Microsoft to be the leader in that particular direction.
But I usually think Microsoft, when I want someone to basically treat security as a joke,
not be avant-garde in terms of data transfer pricing.
Yeah, competition is good.
But yeah, to your point,
like you either have to go with Azure
and hope you don't get hacked
or go with Google and hope that they don't shut it down.
Like those are your choices right now for a lot of things.
But back to the hybrid conversation.
Absolutely, like it becomes a thing of like,
is this architected to be hybrid
or is this a failed half attempt of like,
actually that stuff just is going to stay on-prem.
And the successful situations that I've seen are the we just need
to burst and we have compute. Right. And that's like we want the elasticity of compute and compute
resources in the cloud. But we want to keep that majority anything. It would be a reserved instance
or a savings plan that should be on prem. It's going to be always cheaper and you don't have
it ever worry about cross AC because you own the switches, that sort of stuff.
And then like that extra,
anything you would run in Spot,
yeah, go run it in Spot in a cloud provider.
And how you get there becomes a lot harder
for a lot of people.
And they think that Kubernetes is going to solve that problem.
It's like, oh, well, I have a Kubernetes.
So now I can have another Kubernetes
and we can just deploy twice.
I'm like, no, not really,
because there's a lot there right yeah and no no two kubernetes installations are alike there's
always different prerequisites of what services and how people have approached these things it
feels like we've come full circle because this was the guidance people gave back in the in the
knots when cloud first came out before ebs was a thing it was oh great just scale use this to scale
up so you own the base rent the peak and that's the way that people always approached it somehow first came out before EBS was a thing. It was, oh, great, just scale, use this to scale up. So you
own the base, rent the peak. And that's the way that people always approached it. Somehow that
changed to, oh, put everything in the cloud. And there are values and validity to that approach,
especially at small scale. When you're a startup building something, you should not be negotiating
data center leases. At some point of scale, though, the economics start to turn. And ideally,
that's when the scale discounting starts to come into play. But I've never yet used to be able to say this. I
can't anymore. But I used to say that I don't know too many companies that spend more on cloud
infrastructure than they do on people to operate that cloud infrastructure. And then we had a bunch
of AI companies go and basically spend $100 million that they can't pay on their cloud
built stability. I am looking at news articles about you on this.
And okay, yeah, step one, you scam the VCs out of money.
And then step two is you go buy all the GPUs from cloud providers.
You got your order of operations confused, and now you're having a fire sale.
That's unfortunate.
Sure.
And I mean, to the first point of bursting, peeking into the cloud, the most successful
times I've seen that done is when I was at Disney Animation, like our render jobs were like, Hey, the movie has a date.
We know how long it takes us to render things. We don't have enough computers, right? And like,
we could just do that math. We're like, this is where it lines up. Okay, well, let's just go find
cheap compute somewhere. And then just we'll spin up, you know, more, we'll get more CPUs. We'll
render the movie. And then when it's done, the movie's done and we shut it all down. And literally
we built and we were creating box software,
right?
And that sort of business model,
we can shut it all down
as soon as we were done with the movie.
And it was great
because we could just cleanly
turn it off.
We're like, hey,
this is just coming out of the budget
and we got to make it back
in the movie, right?
Like this is how money works.
The other one is like
those big launches
where it's like, oh,
it's a huge peak at the front,
like gaming, right?
Like they're like, oh,
we want on-prem
because we have to be close
to users or whatever. But we just, we don't want people to not be able to
log in on day one. And after two weeks, everything kind of settles down and we can figure out from
there. And those are perfect examples of like, how do you extend that? How do you make a stretched
environment that isn't, you know, completely separate in one area or another. And in a lot
of those cases, you want that like consistent scheduling of like, Hey, at Disney animation, we had a custom scheduler that we would, you know, spin up jobs anywhere
we wanted to. And it's with side arrow. Like we have this thing called cube span,
which is a wire guard connection that makes a mesh between your nose. And it's like,
you joins the same cluster. And at that point you have one cluster that you say, Hey, I can still
run cluster auto scaler or carpenter or whatever, and say, Hey, when I need computes automatically
grab me a few.
They're not going to live forever.
It's only going to be for a temporary time.
But I'd rather give a user a good experience than bifurcate my deployment engines and run
two things.
And that becomes the really hard thing to maintain.
Gaming companies are fascinating in that space because in many cases, they have to both get
as close to users as physically possible, and they have a very strong ebb and flow throughout the day, and almost a follow the sun type of style.
And I haven't looked into details of it. It depends on a case-by-case basis, but very often it's the evenings that you're following around there because most people don't play video games from work except during a pandemic, but that's neither here nor there. So they basically distill something down
and it just needs to be close to users.
It's highly dynamic throughout the day
and it basically can run anywhere
that can handle a Docker container, give or take.
So you have an x86 or ARM instruction set
that you can run there, great.
You can run the thing that we need to do.
What I love looking at is people
with sustained workloads in clouds.
What is the average lifetime of an
instance? How scaling are you? Are you just treating the cloud like a big dumb data center?
That's not economically great. It does solve a number of problems, don't get me wrong,
and I don't have enough context without more information to say whether this is good or bad,
but when I see something that, oh yeah, we're running heavily on spot, we are scaling up and
down constantly throughout the course of the day, Here are the workloads and here are the patterns where it shifts. Yeah, I'm not
going to suggest those people move to on-prem in almost any case, just because it makes little
sense for them. But that's not everyone, especially at enterprise. There's an awful lot of effectively
permanent pet style EC2 instances living around where if you already have a data center and the
staff able to run it, maybe there's an economic story to not be there.
I spent a lot of time at AWS working with the Carpenter team on the node scheduler in
Kubernetes.
Carpenter is this workload native thing that can do that dynamic scaling really well.
And I've changed my mind a lot about how autoscaling works in a lot of cases because it just takes
a lot of engineering.
And in a lot of cases, I still prefer like, hey, you know what?
If you're auto scaling, let's say you're going up and down by 30%, which is a pretty big
shifts throughout a day of like we get 30% more instances and we shut them all down.
And I'm like, actually, again, that steady state of just over provisioning on-prem is
going to be cheaper, not only cheaper, just from the budget perspective of like, hey,
on-prem machines cost less, but also the engineering time of making sure that you're getting instances that you need, that
you're not getting iced from your cloud provider and saying like, oh, those instances aren't
fair.
Or even the performance impacts of a Linux kernel when you're scheduling a process and
that same application lands on a machine that has, say, four cores and lands on a machine
that has 24 cores.
Even if you have reservations set in your application, the performance characteristics
of how the kernel schedules you on the processor is going to be drastically different.
And we've had a lot of performance analyzation of like, how does this actually work when
you're doing this drastically?
And in most cases, it says like, no, actually, you just want to pin.
You want to pin your workload to a type of node and always make sure you have the same amount of cores. I'm like,
well, then you should just buy a handful of rack of servers and just stick it on there and then
just not use it as much. Right. Cause I know that it looks like waste, but really it's not, it's
like, Hey, but it's still cheaper than doing the engineering work of all of that effort to say,
I want to scale up and down, which is cool. You can't, it looks really neat. Those graphs are
amazing, but also you wasted a lot of time when you could have just like bought two boxes and
said like, Hey, guess what? We're done. I do find it fun when I'm talking to customers about this,
when they talk about auto scaling, cause it's hard. It's great at giving you the capacity you
need 20 minutes after you needed it. And they talk like, I see some significant peaks in some cases
and was talking to one and I'll change some details a little bit here. Cause you know,
you don't out customers here, but it was effectively the equivalent of, yeah, we do
have to scale massively for incoming rush, but those spikes are also at Formula One races. And
spoiler, they schedule when those things are going to be. We don't wake up at two in the morning,
surprise, impromptu race where we have to scramble to auto scale so yeah things like that make a fair
bit of sense i will say that there is a bit of a countervailing point now which is the proliferation
of ai by which i mean the latest scheme people have come up with to sell nvidia gpus before this
it was crypto before that it was gaming and it's so hard to get exactly you're holding an nvidia
exactly you know what you're up to it's the increasingly to get. Exactly. You're holding an NVIDIA GPU.
You know what you're up to.
Increasingly, the hard part has been finding them, particularly at scale.
So it's almost an inversion of data gravity, where people are moving workloads to wherever they can get the requisite GPUs.
They'll eat the data transfer costs.
They just need to be able to get them somewhere.
And looking at the cost that cloud providers charge for these things versus what NVIDIA does, assuming you can get them, yeah, you'll hit break even in a disturbingly
short number of months if you can get them. And right now, that seems to be the hard part.
I mean, I know companies and people that have enough money that they have them and they can't
use them because they don't have a data center that is capable of racking it. Like it's not just buying the thing.
It is prep work of like,
hey, guess what?
At some point that colo
doesn't have the cooling system
and power requirements needed
for those, you know,
A400, H500, 400s, whatever it is.
That's like, oh yeah,
we have these like giant NVIDIA GPUs.
We want you to put them in a colo.
I'm like, no, actually
you do have to plan something.
You take them home,
you plug one of them in,
they're half rack size things
and suddenly all the lights
on your block go out.
Yeah, I mean, I have,
I got solar power installed in my house,
which has been great,
but like my batteries would drain
in two seconds
if I was like running actual stuff here.
But like, yeah,
there has to be some planning.
And I think in a lot of cases,
this like knee jerk reaction
of catching up
has caught up to too many people
of saying like, I have to have to too many people of saying like,
I have to have that now. And it's like, actually let's just take two minutes and do a little bit
of math and see what do you actually need? What are your actual, and I know again, a lot of it's
stock price, a lot of our quarterly earnings, a lot of those things have to be in this framework
of business decisions. But like at some level, someone should be smart enough to say,
okay, where's the math line of these lines cross? What do I need to do? Hey, guess what? I can build
that in my next budget. That can be a six-month project, not a six-day project. The more I do
this, the more I realize that the common guidance falls apart when it comes to specifics. Because I
talk to very large companies that are doing a lot of stuff on
prime as well as in the cloud.
And I talked to equally large companies that are only doing things in the cloud.
And something I've learned is that neither one of those two company profiles generally
hires idiots to make their decisions for them.
There are contextual reasons in the bounds of what it is that they're doing that makes
sense.
I would not expect anyone, one of those companies,
to listen to us talking on this podcast in any direction
and suddenly do a hard pivot in another thing
because I heard this on the podcast.
It doesn't work that way.
There is always going to be nuance in how budgets are done,
expertise that you can have available,
where you can get power,
large, complicated contract commitments
that you have in different directions.
The world is complicated,
and generalized guidance does not substitute for using judgment in the world that you're in.
I feel the need to say that just because periodically I have people come back and say,
well, we're trying to pivot our entire strategy based upon what you said. It's like, ah,
could I have a little more context? Don't put that juju on me. You understand that when I call
it post-grasqueal or when I say that Route 53 is a database, that is shitposting. You should not actually do it, right? I just want
to make sure that people are taking the right lesson away from these things, which is evaluate
things in a full context of what you're doing, not because some overconfident pair of white dudes on
a podcast had some thoughts. And that's one thing that I've learned a lot over my years of since writing a book
about cloud infrastructure to today.
And most of my time has been working in these,
you know, in a cloud, but also on-prem
and seeing what that balance is.
And it is, it very much depends.
But I think that a lot of times people get blinded
by just forgetting to make the decision of like,
oh, should I not go to the cloud
or should I not be on-prem?
And they just blindly go to one or another
and they make decisions about that environment.
They kind of just forget about
the other side of it completely of like,
oh, what is the math of buying a rack of servers
and putting it in a colo and saying like,
oh, I get infinite bandwidth of those things.
It's like, okay, is that possible?
I don't know.
Should I?
I don't know.
I have to hire people.
What does that mean?
Okay.
What does it mean with deprecates?
Wow.
Okay.
Like that may be the wrong decision, right?
Like if you, if you need GPUs today and you don't want contracts on them, like I love
the cloud.
The cloud is amazing for experimenting.
I've learned a lot and being able to do things and, and built plenty of infrastructure and
plenty of applications that run in the cloud.
And then at the same time, like I have, I love being able to plug a physical cable in and being able to feel the heat from a computer. I still think today that like all of this
greenwashing of like data centers being green and whatnot, and like saying like, oh, we can go to
the cloud because it's, it's renewable energy. And it's like, well, it's kind of buybacks. And
I was like, the more time people spend in a hot aisle of a data center, the more they will realize
what they are doing to the environments and. And touching metal and feeling heat from actual machines will make it actually real to you
of saying like, oh, those thousand computers I just created have an impact.
And what does that actually mean for what I'm doing?
Is what I'm doing worth it for the world and for humanity?
All those things, GPUs and AI included.
That's the challenge I have with a
lot of the narrative around this. It feels like it's trying to shift the blame in some ways onto
individual users. Whereas if I'm building some Twitter for pets style thing, and I start Googling
about the most cost-effective way to run my serverless architecture from a green perspective,
I have a bigger carbon footprint for those Google searches than I do for the actual application that
I'm running because it turns out dogs don't tweet or technically they do, but we couldn't find
a way to make them racist enough to bootstrap a Twitter clone.
Especially with AI built into Google now.
Oh my God. Yes. Hey, instead of answers, how do we just make things up? Because it sounds good.
Like have you put any glue on your pizza lately?
Oh, I couldn't believe that when I saw that earlier today, that was, yeah, that strikes me as the kind of thing computers come up with.
And okay, great.
But why are we putting that front and foremost instead of actual expertise from people who
have been there before?
I think a lot of companies have forgotten that they've built trust over years and years
and years.
And that trust can break really quickly with something that was not well thought through. And trust is a very hard thing to keep.
It feels like Amazon earned trust so they can then squander it by telling everyone that they
are leaders in AI and then demonstrating again and again and again just how far from reality that is.
It's frustrating to me.
You don't have to be a leader in AI as long as you're ahead of someone else, right? As long as again and again and again, just how far from reality that is. It's frustrating to me.
You don't have to be a leader in AI as long as you're ahead of someone else, right? As long as you're ahead of your customer, like that is the message for a lot of this is like, oh, I'm a
leader because I took one more step before you did. I'm old and conservative when it comes to
particular expressions of technologies, file systems, databases, the stuff where mistakes
will show. I'm one of the last kids in my block to adopt those things. And it seems to work out reasonably well for me.
I don't think that you need to basically pivot your entire company for what right now is
extraordinarily hype driven. Yes, there's value in AI, but no, customers aren't expecting you to find,
unlock it, and then deliver it to them with a gift wrap bow on it. It's too early for that.
But the stock market is, and that's where the...
That's the problem. Remember it was customer obsession and not
market financial analyst obsession?
When the customers are giving you money.
Exactly. Well, that's the trick of it too. When I talk to my customers who are spending,
in some cases, hundreds of millions a year on AWS and half our consulting work is
contract negotiation with them, with AWS. We're not seeing
people do what the narrative would have you believe, which is, well, we're spending a hundred
million a year right now, but on our next commit, let's make it 150 because of all of that gen AI
stuff. It's small scale experiments in almost every case. And that those, those same small
experiments are trumpeted in keynotes as this company is radically transforming the way that they do business with the power of AI.
Meanwhile, a week beforehand, we're talking to them.
It's, yeah, we've done this thing.
It's an experiment.
Yeah, there's a press release or two in here, but we're winding the effort down because it hasn't been worth the effort and energy and cost it takes to do this.
So we're keeping an eye on it, but it's not substantial to what we're doing.
So it's the intentional misrepresentation of what people are doing with these things
that starts to irk me more and more
because they can't change their minds
and they won't change the subject.
Sorry, I'll rant about this all day if you'll let me.
But I want to thank you for taking the time to speak with me.
If people want to learn more about what you're up to now,
where's the best place for them to find you?
JustinGarrison.com is my website.
I highly encourage people to buy a domain and run their own websites.
I've been pushing that for years now as more and more things shift to platforms.
I love owning a website that I've been running a website for almost 20 years now and blogging
almost monthly for 20 years.
And after 20 years of blogging, it turns out that like I have a lot of bad takes,
but also it's just like the place that you can find me.
And it's the place that I primarily want to put my thoughts
and want people to reach out to me.
That's a really good philosophy.
I've done something very similar historically.
People are like, oh, you have a big Twitter audience.
Yeah, but I have a bigger newsletter audience
because I own that domain.
I have people's email addresses.
I can land in their inbox
whenever I feel I have something to say
because double opt-in and consent are things.
And I don't have to please the whims of a given algorithm
in order to reach out to people who've expressly stated
they want to hear what I have to say from time to time.
There's so much value in that.
Except for I use review for my newsletter,
which got shut down when someone decided
to shut it down at X.
Yes, that's why I was talking to some of those folks early on. I built my own custom system.
Never do that. But I can port the thing between any email service provider or, God forbid,
I can go back to the olden days of running my own series of post-fix mail exchangers if it really comes down to it. I'm not saying that migration will be painless and there might not
be a week or two of delayed issues, but it's definitely something that is possible to do because, surprise, I back
up the database from time to time. Owning the stack gives you options. We will, of course,
put a link to that in the show notes. Thank you so much for taking the time to speak with me.
I appreciate it. Thanks, Corey. Justin Garrison, director of DevRel at Sidero. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave a five-star review on your podcast platform
of choice.
Whereas if you've hated this podcast, please leave a five-star review on your podcast platform
of choice, along with an angry, insulting comment disparaging any of the opinions we've
just had.
But be sure to mention which on-prem provider or
which cloud provider you work for so we understand the needless, grievous personal attacks.