Screaming in the Cloud - Mastering Kubernetes for Multi-Cloud Efficiency With Nick Eberts
Episode Date: April 16, 2024In this episode, Corey chats with Google's Nick Eberts about how Kubernetes helps manage applications across different cloud environments. They cover the benefits and challenges of using Kube...rnetes, especially in Google's cloud (GKE), and discuss its role in making applications more flexible and scalable. The conversation also touches on how Kubernetes supports a multi-cloud approach, simplifies the deployment process, and can potentially save costs while avoiding being tied down to one cloud provider. They wrap up by talking about best practices in cloud infrastructure and the future of cloud-native technologies.Show Highlights: (00:00) - Introduction to the episode(03:28) - Google Cloud's approach to egress charges and its impact on Kubernetes(04:33) - Data transfer costs and Kubernetes' verbose telemetry(07:23) - The nature of Kubernetes and its relationship with cloud-native principles. (11:14) - Challenges Nick faced managing a Kubernetes cluster in a home lab setting(13:25) - Simplifying Kubernetes with Google's Fleets(17:34) - Introduction to GKE Fleets for managing Kubernetes clusters (20:39) - Building Kubernetes-like systems for complex application portfolios (24:06) - Internal company platforms and the utility of Kubernetes for CI/CD (27:49) - Challenges and strategies of updating old systems for today's cloud environment(32:43) - The dividing line between Kubernetes and GKE from a product perspective. (35:07) - Where to find Nick (36:48) - Closing remarks About Nick:Nick is an absolute geek who would prefer to spend his time building systems, but he has succumbed to capitalism and moved into product management at Google. For the last 20 years, he has worked as a systems engineer, solution architect, and outbound product manager. He is currently the product manager for GKE Fleets & Teams, focusing on multi-cluster capabilities that streamline GCP customers' experience while building platforms on GKE. Links referenced: Duck Bill Group's website:http://www.duckbillgroup.com Nick on Twitter/X : @nicholasebertsNicholas Eberts on Instagram: https://www.instagram.com/neberts1/Nick on Linkedin: https://www.linkedin.com/in/nicholaseberts/SponsorPanoptica Academy: https://panoptica.app/lastweekinaws
Transcript
Discussion (0)
maybe that's where Kubernetes has a strength.
Because you get a lot of it for free, it's complicated,
but if you figure it out and then create the right abstractions,
you can end up being a lot more efficient
than trying to manage, you know, a hundred different implementations.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I'm joined today by someone rather exciting.
You don't often get to talk to, you know, paid assassins.
But Nick Eberts is a product manager over at Google,
so I can only assume that you kill things for a living.
Not if we can help it, right?
So if you're listening to this and you're using anything that I make,
which is long lines of GKE,
fleets, multi-cluster stuff, please use it. Otherwise, you're going to make me into a killer.
This episode's been sponsored by our friends at Panoptica, part of Cisco. This is one of those real rarities where it's a security product that you can get started with for free, but also scale to enterprise grade.
Take a look.
In fact, if you sign up for an enterprise account, they'll even throw you one of the
limited, heavily discounted AWS skill builder licenses they got.
Because believe it or not, unlike so many companies out there, they do understand AWS.
To learn more, please visit panoptica.app slash last week in AWS.
That's panoptica.app slash last week in AWS. Exactly. If our customers don't use this,
we're going to have to turn it off. That's an amazing shakedown approach. Although,
let's be honest, every company implicitly does have that. Like, if we don't make enough money, we're going to go out of business is sort of the
general trend, at least for the small scale companies.
And then at some point, it's, we're going to indulge our own corporate ADHD and just
lose interest in this thing that we've built and shipped.
We'd rather focus on the new things, not the old things.
That's boring.
But Kubernetes is not boring.
I will say that. One of the things that
led to this is a few weeks before this recording, I wound up giving a talk at the Southern California
Area Linux Expo called Terrible Ideas in Kubernetes. Because five years ago, I ran my mouth
on Twitter, imagine that, and predicted that no one would care about Kubernetes five years from
now. It would drift below the surface level of awareness that most people had to think about. I think I'm directionally correct, but I got the
timing wrong. I'll blame COVID for it. Why not? And as penance, I installed a Kubernetes of my
very own in my spare room on a series of 10 raspberries pie and ran a bunch of local workloads
on it for basically fun. And I learned so many things.
I want to say about myself, but no, not really.
Mostly about how the world thinks about these things
and how what Kubernetes is
once you get past conference stage talking points
and actually run it yourself.
I get the sense you probably know more about this stuff than I do.
I would seriously hope anyway.
GKE is one of those things where people have said for a long
time, the people I trust, most people call them customers, that they have been running Kubernetes
in different places and GKE was the most natural expression of it. It didn't feel like you were
effectively fighting upstream trying to work with it. And I want to preface this by saying so far,
all of my Kubernetes explorations personally have been in my on-prem environment because given the way that all of the clouds charge for data transfer,
I can't necessarily afford to run this thing in a cloud environment, which is sad, but true.
On that note specifically, I think maybe you've noted this at other times,
Google Cloud stopped charging for egress.
You stopped charging for data egress when customers agree to stop using Google Cloud stopped charging for egress. You stopped charging for data egress
when customers agree to stop using Google Cloud.
All three of the big clouds have done this.
And I think it's genius from the perspective of
it's a terrific sales tool.
If you don't like it,
we won't charge you to get your data back.
But what hurts people is not,
I want to move the data out permanently.
It's the ongoing cost of doing business. Perfect example. I have
a 10-node Kubernetes cluster that really isn't doing all that much. It's spitting out over 100
gigabytes of telemetry every month, which gets fairly sizable. It would be the single largest
expense of running this in a cloud expense other than the actual raw compute. And it's doing
nothing, but it's talking an awful lot. And we've all had
co-workers just like that. It's usually not a great experience. So it's the ongoing ebb and
flow. And why is it sending all that data? What is in that data? It gets very tricky to understand
and articulate that. So the data transfer is interesting. I mean, I'd want to ask you, what metrics or what signals are you sending out
to cross a point in which you would get billed?
Because that's interesting to me.
I mean, do you not like the in-cloud logging
and operations monitoring stuff?
Because when we ship metrics there,
we're not billing you for it.
Now we are billing you for the storage.
Sure.
And to be fair,
storage of metrics has never been something
I found prohibitive on any provider.
This is, again, this is running in my spare room.
It is just spitting things out.
Like, why do you use the in-cloud provided stuff?
It's like, well, it's not really a cloud
in the traditional sense.
And we will come back to that topic in a minute.
But I want to get things out somewhere.
In fact, I'm doing this multiple times,
which makes this fun.
I use Axiom for logs.
That's how I tend to think about this. And I've also instrumented it with Honeycomb.
Axiom is what told me we're about 250 gigabytes and climbing the last time I looked at it.
And it's at least duplicating that, presumably, for what gets sent off to Honeycomb as well.
I also run Prometheus and Grafana locally, because I want to have all the cool kids do.
And frankly, having a workload that runs Kubernetes
means that I can start actively kicking the tires
on other products that really are,
it's contrived.
You try and shove it into just this thing locally
on your laptop or something that,
like I've had some of my actual standing applications
are for pure serverless build on top of Lambda functions.
That gets really weird for some visions
of what observability should be.
So I have an actual Kubernetes that now I can throw things at and see what explodes.
Now that makes sense. I mean, like, listen, I love Honeycomb and there's a lot of third-party
tools out there and providers. And one of the things that we do at Google, probably done across
the board, is work with them to provide an endpoint or a data store or an existence of their service
that's local within the cloud, right?
So if you're using Honeycomb
and that Honeycomb instance
that is your SaaS provider
actually is an endpoint
that's reachable inside of Google Cloud
without going out to the network,
then you can reduce the cost.
So we try to work with them to do things.
One example technology we have
is Private Service Connect,
which allows you third-party companies to sort of host their endpoint in your VPC with an IP that's inside of your VPC.
So then your egress charges are from a node running in a cluster to a private IP not going out through the internet. So we're trying to help because our customers do prefer not to pay large amounts of
money to use essentially as a service that's most of these services are running on Google Cloud.
I do confess to having a much deeper understanding of the AWS billing
architectures and challenges among it. But one of the big challenges I've found,
this will lead naturally into this from the overall point that few of us here at the
Duckbill Group have made on Twitter, which is how you and I started talking.
Specifically, we have made the assertion that Kubernetes is not cloud native, which sounds an awful lot like clickbait, but it is a sincerely held belief.
It's not one of those, somebody needs to pay attention to me.
No, no, no.
I have better stunts for that. This is based upon a growing
conviction that I've seen from the way that large companies are using Kubernetes on top of a cloud
provider and how Kubernetes itself works. It sounds weird to say that I have built this on
a series of raspberries pie in my spare room. That's not really what it's intended for or
designed to do, But I would disagree
because what a lot of folks are using
is treating Kubernetes as a multi-cloud API,
which I think is not the worst way to think of it.
If you have a bunch of servers sitting in a rack somewhere,
how are you going to run workloads on it?
How are you going to divorce workloads
from the underlying hardware platform?
How do you start migrating it
to handle hardware failures, for example?
Kubernetes seems to be a decent answer on this.
It's almost a cloud in and of itself.
It's similar to a data center operating system.
It's realizing the vision that OpenStack
sort of defined but could never realize.
No, that's 100% it.
And you're not going to get an argument from me there.
Kubernetes, running your applications in Kubernetes
do not make them cloud native.
One of the problems with this argument in general
is that who can agree on what cloud native actually means?
It means I have something to sell you in my experience.
Right.
My interpretation, it sort of adheres to the value prop
of what the cloud was when it came out.
Flexible, just pay for what you want when you need it,
scale out on demand, these kinds of things.
So applications definitely are not immediately cloud native when you put them in Kubernetes. You have to do some work to make them auto scale.
You have to do some work to make them stateless, maybe 12 factor, if you will, if you want to go
back like a decade. Yeah, you can't take a Windows app, run it on Kubernetes clusters that have
Windows Node support that's a monolith and then call it cloud native. Also, not all applications
need to be cloud native. That is not the metric that
we should be measuring ourselves by. So it's fine. Kubernetes is the lowest common denominator,
or it's becoming the lowest common denominator of compute. That's the point.
If you have to build a platform, or you're a business that has several businesses within it,
you have to support a portfolio of applications, it's more likely that you'll
be able to run a high percentage of them on Kubernetes than you would on some fancy paths.
Like that's been the death of all paths. It's like, ooh, this is really cool. I have to rewrite
all my applications in order to fit into this paradigm.
I built this thing and it's awesome for my use case. And it's awesome right until it gets a
second user, at which point the whole premise falls completely to custard.
It's a custard.
It's awful.
It's a common failure pattern,
where anyone can solve something to solve for their own use cases.
But how do you make it extensible?
How do you make it more universally applicable?
And the way that Kubernetes has done this has been to effectively,
you're building your own cloud when you're using Kubernetes to no small degree.
One of the cracks I made in my talk, for example, was that Google has a somewhat condescending and
difficult engineering interview process. So if you can't pass through it, the consolation prize
is you get to cosplay as working at Google by running Kubernetes yourself. And the problem when
you start putting these things on top of other cloud provider abstractions is you have a cloud within a cloud
and to the to the cloud provider what you've built looks an awful lot like a single tenant app with
very weird behavioral characteristics that for all intents and purposes remain non-deterministic
so as a result you're staring at this thing that the cloud provider says well you have an app and
it's doing some things and the level of native understanding of what your workload looks like from the position of that cloud provider become obfuscated through that level of indirection.
It effectively winds up creating a host of problems while solving for others.
As with everything, it's built on trade-offs.
Yeah.
I mean, not everybody needs a Kubernetes, right? If there's a certain complexity that you have to have of the applications that you need to support, then it's beneficial, right?
It's not just immediately beneficial.
A lot of the customers that I work with actually too much.
I don't want to say dismay, but a little bit.
Like they're doing the hybrid cloud thing.
I'm running this application across multiple clouds. And Kubernetes helps them there because while it's not identical on every single cloud,
it does take like 80, maybe 85, 90% of the configuration.
And the application itself can be treated the same across these three different clouds.
There's 10% that's different per cloud provider, but it does help in that degree.
We have customers that can hold us accountable.
They can say, you know what?
This other cloud provider
is doing something better
or giving it to us cheaper.
And we have a dependency
on open source Kubernetes
and we built all our own tooling.
We can move quickly.
And it works for them.
That's one of those things
that has some significant value for folks.
I'm not saying that Kubernetes
is not adding value.
And again, nothing is ever
an all or nothing approach. But an easy example where I tend to find a number of my customers
struggling, most people will build a cluster to span multiple availability zones over an AWS LAN
because that is what you are always told. Oh, well, yeah, we can strain blast radiuses, so of
course we're going to be able to sustain the loss of an availability zone. So you want to be able to have it flow between those. Great. The problem is,
is it costs two cents per gigabyte to have data transfer between availability zones,
which means that in many cases, Kubernetes itself is not in any way zone aware. It has no sense of
pricing for that. So it'll just as cheerfully toss something over a two gigabyte link as opposed to the thing,
two gigabit link,
as opposed to the thing
right next to it for free.
And it winds up in many cases
bloating those costs.
It's one of those areas
where if the system understood
its environment,
the environment understood
its system a little bit better,
this would not happen.
But it does.
So I have worked on Amazon.
I didn't work for them. I've worked on used EC2 for
two or three years. That was my first foray into cloud. I then worked for Microsoft. So I worked
on Azure for five years and I now I've been on Google for a while. So I will say this, I, I,
my information with Amazon's a little bit dated, but I can tell you from a Google perspective,
like that specific problem you call out, there's at least, there's upstream Kubernetes configurations
that can allow you to have affinity
with transactions.
It's complicated though.
It's not easy.
We also, so one of the things
that I'm responsible for
is building this idea of fleets.
This idea of fleets is that you have
n number of clusters
that you sort of manage together.
And not all of those clusters
need to be homogenous,
but pockets of them are homogenous.
Right? And so one of the patterns that I'm seeing our bigger customers do is create a cluster
per zone. And they stitch them together with the fleet, use namespace sameness,
treat everything the same across them, slap a load balancer on front, but then silo the
transactions in each zone so they can just have an easy and efficient and
sure way to ensure that, you know, interzonal costs are not popping up.
In many cases, the right approach. I learned this from some former Google engineers back in the
noughts, which back when being a Google engineer was a sort of thing where the hush came over the
room and everyone leaned in to see what this genius wizard would be able to talk about.
It was a different era on some level.
And one of the things I learned was that in almost every scenario,
when you start trying to build something
for high availability,
this was before cross-AZ data transfer
was even on anyone's radar,
but for availability alone,
you have even a phantom router
that was there to take over in case the primary fails.
The number one cause of outages,
and it wasn't particularly close,
was by a failure in the heartbeat protocol or the control handover.
So rather than trying to build data center pods that were highly resilient, the approach instead was load balance between a bunch of them and constrain transactions within them, but make sure you can fail over reasonably quickly and effectively and automatically.
Because then you can just write off a data center in the middle of the night when it fails, fix it in the morning, and the site continues to remain up. That is a
fantastic approach. Again, having built this in my spare room, at the moment, I just have the one.
I feel like after this conversation, it may split into two, just on sheer sense of this is what
smart people tend to do at scale.
Yeah, it's funny. So when I first joined Google, I was super interested in going through their like SRE program. And so one thing that's great about this company that I work for now is
they give you the time and the opportunities. So I wanted to go through what SREs go through when
they come on board and train. So I went through the interview process and I believe that process
is called hazing, but continue. continue yeah but the funniest thing is so you go through this and
you're actually playing with tools and affecting real org cells and using all of the google terms
to do things um obviously not in production and then you have like these tests and most of the
time the answer the test was hey drain the cell Just turn it off and then turn another one on.
It's the right approach in many cases.
That's what I love about the container world is that it becomes ephemeral.
That's why observability is important because you better be able to get the telemetry for
something that stopped existing 20 minutes ago to diagnose what happened.
But once you can do that, it really does free up a lot of things, mostly.
But even that I ran into significant challenges with.
I come from the world of being a grumpy old sysadmin.
And I've worked with data center remote hands employees that were, yeah, let's just say that was a heck of a fun few years.
So the first thing I did once I got this up and running, got a workload on it, is I yanked the power cord out of the back of one of the node members that was running a workload like I was rip-starting a lawnmower enthusiastically at two in the morning, like
someone might have done to a core switch once. But yeah, it was, okay, so I'm waiting for the pod to,
the cluster to detect the pod is no longer there and reschedule it somewhere else. And it didn't
for two and a half days. It's like, because I was under the, and again, there are ways to configure
this and you have to make sure the workload is aware of this.
But again, to my naive understanding,
part of the reason that people go for Kubernetes
the way that they do
is that it abstracts the application
away from the hardware
and you don't have to worry about individual node failures.
Well, apparently I have more work to do.
These are things that are tunable and configurable.
One of the things that we strive for on GKE
is to make a lot of these best practices.
This would be a best practice,
recovering the node,
reducing the amount of time it takes
for the disconnection to actually release
whatever lease that's holding that pod
on that particular node is.
We do all this stuff in GKE,
and we don't even tell you we're doing it
because we just know that this is the way that you do things.
And I hope that other providers are doing something similar just to make it easier.
They are.
Again, I've only done this in a bare metal sense.
I intend to do it on most of the major cloud providers at some point over the next year or two.
Few things are better for your career and your company than achieving more expertise in the cloud.
Security improves. Compensation goes up. Employee retention skyrockets. are better for your career and your company than achieving more expertise in the cloud.
Security improves, compensation goes up, employee retention skyrockets. Panoptica,
a cloud security platform from Cisco, has created an academy of free courses just for you. Head on over to academy.panoptica.app to get started. The most common problem I had was all related
to the underlying storage
subsystem. Longhorn is what I use. I was going to say, can I give you a fun test?
When you're doing this on all the other cloud providers, don't use Rancher or Longhorn.
Use their persistent disk option. Oh, absolutely. The reason I'm using Longhorn,
to be very clear on this, is that I don't trust individual nodes very well and yeah ebs or
any of the providers have a block have a block storage option that is far superior to what i'll
be able to achieve with local hardware because i don't happen to have a few spare billion dollars
in engineering lying around in order to abstract a really performant really durable block store
and i don't that's not on my list. Well, so I think all the cloud providers
have really performant, durable block store
that's presented as disk store, right?
They all do.
But the real test is when you rip out that node
or effectively unplug that network interface,
how long does it take for their storage system
to release the claim on that disk
and allow it to be attached
somewhere else? That's the test. Exactly. And that is a great question. And there are ways,
of course, to tune all of these things across every provider. I did no tuning, which means
the time was effectively infinite, as best I could tell. It wasn't just for this. I had a
number of challenges with the storage provider over the course of a couple months. And it's
challenging. I mean, there are other options out there that might have worked better. I switched
all the nodes that have a backing store over to using relatively fast SSDs because having it on
SD cards seemed like it might have been a bottleneck around there. And there were still
challenges on things and in ways I did not inherently expect. That makes sense. So can I
ask you a question?
Please.
If Kubernetes is too complicated,
let's just say, okay, it is complicated.
It's not good for everything.
But most PaaSes are a little bit too constrictive, right?
Like their opinions are too strong.
Most of the time,
I have to use a very explicit programming model to take advantage of them.
That leaves us with VMs in the cloud, really, right?
Yes and no.
For one workload right now that I'm running in AWS, I've had great luck with ECS, which
is, of course, despite their word about ECS anywhere, it is a single cloud option.
Let's be clear on this.
It is, you are effectively agreeing to a lock-in on some form, but it does have some elegance
because of how it
was designed in ways that resonate with the underlying infrastructure in which it operates.
Yeah, no, that makes sense. I guess what I was trying to get at, though, is if ECS wasn't an
option and you had to build these things, I feel like my experience working with customers,
because before I was a PM, I was very much field consultant, customer engineer,
solution architect, all those words.
Customers just ended up rebuilding Kubernetes.
They built something that auto-scaled.
They built something that had service discovery.
They built something that repaired itself.
They ended up creating a good bit of the API, is what I found.
Now, ECS is interesting.
It's a little bit hairy when you
actually, if you were going to try to implement something that's got smaller services that talk
to each other a lot. If you just have one service and you're auto-scaling it behind a load balancer,
great. Yeah, they talked about S3 on stage at one point with something like 300 and some odd
microservices that all comprised to make the thing work, which is phenomenal. I'm sure it's the right
decision for their workloads and whatnot. I felt like I had to jump on that as soon as it was said,
just a warning. This is what works at a global hyperscale centuries-long thing that has to live
forever. Your blog does not need to do this. This is not a to-do list. But yeah, back when I was
doing this stuff in anger, which is, of course, my name for production for production as opposed to staging environment which is always called theory because it works in theory
but not in production exactly back when i was running things in anger it was always it was
before containers had gotten big so it was always uh take amis uh to a certain point and then do
configuration management and uh code deploys in order to get them to current and yeah then we
bolt on all the things that Kubernetes does offer that any system has
to offer.
Kubernetes didn't come up with these concepts.
The idea of durability, of auto-scaling, of load balancing, of service discovery, those
things inherently become a problem that needs to be solved for.
Kubernetes has solved some of them in very interesting, very elegant ways.
Others of them it has solved for by,
oh, you want an answer for that?
Here's 50, pick your favorite.
And I think we're still seeing best practices continue to emerge.
No, we are.
And I did the same thing.
Like my first role that I was using cloud,
we were rebuilding an actuarial tool on EC2.
And the value prop obviously for our customers was like,
hey, you don't need to rent
1,000 cores for the whole year from us. You could just use them for the two weeks that you need them.
Awesome. That was my first foray into infrastructure code. I was using Python
and the Botto SDK and just automating the crap out of everything. And it worked great.
But I imagine that if I had stayed on at that role, repeating that process for n number of applications
would start to become a burden. So you'd have to build some sort of template, some engine,
you'd end up with an API. Once it gets beyond a handful of applications, I think maybe that's
where Kubernetes has a strength because you get a lot of it for free. It's complicated. But if you
figure it out and then create the right abstractions for the different user types you have, you can end up being a lot more efficient than trying to manage 100 different implementations.
We see the same thing right now. Whenever someone starts their own new open source project or even starts something new within a company, great.
The problem I've always found is building the CI-CD process. How do I hook hook it up to GitHub actions or whatever it is to fire off a thing? And until you build sort of what looks like an internal company
platform, you're starting effectively at square one each and every time. I think that building
an internal company platform at anything beyond giant scale is probably ridiculous, but it is
something that people are increasingly excited about. So it could very well be that I'm the one
who's wrong on this. I just know that every time I build something new, there's a significant boundary
between me being able to YOLO slam this thing into place and having merges into the main branch
wind up getting automatically released through a process that has some responsibility to it. Yeah. I mean, there's no perfect answer for everybody,
but I do think, I mean,
you'll get to a certain point
where the complexity warrants a system like Kubernetes.
But also the CICD angle of Kubernetes
is not unique to Kubernetes either.
I mean, you're just talking about pipelines.
We've been using pipelines forever.
Oh, absolutely.
And even that doesn't give it to you out of the box.
You still have to play around with getting Argo
or whatever it is you choose to use set up.
Yeah, it's funny, actually.
Weird tangent.
I have this weird offense
when people use the term GitOps like it's new.
So first of all, we've all been,
well, as an aged man
who's been in this industry for a while,
we've been doing GitOps for quite some time.
Now, if you're specifically talking about a pull model,
fine, that may be unique.
But GitOps is simply just,
hey, I'm making a change in source control.
And then that change is getting reflected in an environment.
That's how I consider it.
What do you think?
Well, yeah, we store all of our configuration in Git now.
It's called GitOps.
What were you doing before?
Oh yeah, go retrieve the previous copy of what configuration in Git now. It's called GitOps. What were you doing before? Oh yeah, go retrieve the
previous copy of what the configuration looked like.
It's called copyofcopyofcopyof
thing.back.cjq.usethisone.doc.zip.
Yeah, it's great.
That's even going
further back. Yeah, let me please
make a change to something that's
in a file store somewhere and
copy that down to X amount of
VMs
or even hardware machines
just running across my data center.
And hopefully that configuration change
doesn't take something down.
Yeah.
The idea of blast radius
starts to become very interesting
in canary deployments.
And all the things
that you basically rediscover
from first principles
every time you start building
something like this.
It feels like Kubernetes gives a bunch of tools that are effective for building a lot of those things.
But you still need to make a lot of those choices and implementation decisions yourself.
And it feels like whatever you choose is not necessarily going to be what anyone else has chosen.
It seems like it's easy to wind up in unicorn territory fairly quickly. But I just, I don't know. I think as we're thinking about what the alternative for a Kubernetes is
or what the alternative for a PaaS is,
no one, I don't really see anyone building a platform to run old shitty apps.
Who's going to run that platform?
Because that's the, what, 80% of the market of workloads
that are out there that need to be improved.
So we're either waiting for these companies to rewrite them all, or we're going to make their life better somehow.
That's what makes containers so great in so many ways. It's not the best approach,
obviously, but it works. You can just take up something that is 20 years old,
written in some ancient version of something, shove it into a container as a monolith.
Sure, it's an ugly big container, but then you can at least start moving that from place to place
and unwinding your dependency rat's nest.
That's how I think about it, only because, like I said, I've spent 10, 12 years working with a lot of customers trying to unwind these old applications.
And a lot of the times they lose interest in doing it pretty quickly because they're making money and there's not a whole lot of incentive for them to break them up and do anything with them. In fact, I often theorize that whatever their business is, the real catalyst for change is when another startup or another smaller company comes up and does it more cloud natively and beats their pants off in the market, which then forces them to have to adjust.
But that kind of stuff doesn't happen in like payment transaction companies. Like no one,
like you have, there's a heavy price to pay to even be in that business. And so like,
what's the incentive for them to change? I think that there's also a desire on the
part of technologists many times, and I'm as guilty as any one of this to walk in and say,
this thing's an ancient piece of crap. What's it doing? And the answer is like,
about 4 billion in revenue.
So maybe mind your manners.
And yeah, okay.
Is this engineeringly optimal?
No, but it's kind of long bearing.
So we need to work with it.
If people are not still using mainframes because they believe that in 2024,
they're going to greenfield something
and that's the best they'd be able to come up with.
It's because that's what they went with 30, 40 years ago. And there has been so much business process built around its
architecture, around its constraints, around its outputs, that unwinding that hairball is impossible.
It is a bit impossible. And also, is it that? Those systems are pretty reliable.
The only, the downside is just the cost of whatever IBM is going to charge
you to have support.
So we're going to re-architect and then migrate it to the
cloud. Yeah, because that'll be less expensive.
Good call. It's a,
it's always a trade-off. And economics are
one of those weird things where people
like to think in terms of cash dollars they pay
vendors as the end-all be-all,
but they forget the most
expensive thing that every company has to deal with
is its personnel,
the payroll costs,
dwarf cloud infrastructure costs,
unless you're doing something truly absurd
at very small scale of a company.
Like there's,
like I've never heard of a big company
that spends more on cloud
than it does on people.
Oh, that's an interesting data point.
I figured we'd at least have a handful of them,
but interesting. I mean, you see it at least have a handful of them, but interesting.
I mean, you see it in some very small companies where like, all right, we're a two-person startup and we're not taking market rate salaries and we're doing a bunch of stuff with AI.
And OK, yeah, I can see driving that cost into the stratosphere, but you don't see it at significant scale. In fact, for most companies that are legacy, which is the condescending engineering term for it makes money, which means in terms it was founded before five years ago,
a lot of companies that are their number two expense is real estate, more so than it is
infrastructure costs. Sometimes, yeah, you can talk about data centers being as part of that,
but office buildings are very expensive. Then there's a question of, okay, cloud is usually
number three, but there are exceptions for that. Because they're public, we can talk about this
one. Netflix has said for a long time that their biggest driver, even beyond people, has been
content. Licensing all of the content and producing all of that content is not small money. So there
are going to be individual weird companies doing strange things. But it's fun.
I mean, you also get to this idea as well that, oh, no one can ever run on-prem anymore.
Well, not for nothing.
Technically, Google is on-prem.
Yeah, so is Amazon is on-prem.
They're not just these magic companies are the only ones that can remember how to be able to replace hardware and walk around between racks of servers.
It's a, there's, it just, is it economical?
Is it, when does it make sense to start looking at these things?
And even strategically, tying yourself indelibly to a particular vendor,
because people remember the mainframe mistake with IBM.
Even if I don't move this off of Google or off of Amazon today,
I don't want it to be impossible to do so in the future.
Kubernetes does present itself as a reasonable hedge.
Yeah, it neutralizes that lock-in with vendor
if you are to run your own data centers or whatever.
But then you're locked into,
a lot of the times you end up getting locked
into specific hardware,
which is not that different than cloud
because I do work with a handful of customers
who are sensitive to even very specific versions of chips.
They need version N because it gives them 10% more performance.
And at the scale they're running,
that's something that's very important to them.
Yeah.
One last question before we wind up calling this an episode
that I'm curious to get your take on,
given that you work over in product.
Where do you view the dividing line between Kubernetes and GKE?
So this is actually a struggle that I have
because I am historically much more open source oriented
and about the community itself.
I think it's our job to bring the community up,
to bring Kubernetes up.
But of course it's a business, right?
So the dividing line for us,
that I think about is the cloud provider code,
the ways that we can make it work better on Google Cloud
without really making the API weird, right?
I don't want to, we don't want to run some version of the API
that you can't run anywhere.
Yeah, otherwise you just roll a Borg and call it a day.
Yeah, but when you use a load balancer,
we want it to be like fast, smooth, seamless, and easy.
When you use persistent storage,
like we have persistent storage
that automatically replicates the disk across all three zones so that when one thing fails,
you go to the other one and it's nice and fast. So these are the little things that
we try to do to make it better. Another example that we're working on is fleets. That's specifically
the product that I work on, GKE Fleets. And we're working upstream with Pluster Inventory to ensure
that there is a good way for customers to take a dependency on our fleets without And we're working upstream with cluster inventory to ensure that there is a good way for
customers to take a dependency on our fleets without getting locked in, right? So we adhere
to this open source standard. Third-party tool providers can build fun implementations that then
work with fleets. And if other cloud providers decide to take that same dependency on cluster
inventory, then we've just created another good abstraction for the ecosystem to grow
without forcing customers
to lock into specific cloud providers
to get valuable services.
It's always a tricky balancing act
because at some level,
being able to integrate
with the underlying ecosystem
it lives within
makes for a better product
and a better customer experience.
But then you get accused
trying to drive lock-in.
The whole, I think the main,
if you talk to my skit,
Drew Bradstock runs
Cloud Container Runtimes for
all of Google Cloud. I think he
would say, and I agree with him here,
that we're trying to get you to come
over to use GKE
because it's a great product, not because
we want to lock you in. So we're
doing all the things to make it easier for you
because you listed out a whole
lot of complexity. We're really trying to things to make it easier for you. Because you listed out a whole lot of complexity.
We're really trying to remove that complexity.
So at least when you're building your platform
on top of Kubernetes,
there's maybe, I don't know,
30% less things you have to do
when you do it on Google Cloud
than other platforms or on-prem.
Yeah, it would be nice.
But I really want to thank you
for taking the time to speak with me.
If people want to learn more,
where's the best place for them to find you?
If people want to learn more, where's the best place for them to find you? If people want to learn more,
I'm active on Twitter.
So hopefully you can just add my handle in the show notes.
And also just if you're already talking to Google,
then feel free to drop my name
and bring me into any call you want as a customer.
I'm happy to jump on and help work through things.
I have this crazy habit where I can't get rid of old habits.
So I don't just come on the calls as a PM and help you.
I actually put on my architect consultant hat
and I can't turn that part off.
I don't understand how people can engage
in the world of cloud
without that skillset and background personally.
It's so core and fundamental to how I view everything.
I mean, I'm sure there are other paths.
I just have a hard time seeing it.
Yeah, yeah, yeah.
It's a lot less about, let me pitch this thing to you
and much more about, okay, well, how does this fit
into the larger ecosystem of things you're doing,
the problems you're solving?
Because, I mean, we didn't get into it on this call
and I know it's about to end, so we shouldn't,
but Kubernetes is just a runtime. There's
like a thousand other things that you have to figure out
with an application sometimes, right?
Like storage, bucket storage,
databases, IAM,
IAM.
Yeah, that is a whole separate kettle of nonsense.
You won't like what I did locally, but that's beside
the point. But are you allowing
all anonymous? Exactly.
The trick is, if you harden the perimeter
well enough, then nothing is going to ever
get in, so you don't have to worry about it. Let's also
be clear, this is running a bunch of very
small-scale stuff. It does use a real
certificate authority, but still. I have the
most secure Kubernetes cluster of all time running
in my house back there. Yeah, it's
turned off. Even then,
I still feel better if we're sunk into
concrete and then dropped into a river somewhere.
But, you know,
we'll get there.
Thank you so much
for taking the time
to speak with me.
I appreciate it.
No, I really appreciate your time.
This has been fun.
You're a legend,
so keep going.
I'm something, all right.
I think I have to be dead
to be a legend.
Nick Ebertz,
product manager at Google.
I'm cloud economist Corey Quinn,
and this is Screaming
in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform
of choice or on the YouTubes. Whereas if you hated this podcast, please continue to leave a
five-star review on your podcast platform of choice, along with an angry, insulting comment
saying that that is absolutely not how Kubernetes and hardware should work. But remember to disclose
which large hardware vendor you work for in that response.