PurePerformance - A Minimalistic Approach to Kubernetes with Kelsey Hightower
Episode Date: March 2, 2020We're back to our regular scheduled show!Kelsey Hightower (@kelseyhightower) has worn many hats as it says on his bio but we also learned from him that he probably doesn’t have that many hats at hom...e as he has been living a minimalistic life over the past couple of years. A philosophy as we learn in this podcast that also goes well when it comes to building your next platform on Kubernetes.In this podcast we learn about the do’s and don’ts, how you should plan and test for k8s upgrades, which tradeoffs you have to take as it comes to performance, how to think about developer productivity on k8s and why it is important to read up on security as it relates to the software we build, deploy and run on our k8s clusters.Thank you Kelsey for supporting our community with your time and expertise. Hope to have you back in the future!https://twitter.com/kelseyhightower
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. We are the performance. It's definitely been a while for our audience. We had our last official podcast before Perform was with Abigail Wilson, no relation.
And then we had all those Perform podcasts, and I want to thank everybody for, you know,
I know those are a lot more Dynatrace-centric, more about our user conference,
but thank you for making it through those if you're not interested in the Dynatrace stuff. We're back to regular programming now, and I know there were some issues with some of the uploads,
so those should be fixed by now. But yeah, it was really busy at Perform and got to
see you there, which was always nice. Yeah, even though it was only for a short while. Yeah, it was
only for a short because I've been running around and you guys did all the work. And I just felt
like I didn't really contribute a whole lot. You contributed a ton, Andy, all those interviews you did with the PMs.
Don't sell yourself short there.
That was amazing.
Amazing work you get done for us.
Yeah, it was only a short while,
but it felt like forever.
They really run us ragged.
But it was great to see all of our customers
and vendors and everyone else over at Perform.
And hopefully we'll be seeing everyone again next year.
But we're back to our normal show, right?
Exactly.
And we have a very exciting guest today.
I say that often, I know.
Everyone's a very exciting guest.
Anybody who's willing to come on the show is exciting for me.
But to me, I've been following this person,
checking out things they've been doing.
I've enjoyed the no-code repository.
So do you want to go ahead and introduce our guest here, Andy,
or anything else you want to contribute before we head in?
No, I think you're right.
I mean, it's amazing to have a guest like him on the show today.
And the only thing I want to say,
I want to read out his tagline on his Twitter profile,
and it just says, minimalist.
That's all it says.
And I actually want to hand over the token
to Kelsey Hightower, who is there with us today.
I'm not sure where he is today.
He's officially lives in Portland, Oregon,
but I know he's traveling a lot.
So Kelsey, where are you today?
And can you please let us know what a minimalist is
and why this is driving you?
Yeah, so I actually live in Washington state now.
Okay.
And I work from home, so I'm at home now, so I'm not actually traveling today.
But I do travel quite a bit for our customers and, you know, some community work or engineering work at one of the Google offices.
And a minimalist, I started doing, well, I adopted the term because, you know, it was the best way to describe some of the life changes I probably made maybe 15, 20 years ago, which is, you know,
living debt-free, all of my clothes fit in probably one bag. I own very little items,
but I really appreciate the items that I do own. And I kind of keep this minimalist philosophy
in terms of the things that I need versus the things that I want. And I keep that balance.
And I guess the term minimalist
fits the bill. That's awesome. You know, it's funny when you talk about the clothes,
I know you travel a lot. So that requires having enough clean clothes to travel with. But I was
just thinking like, you know, since I started working from home, the amount of clothes I buy
is very minimal, not quite minimalist, but you, but I wear the same jeans all week long,
if I wear pants at all.
So, yeah, I get it. I think that's really awesome.
It's something I would love to do, but never really had the drive to do,
or the push to do.
So, it's great that you're pulling that off.
So Kelsey, you are working for Google
for those people that kind of managed to escape
who you actually are.
I think if you are Googling your name
or if you Google Kubernetes,
then your name comes up all the time
because you're doing a fantastic job
in going out there and showing people
what the technology that,
in this case, Google and Kubernetes brings to the table, how this can change our life
and how we can use it, you know, in the best ways to, I think, in the end, deliver better
software value faster to our end users.
So I need to get this out there because even though I think a lot of people know Kubernetes
but I still can you give me
can you give me your view on where are we right now
with Kubernetes I know it's a hot topic and it seems
it's you know it's been widely adopted
and more and more people that we talk to have already
started using Kubernetes also in production or are
about to.
But can you let us quickly know what's the state of Kubernetes and what does Kubernetes give you?
But also maybe what does Kubernetes not give you?
What else do you need on top of Kubernetes in order to become successful with deploying
high-scaling, performing applications, which is still to the heart of our listeners?
Wow.
So I think Kubernetes is, you know, it's definitely an application platform. Most people see it for what it is on the surface, right? You have a bunch of containers, you have a bunch of
either bare metal or virtual machines, and you want something to orchestrate the lifecycle
of your applications that happen to be packaged in containers, right? This is how most
people look at Kubernetes. This is how most people leverage Kubernetes today. So, and this is just a
continuation on the good work that Docker started, right? This idea that you would package your
applications in a standard format. You could then share and distribute those using standard runtimes.
And I think that's just kind of where most people are today. If you're an advanced user, though, you're probably
progressed well past that. You're probably using Kubernetes to build your own automation tools.
So this concept of an operator or people using CRDs, these are custom resource definitions,
and they've taken this whole declarative approach to application management
and spread it across other automation tasks
like creating CICD pipelines
or machine learning pipelines.
So I think this is kind of where the industry is
depending on when you got started with Kubernetes.
Now, I think you bring up a great point
and I just want to bring up one project
that we are actively working on. As you
said, Kubernetes is a great application platform. You deploy a container. Kubernetes takes care of,
you know, managing the health, spinning up the containers in case they're failing. I mean,
that's all taken care of. But what we have also realized, and we kind of maybe are in the next
mature state as an engineering organization, is we need something to better manage the lifecycle
of an artifact as it gets kind of deployed into a stage.
Then it needs to be properly tested.
It needs to be valued and then promoted into the next stage,
kind of an event-driven model
for managing the lifecycle of an artifact
all the way through into production
and also managing you know problems
that may come up in production so the the thing we built on top uh of kubernetes and and finally
we're also part of the cncf landscape now is an open source project called captain that we also
are happy to give back to the world where we are you know automating some of these tasks on top that that's quote unquote
bare metal Kubernetes doesn't give us.
Now, what we have seen,
and maybe you see this as well,
we have a lot of people
that are just getting started with Kubernetes.
And then we have these
that have already been working with it for years and years
and are really kind of far beyond what
everybody that just starts can imagine. Is there any advice that you can give people that are just
getting started and there's still a lot that are getting started and getting their hands on?
What are the things people want to make sure that they are addressing and what are the things that
you say, you know, please don't make this mistake because it's been made all over again, so avoid these things?
Oh, man.
I think the biggest one is using Kubernetes for everything.
Okay.
I think that's a thing where most people
look at Kubernetes and say,
wow, I can probably use Kubernetes
to recreate every automation tool
I've ever built in my life
because it's the new hotness.
There's declarative configs.
You can extend it.
And I think people are just getting a little carried away with that.
It sits on top of infrastructure, right?
So Kubernetes assumes you have infrastructure that you can use, right?
So I think that's the best place to start
in terms of putting Kubernetes as a layer
to orchestrate the things below it.
Some people want to recreate all the things below it,
like build a cloud platform on top of Kubernetes itself.
That's going to be very challenging
if you don't have a lot of experience on how to do that.
These are kind of separate concerns,
like IaaS, infrastructure as a service,
the thing you get from a cloud provider, what the right extensions and add-ons you may get from
tools like VMware. But Kubernetes doesn't really operate at that layer. It actually
leverages that lower layer to do its bidding. So without that in place, Kubernetes is really,
I don't know if it's going to be as effective. So I think the idea to
keep in mind is Kubernetes has a purpose. It's definitely great at what it does. You can extend
it to build other platforms, but just understand there's a big difference between building platforms
and building your own platform as a service, if that's your thing, and using Kubernetes. So I
think you just have to be fairly clear about your intentions. Andy, this calls to mind a conversation we had a while back.
And all the terms and the guests are eluding me right now.
But it had to do Wardley Maps, right?
And there was this idea, and Kelsey, I was curious on your perspective on this.
There was the idea, it sounds like what you're saying is people are, you know, since Kubernetes is extensible, since you can kind of make it do a lot of things,
the question is what you're proposing is what should you be making it do
and what shouldn't you be making it do?
And I'm not asking for any kind of a vendor endorsement in this question,
but more of the question of it seems to, if there are products and services out there that do things
very well, and it's not in the core scope of Kubernetes, then leverage those external products.
Find out what you should build in-house or make on your own, and then also identify what already
exists that can handle the things that you need
and use them because it's a lot easier to have a dedicated company
or a dedicated set of resources to manage your IaaS components
or maybe manage your monitoring or maybe manage all these other components.
So where do you see that case that you're talking about falling into the
go to a third party versus
build your own? The way I like to think about this is, let's say Kubernetes didn't exist.
What decision would you make? You know, what logging stack would you pick? Would you pick
Splunk? Why did you pick Splunk? All of those decisions are probably still good decisions. Just because Kubernetes has a default logging tool,
Fluid D in this case,
doesn't mean you have to throw away Splunk, right?
It may give you an option.
Maybe there's a lower cost there,
but it doesn't mean that your previous good decision
is no longer valid.
And I see that happen a lot.
Same thing for your load balancer.
One thing you would like to see is, let's say you're using Nginx. It would whole class of reasons that has nothing to do with Kubernetes
or the ecosystem that may be introducing a different take on load balancing just because
Kubernetes is the new hotness. So I think a lot of times you got to make sure that you're making
decisions that are just kind of good on their own and not making decisions just because you have Kubernetes. I like that. Hey, so I know you've been talking a lot
about all sorts of aspects on Kubernetes. One aspect that we are talking a lot about is
performance and talking to some of our performance engineers that we have in-house and to some of the performance engineers that Brian and I often meet because that's also our background.
We have a couple of friends around the world that have been doing performance engineering for 20, 30 years. you know, the next new platform, especially as it comes to the hype around it.
And then really, does it really perform as well?
Does it really scale as well?
As you said, underneath, it's still just hardware
where you have just more layers of abstraction on top of it.
And therefore, everything has to be slower by default
because you have so many levels of abstractions.
Can you talk a little bit about what you have seen
from a performance and scalability perspective?
What are things performance engineers especially have to consider when they are dealing with Kubernetes?
Are there any new patterns we have to be aware of?
Are there any new metrics that are interesting when it comes to monitoring the performance of the Kubernetes platform itself? Is there anything that you say, well, the way you did
it on the classical infrastructure doesn't make sense anymore in Kubernetes and here is why. So
especially speaking to those folks that have a long history with performance engineering and
now moving to Kubernetes, can you talk a little bit about to kind of spread their doubts,
but also give them some practical advice?
Yeah, so if you're going to be measuring Kubernetes, right,
this is a new system.
And most people that are measuring,
whether it's scheduler latency, the metrics,
how many pods did it schedule,
how much memory is the Kubernetes API server using.
A lot of times the API server does serve as a cache,
so it's expected to use a lot of memory
to make a performance trade-off in terms of API calls.
So when you say kubectl apply and you deploy your application,
you want that interaction to be fast as possible.
So a way of doing that is Kubernetes does attempt
to use a lot of memory on the API
server to protect the database etcd.
And that's a big performance trade-off, right?
So some people will look at that and say, wow, this API server takes up way too much
memory.
But again, that's by design.
So just like any new system, you really need to understand why it's designed the way it
is, what trade-offs is it
making for the sake of performance, before you can really start to measure and understanding
what your measurements are telling you. So that's step one. So when measuring Kubernetes,
it is a different system. There is a control plane, the scheduler, the API server, and all
the things that make that control plane work. And there's a data plane. There's a couple of
data planes if you want to think about it, like the kubelet, the thing that make that control plane work. And there's a data plane. There's a couple of data planes if you want to think about it,
like the kubelet, the thing that takes from the control plane what to run.
That thing lives on the node,
and it has its own set of metrics that you may care about.
So that's Kubernetes itself.
The biggest gotcha I see from people who have experience
really thinking about performance of the application.
The defaults are the things where I think a lot of people
who have been doing this for a very long time
tend to forget the defaults that are on a virtual machine, right?
So if you take a VM or bare metal server,
you're probably running usually as root,
even though people won't admit it.
You're usually running this root.
You have access to all the memory, all the CPUs.
You have no isolation. You have no namespaces limiting what you can do with the memory or CPU.
So the whole machine is yours. And if you're used to a performance benchmark under those conditions,
you're going to be really surprised when I take your application. And it's not the fact that I
put it in a tarball, aka container image, is where the impact comes from.
It comes from the fact that you're going to be running
under a container runtime that by default
will try to isolate that process
and also limit how that process gets to interact
with the number of CPUs available.
So one concrete example would be
like a Java application, right?
You're running on the JVM
and if I give you a 16 node application, right? You're running on the JVM. And if I give
you a 16 node box, right? 16 CPUs, typically, depending on how you've configured your JVM,
it's going to see those 16 CPUs and just grab all of them, right? Like I'm Java,
take all the memory, takes all the CPU, and that may become your new performance benchmark.
Now, let's say you have Kubernetes and you take the same application and you do something
like, you know, you may limit the CPUs to two because you've copied and pasted from
Stack Overflow.
You're not quite sure what you're doing in that pod specification when you deploy the
app.
The problem, though, is even though you told Kubernetes to only run you on a machine that has two CPU free,
and you may even set some limits, like no more than four CPUs of utilization. That sounds about
right. And then you run on the machine and the JVM is still going to see 16 CPUs. And it actually
will not necessarily take a queue from your configuration. So the scheduler and the kubelet and the container
runtime will limit you, but the JVM will still run as if it was still on the VM and that will
manifest itself in thread contention, all kinds of things that will look like the app is now running
quote unquote slow. So then is this problem not solved by any of the JVM vendors so that JVMs understand in which runtime environment they really run
and therefore automatically adjust the defaults?
And is something like this available or not available?
I'm going to say maybe, but like all JVM problems,
there's probably like 200 additional switches that you can do, right?
So there's already like 10,000 switches
and most people I know don't really know which ones do what.
So most people tend to run with like a default configuration.
I think your audience is going to be a little bit different.
They're probably used to really tuning the GC
or tuning various aspects of the JVM.
But again, even with that knowledge,
you still may not know the relationship
to your
current configuration and the limitations being opposed by the container runtime.
So I think those are the things that you're just going to observe that, oh, the kernel
is throttling the threads that you're operating on. You would have to understand what Docker is
doing to you. So I haven't seen any JVMs that just auto-detect the fact that they're running inside of a container runtime.
There's environment variables that you can pass to the JVM, right?
So if you can take your configuration and you can say, hey, this thing only has this many CPUs.
I want to limit my thread pool based on that number.
So you mentioned a lot of this comes from, you know, you're running Kube as root and all this.
Is there something during the initial setup that you would do differently to be able to contain this?
Or does it all come into not just copying and pasting from Stack Overflow,
but understanding all of the settings that you're going to put into your containers and everything to limit this?
How does one control this besides that environmental variable for the threads or limit it to two CPUs specifically?
So there's two things at play here.
One, most people are coming from a world
where they just had full ownership
of the machine without isolation,
running possibly as root.
It's probably less about running as root
as it is about having access
to the entire machine, right?
So that's the baseline
most people are tracking, right?
I had the whole machine to myself,
very few agents or daemons running
except for the ones that I put there.
And that becomes your new baseline,
especially if you've been running like that for years.
Then you move into Kubernetes world.
So whether you're running as root or not,
there are ways to mitigate the ID that this thing runs in.
But the biggest challenge I I think, is communicating
between the runtime configuration, using network namespaces and cgroups for the first time that
may introduce some overhead that your runtime isn't used to in case of the JVM. And the way
you communicate that to the JVM is using what we call the downward API. So in Kubernetes at runtime,
you can actually gather some information
about some of these limitations
that you're putting in place
and use that to actually configure the JVM
as your process is being launched, right?
So either do it as part of your wrapper script
where you're about to launch the app
and you want to set your JVM switches up ahead of time.
But that's the problem there. The biggest challenge is communicating the limitations,
because what the JVM will see will be largely different from what the constraints that are
put around it. So, I mean, if I think back the way we started our talk and we said you are
minimalist, isn't that the best advice? If you actually tell folks that used to run their JVMs
on a I own the world kind of environment to,
hey, you may want to start thinking about
what's the minimum environment that your application,
what's the minimum environment that your application needs
in order to still run as expected from a performance perspective?
I think that's a general great advice then.
Stay minimum.
And that means you can then really truly also leverage the scalability options that Kubernetes gives you by then spinning up maybe more pods in order to scale.
Obviously, assuming your architecture allows for that.
But would that be a good kind of advice?
Stay minimalistic?
Yeah, I think even before Kubernetes,
I think this idea of need versus want.
So if you look at most people,
if we continue with the JVM,
if you look at your class path,
do you really need all of those jars in there?
And I think a lot of times,
sometimes those jars increase the startup time,
they use more memory. So if you can clean the class path out jars in there. And I think a lot of times, sometimes those jars increase the startup time,
they use more memory. So if you can clean the class pass out and just leave what you need,
you're probably going to have a much better time in terms of, you know, compute utilization and requirements and also understanding what flags do you need. I've seen a lot of people who have like
30 flags set. I'm like, why are these flags being set? He's like, I don't know, man. It worked at some point, so we don't change it. And when you
go into a new platform, not understanding what flags are critical and what they do,
you never can really get to the point of having only the flags you need to tune the behavior that
you want. And going back to the idea of the jars running, sorry for picking your brain on this so
much, but curious from from a point of view,
a developer might know which ones they're using,
which ones they're not.
But let's say you're talking to someone
from a performance engineering team, RSRE team.
Is there a way with any tools or anything else
that they can detect which ones are utilized
and which ones aren't
so that they can make a recommendation to say,
hey, we noticed these ones aren't being leveraged.
Maybe you should consider removing them.
You know what?
I haven't worked with the JVM extensively for a long time,
but I'll tell you what I used to do,
just to be very transparent.
From my experience, no one knows what they actually need.
Because if I ask a lot of developers,
what are your dependencies?
And most of them will say, I don't know.
I need Red Hat version X. I need JVM version X. And then ideally everything
in the class path is what I need because it is working. As you remove something, it may stop
working. So don't remove anything. That's usually the status quo that I've seen. Now, if you're
lucky, one thing I used to do is just look at the startup process. Like, it's tedious as hell. But you have to be careful because you may have library paths that are only
exercised in very unique scenarios. So you don't want to go remove something because you haven't
seen it be used in a long time. For example, if you have a Postgres jar, you may say, oh,
this app doesn't call Postgres. Well, it may under a reporting condition
that runs only once per month. So you kind of need to have a negotiation. So what I like to do is
sit side by side with the team, figure out what parts of this is being used. And then what is it
being used for? Why is this Postgres jar here at all if we're using Oracle? And someone will say,
oh, we'll use it for reporting. So, okay, it gets to stay. And then I tried to remove one by one until we cut too deep and we have to add back.
But to me, it was just more of a trial and error sitting next to the experts. And it just takes
time to get to a point where you have that discipline knowing what needs to be there.
Hey, Kelsey, I got another question for you. So one of the things that we get asked a lot about, so yes, we have,
we understand we are going, we're going towards containers, we're using Kubernetes.
Shall we treat our pre-prod environment and our production environment in a way that we have
different Kubernetes clusters, or shall we use the isolation that namespaces give us
to isolate the stages, pre-prod and prod,
or shall we even go the next level and say,
well, there should only be prod.
We don't need any pre-prod anymore
because if we're doing it right,
we just deploy into prod.
And if we have a new version,
we just do canary deployments
or we use feature flags.
So I know this is a long stretch question,
but the question is,
what are,
if you talk with an organization
that is trying to figure out
how does this work for them?
Do they need multiple Kubernetes clusters
for the different environments
or is namespace isolation enough
or should we go all in and say we only do production
and then use these new deployment options that we have?
What is your experience?
Where does it work?
Where does it not work?
Because I'm sure it depends also on the application
and the maturity level of the organization.
Yes.
All of those things.
This has been a question for years like 20 should you have one vmware
cluster for everything should you have one amazon account or one gcp account for everything
all this starts to break down really quickly when you start to think about security
software upgrades mistakes blast radius there's so many things around maturity
that have nothing to do with the happy state
of you getting everything right all the time.
So I think as humans and human nature,
mistakes are going to happen.
That's okay.
So here's the thing.
Let's just talk about those scenarios really quickly.
Let's say you have one Kubernetes cluster for production.
Just one big
one. You have 5,000 nodes in there and you're going to use namespaces to create environments.
Great. This sounds like a great idea. So in production, your third-party extensions are
really locked into Kubernetes 1.14. Just 1.14, some developer tried their best and they coded around 114. 116 comes out. They didn't test
those existing extensions with 116 because there was nothing to test against. Turns out 116 is
incompatible with those third-party extensions that you come to rely on. You upgrade the cluster.
For QA purposes, QA wants to test, but there's only one control plane.
Even though you have multiple namespaces, there's only one control plane. You upgrade to 116.
QA is fine. Production is down. The first thing you're going to do is say, hmm,
maybe we should have more than one Kubernetes cluster.
Now, if you feel that more than one could be in production,
so you have cluster A and cluster B,
and now you have 2,500 nodes in one, 2,500 nodes in the other,
and now you have a much easier way of rolling out a Kubernetes control plane upgrade.
You can upgrade one cluster,
and if a percentage of your applications fall over,
then you know not to touch the other cluster where production traffic is still serving.
So I think that's going to be your biggest decision factor, is how much of a blast radius
do you want? The other thing that I see is maybe you're doing this performance testing,
and you really need to blast this cluster. Okay, if you only have one cluster where all of your stuff is being shared, either you're
going to over-provision that cluster to accommodate for big performance tests or what we see a
lot in the cloud, especially when you have more tools to manage a cluster, there's almost
no reason to put all your eggs in one basket. So let's say
your steady state in production is 2,000 nodes. If I want to do a performance test, I can just
create another 2,000 node cluster, run my performance test, and delete the entire cluster
instead of trying to maintain a maybe 3,000 node cluster that auto scales up and down playing musical chairs. So I think there's
just so many more options that you really don't need to think about one big cluster. Now this is
slightly different if you're like on-premise where you don't necessarily have the ability to do
elastic compute where you can scale up and down easily. So you may get forced into over-provisioning
which we see a lot when it comes to
on premise. But even in that scenario, you can add and remove nodes to an existing cluster. So
your automation is going to be different. But I strongly recommend people just for the sake of
Kubernetes upgrades, and thinking about your third party extensions and dependencies that may not be
so may not have the ability to have QA, dev, performance,
and production control plane be at the same version
during all the test phases.
This is fascinating. Thank you so much.
So just to be clear, when we talk about these third-party extensions,
we're primarily talking about operators
that you have running in your Kubernetes cluster
that you may have written yourself or that you brought in from somebody else.
What other extensions are there for those people that are not that familiar with Kubernetes
that we need to take care of?
So I'll tell you one that I saw break in production for real production people,
and I had to help troubleshoot.
So think about the network policy extension.
Very great company, Calico,
and they have a really nice extension
that really implements
Kubernetes network security policies well.
And the way that works is
there's an agent that runs on every node,
and you can put in Kubernetes network policy.
So this works like a traditional firewall.
It leverages IP tables.
And what it does
is it watches the Kubernetes API. And as it gets configurations about how to configure that firewall,
it then goes and updates IP tables. Okay. So a new version of Kubernetes comes out.
And to be fair, there was no way for them to test future compatibility. So let's say there's a small
bug in either a Kubernetes control plane. So the's say there's a small bug in either A,
Kubernetes control plane.
So the way it does its configuration
or the way it sends its changes may have a bug
or maybe it just changes.
Or there's a bug in the controller itself
that's watching Kubernetes.
That actually changed and all of the firewall rules closed
and they failed open or they failed close.
That means all traffic in the entire cluster just stopped, period.
At that point, you are now hard down completely.
But you could have tested this if you had a separate cluster.
You could have upgraded the Kubernetes version, saw everything, stopped and said,
all right, we're not moving this to the other cluster until we resolve it.
This stuff is all very advanced Kubernetes, right?
A lot of times people start by just, we're going to spin it up,
we're going to start running some things.
And probably until they hit their first hiccups,
they're not thinking about these things.
For people who want to try to get ahead of this,
besides DMing you and trying to take your time,
which I know is not reasonable.
Yeah, don't do that. Yeah, don't do that, exactly.
But what resources are out there
to support people understanding?
Besides going to like KubeCon,
besides listening to like watching some of,
you know, your presentations and other presentations,
is there any sort of organized
or semi-organized best practices
or maybe things to consider,
stuff going on out there at this point?
Or is it all kind of trial and error
and looking up scenarios and doing a bunch of searches
and trying to figure out how to do this?
Yeah, I think that's true of all tech, right?
There's no substitute for experience.
With that said, though, the nice thing is
there is a official certification
for a lot of people that are, you know, trying to become Kubernetes administrators so they can
at least learn some of this baseline knowledge about what these components are and how they
fit together. KubeCon, I mean, there are just so many talks over the years that cover everything
from security to performance tuning and, you know, horror stories like the ones I just mentioned in production from real companies.
So I encourage people to learn from others.
So watch a couple of those talks, watch a couple of those videos, and then give yourself some headroom.
Having more than one cluster will give you the time to make those mistakes, but then limit the impact of those mistakes.
And I think that's all we're trying to do is have this delicate balance in tech where we're all kind of learning as we go,
especially when there's new tools like this. We can leverage the people who've already gone
through this, listen to them talk, read those blog posts as inputs for our own education,
and then leaving ourselves headroom in case we get it wrong.
I imagine a lot of people might think, or maybe you tell me if you come across this,
I would imagine some people or at least organizations might think,
well, if I'm using GKE or EKS or anything else like that,
I don't have to worry about this stuff so much.
But I would imagine that the proper advice is,
no, you still have to worry about it
because they're just providing the framework there
or basically the hardware for you,
you still need to know what you're doing on a very deep level.
Don't rely on the cloud vendors to take care of all this for you
because they won't.
They're just going to give you what you ask for.
Yeah, I remember my first encounter with true automation.
I've always wrote scripts and things like that,
but I remember getting configuration management
for the first time with Puppet.
And I was like, look, I can update all 1,000 servers in two or three minutes. Let me show you.
Bam. All the servers are updated. It's like, oh, so we don't even have to worry about deployments
anymore. I'm like, oh, no, that's a thing of the past. Now, the problem with that is
when you have a wrong configuration or the wrong version of the app, you will totally
roll out the wrong version of the app in minutes and not days, which is really, really bad. So when
it comes to Kubernetes, even if you have a fully managed Kubernetes cluster, it still doesn't save
you from the issue we talked about earlier, which is an upgrade of a version of Kubernetes that your
extensions can't
handle. It's just going to happen faster. So when you click on that auto upgrade button,
it's just that at some random point in time, you may be a victim of an auto upgrade. So you still
need to be cautious. And I think the same discipline applies even when the cluster is fully managed. Hey, Kelsey, I think earlier you mentioned that,
you know, one of the big beneficiaries,
the group of beneficiaries of Kubernetes
are the application developers.
And at least that's the way I remember you saying it.
And that's obviously true because it's easy to, you know,
deploy your containers.
And then thanks to, you know, service meshes like Istio,
you can easily deploy it side by side with the previous version.
And then it just, you know, do things like a canary deployment.
Do you, have you seen what,
with the people that have worked in the past any kind of, let's say, automation evolving around this?
Because knowing that I can deploy my container side by side by using Istio, by defining my helm charts correctly is great if I know how to do this. but with more and more application developers kind of hopping on to Kubernetes,
is there something where we can help them
instead of having to learn Helm,
instead of having them to learn
how to configure Istio correctly?
Are there things available,
especially for application developers,
that allow them to deploy easier
with kind of like a self-service experience,
I would almost say, to really truly leverage
the power that Kubernetes gives them.
But obviously something is needed on top.
Yeah, so it's funny.
This question comes up every five to seven years.
New thing shows up, we leak it to the developers, right?
Hey, virtualization.
Hey, everyone gets to log into the VMware console.
Hey, Vagrant, everyone gets VMs on their laptops.
Hey, Cloud, everyone gets an Amazon or Google account.
Hey, Docker, everyone gets a Docker daemon.
Oh, what's that? Is that Kubernetes?
Everyone gets kubectl.
It's like, what are you doing?
If you keep doing that, you're going to keep asking the same question every single time.
There's really, really one serious workflow that you really, really want.
I, as a developer, check in my code. There's a way to build my application to produce an artifact. Jar, war, binary, doesn't matter. You can package that in an RPM container. Doesn't
matter to me either. I need that thing to be running on some compute. That's kind of what
we were talking about here. Now, I may need to give you some additional metadata so you know where you want me to run it or how much memory or CPU it
needs. That's usually all you need, regardless of the system, VMs, Kubernetes content, it just
doesn't matter. The problem is we haven't had a lot of time to create those other systems.
There are systems like this, like Cloud Foundry, OpenShift, Heroku, App Engine.
These are very opinionated ways to take a subset of the things we just described, some artifact
and some metadata, and to deploy your application, right? That's kind of been the whole goal of these
passes. So when you look at Kubernetes, you look at Istio, these are just platform components
for you to build your own pass. So if you're an operator and you have a bunch of virtual machines or bare metal machines, you could spend almost a lifetime at this point trying to create all the things that Kubernetes has and all the things that Istio has to build your own deployment system for your team or the people you're trying to support with your infrastructure. Or you can download or use components like Istio and Kubernetes and then layer on your
opinionated workflow at the top.
So this all boils down to what workflow do you want your developers to have?
I'm going to use one more analogy here.
It's like the internet.
The current workflow for most consumers of the internet,
you go to somewhere like Best Buy and you buy a modem and it says, this will get you on the
internet. It's all right, great. You buy the modem and you open the box and there's a cable in it
and you twist it in the wall and lights start flashing. You're probably online at that point.
That's it. And then you just get online and if things don't work,
you look at the lights
and then you unplug it and plug it back in.
End of story.
That is the interface.
That's the workflow.
What we do in the Kubernetes world,
we say, no, we can't be having that.
We need everyone to know how to create a Cat5 cable.
So here's the chart of how to twist the pairs of wires
and here's how you clip the other end. So let's the chart of how to twist the pairs of wires. And here's how you clip the
other end. So let's give every developer or every internet consumer a Cat5 kit so they know how to
create cables. And next, we're going to teach them how BGP works, right? Because that's the backbone
of the internet and we're doing DevOps. We don't want people just using the internet. We want them
to know how it actually works. It's like, come on, you got to be realistic. So focus on the workflow.
Kubernetes and Istio are implementation details
and they give you a lot of platform features,
but don't make the mistake of leaking all of those details
as if that's the end game.
I love your analogy.
And I may, if it's okay, I steal slash borrow it
at some of the presentations that I'm doing
because it's, and I steal slash borrow it at some of the presentations that I'm doing.
And I completely agree with you, right?
This is the challenge that we also faced. And with we, I mean our own organization where when we started with Kubernetes, everyone that tried to use Kubernetes had to figure out, well, they were kind of building all these cables and trying to figure out how TCPIP works and kind of to bring your analogy, kind of figuring out how to deploy,
figuring out how to configure Istio.
And when we realized this is greatly slowing us down from the way we could actually deliver
value to our customers, because we have a lot of developers that are not just purely spending time
in trying to figure out
how does this big thing underneath actually work.
And this is also when we started building
and maybe call it an opinionated way
of enabling our developers to just,
hey, give me your artifact.
And then we are kind of managing the lifecycle of that artifact
by automatically deploying it, automatically configuring Istio,
automatically monitoring your SLIs and SLOs
that the developer also specified as metadata,
and then giving the developer feedback
and giving them the option to promote it into the next stage.
I want to be clear here. I'm not saying that everyone in the world who downloads giving them the option to promote it into the next stage. I want to be clear here.
I'm not saying that everyone in the world
who downloads Kubernetes for the first time
will automatically have the time to build this perfect workflow.
What I'm saying is there's a couple of things.
In the very beginning, your workflow may very well be
people using kubectl apply and pushing YAMLs at the API server.
That data definition is also the fundamental element
that no matter what workflow you build,
it will always create those data elements,
those configurations you give to Kubernetes.
So the nice thing about this is it becomes your escape hatch.
So let's say you build a workflow that says,
if you check in some code, we build the container,
we deploy to Kubernetes.
That's not very advanced.
But the nice thing about that is you may not have any troubleshooting tools in that workflow either.
But your pipeline will just create the same artifacts that developers were creating with
kubectl. Bonus points if you check them in before you deploy them. Now troubleshooting can just be
done on the backend using kubectl until you decide if you want to have higher
level debugging tools to go along with that workflow. So keep that in mind that I'm not
expecting perfection day one, but just make sure you understand that there is work to do
based on what you get out of the box. And I think that also ties into the concepts
of what we see even in our own organization where operations becomes more of
moving from the old-fashioned operations type of team
to the pipeline and requirements team to say to the developers,
check in your code with these additional artifacts,
but it's a subset of the artifacts that they need to know about.
They don't need to know every single component. But if the operations team can
outline and define what is required along with a code check-in, then the rest of the pipeline
will pick everything else up based on the conformity to what's out there and push it.
And the developers don't have to think about all the rest of that. Because again, you don't want
developers to have to understand all the underlying components.
You want them to write code
and you want them to write code that works.
So the more you can abstract away from them
and put that into the hands of,
maybe it's your operations team,
maybe it's another team,
but someone else defines all that
and someone gives them a gated amount
of additional components they check in,
it becomes a lot more manageable.
Well, probably we could talk about performance.
And I also really like the way you kind of explained it.
Think about what's the workflow
you want your developers to have, right?
And how can you make this more efficient,
more streamlined, more easy to use?
How can we offer things like this as a self-service?
But on the other end, and Kelsey, probably you'll agree with me,
whether this is Kubernetes or whether this used to be in the past
a system where you were deploying to a big Java application server,
in the end, we should have always thought about that way.
How can we make the lives of developers so easy
that they can really focus on delivering code
that then gets automatically deployed in the right way
without having to think about all the plumbing underneath.
But I guess that's just new.
Exactly. Happy paths with escape hatches.
Yeah, exactly. I like that.
Yeah, let's jump over to the topic
that Brian just brought up, right?
So security is a big topic of yours.
And I think your roles, if you look at some of the meetups you've done in the recent weeks,
I think you're scheduled for some of those.
If I look at your Twitter feed and also the podcast you've done, then security is a big
topic in the Kubernetes space.
Can you tell us a little bit more, particularly what are a big topic in the Kubernetes space. Can you tell us a little bit
more, particularly what are the big topics that people need to be aware of and things that people
have to think of? Yeah, so right now, I mean, it's just been amazing. Like the last five or six years,
the number of platforms we have, whether it's Lambda, Cloud Run, Heroku, Cloud Foundry,
raw Kubernetes, or just Docker.
All of those are now getting us to the point where we can put our software
wherever we want it, when we want it.
This is a great place to be in.
Now the security question comes in,
how do you lock down those platforms?
There's tons of discussions around
how do you secure those platforms
to only run the things that you want to run.
Now we're turning our attention to what are we running?
So if you build, let's say, a Go application
and you may have some OS dependencies,
something like FFmpeg or something like that,
that will come from your operating system.
Typically, if you're using something like Red Hat,
they'll sign some packages, track the CVEs.
You can run yum update and maybe rebuild your container or rebuild or update
the server that the app is running on and do dynamic linking. Okay, we kind of have a handle
somewhat on that. But where there's a blind spot in the industry is our third party dependencies
that happen to be software related. So let's say someone creates a go package or go module
that you just import to do some basic functionality.
Who is tracking the CVE of that little code snippet that you're getting from somewhere like
GitHub? It's not an OS package. It may not have the same scrutiny or visibility that things like
FFmpeg have. So now you have this blind spot. So when we package up our applications, let's say in
containers,
we have to worry about the OS level dependencies. And we also have to worry about the third party dependencies that get compiled in our binary. So this end-to-end software pipeline or governance
or chain of trust, how do we establish that chain of trust? And when it's broken, meaning there's a
vulnerability in one of those third party dependencies, how do we identify who has it and how do we redeploy when that takes place?
So how do we do this? I mean, how do we... No one knows because...
It seems that obviously depending on where you are in the delivery pipeline, when you're, I mean, there's obviously ways where you can probably include code scanners
and to figure out, do you just copy paste it something that potentially is vulnerable, right?
And then as you're then linking to other libraries, is this, I mean, as you said,
maybe nobody knows yet where it all fits in, but there's probably multiple phases in the end-to-end delivery pipeline where you need to think of security and how to make sure you're not including or introducing existing vulnerabilities, or you need to take care of logging on what you include so that later on, in case there is a new vulnerability,
you know that you have this vulnerability currently deployed.
Yeah, I think people are trying to take an approach that you see in the auto industry,
right?
If you build a car, the car has a VIN number, has a make and a model, has a color.
And typically, you can trace back a lot of the components like the alternator, the engine,
even the airbag to that VIN number. So you have a bill of goods. And you can think of the VIN number as the components like the alternator, the engine, even the airbag to that
VIN number. So you have a bill of goods and you can think of the VIN number as the signature for
the car. It's probably not a perfect analogy, but now you have a way to trace back. So when there's
a recall on a car, you can use the DMV registration to identify everyone who has that car and almost
pinpoint the problem down to one manufacturer and say, hey, return your car and we will fix this component.
We don't have that for software.
So in the software world, you don't really know who's importing your library.
You don't know what version or what checksum of that library that they have.
So one way we're trying to solve this is by starting at the very start.
You're checking your code.
Ideally, people are signing their commits so we can trace back
where these things were introduced by the various authors. People like me, we check in our
dependencies. So even though we have a third-party dependency, I like to check in that dependency so
I can have it with me to prevent someone from replacing a 1.0 and me just re-downloading it in
my next build. So I like to kind of check in the things that I'm using
and then rely on the checksum, not just the version,
to really tell me what I have.
You can analyze that code.
There's static analysis that you can use.
There's linters and there's also reputation.
Do I trust this person?
Therefore, I may have a head start on trusting this code.
And then we have to trust the build system, right?
So we're not going to talk about
recreating the universe from scratch. We're just going to zoom forward to the we have to trust the build system, right? So we're not going to talk about recreating the universe from scratch.
We're just going to zoom forward to the point where
I trust my build system and the hardware that it runs on.
I trust my compiler.
And once you have all those things,
we want to sign the results
and then carry around that signature
as kind of proof that we know where this thing came from.
I'm not saying it's going to prevent problems,
but it's going to let us identify when problems do crop up and we can trace it all the way back
to how that software got introduced. I like the term signature, by the way,
because it reminds me of a term that we've been using, which is a performance signature.
So we look at different performance characteristics of a system, and then this kind of makes up the signature.
And then if that signature changes, let's say from build to build, from configuration change to configuration change, or from workload to workload, then it's a way for us to detect that something is wrong.
And then we can typically also pinpoint it to the area of where that signature changes.
So I like the term.
So feel free to keep using it.
And what you just described
is a very much a missing component
to this software delivery pipeline.
Lots of people are thinking about this
from the QA standpoint.
So a QA team runs some quality assurance test
and they may sign it with their key.
We used to do this with RPMs back in the day.
And that means QA has approved to this
to go to the next environment.
We see this with security scanning tools that will look for vulnerabilities, and they may also
attach an additional signature saying, hey, I assert that this is free of any vulnerabilities
that I know about. So it would attach a signature. But you're right. For people in the performance
community, there could develop like a set of performance baselines and attach yet another signature to say,
this piece of software also conforms to our benchmark targets for what it means to be performance software.
Yeah, that's awesome.
Hey, Kelsey, I know we've been taking up a lot of your time already.
And it's really amazing that knowing, at least
seeing based on social media, what you're doing, it's great that you could really take out that
amount of time off your busy schedule. I know there's a lot of more stuff we can talk about
when it comes to Kubernetes, which is obviously dear to your heart. We can, I'm sure, talk more
about what it means to be living like a
minimalist and and how we can apply this to other parts of our life as well is there typically and
i know this brian is something you always say typically we kind of wrap it up uh in a little
summary that i give but is there any other topic that is also dear is there any other thing we're
missing uh maybe something we can also cover in a later episode.
I mean, we talked about performance.
We talked about people that get started,
the whole myth of,
should we build a big Kubernetes container
or just smaller ones for the upgrades?
We talked about security.
Is there any other area and topic
that we must not forget about?
Yeah, so I think service mesh is a really big component where we're starting to move a lot
of smarts into the network layer. So people are taking sidecars like Envoy and control planes like
Istio. And one, I think, big topic of performance is if we start to move things like TLS mutual off
to be mandatory. So imagine a set of microservices, thousands of them.
You're going to introduce a real overhead in terms of performance to encrypt all traffic locally.
And that trade-off between security and performance is a big one because costs also go up.
I think there needs to be more exploration there around that trade-off between security and performance because if the system is so slow that it's
unusable, then people will, by default, more than likely turn off that security.
So I think we get more professionals who are looking at this from a performance standpoint,
not only identifying the performance bottleneck, but also helping people solve it.
Maybe there's kernel tunables we could use.
There are different techniques we can do in these proxies to speed things up quite a bit without turning them off.
I think this is really a big area
because I think right now,
most people are just taking the other trade-off,
which is turn off the security bits
because they can't afford the performance overhead.
But isn't that also an argument for,
I know there's a lot of people that say,
well, first of all, microservices don't solve
every problem, but then you exactly get in the arguments from people, well, you wouldn't
have this problem if you're smarter with architecturing your software and putting things together
that belong together without kind of breaking every little component, every little function
out in its own service.
And then you have to deal with the overhead of, first of all,
transportation, communication, and security. Well, so the thing is microservices very rarely
have I seen solve a performance problem. They solve organizational problem where you may have
a large organization that needs to work together without tripping over each other's toes. I can see
from a big organization structure, but to believe that we can go from
in-memory communication to over the network and somehow solve a performance problem,
that one's hard to swallow because I've never seen it, right? I think there are techniques you
can do with caching to close the gap, but it's very hard to get back to data locality. I think
the reason why that one's a little bit more up in the air is because you don't have to adopt microservices. You can choose an architectural pattern that works for you. You
can mitigate different designs, but security is one where you can almost not really make the same
trade-off. You need security at all levels. How much you need is really being impacted by,
because I think a lot of times,
a lot of security tools don't think about performance. They think about security as the end-all be-all. But the thing is, if it's too slow, it gets turned off. And I think that one
is where I think we need more attention. I think microservices have been explored extensively. It's
kind of philosophical at this point. But the security one really needs people to really look
at it because
if you make the wrong trade-off here, it doesn't matter if you have a monolith or microservices,
it's just insecure.
Do we have, from a performance perspective, do these service meshes that, you know, handle
the secure communication at least, are there benchmarks, Are there metrics that actually tell us
how much overhead is involved in securing the communication
and encrypting, decrypting?
Are you aware of that?
Yeah, there's people who are measuring that kind of thing.
I don't know if we have a great baseline
of acceptable performance yet.
I mean, I think there's some things
that have been published about the cost,
but I don't know if we understand what's acceptable meaning if you hit this threshold
we should live with that so i think there's room for that to be a little bit more awareness i think
it's like there's in the performance space we often talk about a performance budget so that
means how much before i mean how much you know what's your what's your performance end goal
and it means you know how much time can you spend on each individual layer of your stack in order to fulfill that budget, in order to end up with, let's say, a certain response time, a certain throughput.
I would just assume that security, the overhead of security, the quote-unquote performance overhead of security, would just have to be factored in into the performance budget. That means where along the end-to-end transaction
can you trade off,
how much time can you leverage
or use for security-related purposes
in order to not impact your performance goals in the end?
One would hope.
One would hope, exactly.
All right. Well, we
thank you very much for taking some time with
us today and
look forward to everything
else that's going to be coming out with you in the
future. People can follow you at Twitter
at, what is it again?
My Twitter handle
is just Kelsey Hightower. I pretty
much keep my DMs open. So if you have a
question, I was going to say a good question, but I think all much keep my DMs open. So if you have a question,
I was going to say a good question,
but I think all questions can be a good question.
So you can feel free to DM me there.
And yeah, I'll be on Twitter
trying to interact and share as I learn.
And I can confirm that because Kelsey,
thank you so much for responding to my DM.
So you really are a man of your word.
So thank you so much for that and thank you
for being inspiring to so many people
especially with your demos that you do on stage
and with all the content that you put
out there. It's really phenomenal.
Keep up inspiring more people
and let's
make sure that
let's make the world a better
place. Let's end it with this.
Awesome. Thanks for having me
and thanks for the kind words.