PurePerformance - How to successfully run k8s software in SaaS and on-premise with Marc Campbell
Episode Date: December 28, 2020K8s enables organizations to more easily deploy their containerized solutions as it takes away a lot of the operational tasks which are built-into k8s. This in theory means that you can run your softw...are anywhere and provide it as SaaS offering or deploy it behind corporate firewalls for those customers that demand an on-premise installation.In this episode we have Marc Campbell, Founder and CTO of Replicated, where they help the k8s community to deliver and manage apps on k8s anywhere. For anyone looking into running their apps on k8s you will learn the challenges of Day 1 (delivery, install) and Day 2 (operation, monitoring, troubleshooting) operations. Marc shares common performance and scalability challenges and how to prepare for them during development.In this episode we have Marc Campbell, Founder and CTO of Replicated, where they help the k8s community to deliver and manage apps on k8s anywhere. For anyone looking into running their apps on k8s you will learn the challenges of Day 1 (delivery, install) and Day 2 (operation, monitoring, troubleshooting) operations. Marc shares common performance and scalability challenges and how to prepare for them during development.https://www.linkedin.com/in/campbe79/https://www.replicated.com/https://www.heavybit.com/library/podcasts/the-kubelist-podcast/ep-7-keptn-with-andreas-grabner-of-dynatrace/https://troubleshoot.sh/https://kots.io/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always, my wonderful host Andy Grabner is here.
I hope, Andy, are you here?
Where else would I be if I were still locked up at home in this pandemic?
I don't know, I thought you might be out somewhere learning how to do your own separate audio recording this year by yourself and keeping it secret from you this whole time i know you just learned something you know but actually i i have the the power of uh
recording stuff locally i'm sorry that i never told you but i guess you never really asked or
maybe i just ignored it when you asked years ago yes Yes, yes, yes. So I'll just get this part out of the way
because I like to shame.
Sorry about the audio quality this week, folks.
We've been having more and more problems with Ringer.
So Ringer, if you're listening, let this be a fair warning.
But we're backing up to a Zoom recording,
which gives us pretty low resolution.
So hopefully it's not too bad.
Hopefully everyone will continue listening because we have quite a wonderful show today.
Andy, why don't you introduce the show topic and then our wonderful guest?
Definitely.
Well, I think if the quality of the sound is not that good, we just make up for it with the quality of the content.
That's the way we do it here.
All right.
So today I was actually introduced to our guest today through his podcast,
Cubelist.
And I just did an episode recording around Captain.
And the guest, therefore, is the person who runs the Cubelist podcast.
I'm pretty sure he can tell us much more than I can about
himself, about the podcast, and about his background. And this is why I want to introduce
Mark Kemple, founder and CTO at replicated.com, and probably doing a lot of other things that
we don't know about. But Mark, welcome to the show. And let us let the audience know who you
are and what you're doing. Great. Yeah, thanks for having me, Andy and Brian.
It's great to be here.
Yeah, Andy, so my name is Mark.
I'm the CTO and co-founder of a company called Replicated.
We've been around for about five and a half years,
basically helping SaaS companies and software companies
ship their software to their largest enterprise customers
behind the firewall.
So we work with companies like Puppet and HashiCorp
and CircleCI, and they take the software
that they're running in a SaaS product
and package it up for Kubernetes,
ship it behind the firewall so their customers can run it.
So that's obviously extremely interesting for, first of all, us, Dynatrace.
I mean, Brian and I, we work for Dynatrace,
and we have the same deployment model.
We have a SaaS solution, but we ship it to many of our customers,
also on-premise.
We call it the Dynatrace Managed Offering.
We are shipping it through our means that we've built over the years. But I clearly
see that, obviously, this is a very important deployment model because, unfortunately, not
everybody can always use software, as it says, offering. Now, Mark, before we go into the topic,
because we asked you to come on the show and really talk about, first of all, I want to just get some additional
background on why people are deciding to do this, to move from SaaS to on-premise, even
though I think it's clear, but still just some intro for people that never thought about
that.
And then the other thing is we really want to talk about what are challenges when running
software on-premise that you obviously work,
you know how it runs well in the SaaS environment that you've developed it for?
But can you, before we dive into this topic,
I mean, how often do you see that organizations become very successful with SaaS and then realize, oh my gosh, how can I
do this now on-premise because all of a sudden I get a demand? So is this the common use case,
or do people from the start, when they think about building SaaS, already think about SaaS
and on-premise? What's the most common use case here or the most common, I thought?
Yeah, that's a great question. So when we started Replicated, we thought the primary use case was going to be an existing SaaS product, met a large sales opportunity,
and that opportunity required them to ship it on-prem behind the firewall. And just to clarify,
when I say on-prem, it doesn't often mean like bare metal server
sitting in a server closet somewhere.
It's like, it's the customer's AWS account
or the customer's Azure account these days.
But yeah, we thought that that was the primary use case
and that is a big use case,
but there's also a lot of traditional on-prem software
that's being rewritten to be microservice, Kubernetes-based.
And the deployment model that they've been running with for years has been to ship a JAR or a WAR file to their customer, let them spin up a database, and then run it in Tomcat or whatever the web server is that they choose. And now the developers, you know, have chosen Kubernetes to write the software because it's like, they get a lot of benefits
from this platform and that just increased the complexity of that deployment to those customers.
And so they need a, they need a way to help solve some of that. So it's like, yeah, like a lot of it
is SaaS that is growing from a, us you know a startup into like more
enterprise sales but it's also just you know the the proliferation of kubernetes has created this
this challenge too that's actually interesting so if i if i understand this correctly traditional
software has been containerized and maybe put into kubernetes obviously, right, that's the way to go.
But then people realizing instead of running it on-premise,
pushing the complexity of Kubernetes to the customers,
then why not just take it and then also provide it as SaaS?
So I think that's interesting that obviously these two models
or these two directions come up.
Because initially, I guess just as you,
I would have thought that somebody comes up with a great new idea,
obviously builds things, containerized, run it somewhere
in a Kubernetes probably managed offering in the cloud
or a SaaS offering, and then kind of being challenged with,
well, how can we now not only run this,
but I think the biggest challenge is really how do we distribute it also behind the firewall,
right?
Because that's probably a big challenge too, because if we have these air-gapped systems,
you cannot just upload it to Docker Hub or somewhere else and then people download it
from there.
Yeah, exactly.
I mean, we look at the problem as there's really two problems to solve when running software behind the firewall. Day're successfully getting it running in their environment.
But then down the road, how are they going to operate it?
Six months from now, it's not performing well, or it's crashing, or they need to upgrade
it.
The shape of their enterprise has changed a little bit, and their usage patterns have.
And it's in an air-gapped, totally sealed-off environment. And you as the ISV, how are you going to understand that
and be able to help them troubleshoot it?
Because they don't have that operational expertise
in the database components, the queue components,
all of the various technical components
that you're shipping with your application.
Now, Ardir, that's an interesting question.
So if you think about a completely
archived system and something happens, do you have any kind of quote-unquote best practices
or are there any frameworks, any other tools out there that provide, I don't know, some support for,
I don't know, collecting the right log files, collecting monitoring data, collecting whatever needs to be collected,
environment information,
so that this information can then be somehow shipped over
or sent over to the ISV to then have a look at it.
Are there any standards maybe even,
or is there any type of tooling around
that has become a de facto standard for that?
How does that work?
Yeah.
So Kubernetes definitely makes it easier, right?
Before Kubernetes, you would have had to know the location of log files
and know what specifically to look for, collect it all, and send it back.
And with Kubernetes, at least now we can run kubectl logs
and kubectl describe pod.
And whatever it is that you want to run.
You can generally access them all through a common CLI now.
But as part of like,
that's a lot of the functionality that replicated
actually does provide.
And we actually released an open source tool
just called troubleshoot.
And it is a way to declaratively in a YAML file
describe what you want to collect and then provide automatic redaction.
Because generally, if you're going to collect log files and pod...
Content-sensitive information, I'm thinking of that.
Exactly. There's passwords. There's PII.
And so it redacts it all.
And this is just a total open source component called replicated troubleshoot.
You can use it without even talking to us, signing up. And this is just a total open source component called replicated troubleshoot.
It's not, you can use it without even talking to us, signing up.
It's not tied to any service that we have.
It allows you just to collect all that.
And it eliminates the slow asynchronous kind of back and forth troubleshooting that you otherwise would have had to do.
Because if you have a customer who's having a problem and you need to collect logs, you'll just say, great, can you give me the logs of this pod or follow this deployment log for a little while and grab them?
And then that may just say, great, now I actually need to go describe the service to see how it's configured.
And that asynchronous process just becomes tedious in those air-gapped environments.
And so collecting it all at once, getting it all into giant targies,
a support bundle type thing, is super useful.
Yeah. I mean, Brian, this reminds me a lot of what we are obviously doing
with DynamoTrace, even back in the AppMon days when we said,
we call it a support archive, which was basically a way to collect
all sorts of logs and config files that were relevant for us to understand
what the environment looks like and
where the problem might be and then ship it or
send it to our
support team. Archie.
Exactly. Archie, right? Yeah.
Yeah, Archie. So, Margie,
this might be a strange name.
It might be a good thing to hear about
actually because it could come handy.
Yeah, so maybe this is something for you too
to think about. So, we have we collect something that we call a support archive.
As you said, collecting all the logs in Config
and then zipping it up.
And then when it was uploaded,
we were using Jira for support tickets.
And if they were uploaded,
then the engineers typically unzip it
and then they look for patterns.
And so our engineering team said,
you know what, we can automate that.
So they developed a tool that they called Archie
for Archive Analyzer, basically.
And that Archie was triggered
every time somebody uploaded
one of these support archives to Jira.
And then our developers could actually also write
some rules that were automatically parsing or scanning for certain log patterns and for certain things that they already knew.
And so it automated a lot of the initial troubleshooting steps.
Yeah, actually, that's exactly what the troubleshoot product tries to do, too.
We talked about the collection.
You can write a custom resource, which is a kind, you know, collector or whatever, and that defines everything to collect. And then, you know, you can also write these analyzers. And like, to your point, the automation of that is great when you can say, you know, oh, it is running on this version of Kubernetes, and it has this ingress controller, and this shows up in a log. Here's a link to a support article. And what we try to
think about is really how to, when a third party is running that piece of software, how can we push
that like self-service and that remediation capabilities all the way down into their
cluster, into their hands so that they're not, you know, if Dynatrace were to ship software behind
the firewall like that, you know, you don't want your customers to be completely dependent and
relying on you and your support team every time something goes wrong. You want to give them the
tools, especially if they're Kubernetes experts, to be able to, you know, understand how to
troubleshoot the system and get it back up. Yeah. Very cool. Now, let me ask you a question. So we
talk about Kubernetes here, right? And I mean, when I started my journey with Kubernetes about two years ago,
I was under the impression that every Kubernetes was made equal.
And if I run my pods somewhere, then I can run it anywhere.
But obviously, well, this is not the case,
because otherwise we wouldn't have this conversation probably.
So why is it still after so many years
of Kubernetes development and obviously the product
or the platform and all the community
and the ecosystem around it,
obviously investing a lot in, I think, in standards
and then making the product more solid
or the platform more solid.
Why is it still the case that we don't have a situation
where I can run my application anywhere
regardless of whether this is in Azure or in AWS
or wherever I run it?
Yeah, I mean, I ask that all the time too.
It's a good question.
I think Kubernetes is know, Kubernetes is
that abstraction and that common API. And, you know, the way that we look at it is like the real
value of Kubernetes is all of this complexity that you had in your application around deployments,
around rolling updates, around, you know, networking and storage. It's pushed that
complexity down into this platform. So you don't have to think about it at the application layer anymore. The platform is now responsible for it. But there's little
variations on that platform. And you start to realize it when you go into unknown or potentially
even hostile style Kubernetes environments. You might be used to running on, you know, GKE or EKS that has, you know, like an EBS PVC provisioner and everything's working great.
You know, you have EFS behind it.
And some, you know, classic examples here that we've run into is, you know, your application may expect to be able to provision or rewrite many PVC in Kubernetes.
And not all Kubernetes clusters make that guarantee that you can do that.
And so that doesn't work.
You might expect that there's an ingress controller that you can have a load balancer type because
the cloud provider you're deploying to supports that.
And that's not true all the time.
Disk IO, network IO is like super, super dependent and like at scale and at load kubernetes really relies on
that etcd behind it to to to perform and on you know an under provisions server and like you in
different cloud providers just the disk io performance is going to be a little bit different
um in in unpredictable which you know like you talk about this a lot on your on the podcast here
on other episodes where like that that just you know propagates up into like you know, you talk about this a lot on the podcast here on other episodes where like that just, you know, propagates up into like, you know, unexpected types of failures
and you have to understand how to run those.
Yeah, just the cloud providers are all different.
And then, you know, on a totally different layer, there's just like flavors of Kubernetes
that you need to think about.
Open shift being a big one, right?
You might be totally, you know, your app runs great in Kubernetes.
You've met all the requirements and everything,
but you have decided that when you're running it
in your Kubernetes cluster, you understand the risks
and you're managing that,
but it's okay to run as root in this one pod.
But as soon as you deploy it to a customer
and they're in a sensitive, regulated industry
and they don't want anything running as root,
they're running it as OpenShift, and it's just going to fail to run at that point.
Yeah, actually, it's funny that you mentioned this example
because I just got off the call earlier,
and one of our users of the open source project, Kip,
that we are building, they had the same thing
with their OpenShift environment.
Some of our containers that require root access were not able
to run. So we worked with them and also with RedHeads to make this work.
That makes a lot of sense. And Andy, I have to say I love the
idea of a hostile Kubernetes deployment. It makes me think of
if you're familiar with the Hitchhiker's Guide books, Marvin,
the paranoid android, you could're familiar with the Hitchhiker's Guide books, Marvin, the paranoid
android, you could have Oscar the Grouchy
Kubernetes cluster.
Every time you type in a
kubectl, come on, what do you want?
Just seeing a
personified Kubernetes cluster, that would be
really awesome if it... Your YAML
is wrong!
Sorry.
I just got hung up on that idea. I had to bring it up. Mark, if that is the case Sorry. I just got hung up on that idea.
I had to bring it up.
But Mark, if that is the case, then I wonder
what have we gained from all of Kubernetes?
Because in the end, if I think back
of the days prior
to Kubernetes, when we wrote software,
we were basically saying,
what are the system requirements? We said, this particular
version of Windows or Linux
and these hardware requirements are supported because they're also tested.
So does this mean in the Kubernetes world, actually nothing has changed, that we still
have to say our software requires exactly these hardware combinations and software combinations
because this is what we've tested for?
Or at least that's what it sounds to me.
No, I mean, I think that's fair.
I don't think that that's true.
I think my message is more like it's continuing to evolve.
We've made huge, huge strides with Kubernetes,
and you don't have to rely on that anymore.
An example that I like to talk about here is, you know, traditionally,
you know, SAP HANA is a piece of on-prem software that enterprises may be familiar with.
And to deploy SAP HANA, it literally was a 95-page manual that you needed to get it running.
And you would have to put a team of like, you know, six, 10 engineers together and spend,
you know, a year provisioning hardware and getting that all up and running.
And it's just not that Kubernetes is definitely, definitely solve that problem.
And it's day one operations are really easy. You know, we have a tool for it, Helm,
Kubernetes operators work, you know, like it's, it's really easy. I think my, my message is really
like around like that long tail of expectations. You know, you, if, if you are shipping into similar Kubernetes environments, similar now is a lot
bigger than it ever used to be. Kubernetes has leveled that field a lot. There is still this
long tail that you have to be aware of around little differences. And just go into it with
eyes wide open when you're shipping behind the firewall and understand how are you going to troubleshoot that?
And how are you going to like think about performance and load testing and chaos, you know, being prepared for like understanding what these problems are going to be?
Like, but like Kubernetes has definitely normalized that both in like a technical perspective and really in like a people and an operations perspective. You know, before, if you shipped software as a JAR file, the example I gave earlier,
you kind of expected that, you know, maybe your customer had a little bit of expertise
with, you know, Java runtime and Java, you know, being able to like get the JRE running
or you had Elasticsearch as a requirement and they would have to configure Elasticsearch.
And now it's just Kubernetes deployments and Kubernetes config maps and Kubernetes secrets.
And so you can rely on this level of common expertise, and they can run a Kubernetes cluster for you.
And it's not just running a Linux server that you're just going to be super unfamiliar with the distribution of Linux and the kernel that's on it.
Yeah, now I get your point.
I think that's fair.
Obviously, we get this additional level of abstraction that at least takes away the unknown
things underneath the Kubernetes platform.
So the platform itself should be kind of at least standard.
But still, let me ask you a question.
In your experience with the people you work with,
how often do you see that the software that is then getting deployed
on-premise runs on existing Kubernetes platforms
or Kubernetes clusters that the customer already provides
and where they maybe have some other software running?
Or how often does it happen happen and maybe it should happen that you say well i need a dedicated kubernetes cluster in order
for us to perform as expected yeah i mean so you know we we solve for both of those problems we
you know our goal is really to let you as an isV ship your Kubernetes application and target either of those
customers that either under that spectrum, you know, zero Kubernetes expertise to, you know,
like a super modern enterprise who has a great Kubernetes story, they're running it in production
or anywhere along the way. But to answer the question, I think, you know, it's today, I think
there's still a lot of enterprises that choose to kind of treat a Kubernetes application as an appliance.
And it might be like, you know, 60-40 or 70-30, where the larger number is the folks who are just coming with some Linux boxes and they want to run, you know, a curl pipe bash command or they want to spin up an AMI.
And the fact that it's Kubernetes underneath that doesn't mean anything to them.
It just happens to be that way. But we're definitely
building for this future that we're seeing. And Kubernetes adoption is crazy. It's growing so
fast right now. And we're building towards this world where that's going to be the minority and
eventually gone. Every enterprise is going to have Kubernetes and have operational expertise in Kubernetes.
And, you know, if you're running on Azure, if you're a large bank and you decide, hey, we're going to run everything on Azure, it's pretty easy just to spin up an AKS cluster and deploy an application to it.
You don't need these snowflakes and like, you know, special environments.
It just helps you be able to manage it more.
So I think every, really every, it's pretty quick.
Every month that goes by right now,
more and more enterprises have Kubernetes
that they're willing to deploy to.
Now, let me ask you another question.
So we have with, you know,
I've in the last two years worked a lot with Kubernetes,
especially in our open source project,
kept and have installed it or tried to help people install it
in their environments.
And we very often ran into things you mentioned earlier,
either special restrictions in their environment.
Typically, we run into some network issues.
Sometimes we run into storage issues.
Or like these air- air kept systems where you
know they're not allowed to download anything so you download it somehow you get them to download
the containers put them into their registry through the security scans and then eventually
uh you step by step you actually find out which which containers and you actually need
and to run it and then also what you're all depending on but um is there a i don't know a
list or a good guide for developers or organizations that are building software now that they want to
run on premise later on like a checklist saying hey so here are the things you should know and
you should test for you You mentioned chaos engineering earlier.
Are there maybe some environments
or some environment combinations
that everyone should test for to really make sure
that they at least have 80 to 90% of common problems
kind of ruled out before they start deploying on-premise?
Yeah, I mean, so I think generally,
if you're building as vanilla Kubernetes as possible,
it's going to work. And you have to really think about security also. If you're deploying a
Kubernetes application, but you're going to rely on a dozen CRDs or operators, that's just going
to require cluster admin level RBAC permissions to do that installation and it may may cause some challenges so definitely focus on security um you know run
run load tests definitely run load tests against your application you know k6 is pretty popular
right now and works really well chaos kind of in two different ways um Run chaos against both the underlying platform, right? So go to, you know, pick a chaos platform to run on
and then be able to like stress test Kubernetes,
stress test, you know, what happens when etcd
and core DNS are under load
and how does your application perform there?
But then also be able to like run chaos in the application.
You know, generally, you know,
if you're shipping your product to a, you know, into a multi-tenant SaaS environment, you understand how your application is going to perform when a pod restarts or is not available or it's doing a rolling update.
It's really just the difference in like that, the underlying platform though, when Kubernetes is misbehaving or it's like a, you know, something unexpected there. And just really focus on like, you know, what happens when there's like high load or partial outages in the platform, like slow disks
or high packet loss and like the CNI providers, like not working as great as you want it to be.
The thing too, is really just, you know, we survey the customers, you know, when you're going to like,
if you're going to ship software behind the firewall, like, you know, we survey the customers, you know, when you're going to like, if you're going to ship software behind the firewall, like, you know, one of the things that, that, that we've built over
years is really, it's kind of a manual process, but it's just like the first few customers you
ship to, you know, send them a survey where you're asking them everything from like, what, what OS
are you going to run this on? Like what version of Kubernetes, what distribution, like how do you
update the Kubernetes? What CNI provider, what provider what csi provider what what container runtime are you running um and just understand all this because like the you know the there is
this long tail but like the there's a there's there's absolutely a light at the end of the
tunnel here right like like each install is easier and easier and easier so the first couple like are
going to be a little bit unknown for you um you know, and, you know, go through those. But like, once you, once you,
you can definitely scale this and operationalize it and just rely on those,
like those, those patterns over and over again and rely on the tests.
And then bake,
the last thing is really just bake all of the lessons that you learned.
You know, if you have a customer who has a problem running your software,
like go back and take the support archive that you have and enhance.
If you had to do a one-off, you know, can you grab the logs of this?
Don't forget to go bake that back into the archive collection.
So it's always in there and then go write an analyzer rule for it.
So the next customer doesn't have to go through that manual process and continue to evolve
and get the product more like self-service.
And, you know, you don't, you don't need to pull in the senior engineer
in order to troubleshoot that cluster anymore.
This all sounds very familiar to me in a way, right?
And I may be oversimplifying,
but I'm curious to get your take on this.
Everything that you're talking about now
and some things that came up earlier about
all the differences between different Kubernetes rollouts
and having to manage.
Isn't this what Cloud Foundry was all about,
of having an opinionated platform and making it
so all you have to do is push your code?
I mean, I guess the main problem with Cloud Foundry
is you have this behemoth mass of servers
that you need to run it on.
But again, I may be being too simplistic here,
but wasn't that the goal of Cloud Foundry to do everything that we're trying to recreate through Kubernetes now by trying to get as many things standardized?
Yeah, I mean, I think it was. I think, you know, one of the benefits of Kubernetes, though, is really it's like it's vendor agnostic now right and like yeah if i want to run you know tanzu from vmware
like and i have a contract with them i can but if i don't and i just want like a simple k3s cluster
i mean obviously depending on you know the expectations of that application um i might be
able to to do that and you know like all of the cloud providers have managed kubernetes it's just
it's it's right ubiquitous kubernetes is everywhere and like like nothing's really been at that level yet until now yeah it just seems like everyone's trying
to get to where cloud foundry was but not through a vendor or massive hardware scaled component
which just yeah it's just always odd when you see someone come out with a good idea but maybe
again i'm not saying cloud foundry is dead in any way, but obviously it didn't take off to the level
that some people, I guess, hoped it would.
Yeah, it's just funny.
I'm hearing all this and thinking.
Sometimes people come along a little too early, I think.
Mark, how do you deal with scaling on-premise?
Meaning, if you run in SaaS,
you obviously have your monitoring,
you know how you can scale your resources
with increasing demand.
How do you deal with this on the on-premise side?
Do you just constantly try to get some monitoring data
from the on-premise installation
and based on that, give new recommendations
on how to scale or how does this work?
Yeah, that's definitely an area that's like,
there's a lot of variety, I think, today
in how you ship to enterprise
because it's going to be dependent on the enterprise.
It's like in a perfect world,
you could rely on all of your customers
to have Kubernetes set up with Prometheus, both horizontal and vertical auto-scaling groups and everything
is just going to handle that completely automatically. You know, but, you know,
realistically, that's just not always there. And some folks are running this as appliances and they
have, you know, a three-node cluster and how do they handle that scale? And scale is also just
interesting because, you know, an enterprise customer, you know, a three-node cluster and how do they handle that scale? And scale is also just interesting because, you know, an enterprise customer, you know,
it may be that your largest enterprise customer starts to rival the size of your entire multi-tenant
SaaS offering or getting close to it.
So, like, there can be, like, legitimately real scale there.
And, like, it's going to be dependent on the applications.
You know, a lot of enterprise applications aren't going to have a very, like, you know, spiky workload where it's like, oh, it's, you know, Black Friday or holiday
sales. And so now it's like, we need to scale this thing up 10x. It's predictable, it's linear
scale and growth. So like definitely, you know, monitoring is key. Hook that in like documentation
of like, how are you going to like, make sure you're really clear about the minimum requirements,
a pattern that we've seen, you know,
some companies do really well too is to say, you know, like,
I think GitLab does this on their documentation where they have a config for,
you know, a 100 user,
a config for a 1000 user and a config for a 10,000 user workload.
And like, they're, they're different,
but you just kind of pick that bucket that you want to go into and just make sure you're like
picking the right configuration,
the right hardware to start with.
Honestly, Andy, like one of the biggest challenges
that we see in like when we troubleshoot
and support these systems
is just under-provisioned hardware
and just like, it's crazy.
Just put the right size hardware in there
and Kubernetes generally does a pretty good job.
Yeah, yeah.
Yeah, so what's funny is just this week,
we like, we kind of in the process of shipping
a new piece of software on-premise as well.
And we're just making these baby steps actually
to deploy a new software on Kubernetes on the hardwareise as well. And we're just making these baby steps actually to deploy a new software on Kubernetes
on the hardware of our customers.
And one of the things we ran into this week
because I'm kind of leading the project
is exactly undersized hardware.
We ran into file system issues.
Not only, I mean, we ran into all sorts
of interesting things we'd never thought of, right?
So like, because we've been doing all sorts of testing
and in our environment, everything worked as expected.
And then the first install,
they had an old version of CentOS
with a certain file system flag that was not turned on.
And, or then we had, you know, very strange network policies
or one environment was just a completely undersized environment
because they were just saying,
yeah, we just have a spare machine somewhere.
You can use that.
And, right, I mean, that's, yeah.
Yeah, and I mean, we've recently learned that this one,
we were troubleshooting a customer on Azure recently
and Kubernetes, you have to have swap disabled on the operating system.
When you're deploying Kubernetes, it doesn't work well if you have swap enabled.
And so, you know, we have a way to turn off swap.
You would expect that to be super consistent across every, you know, Linux flavor, every distribution, every kernel and it is except you know this azure um ignores that and
runs a script every time you reboot an instance that turns swap back on and like you know
troubleshooting that it's like it's like this that's this long tail that we talk about right
where like the you know there are these problems they're unexpected and i think you know it's good
news is it just it seems like there's's an insurmountable number of these permutations that you're going to run into.
But in reality, there's not.
There's not infinite cloud providers.
There's not an infinite number of Kubernetes distributions out there.
And they're not widespread, too.
They're clustered together.
And a lot of your customers are going to like have the same configuration so once
you kind of figure out how to solve this or you know like we've been solving this for five or six
years and like you know finding somebody who can help you solve this problem you know to accelerate
that a little bit but um you know like it's it's pretty quick and then like stuff just runs like
kubernetes it just solves this problem so well and you know as long as you know what you're getting
into and you can manage it,
it's stuff's going to be super reliable
running in those customer environments.
Now, one more question that I have on,
let's assume you have a SaaS business
and then you have a handful number of managed customers
or with managed, I always say managed
because in the damageless world,
damageless managed means on-premise.
So if you have a handful of people
that run your software on-premise
and that number grows,
how do you keep track
and how do you control and automate
the rollout of the right version
to all of your different remote on-premise customers?
Is this, how do you do this, right?
Because I would assume SaaS is always kind of the latest version that you have, and then
you allow your on-premise customers to then say, do they also want to be on the latest?
Well, maybe they want to be on a different update schedule.
So my question is, first of all, is this the case?
Just as an error that I explained, that most companies run the latest version in SaaS,
and then they roll it out to their on-premise customers, but maybe also with a slightly
different schedule, depending on the customer's need.
And the second question is, if this is the case, you will end up with different versions of that software
being distributed across the globe in SaaS and on-premise.
How do you manage that?
How do you manage version hell again?
Or how do you manage version hell?
Because every installation is one version.
But how do you keep an oversight of all this?
Yeah.
No, that's actually a great question.
That is actually relatively common, right?
Like you're probably doing continuous delivery to your SaaS environment.
A lot of like what Replicated does is to help solve that problem.
And I'm not going to like kind of dive into those details here.
But like, you know, at a high level, you know, the way that we see the problem is really, think about release channels. Even among your current customers running the on-prem distribution, they're not all going to be on that same cadence. You may have large enterprises who say, we only want to update every three or six months. And you might have other customers, or even that large enterprise may have two installations of it, one where they're on the nightly or the weekly build.
So we look at like, you should run continuous delivery
into these customer, into the on-prem product
and make those available, but then have release channels.
And you might just do something as simple as like nightly beta
and like stable, but you can also have like a quarterly
or like whatever release channels you want and have like a quarterly or, you know, like whatever
release channels you want and allow your customer to subscribe to those. And then they can get the
updates that meet their requirements. You know, like large enterprises, especially regulatory
regulated ones are going to have like a process to ingest software and they have to like,
and it could be a little bit of a heavy process. And so they just, they, they don't want to receive software every day from you.
Like it's just, it's too time consuming for them.
And so they want like, you know, more slow, like in reliable and stable and tested, um,
versions.
Um, but like, I, I also do think like you can't allow that proliferation of versions
just to go unchecked out there.
Otherwise your software is going to be super, super hard to support.
And, you know, then you, you know,
your engineers are going to be thinking, great,
we're building for Kubernetes one 20 and, you know,
we expect to be able to ship, you know, this new, you know,
custom resource that depends on something that was, you know,
introduced or doesn't depend on something that was deprecated. And, you know,
you don't,
you obviously don't want to be supporting customers running Kubernetes 113 today.
And so we think about these low waterlines
where you have to really be able to have,
maybe you set a policy and you tell your customers,
you have to stay within two versions
of our stable release channel or, you know, 12 months.
And that's a long time, but like, you know, like large enterprises do want these like
LTS long-term supported versions and they want to keep them up and running for a while
just for, for those reasons.
But like, yeah, you definitely have to think about both how you're going to update them
and like when it's the customer's choice to update
because they want the latest version and when you're out of the support window
and we can't support that version anymore unless you do update.
And that makes a lot of sense.
And I think it's also very aligned with what we are doing here on the Dynatrix side
when we deploy our software to our on-premise customers.
Now, I mean, we've built over the years our own deployment models and deployment pipelines.
So our team, fortunately, has been doing this for a while
and they've optimized it so they know exactly
what is deployed where.
But for folks that you see out there,
especially people that are starting,
what type of tools would they use to really keep an overview or really understand what is running where?
Or is this something that you also have in your replicated solution?
Yeah.
So for non-air-gapped installation, the problem, I guess I would kind of divide into two problems.
One is internet-connected installations installations and one is like total air
gapped installations. So for internet connected ones, yeah.
Like we, you know,
those are constantly checking for updates and syncing the license and stuff
like this. And so we definitely do report back what version they're running.
And so you can see, you know,
adoption rate of the latest version and like, and drill into that for air gap
ones, you know, like, you know,
we definitely take a lot of effort and put a lot of work into making sure that air gap
installations don't attempt to even reach out to the internet. Not like, you know, you'll fail
audits if you attempt to reach out, but like fail silently, like a true air gap environment just
really can't even attempt to ping the internet, can't attempt to do a check. And so for those, you know, you have to really rely on like, you know, a couple of other
options. One would just be, you know, you can see when they've downloaded the latest version.
There's no guarantees that that means they've installed it or are successfully running it.
You can collect one of those support archives, a support bundle, and in there report the version
that's running and the status of it all.
So, you know, they redact all the sensitive information
and, you know, you can, you know, do a periodic review
where they send that to you.
And you're like, you know,
checking to make sure everything's running okay.
And also you get a little bit of information
about what version they're running.
But, you know, it's definitely one of the challenges
of running software behind the firewall in total air-gapped environments is you just, like, there's no way that you can
guarantee a lot of visibility into those. Cool. Hey, Mark, this was extremely insightful
for me that, you know, I know we initially, when we kind of set up this podcast recording,
we wanted to primarily focus on performance.
And I think we covered all this, but I think in general,
this whole discussion about how we, in our industry,
deliver software that maybe runs in SaaS and now on-premise,
or maybe it starts on-premise and then goes SaaS
and then kind of lives in both worlds.
All these considerations, all the things that can go wrong.
I think you gave a lot of great insights for engineers,
for people that architect these systems
on what they need to understand,
what can be different in these environments.
Thankfully, at least that's what I hear,
is Kubernetes definitely makes it easier
to kind of expect a certain standard,
yet there are still small nuances that might be different
depending on the target OS, the target environment.
Is there anything else in closing remarks
that you want to make sure that the audience who is listening in
that may contemplate right now,
hey, our SaaS software would be great to run on-premise too, because we have these sales
opportunities.
Anything else, any links to literature, anything we can pass on?
Yeah, no, for sure.
I think, you know, first thing is, like, I know we talk a lot about the scary, hard problems
to solve, but like, it's easier than it's ever been.
Kubernetes makes it super easy. And like, I definitely don't want to scare anybody away. I just want to like, you
know, like, hey, it's not, there's no guarantee that just because your software works in Kubernetes
on your laptop, that means you're going to ship it to an enterprise who's going to have success
running it at scale, right? Like you have to like be a little bit thoughtful about that. And like,
you know, that's what, that's what we do. I think, you know, we're, we're also,
you know,
Andy you and I talked a little while ago on the Kublai's podcast about
Captain and like the, this whole world of like, you know,
building for SLIs,
SLOs and moving towards like fully autonomous operations is like super cool.
And I think that's still, you know,
something we're starting to see a little bit in Kubernetes and the more autonomous operations
that you can package into your application,
you know, relying on, you know, an Elasticsearch operator
instead of like manually running Elasticsearch.
And now that operator can run that
just provides more reliability
and more like, you know, baked in, you know,
operational expertise into those.
And so it's just going to work better and better and better. You know, I think, you know, it's, it's, it's easier than
ever. And like, yeah, there are links, you know, like, you know, obviously replicated as a platform,
you know, this is what we do is what our team does every day. We've been doing it for five and a half
years, but like there's open source projects, you know, we use, you know, we've created troubleshoot
dot SH, which is like the, you know the build your own support bundle, run it.
You don't have to talk to us.
It's just an open source project.
Ad analyzers do all this.
There's shared Slack channels inside the Kubernetes Slack for the Troubleshoot project.
And we're happy to talk.
And honestly, I'm always happy just to chat about shipping software onprem regardless of like how you're doing it to understand the whole ecosystem better
very cool yeah thank you so much for for being on the show today brian is there anything else
from your end that you wanted to add uh no nothing on my end it was all very very interesting and
it's if you think back to the earlier conversations we had about Kubernetes Andy a couple years ago, it's crazy how far it's come and how much it's dominated.
Just mind-blowing.
So this is a very interesting conversation just to be listening to because it's so far from where we were.
Well, thanks everyone for listening.
Mark, thanks again for joining us.
It's been really great having you on.
If anybody has any questions or comments,
you can reach us at pure underscore DT at Twitter,
or you can send an email at pureperformance.dynatrace.com.
Mark, where can people go to find out more about you
and everything related to the wonderful goodness of Replicated?
Yeah, so I mean, we're at replicated,
you know, dot com,
um,
or we have,
uh,
our COTS product,
which is,
you know,
we didn't talk about that by name,
but this is Kubernetes off the shelf and it's at COTS.io,
K-O-T-S.io.
Awesome.
Okay.
We'll put some of those links out in the,
uh,
show notes as well.
Thank you everyone for listening and,
uh,
we'll see you next episode.
Bye-bye episode bye-bye