PurePerformance - How to successfully run k8s software in SaaS and on-premise with Marc Campbell

Episode Date: December 28, 2020

K8s enables organizations to more easily deploy their containerized solutions as it takes away a lot of the operational tasks which are built-into k8s. This in theory means that you can run your softw...are anywhere and provide it as SaaS offering or deploy it behind corporate firewalls for those customers that demand an on-premise installation.In this episode we have Marc Campbell, Founder and CTO of Replicated, where they help the k8s community to deliver and manage apps on k8s anywhere. For anyone looking into running their apps on k8s you will learn the challenges of Day 1 (delivery, install) and Day 2 (operation, monitoring, troubleshooting) operations. Marc shares common performance and scalability challenges and how to prepare for them during development.In this episode we have Marc Campbell, Founder and CTO of Replicated, where they help the k8s community to deliver and manage apps on k8s anywhere. For anyone looking into running their apps on k8s you will learn the challenges of Day 1 (delivery, install) and Day 2 (operation, monitoring, troubleshooting) operations. Marc shares common performance and scalability challenges and how to prepare for them during development.https://www.linkedin.com/in/campbe79/https://www.replicated.com/https://www.heavybit.com/library/podcasts/the-kubelist-podcast/ep-7-keptn-with-andreas-grabner-of-dynatrace/https://troubleshoot.sh/https://kots.io/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always, my wonderful host Andy Grabner is here. I hope, Andy, are you here? Where else would I be if I were still locked up at home in this pandemic? I don't know, I thought you might be out somewhere learning how to do your own separate audio recording this year by yourself and keeping it secret from you this whole time i know you just learned something you know but actually i i have the the power of uh recording stuff locally i'm sorry that i never told you but i guess you never really asked or
Starting point is 00:01:00 maybe i just ignored it when you asked years ago yes Yes, yes, yes. So I'll just get this part out of the way because I like to shame. Sorry about the audio quality this week, folks. We've been having more and more problems with Ringer. So Ringer, if you're listening, let this be a fair warning. But we're backing up to a Zoom recording, which gives us pretty low resolution. So hopefully it's not too bad.
Starting point is 00:01:24 Hopefully everyone will continue listening because we have quite a wonderful show today. Andy, why don't you introduce the show topic and then our wonderful guest? Definitely. Well, I think if the quality of the sound is not that good, we just make up for it with the quality of the content. That's the way we do it here. All right. So today I was actually introduced to our guest today through his podcast, Cubelist.
Starting point is 00:01:53 And I just did an episode recording around Captain. And the guest, therefore, is the person who runs the Cubelist podcast. I'm pretty sure he can tell us much more than I can about himself, about the podcast, and about his background. And this is why I want to introduce Mark Kemple, founder and CTO at replicated.com, and probably doing a lot of other things that we don't know about. But Mark, welcome to the show. And let us let the audience know who you are and what you're doing. Great. Yeah, thanks for having me, Andy and Brian. It's great to be here.
Starting point is 00:02:28 Yeah, Andy, so my name is Mark. I'm the CTO and co-founder of a company called Replicated. We've been around for about five and a half years, basically helping SaaS companies and software companies ship their software to their largest enterprise customers behind the firewall. So we work with companies like Puppet and HashiCorp and CircleCI, and they take the software
Starting point is 00:02:53 that they're running in a SaaS product and package it up for Kubernetes, ship it behind the firewall so their customers can run it. So that's obviously extremely interesting for, first of all, us, Dynatrace. I mean, Brian and I, we work for Dynatrace, and we have the same deployment model. We have a SaaS solution, but we ship it to many of our customers, also on-premise.
Starting point is 00:03:18 We call it the Dynatrace Managed Offering. We are shipping it through our means that we've built over the years. But I clearly see that, obviously, this is a very important deployment model because, unfortunately, not everybody can always use software, as it says, offering. Now, Mark, before we go into the topic, because we asked you to come on the show and really talk about, first of all, I want to just get some additional background on why people are deciding to do this, to move from SaaS to on-premise, even though I think it's clear, but still just some intro for people that never thought about that.
Starting point is 00:03:56 And then the other thing is we really want to talk about what are challenges when running software on-premise that you obviously work, you know how it runs well in the SaaS environment that you've developed it for? But can you, before we dive into this topic, I mean, how often do you see that organizations become very successful with SaaS and then realize, oh my gosh, how can I do this now on-premise because all of a sudden I get a demand? So is this the common use case, or do people from the start, when they think about building SaaS, already think about SaaS and on-premise? What's the most common use case here or the most common, I thought?
Starting point is 00:04:50 Yeah, that's a great question. So when we started Replicated, we thought the primary use case was going to be an existing SaaS product, met a large sales opportunity, and that opportunity required them to ship it on-prem behind the firewall. And just to clarify, when I say on-prem, it doesn't often mean like bare metal server sitting in a server closet somewhere. It's like, it's the customer's AWS account or the customer's Azure account these days. But yeah, we thought that that was the primary use case and that is a big use case,
Starting point is 00:05:19 but there's also a lot of traditional on-prem software that's being rewritten to be microservice, Kubernetes-based. And the deployment model that they've been running with for years has been to ship a JAR or a WAR file to their customer, let them spin up a database, and then run it in Tomcat or whatever the web server is that they choose. And now the developers, you know, have chosen Kubernetes to write the software because it's like, they get a lot of benefits from this platform and that just increased the complexity of that deployment to those customers. And so they need a, they need a way to help solve some of that. So it's like, yeah, like a lot of it is SaaS that is growing from a, us you know a startup into like more enterprise sales but it's also just you know the the proliferation of kubernetes has created this this challenge too that's actually interesting so if i if i understand this correctly traditional
Starting point is 00:06:16 software has been containerized and maybe put into kubernetes obviously, right, that's the way to go. But then people realizing instead of running it on-premise, pushing the complexity of Kubernetes to the customers, then why not just take it and then also provide it as SaaS? So I think that's interesting that obviously these two models or these two directions come up. Because initially, I guess just as you, I would have thought that somebody comes up with a great new idea,
Starting point is 00:06:52 obviously builds things, containerized, run it somewhere in a Kubernetes probably managed offering in the cloud or a SaaS offering, and then kind of being challenged with, well, how can we now not only run this, but I think the biggest challenge is really how do we distribute it also behind the firewall, right? Because that's probably a big challenge too, because if we have these air-gapped systems, you cannot just upload it to Docker Hub or somewhere else and then people download it
Starting point is 00:07:19 from there. Yeah, exactly. I mean, we look at the problem as there's really two problems to solve when running software behind the firewall. Day're successfully getting it running in their environment. But then down the road, how are they going to operate it? Six months from now, it's not performing well, or it's crashing, or they need to upgrade it. The shape of their enterprise has changed a little bit, and their usage patterns have. And it's in an air-gapped, totally sealed-off environment. And you as the ISV, how are you going to understand that
Starting point is 00:08:08 and be able to help them troubleshoot it? Because they don't have that operational expertise in the database components, the queue components, all of the various technical components that you're shipping with your application. Now, Ardir, that's an interesting question. So if you think about a completely archived system and something happens, do you have any kind of quote-unquote best practices
Starting point is 00:08:32 or are there any frameworks, any other tools out there that provide, I don't know, some support for, I don't know, collecting the right log files, collecting monitoring data, collecting whatever needs to be collected, environment information, so that this information can then be somehow shipped over or sent over to the ISV to then have a look at it. Are there any standards maybe even, or is there any type of tooling around that has become a de facto standard for that?
Starting point is 00:09:03 How does that work? Yeah. So Kubernetes definitely makes it easier, right? Before Kubernetes, you would have had to know the location of log files and know what specifically to look for, collect it all, and send it back. And with Kubernetes, at least now we can run kubectl logs and kubectl describe pod. And whatever it is that you want to run.
Starting point is 00:09:25 You can generally access them all through a common CLI now. But as part of like, that's a lot of the functionality that replicated actually does provide. And we actually released an open source tool just called troubleshoot. And it is a way to declaratively in a YAML file describe what you want to collect and then provide automatic redaction.
Starting point is 00:09:50 Because generally, if you're going to collect log files and pod... Content-sensitive information, I'm thinking of that. Exactly. There's passwords. There's PII. And so it redacts it all. And this is just a total open source component called replicated troubleshoot. You can use it without even talking to us, signing up. And this is just a total open source component called replicated troubleshoot. It's not, you can use it without even talking to us, signing up. It's not tied to any service that we have.
Starting point is 00:10:12 It allows you just to collect all that. And it eliminates the slow asynchronous kind of back and forth troubleshooting that you otherwise would have had to do. Because if you have a customer who's having a problem and you need to collect logs, you'll just say, great, can you give me the logs of this pod or follow this deployment log for a little while and grab them? And then that may just say, great, now I actually need to go describe the service to see how it's configured. And that asynchronous process just becomes tedious in those air-gapped environments. And so collecting it all at once, getting it all into giant targies, a support bundle type thing, is super useful. Yeah. I mean, Brian, this reminds me a lot of what we are obviously doing
Starting point is 00:10:51 with DynamoTrace, even back in the AppMon days when we said, we call it a support archive, which was basically a way to collect all sorts of logs and config files that were relevant for us to understand what the environment looks like and where the problem might be and then ship it or send it to our support team. Archie. Exactly. Archie, right? Yeah.
Starting point is 00:11:13 Yeah, Archie. So, Margie, this might be a strange name. It might be a good thing to hear about actually because it could come handy. Yeah, so maybe this is something for you too to think about. So, we have we collect something that we call a support archive. As you said, collecting all the logs in Config and then zipping it up.
Starting point is 00:11:33 And then when it was uploaded, we were using Jira for support tickets. And if they were uploaded, then the engineers typically unzip it and then they look for patterns. And so our engineering team said, you know what, we can automate that. So they developed a tool that they called Archie
Starting point is 00:11:51 for Archive Analyzer, basically. And that Archie was triggered every time somebody uploaded one of these support archives to Jira. And then our developers could actually also write some rules that were automatically parsing or scanning for certain log patterns and for certain things that they already knew. And so it automated a lot of the initial troubleshooting steps. Yeah, actually, that's exactly what the troubleshoot product tries to do, too.
Starting point is 00:12:19 We talked about the collection. You can write a custom resource, which is a kind, you know, collector or whatever, and that defines everything to collect. And then, you know, you can also write these analyzers. And like, to your point, the automation of that is great when you can say, you know, oh, it is running on this version of Kubernetes, and it has this ingress controller, and this shows up in a log. Here's a link to a support article. And what we try to think about is really how to, when a third party is running that piece of software, how can we push that like self-service and that remediation capabilities all the way down into their cluster, into their hands so that they're not, you know, if Dynatrace were to ship software behind the firewall like that, you know, you don't want your customers to be completely dependent and relying on you and your support team every time something goes wrong. You want to give them the tools, especially if they're Kubernetes experts, to be able to, you know, understand how to
Starting point is 00:13:13 troubleshoot the system and get it back up. Yeah. Very cool. Now, let me ask you a question. So we talk about Kubernetes here, right? And I mean, when I started my journey with Kubernetes about two years ago, I was under the impression that every Kubernetes was made equal. And if I run my pods somewhere, then I can run it anywhere. But obviously, well, this is not the case, because otherwise we wouldn't have this conversation probably. So why is it still after so many years of Kubernetes development and obviously the product
Starting point is 00:13:53 or the platform and all the community and the ecosystem around it, obviously investing a lot in, I think, in standards and then making the product more solid or the platform more solid. Why is it still the case that we don't have a situation where I can run my application anywhere regardless of whether this is in Azure or in AWS
Starting point is 00:14:16 or wherever I run it? Yeah, I mean, I ask that all the time too. It's a good question. I think Kubernetes is know, Kubernetes is that abstraction and that common API. And, you know, the way that we look at it is like the real value of Kubernetes is all of this complexity that you had in your application around deployments, around rolling updates, around, you know, networking and storage. It's pushed that complexity down into this platform. So you don't have to think about it at the application layer anymore. The platform is now responsible for it. But there's little
Starting point is 00:14:50 variations on that platform. And you start to realize it when you go into unknown or potentially even hostile style Kubernetes environments. You might be used to running on, you know, GKE or EKS that has, you know, like an EBS PVC provisioner and everything's working great. You know, you have EFS behind it. And some, you know, classic examples here that we've run into is, you know, your application may expect to be able to provision or rewrite many PVC in Kubernetes. And not all Kubernetes clusters make that guarantee that you can do that. And so that doesn't work. You might expect that there's an ingress controller that you can have a load balancer type because the cloud provider you're deploying to supports that.
Starting point is 00:15:34 And that's not true all the time. Disk IO, network IO is like super, super dependent and like at scale and at load kubernetes really relies on that etcd behind it to to to perform and on you know an under provisions server and like you in different cloud providers just the disk io performance is going to be a little bit different um in in unpredictable which you know like you talk about this a lot on your on the podcast here on other episodes where like that that just you know propagates up into like you know, you talk about this a lot on the podcast here on other episodes where like that just, you know, propagates up into like, you know, unexpected types of failures and you have to understand how to run those. Yeah, just the cloud providers are all different.
Starting point is 00:16:15 And then, you know, on a totally different layer, there's just like flavors of Kubernetes that you need to think about. Open shift being a big one, right? You might be totally, you know, your app runs great in Kubernetes. You've met all the requirements and everything, but you have decided that when you're running it in your Kubernetes cluster, you understand the risks and you're managing that,
Starting point is 00:16:37 but it's okay to run as root in this one pod. But as soon as you deploy it to a customer and they're in a sensitive, regulated industry and they don't want anything running as root, they're running it as OpenShift, and it's just going to fail to run at that point. Yeah, actually, it's funny that you mentioned this example because I just got off the call earlier, and one of our users of the open source project, Kip,
Starting point is 00:16:59 that we are building, they had the same thing with their OpenShift environment. Some of our containers that require root access were not able to run. So we worked with them and also with RedHeads to make this work. That makes a lot of sense. And Andy, I have to say I love the idea of a hostile Kubernetes deployment. It makes me think of if you're familiar with the Hitchhiker's Guide books, Marvin, the paranoid android, you could're familiar with the Hitchhiker's Guide books, Marvin, the paranoid
Starting point is 00:17:25 android, you could have Oscar the Grouchy Kubernetes cluster. Every time you type in a kubectl, come on, what do you want? Just seeing a personified Kubernetes cluster, that would be really awesome if it... Your YAML is wrong!
Starting point is 00:17:43 Sorry. I just got hung up on that idea. I had to bring it up. Mark, if that is the case Sorry. I just got hung up on that idea. I had to bring it up. But Mark, if that is the case, then I wonder what have we gained from all of Kubernetes? Because in the end, if I think back of the days prior to Kubernetes, when we wrote software,
Starting point is 00:17:58 we were basically saying, what are the system requirements? We said, this particular version of Windows or Linux and these hardware requirements are supported because they're also tested. So does this mean in the Kubernetes world, actually nothing has changed, that we still have to say our software requires exactly these hardware combinations and software combinations because this is what we've tested for? Or at least that's what it sounds to me.
Starting point is 00:18:27 No, I mean, I think that's fair. I don't think that that's true. I think my message is more like it's continuing to evolve. We've made huge, huge strides with Kubernetes, and you don't have to rely on that anymore. An example that I like to talk about here is, you know, traditionally, you know, SAP HANA is a piece of on-prem software that enterprises may be familiar with. And to deploy SAP HANA, it literally was a 95-page manual that you needed to get it running.
Starting point is 00:18:57 And you would have to put a team of like, you know, six, 10 engineers together and spend, you know, a year provisioning hardware and getting that all up and running. And it's just not that Kubernetes is definitely, definitely solve that problem. And it's day one operations are really easy. You know, we have a tool for it, Helm, Kubernetes operators work, you know, like it's, it's really easy. I think my, my message is really like around like that long tail of expectations. You know, you, if, if you are shipping into similar Kubernetes environments, similar now is a lot bigger than it ever used to be. Kubernetes has leveled that field a lot. There is still this long tail that you have to be aware of around little differences. And just go into it with
Starting point is 00:19:42 eyes wide open when you're shipping behind the firewall and understand how are you going to troubleshoot that? And how are you going to like think about performance and load testing and chaos, you know, being prepared for like understanding what these problems are going to be? Like, but like Kubernetes has definitely normalized that both in like a technical perspective and really in like a people and an operations perspective. You know, before, if you shipped software as a JAR file, the example I gave earlier, you kind of expected that, you know, maybe your customer had a little bit of expertise with, you know, Java runtime and Java, you know, being able to like get the JRE running or you had Elasticsearch as a requirement and they would have to configure Elasticsearch. And now it's just Kubernetes deployments and Kubernetes config maps and Kubernetes secrets. And so you can rely on this level of common expertise, and they can run a Kubernetes cluster for you.
Starting point is 00:20:36 And it's not just running a Linux server that you're just going to be super unfamiliar with the distribution of Linux and the kernel that's on it. Yeah, now I get your point. I think that's fair. Obviously, we get this additional level of abstraction that at least takes away the unknown things underneath the Kubernetes platform. So the platform itself should be kind of at least standard. But still, let me ask you a question. In your experience with the people you work with,
Starting point is 00:21:06 how often do you see that the software that is then getting deployed on-premise runs on existing Kubernetes platforms or Kubernetes clusters that the customer already provides and where they maybe have some other software running? Or how often does it happen happen and maybe it should happen that you say well i need a dedicated kubernetes cluster in order for us to perform as expected yeah i mean so you know we we solve for both of those problems we you know our goal is really to let you as an isV ship your Kubernetes application and target either of those customers that either under that spectrum, you know, zero Kubernetes expertise to, you know,
Starting point is 00:21:50 like a super modern enterprise who has a great Kubernetes story, they're running it in production or anywhere along the way. But to answer the question, I think, you know, it's today, I think there's still a lot of enterprises that choose to kind of treat a Kubernetes application as an appliance. And it might be like, you know, 60-40 or 70-30, where the larger number is the folks who are just coming with some Linux boxes and they want to run, you know, a curl pipe bash command or they want to spin up an AMI. And the fact that it's Kubernetes underneath that doesn't mean anything to them. It just happens to be that way. But we're definitely building for this future that we're seeing. And Kubernetes adoption is crazy. It's growing so fast right now. And we're building towards this world where that's going to be the minority and
Starting point is 00:22:38 eventually gone. Every enterprise is going to have Kubernetes and have operational expertise in Kubernetes. And, you know, if you're running on Azure, if you're a large bank and you decide, hey, we're going to run everything on Azure, it's pretty easy just to spin up an AKS cluster and deploy an application to it. You don't need these snowflakes and like, you know, special environments. It just helps you be able to manage it more. So I think every, really every, it's pretty quick. Every month that goes by right now, more and more enterprises have Kubernetes that they're willing to deploy to.
Starting point is 00:23:15 Now, let me ask you another question. So we have with, you know, I've in the last two years worked a lot with Kubernetes, especially in our open source project, kept and have installed it or tried to help people install it in their environments. And we very often ran into things you mentioned earlier, either special restrictions in their environment.
Starting point is 00:23:37 Typically, we run into some network issues. Sometimes we run into storage issues. Or like these air- air kept systems where you know they're not allowed to download anything so you download it somehow you get them to download the containers put them into their registry through the security scans and then eventually uh you step by step you actually find out which which containers and you actually need and to run it and then also what you're all depending on but um is there a i don't know a list or a good guide for developers or organizations that are building software now that they want to
Starting point is 00:24:15 run on premise later on like a checklist saying hey so here are the things you should know and you should test for you You mentioned chaos engineering earlier. Are there maybe some environments or some environment combinations that everyone should test for to really make sure that they at least have 80 to 90% of common problems kind of ruled out before they start deploying on-premise? Yeah, I mean, so I think generally,
Starting point is 00:24:44 if you're building as vanilla Kubernetes as possible, it's going to work. And you have to really think about security also. If you're deploying a Kubernetes application, but you're going to rely on a dozen CRDs or operators, that's just going to require cluster admin level RBAC permissions to do that installation and it may may cause some challenges so definitely focus on security um you know run run load tests definitely run load tests against your application you know k6 is pretty popular right now and works really well chaos kind of in two different ways um Run chaos against both the underlying platform, right? So go to, you know, pick a chaos platform to run on and then be able to like stress test Kubernetes, stress test, you know, what happens when etcd
Starting point is 00:25:33 and core DNS are under load and how does your application perform there? But then also be able to like run chaos in the application. You know, generally, you know, if you're shipping your product to a, you know, into a multi-tenant SaaS environment, you understand how your application is going to perform when a pod restarts or is not available or it's doing a rolling update. It's really just the difference in like that, the underlying platform though, when Kubernetes is misbehaving or it's like a, you know, something unexpected there. And just really focus on like, you know, what happens when there's like high load or partial outages in the platform, like slow disks or high packet loss and like the CNI providers, like not working as great as you want it to be. The thing too, is really just, you know, we survey the customers, you know, when you're going to like,
Starting point is 00:26:22 if you're going to ship software behind the firewall, like, you know, we survey the customers, you know, when you're going to like, if you're going to ship software behind the firewall, like, you know, one of the things that, that, that we've built over years is really, it's kind of a manual process, but it's just like the first few customers you ship to, you know, send them a survey where you're asking them everything from like, what, what OS are you going to run this on? Like what version of Kubernetes, what distribution, like how do you update the Kubernetes? What CNI provider, what provider what csi provider what what container runtime are you running um and just understand all this because like the you know the there is this long tail but like the there's a there's there's absolutely a light at the end of the tunnel here right like like each install is easier and easier and easier so the first couple like are going to be a little bit unknown for you um you know, and, you know, go through those. But like, once you, once you,
Starting point is 00:27:07 you can definitely scale this and operationalize it and just rely on those, like those, those patterns over and over again and rely on the tests. And then bake, the last thing is really just bake all of the lessons that you learned. You know, if you have a customer who has a problem running your software, like go back and take the support archive that you have and enhance. If you had to do a one-off, you know, can you grab the logs of this? Don't forget to go bake that back into the archive collection.
Starting point is 00:27:32 So it's always in there and then go write an analyzer rule for it. So the next customer doesn't have to go through that manual process and continue to evolve and get the product more like self-service. And, you know, you don't, you don't need to pull in the senior engineer in order to troubleshoot that cluster anymore. This all sounds very familiar to me in a way, right? And I may be oversimplifying, but I'm curious to get your take on this.
Starting point is 00:27:58 Everything that you're talking about now and some things that came up earlier about all the differences between different Kubernetes rollouts and having to manage. Isn't this what Cloud Foundry was all about, of having an opinionated platform and making it so all you have to do is push your code? I mean, I guess the main problem with Cloud Foundry
Starting point is 00:28:15 is you have this behemoth mass of servers that you need to run it on. But again, I may be being too simplistic here, but wasn't that the goal of Cloud Foundry to do everything that we're trying to recreate through Kubernetes now by trying to get as many things standardized? Yeah, I mean, I think it was. I think, you know, one of the benefits of Kubernetes, though, is really it's like it's vendor agnostic now right and like yeah if i want to run you know tanzu from vmware like and i have a contract with them i can but if i don't and i just want like a simple k3s cluster i mean obviously depending on you know the expectations of that application um i might be able to to do that and you know like all of the cloud providers have managed kubernetes it's just
Starting point is 00:29:01 it's it's right ubiquitous kubernetes is everywhere and like like nothing's really been at that level yet until now yeah it just seems like everyone's trying to get to where cloud foundry was but not through a vendor or massive hardware scaled component which just yeah it's just always odd when you see someone come out with a good idea but maybe again i'm not saying cloud foundry is dead in any way, but obviously it didn't take off to the level that some people, I guess, hoped it would. Yeah, it's just funny. I'm hearing all this and thinking. Sometimes people come along a little too early, I think.
Starting point is 00:29:37 Mark, how do you deal with scaling on-premise? Meaning, if you run in SaaS, you obviously have your monitoring, you know how you can scale your resources with increasing demand. How do you deal with this on the on-premise side? Do you just constantly try to get some monitoring data from the on-premise installation
Starting point is 00:30:02 and based on that, give new recommendations on how to scale or how does this work? Yeah, that's definitely an area that's like, there's a lot of variety, I think, today in how you ship to enterprise because it's going to be dependent on the enterprise. It's like in a perfect world, you could rely on all of your customers
Starting point is 00:30:22 to have Kubernetes set up with Prometheus, both horizontal and vertical auto-scaling groups and everything is just going to handle that completely automatically. You know, but, you know, realistically, that's just not always there. And some folks are running this as appliances and they have, you know, a three-node cluster and how do they handle that scale? And scale is also just interesting because, you know, an enterprise customer, you know, a three-node cluster and how do they handle that scale? And scale is also just interesting because, you know, an enterprise customer, you know, it may be that your largest enterprise customer starts to rival the size of your entire multi-tenant SaaS offering or getting close to it. So, like, there can be, like, legitimately real scale there.
Starting point is 00:30:58 And, like, it's going to be dependent on the applications. You know, a lot of enterprise applications aren't going to have a very, like, you know, spiky workload where it's like, oh, it's, you know, Black Friday or holiday sales. And so now it's like, we need to scale this thing up 10x. It's predictable, it's linear scale and growth. So like definitely, you know, monitoring is key. Hook that in like documentation of like, how are you going to like, make sure you're really clear about the minimum requirements, a pattern that we've seen, you know, some companies do really well too is to say, you know, like, I think GitLab does this on their documentation where they have a config for,
Starting point is 00:31:35 you know, a 100 user, a config for a 1000 user and a config for a 10,000 user workload. And like, they're, they're different, but you just kind of pick that bucket that you want to go into and just make sure you're like picking the right configuration, the right hardware to start with. Honestly, Andy, like one of the biggest challenges that we see in like when we troubleshoot
Starting point is 00:31:56 and support these systems is just under-provisioned hardware and just like, it's crazy. Just put the right size hardware in there and Kubernetes generally does a pretty good job. Yeah, yeah. Yeah, so what's funny is just this week, we like, we kind of in the process of shipping
Starting point is 00:32:18 a new piece of software on-premise as well. And we're just making these baby steps actually to deploy a new software on Kubernetes on the hardwareise as well. And we're just making these baby steps actually to deploy a new software on Kubernetes on the hardware of our customers. And one of the things we ran into this week because I'm kind of leading the project is exactly undersized hardware. We ran into file system issues.
Starting point is 00:32:39 Not only, I mean, we ran into all sorts of interesting things we'd never thought of, right? So like, because we've been doing all sorts of testing and in our environment, everything worked as expected. And then the first install, they had an old version of CentOS with a certain file system flag that was not turned on. And, or then we had, you know, very strange network policies
Starting point is 00:33:04 or one environment was just a completely undersized environment because they were just saying, yeah, we just have a spare machine somewhere. You can use that. And, right, I mean, that's, yeah. Yeah, and I mean, we've recently learned that this one, we were troubleshooting a customer on Azure recently and Kubernetes, you have to have swap disabled on the operating system.
Starting point is 00:33:28 When you're deploying Kubernetes, it doesn't work well if you have swap enabled. And so, you know, we have a way to turn off swap. You would expect that to be super consistent across every, you know, Linux flavor, every distribution, every kernel and it is except you know this azure um ignores that and runs a script every time you reboot an instance that turns swap back on and like you know troubleshooting that it's like it's like this that's this long tail that we talk about right where like the you know there are these problems they're unexpected and i think you know it's good news is it just it seems like there's's an insurmountable number of these permutations that you're going to run into. But in reality, there's not.
Starting point is 00:34:09 There's not infinite cloud providers. There's not an infinite number of Kubernetes distributions out there. And they're not widespread, too. They're clustered together. And a lot of your customers are going to like have the same configuration so once you kind of figure out how to solve this or you know like we've been solving this for five or six years and like you know finding somebody who can help you solve this problem you know to accelerate that a little bit but um you know like it's it's pretty quick and then like stuff just runs like
Starting point is 00:34:39 kubernetes it just solves this problem so well and you know as long as you know what you're getting into and you can manage it, it's stuff's going to be super reliable running in those customer environments. Now, one more question that I have on, let's assume you have a SaaS business and then you have a handful number of managed customers or with managed, I always say managed
Starting point is 00:35:03 because in the damageless world, damageless managed means on-premise. So if you have a handful of people that run your software on-premise and that number grows, how do you keep track and how do you control and automate the rollout of the right version
Starting point is 00:35:20 to all of your different remote on-premise customers? Is this, how do you do this, right? Because I would assume SaaS is always kind of the latest version that you have, and then you allow your on-premise customers to then say, do they also want to be on the latest? Well, maybe they want to be on a different update schedule. So my question is, first of all, is this the case? Just as an error that I explained, that most companies run the latest version in SaaS, and then they roll it out to their on-premise customers, but maybe also with a slightly
Starting point is 00:35:59 different schedule, depending on the customer's need. And the second question is, if this is the case, you will end up with different versions of that software being distributed across the globe in SaaS and on-premise. How do you manage that? How do you manage version hell again? Or how do you manage version hell? Because every installation is one version. But how do you keep an oversight of all this?
Starting point is 00:36:23 Yeah. No, that's actually a great question. That is actually relatively common, right? Like you're probably doing continuous delivery to your SaaS environment. A lot of like what Replicated does is to help solve that problem. And I'm not going to like kind of dive into those details here. But like, you know, at a high level, you know, the way that we see the problem is really, think about release channels. Even among your current customers running the on-prem distribution, they're not all going to be on that same cadence. You may have large enterprises who say, we only want to update every three or six months. And you might have other customers, or even that large enterprise may have two installations of it, one where they're on the nightly or the weekly build. So we look at like, you should run continuous delivery
Starting point is 00:37:10 into these customer, into the on-prem product and make those available, but then have release channels. And you might just do something as simple as like nightly beta and like stable, but you can also have like a quarterly or like whatever release channels you want and have like a quarterly or, you know, like whatever release channels you want and allow your customer to subscribe to those. And then they can get the updates that meet their requirements. You know, like large enterprises, especially regulatory regulated ones are going to have like a process to ingest software and they have to like,
Starting point is 00:37:41 and it could be a little bit of a heavy process. And so they just, they, they don't want to receive software every day from you. Like it's just, it's too time consuming for them. And so they want like, you know, more slow, like in reliable and stable and tested, um, versions. Um, but like, I, I also do think like you can't allow that proliferation of versions just to go unchecked out there. Otherwise your software is going to be super, super hard to support. And, you know, then you, you know,
Starting point is 00:38:09 your engineers are going to be thinking, great, we're building for Kubernetes one 20 and, you know, we expect to be able to ship, you know, this new, you know, custom resource that depends on something that was, you know, introduced or doesn't depend on something that was deprecated. And, you know, you don't, you obviously don't want to be supporting customers running Kubernetes 113 today. And so we think about these low waterlines
Starting point is 00:38:32 where you have to really be able to have, maybe you set a policy and you tell your customers, you have to stay within two versions of our stable release channel or, you know, 12 months. And that's a long time, but like, you know, like large enterprises do want these like LTS long-term supported versions and they want to keep them up and running for a while just for, for those reasons. But like, yeah, you definitely have to think about both how you're going to update them
Starting point is 00:39:01 and like when it's the customer's choice to update because they want the latest version and when you're out of the support window and we can't support that version anymore unless you do update. And that makes a lot of sense. And I think it's also very aligned with what we are doing here on the Dynatrix side when we deploy our software to our on-premise customers. Now, I mean, we've built over the years our own deployment models and deployment pipelines. So our team, fortunately, has been doing this for a while
Starting point is 00:39:35 and they've optimized it so they know exactly what is deployed where. But for folks that you see out there, especially people that are starting, what type of tools would they use to really keep an overview or really understand what is running where? Or is this something that you also have in your replicated solution? Yeah. So for non-air-gapped installation, the problem, I guess I would kind of divide into two problems.
Starting point is 00:40:03 One is internet-connected installations installations and one is like total air gapped installations. So for internet connected ones, yeah. Like we, you know, those are constantly checking for updates and syncing the license and stuff like this. And so we definitely do report back what version they're running. And so you can see, you know, adoption rate of the latest version and like, and drill into that for air gap ones, you know, like, you know,
Starting point is 00:40:24 we definitely take a lot of effort and put a lot of work into making sure that air gap installations don't attempt to even reach out to the internet. Not like, you know, you'll fail audits if you attempt to reach out, but like fail silently, like a true air gap environment just really can't even attempt to ping the internet, can't attempt to do a check. And so for those, you know, you have to really rely on like, you know, a couple of other options. One would just be, you know, you can see when they've downloaded the latest version. There's no guarantees that that means they've installed it or are successfully running it. You can collect one of those support archives, a support bundle, and in there report the version that's running and the status of it all.
Starting point is 00:41:05 So, you know, they redact all the sensitive information and, you know, you can, you know, do a periodic review where they send that to you. And you're like, you know, checking to make sure everything's running okay. And also you get a little bit of information about what version they're running. But, you know, it's definitely one of the challenges
Starting point is 00:41:22 of running software behind the firewall in total air-gapped environments is you just, like, there's no way that you can guarantee a lot of visibility into those. Cool. Hey, Mark, this was extremely insightful for me that, you know, I know we initially, when we kind of set up this podcast recording, we wanted to primarily focus on performance. And I think we covered all this, but I think in general, this whole discussion about how we, in our industry, deliver software that maybe runs in SaaS and now on-premise, or maybe it starts on-premise and then goes SaaS
Starting point is 00:42:02 and then kind of lives in both worlds. All these considerations, all the things that can go wrong. I think you gave a lot of great insights for engineers, for people that architect these systems on what they need to understand, what can be different in these environments. Thankfully, at least that's what I hear, is Kubernetes definitely makes it easier
Starting point is 00:42:24 to kind of expect a certain standard, yet there are still small nuances that might be different depending on the target OS, the target environment. Is there anything else in closing remarks that you want to make sure that the audience who is listening in that may contemplate right now, hey, our SaaS software would be great to run on-premise too, because we have these sales opportunities.
Starting point is 00:42:51 Anything else, any links to literature, anything we can pass on? Yeah, no, for sure. I think, you know, first thing is, like, I know we talk a lot about the scary, hard problems to solve, but like, it's easier than it's ever been. Kubernetes makes it super easy. And like, I definitely don't want to scare anybody away. I just want to like, you know, like, hey, it's not, there's no guarantee that just because your software works in Kubernetes on your laptop, that means you're going to ship it to an enterprise who's going to have success running it at scale, right? Like you have to like be a little bit thoughtful about that. And like,
Starting point is 00:43:22 you know, that's what, that's what we do. I think, you know, we're, we're also, you know, Andy you and I talked a little while ago on the Kublai's podcast about Captain and like the, this whole world of like, you know, building for SLIs, SLOs and moving towards like fully autonomous operations is like super cool. And I think that's still, you know, something we're starting to see a little bit in Kubernetes and the more autonomous operations
Starting point is 00:43:48 that you can package into your application, you know, relying on, you know, an Elasticsearch operator instead of like manually running Elasticsearch. And now that operator can run that just provides more reliability and more like, you know, baked in, you know, operational expertise into those. And so it's just going to work better and better and better. You know, I think, you know, it's, it's, it's easier than
Starting point is 00:44:09 ever. And like, yeah, there are links, you know, like, you know, obviously replicated as a platform, you know, this is what we do is what our team does every day. We've been doing it for five and a half years, but like there's open source projects, you know, we use, you know, we've created troubleshoot dot SH, which is like the, you know the build your own support bundle, run it. You don't have to talk to us. It's just an open source project. Ad analyzers do all this. There's shared Slack channels inside the Kubernetes Slack for the Troubleshoot project.
Starting point is 00:44:38 And we're happy to talk. And honestly, I'm always happy just to chat about shipping software onprem regardless of like how you're doing it to understand the whole ecosystem better very cool yeah thank you so much for for being on the show today brian is there anything else from your end that you wanted to add uh no nothing on my end it was all very very interesting and it's if you think back to the earlier conversations we had about Kubernetes Andy a couple years ago, it's crazy how far it's come and how much it's dominated. Just mind-blowing. So this is a very interesting conversation just to be listening to because it's so far from where we were. Well, thanks everyone for listening.
Starting point is 00:45:21 Mark, thanks again for joining us. It's been really great having you on. If anybody has any questions or comments, you can reach us at pure underscore DT at Twitter, or you can send an email at pureperformance.dynatrace.com. Mark, where can people go to find out more about you and everything related to the wonderful goodness of Replicated? Yeah, so I mean, we're at replicated,
Starting point is 00:45:45 you know, dot com, um, or we have, uh, our COTS product, which is, you know, we didn't talk about that by name,
Starting point is 00:45:51 but this is Kubernetes off the shelf and it's at COTS.io, K-O-T-S.io. Awesome. Okay. We'll put some of those links out in the, uh, show notes as well. Thank you everyone for listening and,
Starting point is 00:46:02 uh, we'll see you next episode. Bye-bye episode bye-bye

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.