Software at Scale - Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Ben Afiri, the CEO and co-founder of Commodore, a Kubernetes platform in many ways. Ben was previously a software engineer and product manager at Google. Ben, welcome to the show. Hey, thank you so much. Glad to be here. Yeah, so to start off with, I want to understand something from your background and Commodore more specifically.

Starting point is 00:00:42 Why work on a Kubernetes platform? What gets you excited about Kubernetes so much that you decide, you know what, this job at Google is interesting, but I'm going to work on a company that makes managing Kubernetes easier. Why do that? Yeah, well, it takes me back like three years ago. It looks right now so far away, but three years ago, obviously, Kubernetes was already very hyped. Obviously, a lot of companies talked about Kubernetes. Google, obviously, internally using solely Kubernetes, we call it internally Borg, but basically, it's the internal name of

Starting point is 00:01:14 what today is known as Kubernetes. But it was very in the beginning of the journey of real mass adoption from enterprises to use Kubernetes. When Etienne, my partner, and I talked about what we see currently that is happening and how we see the future, we had this feeling that microservices is going to be the fact of the standard architecture for running application and managing applications. And we saw the overhead and the complexity of managing by yourself, the deployment, you know, the resource management, the options to scale, et cetera, for environments that have hundreds or thousands of microservices. And for us, we kind of intuitively knew that Kubernetes have a very high odds of eventually becoming the standard or de facto, the obvious choice to manage microservices. And for us, it meant basically that

Starting point is 00:02:06 it can become something so significant that eventually it can become known as, you know, the operating system for cloud applications. And we basically took a bet saying, you know, if it's going to happen, if indeed this is the case, if indeed Kubernetes is going to become what, you know, Linux did for operating system for servers, but for applications in the cloud, this can be interesting, right? Like the challenges and therefore the opportunities that will be derived from that can be huge. And both of us wanted to be working on something that will create real impact. And we figured out that we didn't know exactly what to do there, but we figured out that there are going to be a lot of opportunities and challenges if indeed Kubernetes will become the obvious choice and the standard. And, you know, three years later, we just read a report a couple of days back that

Starting point is 00:02:54 more than 80% of the companies, not only cloud native companies, but overall companies in the USA are planning to adopt Kubernetes in the next three years. So for us, you know, at least in retrospect, the bet now looks much more reasonable than it was a couple of years back. No, I certainly would say in practice that's worked out. The question that comes to mind from even that report is why? Like, why are people interested in adopting Kubernetes, right? Like, or why have a container orchestration platform in the first place?

Starting point is 00:03:25 Like, what is your perspective? Yeah, so I will say that like everything else in life, there is the human factor, the psychological factors, and the technical aspects. In terms of the human factor, there is no doubt that today Kubernetes is hyped and today Kubernetes is, let's call it the obvious choice or the standard choice when it comes to where should I and how should I and how

Starting point is 00:03:45 should I deploy and run my application. So once companies like Google, Netflix, Amazon, etc. are using it and are sponsoring it in a way, it does make every other company think twice if they're doing something wrong, if they're not using the best in class tools that those companies are using. So there is, I think, the human factor that made at least some waves and caused some trends to happen a couple of years back. Obviously, when it comes to the technical aspects, Kubernetes does come with a huge promise, right? Basically, you know, it takes a lot of the complexity of managing your own deployment,

Starting point is 00:04:22 secret, you know, scale up, scale down, and resource management around Dockerfiles. And, you know, basically it really allows companies to scale very, very, very fast. Now for companies that have hyper growth, which was all of the companies up until a couple of months back, right? That the growth was the first thing they focused on most of the companies.

Starting point is 00:04:42 Kubernetes is allowing those companies with not a lot of overhead to be able to grow very fast and to scale very fast without the need either to pay to specialists, etc. Or to change your hardware and software, you know, every other week. So being able to scale in a cost effective way, I think it's something that is very appealing for most companies. And I think Kubernetes comes with this promise and in most cases, even able to fulfill this promise, right? And we also saw that in the last couple of years, it became very enterprise-ready with security features

Starting point is 00:05:17 and compliance features baked in, et cetera, which made it also appealing for the Fortune 100 companies. And today I can say that we're also working with huge banks and airline companies that five years ago, I think it was a dream for them to adopt Kubernetes. But today, all of them are either migrating to Kubernetes or just finalized the migration process. So it is quite amazing to see how common and how adopted Kubernetes became in the last

Starting point is 00:05:43 three to five years. Yeah, that makes a lot of sense to me. And what are you seeing? Are you seeing customers using Amazon EKS, like the managed Kubernetes platforms that cloud providers are providing? Are they rolling their own? What have you seen? So it is a mixture.

Starting point is 00:05:59 I would say that definitely, at least from our customers, the segment that we're approaching, the most common use case is a managed service from Amazon to obviously Microsoft and Google. We do have customers that are running Kubernetes on-prem and we also have customers that have hybrid cloud or sometimes even hybrid between on-prem and cloud. From all these different reasons, disaster recovery, the disaster recovery and, you know, supporting different regions and compliance and regulations. So we actually see all of the above, but I think at least for us,

Starting point is 00:06:32 the most common one is managed services from the cloud providers. And last question on this thread, when do you see someone, you know, break out from, oh, I used to run on like bare metal or like EC2 or like directly on like a container orchestration platform to i should be using kubernetes like when do you see people like deciding that that migration is worth it for them i'm wondering if there's like you know

Starting point is 00:06:57 a tipping point and you know just inflexibility or some kind of issue yeah it's a good point i think you know from what we see, at least, there are like two different paths for companies to migrate to Kubernetes. The first path is more of like, you know, a commando way, like, you know, a small unit, small team started to run their own internal tools or their new project with Kubernetes.

Starting point is 00:07:20 It doesn't affect the legacy. It doesn't affect the other teams. It doesn't affect the other services. But it's just like, you know, small thing at the beginning that runs on Kubernetes and eventually, you know, it gradually increases and more and more, you know, internal tools and more and, hey, let's run this on Kubernetes as well because we already have some good experience with that. And it gradually increases to some tipping point where the organization makes a strategic decision to migrate all of the legacy now to Kubernetes or to at least to decide that from now on everything new is being developed, we'll run on Kubernetes, right? So this is like one motion that we see that that it basically starts from like a very small,

Starting point is 00:08:06 scrappy team and evolves. The second motion that we see is that a VP of infrastructure, a VP of platform, or sometimes even the C-level in some organization, decided from cost-effective reasons, decide from security reasons, decide from budget reasons sometimes, that they need to choose other platform. And then when they examine the alternatives,

Starting point is 00:08:29 Kubernetes is usually the most common one or the most standard run to use right now. If you want to migrate from your old, you know, VMware or even, you know, on-prem machines and servers to the cloud. So it can either come like bottom up or top down. But I think that in both motions today, Kubernetes is like, let's say, not only the top choice, but also I think it's also like started to be the top choice by far from the other alternatives out there.

Starting point is 00:08:57 So this is at least from what we see. I assume there are tons of different ways to evolve to Kubernetes, but from our experience is what we see. No, that makes sense. And like where I'm seeing the, really the benefit of this platform is that it's just so flexible,

Starting point is 00:09:12 but it works at scale. It feels like that's how you can summarize it. Like whatever company you are, you can use a managed service, you can use on-prem, you have all of these different needs, you want to put sidecars, this, that. It really does allow you to do anything that you would need

Starting point is 00:09:28 from an orchestration platform. Is that fair to say? Well, yeah, it is very flexible. This is one of the things that I think make it so common. But the flip side of being flexible is being complex, right? So I think it does come with a price, and we can talk about the price later on, but I definitely think that it's flexibility. And as you mentioned, you can run it almost anywhere, right? Like in the edge, in the cloud, on-prem, et cetera.

Starting point is 00:09:55 It's what made it so, so common and so widespread across different organizations. Yeah. I'd love to talk about the price right now. What is the price of the flexibility and what have you seen? So, look, you know, we're a huge fan of Kubernetes, obviously, right? But I think, you know, as we talked, most companies at the end of the day adopting microservices and Kubernetes, you know, from the promise of eventually running faster, right? Deploying more, creating even bigger competitive advantage, being able to move faster, creating less dependency between the teams, right? Deploying more, creating even bigger competitive advantage, being able to move faster, creating less dependency between the teams, right? And in the end of the day, once companies are finishing the migration to Kubernetes and really try to utilize Kubernetes to get all of those traits, in some cases, they're hitting a wall. In some cases, it's not only that

Starting point is 00:10:43 they're not getting what they wanted, they're even degrad wall. In some cases, it's not only that they're not getting what they wanted, they're even degrading their original situation. And we saw an interesting report from one of the incident management tools lately that the time spent on alerts and managing incidents and, you know, the day-to-day operations around your infrastructure, et cetera, in the last three years actually tripled since organization, you know, moved to this microservices environments and Kubernetes architecture. Now, you know, if you're spending three times more on those things, you probably don't move faster, probably not being more cost effective, right?

Starting point is 00:11:20 So sometimes an organization, you know, finish the migration and really try, you know, to now enjoy the fruits of that. It doesn't really work. And I think the reason for that is Kubernetes is a very complex system, right? It's very distributed. It's very scattered. It is composed of thousands of different resources and events that fortunately or unfortunately changed constantly. And they have those non-trivial relations and connections, right? And being able to understand how all of those things work

Starting point is 00:11:51 is mandatory when you're the person that needs to manage operating troubleshoot Kubernetes, right? So, you know, if you're the person, if you're the team that basically needs not only to deploy code, but actually to own your application end-to-end, and this is what is expected for most organizations when it comes to their R&D, you are expecting your developers, you're expecting your team leads to be able to own it end-to-end. They need to have a very vast understanding and experience and know-how around how to do those things, right? And in most cases, they don't have this information.

Starting point is 00:12:26 They don't have this knowledge. They don't have those even tools to do it efficiently. And I think I saw a survey recently that the number one concern about Kubernetes for organization who just adopted Kubernetes is the steep learning curve they're facing. So, you know, we adopted Kubernetes or we migrated to Kubernetes, great.

Starting point is 00:12:46 We have 5% to 10% of our workforce that is familiar with Kubernetes, that knows how to do those complex things in Kubernetes. But we have 90% of our R&D who doesn't really know what Kubernetes is and definitely don't know how to do efficiently the things that, you know, require to do when you have an issue, when you have an incident, when something is not

Starting point is 00:13:08 working, right? And this dissonance between, okay, we migrated to Kubernetes to, okay, we're really utilizing Kubernetes, I think for a lot of organizations is very painful. And, you know, in Commodore, we're trying to help to mitigate this pain, obviously, right? But I think it is a very painful area for a lot of organizations. And if we're talking about the price, think about it that, you know, from the developer perspective, all of a sudden they have this new infrastructure they rely on. They didn't choose it know, had any say to that. And all of a sudden they're required to manage Kubernetes, to operate Kubernetes, to troubleshoot Kubernetes, to use different tools and new processes to do that. And sometimes no one really taught them how to do that. Right. No one really gave them the tools, the knowledge to do that, or even the time to get to know those new technologies and new ways to debug and new ways to understand what's going on, etc. So for them, it can be very frustrating.

Starting point is 00:14:07 But also think about the platform teams, the DevOps and the SREs. Since they are the ones that know how to do those things, immediately, in most cases, they become the bottleneck, right? They become the firefighters. They become the people who do all of the tickets. They become the people that everything is being escalated to. And those people that have tons of other initiatives to do are sometimes, you know, feel that, you know, they are the firefighters of the organizations. So we have this, you know, two different groups in your R&D that are currently, you know,

Starting point is 00:14:37 not being cost effective, right? Not doing their best job. And just a couple of days back, we saw a report by the cloud native observability organization that one out of five developers actually want to quit their job because of this frustration and because of those friction between the developers, the DevOps, the operations, the incident response, and so on. So we definitely see that there is a price, right? It's not something that is going smooth and easy for most organizations. Yeah, that certainly tracks, right? And then when you speak about a steep learning curve, when you're trying to adopt a tool like Kubernetes, like how shared is that learning curve across organizations? And what I'm trying to ask is,

Starting point is 00:15:22 do you have to learn different aspects of like Kubernetes and like, does it break in different ways at different organizations? Or is it mostly, once you understand how Kubernetes works at place A, you probably understand like 50 to 60% of how it works in place B. Like what kind of distinction do you see there?

Starting point is 00:15:40 Yeah, I think that there is like an 80-20 rule here. I think that 80% of the things around Kubernetes are similar from one organization to another. People are doing the same mistakes. People are following the same best practices. Obviously, you do need to blend it and to mix it with your own business logic. Obviously, we'll use different configuration

Starting point is 00:16:04 and different architecture and different deployment strategy, et cetera, et cetera. But I would say that if you have, you know, five to seven years of experience operating and managing Kubernetes, it won't be impossible for you, you know, to onboard in a new organization, understand what's going on there, and be able to operate things efficiently after a couple of weeks, definitely months. I will say that for a developer that, you know, never heard about Kubernetes and they used to always, you know, one huge monolith that is being pushed or released, you know, every quarter to production on top of like EC2 machines, for those people all of a sudden to be able to, you know, being efficient in operating and managing their application on top of Kubernetes, this can be very, very painful and very tough and sometimes very frustrating thing to do.

Starting point is 00:16:55 Yeah. And that's where, like, you can encode information about how Kubernetes works in a platform, right? Just thinking through what you're saying, a developer with five or seven years of Kubernetes experience, probably they're a hot commodity. Everyone wants to hire them. And just from all the surveys, it seems like that person will have job security for life. But you could also encode a lot of that information into a platform. And is that kind of what you envisioned with Commodore?

Starting point is 00:17:22 That why don't we platformize all of the ways that makes the system hard to operationalize? Exactly. Yeah. So we believe that, as you mentioned, the DevOps, the SREs, the people that have, you know, the right experience and expertise. Obviously, you can also try to improve them and, you know, to give them tools to make them more efficient. But to be honest, it will take them maybe from 90 or even 95 to, you know, 96 or 97, right? So, like, the impact you can create is not that significant. And also, product-wise, it will probably be very, very hard to make those people more efficient, right?

Starting point is 00:17:58 And to give them more insight or more automation to really, you know really significantly help them. Having said that, when we look on the developers or the data engineers or everyone that is using Kubernetes as a customer, right? They're building their application or pipelines on top of Kubernetes. Those people, we believe that we can take their efficiency and expertise in Kubernetes from a scale from zero to 100, let's say from 20 to 60 or 70,

Starting point is 00:18:26 by giving them all of the automation and all of the tools baked in in the platform that basically simplified significantly the operation and management and troubleshooting aspects for their applications. So our thesis was, when we started the company, what if we can take the knowledge that already exists, right, for the DevOps and the senior SREs, and we can take their knowledge and expertise, and we can basically democratize what they're doing to the entire organization. So every developer will be able to easily detect issues in Kubernetes, investigate them, and even remediate them independently and efficiently without the need to escalate to the DevOps or to the SRE, right? So basically, this is the entire reason why we founded Commodore.

Starting point is 00:19:13 And this is basically what we do today. So we're actually offering an end-to-end platform for every engineer in the company, from the most junior developer to the head of engineering to be able to do same things on top of Kubernetes, even though they don't have the same amount of expertise and knowledge around Kubernetes, but we mitigate it using our methods and platform to significantly simplify those things for them. Is there a one-wow-moment kind of workflow

Starting point is 00:19:43 that you see in 20% to 30% of demos or when people use the product for the first time? It's like this one thing that keeps annoying them about their Kubernetes workflows that your system just shows or automates. I'm wondering if there's something like that that you all have noticed. Yeah. So, obviously, it's hard to say about one thing because there are so many different things we built in the last couple of years.

Starting point is 00:20:07 But I think one of the most annoying issues that is very repetitive across different customers is that who should care about which issue in Kubernetes? I will give you like the simplest example. I have a pod that is doing restarts, right? Everybody knows that like something that happens on a daily basis, unfortunately. Now, we don't know if this is, you know, an infrastructure issue in the node, we don't know if it's an application issue in the

Starting point is 00:20:33 specific deployment that tries to run this pod, etc. So we need to start, you know, triaging the issue, right? You need to go one by one to different checks or different tools to try to understand if it's an application issue, if it's an infrastructure issue. And you do all of that just to understand if you or your team are the owners for that, or maybe it's another team, right? Like the DevOps or the SREs, if it's like a node issue, for example. So what we do in Commodore, it might sound very

Starting point is 00:21:00 simple, but it saves actually a lot of efforts and resources. Whenever there is a problem in the pod, we run a bunch of checks that are predefined to try to identify why this pod is restarting. Is it a problem that is an application issue, an infrastructure issue? If it's like a node issue, if it is a node issue, why it's a node issue, what happened to the node, right? And we do all of those checks for our users. And then basically we send them an automatic report saying, you have a restarting pod. The reason it's in a crash loop back off, for example. The reason for the crash loop back off is that you did a bad deploy a couple of minutes earlier. Click here to roll it back.

Starting point is 00:21:36 Or if it's a node issue, we'll say, hey, you have a restarting pod. The pod that is, you know, doing the restart, the node that is using is malfunctioning. All of the pods that are using these nodes are restarting around the same time. You should probably escalate to your infrastructure team or to your DevOps team, right? So in one click, they can see not only that they have a pod doing the restart,

Starting point is 00:21:58 but also, is it something that I should care about because it's an application issue, or is this something my DevOps or my SRE should care about because this's an application issue? Or is this something my DevOps or my SRE should care about because this is a node issue, right? And I think when they see it in the same timeline, right, and they get like a single report saying, you know, this is the symptom and this is basically the root cause, then it clicks like, okay, now I understand like how I can significantly improve my day-to-day, how I can significantly reduce the number of tickets I open, how I can significantly reduce the number of escalations

Starting point is 00:22:31 and time being spent on similar issues, right? So for them, it's like seeing it in a single timeline with our report on top of it that can actually like make them understand how much time they can save or how many resources they can save using a tool like us. That would be so helpful for my life as well. We have probably the flakiest set of alerts are just the container restart ones.

Starting point is 00:22:54 And right now we haven't even tuned them well enough to not fire when we do deployments. So we have an alert that says, oh, pod is restarting too frequently in the last 30 minutes. But what ends up happening is you increase the number of pods in a certain deployment because that service needs more capacity and the alerts are not tuned. If we had something that could tell us, this is normal, this happens every deployment, we should just fix it.

Starting point is 00:23:18 Or, oh, it looks like it's an actual application issue. That's when it's in a crash loop. It would make my life easier. Yeah. So I think we spent like, I don't know, overall, maybe 10 to 15 engineering, like years solving only this. Like,

Starting point is 00:23:33 you know, one of the challenges in Kubernetes is that, you know, basically it's, you know, it's a lot of resources and each resource has a status,

Starting point is 00:23:41 right? And one of the challenges is to create a state out of it. So in Commodore, one of the cool things is that you can log into Commodore. Commodore will basically not only tell you what the current statuses are, but basically what's healthy and what's not healthy in each one of your services or resources and how it evolved or changed over time, right? So we actually create a state out of all of the different statuses and events that we read from the QAPI itself. So basically, you know, if you just look on a pod and its status is not ready, right? It's something else, it's pending, it's whatever. Is it an issue or not?

Starting point is 00:24:17 You don't know, right? You need to check if it's in the middle of a deploy. You need to check if maybe, you know, maybe we're using spot instances and something is up and down. You need to check if maybe we're using spot instances and something is up and down. You need to check maybe after a couple of seconds was it resolved? Maybe it was not resolved. Some things are transient. Some things are not transient. There are so many different nuances for each one of the different resources

Starting point is 00:24:37 in typing Kubernetes. So if you just give it to someone who don't really know what Kubernetes is, his chances to being able to resolve an issue are very, very low without all of this understanding, right? Having said that, if you send it to the most senior DevOps in the team, he will tell you, oh, yeah, this is how Kubernetes act. You know, this alert, as you mentioned, is just because, you know, the deployment is being running right now. Let's wait five minutes and it will automatically be resolved, right? But you definitely don't want to escalate every time there is an issue for this very, very expensive and, you know, busy DevOps, right?

Starting point is 00:25:14 So this is exactly what we try. We try only to monitor on the real issues. or we help you to investigate also why it happened so we can actually fix it or remediate it via Commodore, hopefully in one to two clicks, rather than, you know, hours of investigation sometimes. How do you automate the fix, right? And is that scary for engineers? Like one thing that I would be very worried about is some platform acting on my orchestration platform.

Starting point is 00:25:41 Like how do you kind of build that trust? First of all all what you described sounds awful i will never let the vendor to automate anything on my production system right so yeah we're being very very very careful around that what we're doing we're starting with suggestions right like hey we notice that it can be caused by x y and z here is how we can solve x here is how we can solve y here is how we can solve y here is how we can solve z right so we started with suggestions once we saw our suggestions you know they make sense like users are try to follow them we started to add actually actions from our platform, everything that we added is being guarded behind our back.

Starting point is 00:26:28 So actually like the platform engineers or the dev engineers can actually choose which teams can take which actions on which resources, right? So for example, you're probably okay with the developers that own some service to be able to restart the pod, right? Or even to do a pod right or even

Starting point is 00:26:45 to do a revert on staging okay but you're probably definitely not okay with you know letting the developers to drain a node in production right so like you can actually choose which teams can do what and using commodore we are basically suggesting them what to do, giving them the functionality to do that, but they need to take the ownership and responsibility to actually make the action. We don't believe that the right move here is go to the auto-healing motion and try to automate the things for users. We know that it is hard enough for most organizations to trust themselves to make actions but we try to take all of the heavy lifting part from it we're trying to be this single place that showing you the symptom allowing you to explore and to find the root cause suggesting you what to do with it

Starting point is 00:27:39 allowing you to take the action to solve it. Everything is audited. Everything is under Auerbach and SSO. But eventually you're the person that needs to decide this is the right thing to do. Click on the button and hopefully to resolve the issue. And you can also probably click and track how many times are people actually clicking the button. And you can even see how useful your remediations are and optimize that metric over time.

Starting point is 00:28:05 That is the ideal place, right? You ping people when you know you should be pinging people. You let them take the action and you can track if they're taking that action over time. Exactly. The cool thing around it is that, as I mentioned, we never just show actions arbitrary. So every time a user takes action in Commodore is because we gave them some context that they decided that the right thing to do

Starting point is 00:28:29 is take this action, right? So we can actually see when we were right and we can actually fine tune which suggestions we're providing our users in which kind of scenario. And sometimes, you know, we understand that what we did doesn't make sense. And sometimes we see that

Starting point is 00:28:44 what we did totally makes sense. And then obviously, you know, we can that what we did doesn't make sense. And sometimes we see that what we did totally makes sense. And then obviously, you know, we can improve and iterate on top of it. So in the end of the day, we do want to build this machine, right? That for almost all of the scenarios in Kubernetes already knows what are the actions or steps you should take to resolve it, right? And if we can get to this point, our customers will even benefit more from our capabilities. So we're definitely building our brain in a way that is data-driven, and we do hope to improve our suggestions

Starting point is 00:29:15 and our methods around it over time. Yeah. Maybe a dumb question, but why doesn't Kubernetes have all of this stuff inbuilt? I think it's a fair question. I think some you know, some of the reason is, you know, it's an open source project that, you know, basically it can go to so many different directions, you know, from adding more resources to adding more abstraction on top of it, to adding more security functionality. And, you know, there are so many different things to do. And I think specifically the layer that is around reliability, observability, you know,

Starting point is 00:29:53 operational is usually being done historically by either vendors or, you know, the cloud providers themselves. So we don't see it organically coming from the core community of Kubernetes. Definitely, you know, might be a good addition. I do see different releases from the cloud providers you know, tries to take their own perspective and their own thesis on what exactly is missing and how to solve it. And I think this is one of the things that make this project so, you know, so fascinating, right? Like how different companies, different players, different, you know, layers in the ecosystem,

Starting point is 00:30:39 each one contributes something else to create this, you know, eventually amazing, amazing experience for the end user. Yeah, that certainly tracks, right? It's open source. And then, but the fact that you have all of these like open layers, lets multiple people build and make improvements. How do you keep track of, you know, adoption of new features and making sure that Commodore

Starting point is 00:31:03 is satisfying the needs of customers using, like, new features that Kubernetes has? Like, is it kind of like a, you see what Kubernetes is doing, you see what problems there are coming. Based on that, you add extra stuff to Commodore. Like, how does that stuff work? How quickly does Kubernetes actually ship out new things?

Starting point is 00:31:20 And, like, how much of a race is it to, like, just keep up with what's happening in the ecosystem yeah that's a good question first of all i am ex-googler and my partner is from ebay so like we are very data driven so we're keeping a lot of track on what our users are doing what are they doing in the platform you know which features they're finding more useful we're having tons of conversations with our users we try to be very close to our users. You know, in that way, we're very like product-led company and not very like sales-led company.

Starting point is 00:31:54 So we're having, you know, interviews with end users, I think, on a daily basis. So we do try to keep track and to hear their pains, their insights, you know, their feedback, not only on our tool, but also on different tools in the ecosystem, say CD tools, monitoring tools, and definitely on Kubernetes itself. Now, I will say that in Kubernetes, obviously, it is changing quite a lot and adding new capabilities, et cetera.

Starting point is 00:32:19 But what I think is very interesting is the usage and the adoption of custom resources, of crds and this is i think the equivalent of like a wild wild west right because everyone can create a new crd it can all of a sudden become super popular and you know it can change and you can add more functionality there so i think that we see a lot of new CRDs all of a sudden becoming from zero popularity to becoming very, very common in a matter of weeks or months. And then obviously us as a vendor, like as a solution that tries to simplify Kubernetes operation, we need very quickly to understand, okay, what does this CRD does? You know, when it breaks, what do you need to do? What do you need to check?

Starting point is 00:33:06 How does it interact with your other Kubernetes resources? And then basically we have some kind of a research department that does all of those things very, very quickly, translate it into product spec, and then we ship it into the product. So those things we do, I would say, even like on a weekly basis sometimes, but we are trying to respond very fast

Starting point is 00:33:25 to changes in the ecosystem. And on that note of the research department, I would imagine the hiring for your company has to be of just deeply technical people who know what they're doing because this is like a deeply technical product, right? Like you have to know you're aiming for the right things. You have to think about how like DevOps engineers

Starting point is 00:33:42 and SREs are thinking. Is that the primary population of where you can hire from definitely so we have a bunch of people coming from different backgrounds all of our product managers for example we're platform engineers system engineers or sometimes even sres right so, you know, it's super important to hire people that have firsthand experience with the problem we're trying to solve. It is a very technical problem that does require a lot of understanding. If you want to simplify something very complex, you do need to be an expert in this specific complexity, right?

Starting point is 00:34:20 So it is hard for us, you know, to find the right people. And I would say that the bar, you know, walking for Commodore is quite high for that reason. But I think this is what makes sometimes walking here a good opportunity because all of the people here are obsessed on Kubernetes, on cloud native tools, on solving complex issues, on simplifying how to solve complex issues. And, you know, when this is something that fascinates you, when this is something that you find interesting, you can create awesome things together, right? So, you know, even our designers are people that are, you know, fascinated by solving complex issues and simplifying complex tasks. And, you know, working walking together software engineers, designers, product managers that find this, you know, these kinds of challenges interesting, the end result can sometimes be, you know, groundbreaking and, you know, non-orthodox in a lot of cases.

Starting point is 00:35:16 So it is obviously a challenge to find those people, but then, you know, when you do find them, I think, you know, it can create an amazing environment to work together on solving those issues. Yeah. And perhaps a similar note, not exactly the same. How has the economy pivot really affected your team? You can say in whatever detail you'd like to see, like, what are you seeing from, you know, prospect of customers?

Starting point is 00:35:41 Is the adoption of tools like Kubernetes like slowing down? Is the adoption of like vendors kubernetes like slowing down is the adoption of like vendors on top of kubernetes slowing down like what are you seeing generally yeah so i do see two different trends that are opposite to each other this is only obviously our own experience right i'm not claiming to say that no it's the entire landscape but obviously the biggest you know mindset right now for most companies is to do more with less right like a year ago it was growth at any cost right it's just grow grow grow and today what we're hearing is you know we need to do more but with less right and this is a big, big change, right? From going to, from growth at any cost to, you know, to do more with less.

Starting point is 00:36:29 And obviously it translates to sometimes cutting spends, cutting budget, letting people go, right? Very talented people sometimes. And it definitely also affects us, right? I can tell you that we had POCs that the entire team used the tool. All of a sudden, you know, got the letter that they're being laid off from the company. So obviously, it also kills the POC, right? We reached with some companies to commercial agreement. And, you know, for some reason, now the C-level needed to approve it. And, you know, they said just,

Starting point is 00:37:02 how? No, like, we're not going to approve new tools in this calendar year at all. It doesn't matter which tool it is. We're only going to use what we already have. So obviously in some cases we saw the, you know, the effect of what's happening in the market. I would say the second trend we're seeing is that because of companies trying to do more with less and because of the reason that, you know, those DevOps and SREs are so expensive and you have so little of them, you definitely need to guard their time

Starting point is 00:37:34 and to make sure that you're not wasting their time on, you know, escalations, tickets and firefighting. And you do need to do more with less. So we try to understand how we can get more from what you have, right? More from my developers, more from my DevOps, more from my SREs and knock. And then a tool like Commodore that basically automates a lot of the, you know, manual operations, management, and troubleshooting aspects that are, you know are happening on a daily basis for most organizations, we can actually save a significant amount of time,

Starting point is 00:38:10 not only from the developers, but also from the DevOps and the SREs. And if indeed you can prove that you can save X amount of times, X amount of hours every week by automating those tasks, for those organizations, this is actually a good story for them to buy a tool like Commodore, right? So we can actually win, and we saw it in the last couple of months in some POCs that we thought, you know, they're going to decide not because of what's happening in the market, but actually they told us that the management is convinced that with a tool like Commodore, they can actually reduce labor hours, they can actually

Starting point is 00:38:45 reduce the toil, and actually can focus on the things that actually can move the needle for the business and not focus on Kubernetes troubleshooting and Kubernetes operations. So I think overall, we see that for tools like us that can actually save time and automate, you know, IT tasks, companies still have, you know, let's say, big appetite to try to purchase, but definitely these are not easy times, right? So overall, as I mentioned, like I think they opposite to each other,

Starting point is 00:39:17 those two trends. So far we are, you know, we as a company are doing okay, but we're definitely not being cocky or anything. And we will definitely try to understand where the market is going in 23, which kind of trends are going to happen. And as a company, we need to make sure that we can thrive

Starting point is 00:39:36 and succeed even in tough times. And hopefully things will, in a couple of months, will be stabilized and improved, but we're definitely ready for each option. Yeah. The idea of those two trends that are in opposite directions makes a lot of sense to me. Finally, as a set of, you know, just as a technologist, right, what gets you excited

Starting point is 00:39:58 about the next year, right? Are there, you know, some platform improvements that are coming to Commodore? Are there some capabilities that the Kubernetes API is going to provide that's going to expand what comodore can do is there anything like that that gets you excited from like a technology perspective wow so many i would say that i think what i'm mostly excited is our plans to expand beyond only troubleshooting. You know, when we started to talk about troubleshooting, today we're allowing ourselves to say that we, you know, we're also helping to manage, to operate, and to troubleshoot the Kubernetes clusters.

Starting point is 00:40:37 What we want to do by the end of the year is to be very proactive in allowing the end users, the developers and the DevOps, basically not only to manage and to operate, but also to make sure they comply with best practices, to make sure that they're running the most cost-effective way, to make sure that they're deploying in a way that is healthy in the first place to their clusters so they will have less issues. So if we started from a bit of a reactive place of, okay, you have an issue, how to solve it, what we want to do in the next couple of months is to start being more proactive and actually suggest and enforce different best practices, et cetera. So your system will be more

Starting point is 00:41:19 healthy and more reliable and more cost-effective in the first place. So we have tons of new capabilities and features that we plan to roll out in the first place. So we have tons of new capabilities and features that we plan to roll out in the next couple of weeks and months to start and try it out. But we already got a very, very positive feedback from our beta customers that, you know, using our suggestions, best practices, and enforcement, they actually managed to reduce number of escalations,

Starting point is 00:41:43 reduce number of issues, reduce spend using Commodore. So this is definitely something we're all super excited about. And it will be, I think, a pretty significant change and expansion for us as a company. But, you know, we can talk in a year and I can tell you how it went. Some kind of like service guardrails or like that's the kind of stuff you're talking about. Like every service should make sure they have like at least X nodes running

Starting point is 00:42:10 for like availability. Is that kind of an example? There are some things that are like can be done in the static level, but by reading all of the configurations in YAML and see, you know, if they're complying with best practices. But think about like all of the dynamic stuff, right?

Starting point is 00:42:24 Can you identify idle resources, right? Like, you know, you deploy the PVC, it's deployed correctly, but no one is using it, right? It's just out there. Same goes for like, you know, a config map, right? Like you just have a config that no one reads and no one updated for a month, right? Can you just kill it, right? Is it a good usage of your resources? Is it a good usage of your time? So there are so many things from the static point of view, but also from the dynamic point of view, that if you can combine them both and better understand what is actually deployed

Starting point is 00:42:53 and what is actually being used and if it's being used correctly or not, and then you can proactively suggest what to do with it, either to change some configuration to kill this resource or to run it in a different way, right? So we're taking it from like different perspective and different approaches. And yeah, eventually it can be, you know, symbolized as a scorecard and things like that. But I think, you know, the rule engine behind the scene is something that get us very excited lately. Especially like if it's very easy to add new rules and see how that rule is going to affect or

Starting point is 00:43:27 how it operates on your organization, how easy it is to burn that rule down in terms of doing the migration to fix what that rule says. That is exciting. Well, thank you so much for being a guest, Ben. This was a lot of fun. And yes, I hope to have you or someone from Commodore again in a year to see how all of this goes. I can tell you that if everything goes great, it will be me.

Starting point is 00:43:48 And if not, I will send my partner. Yeah, that'd be great. I'd like talking to both of you. Awesome. Bye.

Software at Scale - Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.