Software at Scale - Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri
Episode Date: March 15, 2023Ben Ofiri is the CEO and Co-Founder of Komodor, a Kubernetes troubleshooting platform. Apple Podcasts | Spotify | Google PodcastsWe had an episode with the other founder of Komodor, Itiel, in 2021..., and I thought it would be fun to revisit the topic.Highlights (ChatGPT Generated)[0:00] Introduction to the Software At Scale podcast and the guest speaker, Ben Ofiri, CEO and co-founder of Komodor.- Discussion of why Ben decided to work on a Kubernetes platform and the potential impact of Kubernetes becoming the standard for managing microservices.- Reasons why companies are interested in adopting Kubernetes, including the ability to scale quickly and cost-effectively, and the enterprise-ready features it offers.- The different ways companies migrate to Kubernetes, either starting from a small team and gradually increasing usage, or a strategic decision from the top down.- The flexibility of Kubernetes is its strength, but it also comes with complexity that can lead to increased time spent on alerts and managing incidents.- The learning curve for developers to be able to efficiently troubleshoot and operate Kubernetes can be steep and is a concern for many organizations.[8:17] Tools for Managing Kubernetes.- The challenges that arise when trying to operate and manage Kubernetes.- DevOps and SRE teams become the bottleneck due to their expertise in managing Kubernetes, leading to frustration for other teams.- A report by the cloud native observability organization found that one out of five developers felt frustrated enough to want to quit their job due to friction between different teams.- Ben's idea for Komodor was to take the knowledge and expertise of the DevOps and SRE teams and democratize it to the entire organization.- The platform simplifies the operation, management, and troubleshooting aspects of Kubernetes for every engineer in the company, from junior developers to the head of engineering.- One of the most frustrating issues for customers is identifying which teams should care about which issues in Kubernetes, which Komodor helps solve with automated checks and reports that indicate whether the problem is an infrastructure or application issue, among other things.- Komodor provides suggestions for actions to take but leaves the decision-making and responsibility for taking the action to the users.- The platform allows users to track how many times they take an action and how useful it is, allowing for optimization over time.[8:17] Tools for Managing Kubernetes.[12:03] The Challenge of Balancing Standardization and Flexibility.- Kubernetes provides a lot of flexibility, but this can lead to fragmented infrastructure and inconsistent usage patterns.- Komodor aims to strike a balance between standardization and flexibility, allowing for best practices and guidelines to be established while still allowing for customization and unique needs.[16:14] Using Data to Improve Kubernetes Management.- The platform tracks user actions and the effectiveness of those actions to make suggestions and fine-tune recommendations over time.- The goal is to build a machine that knows what actions to take for almost all scenarios in Kubernetes, providing maximum benefit to customers.[20:40] Why Kubernetes Doesn't Include All Management Functionality.- Kubernetes is an open-source project with many different directions it can go in terms of adding functionality.- Reliability, observability, and operational functionality are typically provided by vendors or cloud providers and not organically from the Kubernetes community.- Different players in the ecosystem contribute different pieces to create a comprehensive experience for the end user.[25:05] Keeping Up with Kubernetes Development and Adoption.- How Komodor keeps up with Kubernetes development and adoption.- The team is data-driven and closely tracks user feedback and needs, as well as new developments and changes in the ecosystem.- The use and adoption of custom resources is a constantly evolving and rapidly changing area, requiring quick research and translation into product specs.- The company hires deeply technical people, including those with backgrounds in DevOps and SRE, to ensure a deep understanding of the complex problem they are trying to solve.[32:12] The Effects of the Economy on Komodor.- The effects of the economy pivot on Komodor.- Companiesmust be more cost-efficient, leading to increased interest in Kubernetes and tools like Komodor.- The pandemic has also highlighted the need for remote work and cloud-based infrastructure, further fueling demand.- Komodor has seen growth as a result of these factors and believes it is well-positioned for continued success.[36:17] The Future of Kubernetes and Komodor.- Kubernetes will continue to evolve and be adopted more widely by organizations of all sizes and industries.- The team is excited about the potential of rule engines and other tools to improve management and automation within Kubernetes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me here today is Ben Afiri, the CEO and co-founder of Commodore, a Kubernetes platform in many ways.
Ben was previously a software engineer and product manager at Google.
Ben, welcome to the show.
Hey, thank you so much. Glad to be here.
Yeah, so to start off with, I want to understand something from your background and Commodore more specifically.
Why work on a Kubernetes platform?
What gets you excited about Kubernetes so much that you decide, you know what, this job at Google is interesting,
but I'm going to work on a company that makes managing Kubernetes easier.
Why do that?
Yeah, well, it takes me back like three years ago.
It looks right now so far away, but three years ago, obviously, Kubernetes was already very hyped.
Obviously, a lot of companies talked about Kubernetes. Google, obviously, internally
using solely Kubernetes, we call it internally Borg, but basically, it's the internal name of
what today is known as Kubernetes. But it was very in the beginning of the journey of real
mass adoption from enterprises to use Kubernetes. When Etienne, my partner, and I talked about what we see currently that is happening
and how we see the future, we had this feeling that microservices is going to be the fact
of the standard architecture for running application and managing applications.
And we saw the overhead and the complexity of managing by yourself, the deployment, you know, the resource management,
the options to scale, et cetera, for environments that have hundreds or thousands of microservices.
And for us, we kind of intuitively knew that Kubernetes have a very high odds of eventually
becoming the standard or de facto, the obvious choice to manage microservices. And for us, it meant basically that
it can become something so significant that eventually it can become known as, you know,
the operating system for cloud applications. And we basically took a bet saying, you know,
if it's going to happen, if indeed this is the case, if indeed Kubernetes is going to become
what, you know, Linux did for operating system for servers, but for applications in the cloud, this can be interesting, right?
Like the challenges and therefore the opportunities that will be derived from that can be huge.
And both of us wanted to be working on something that will create real impact.
And we figured out that we didn't know exactly what to do there, but we figured out that there are going to be a lot of opportunities and challenges if indeed Kubernetes will become the obvious choice
and the standard. And, you know, three years later, we just read a report a couple of days back that
more than 80% of the companies, not only cloud native companies, but overall companies in the
USA are planning to adopt Kubernetes in the next three years. So for us, you know, at least in
retrospect,
the bet now looks much more reasonable than it was a couple of years back.
No, I certainly would say in practice that's worked out.
The question that comes to mind from even that report is why?
Like, why are people interested in adopting Kubernetes, right?
Like, or why have a container orchestration platform in the first place?
Like, what is your perspective?
Yeah, so I will say that like everything else in life,
there is the human factor, the psychological factors,
and the technical aspects.
In terms of the human factor,
there is no doubt that today Kubernetes is hyped and today Kubernetes is, let's call it the obvious choice
or the standard choice when it comes to where should I
and how should I and how
should I deploy and run my application. So once companies like Google, Netflix, Amazon, etc.
are using it and are sponsoring it in a way, it does make every other company think twice
if they're doing something wrong, if they're not using the best in class tools that those
companies are using. So there is, I think, the human factor that made at least some waves
and caused some trends to happen a couple of years back.
Obviously, when it comes to the technical aspects,
Kubernetes does come with a huge promise, right?
Basically, you know, it takes a lot of the complexity of managing your own deployment,
secret, you know, scale up, scale down,
and resource management around Dockerfiles.
And, you know, basically it really allows companies to scale very, very, very fast.
Now for companies that have hyper growth,
which was all of the companies
up until a couple of months back, right?
That the growth was the first thing
they focused on most of the companies.
Kubernetes is allowing those companies with not a lot of overhead to be able to grow very fast and to scale very fast without the need either to pay to specialists, etc.
Or to change your hardware and software, you know, every other week.
So being able to scale in a cost effective way, I think it's something that is very appealing for most companies.
And I think Kubernetes comes with this promise and in most cases,
even able to fulfill this promise, right?
And we also saw that in the last couple of years,
it became very enterprise-ready
with security features
and compliance features baked in, et cetera,
which made it also appealing
for the Fortune 100 companies.
And today I can say that we're also working with huge banks and airline companies that
five years ago, I think it was a dream for them to adopt Kubernetes.
But today, all of them are either migrating to Kubernetes or just finalized the migration
process.
So it is quite amazing to see how common and how adopted Kubernetes became in the last
three to five years.
Yeah, that makes a lot of sense to me.
And what are you seeing?
Are you seeing customers using Amazon EKS,
like the managed Kubernetes platforms that cloud providers are providing?
Are they rolling their own?
What have you seen?
So it is a mixture.
I would say that definitely, at least from our customers,
the segment that we're approaching,
the most common use case is a managed service from Amazon to obviously Microsoft and Google.
We do have customers that are running Kubernetes on-prem and we also have customers that have hybrid cloud or sometimes even hybrid between on-prem and cloud.
From all these different reasons, disaster recovery, the disaster recovery and, you know, supporting different regions
and compliance and regulations.
So we actually see all of the above,
but I think at least for us,
the most common one is managed services
from the cloud providers.
And last question on this thread,
when do you see someone, you know,
break out from, oh, I used to run on like bare metal
or like EC2 or like directly
on like a container orchestration platform to i should be using kubernetes like when do you see
people like deciding that that migration is worth it for them i'm wondering if there's like you know
a tipping point and you know just inflexibility or some kind of issue yeah it's a good point i think
you know from what we see, at least,
there are like two different paths
for companies to migrate to Kubernetes.
The first path is more of like, you know, a commando way,
like, you know, a small unit, small team
started to run their own internal tools
or their new project with Kubernetes.
It doesn't affect the legacy.
It doesn't affect the other teams.
It doesn't affect the other services.
But it's just like, you know, small thing at the beginning that runs on Kubernetes and eventually, you know, it gradually increases and more and more, you know, internal tools and more and, hey, let's run this on Kubernetes as well because we already have some good experience with that.
And it gradually increases to some tipping point where the organization makes a strategic
decision to migrate all of the legacy now to Kubernetes or to at least to decide that
from now on everything new is being developed, we'll run on Kubernetes, right?
So this is like one motion that we see that that it basically starts from like a very small,
scrappy team and evolves.
The second motion that we see is that a VP of infrastructure,
a VP of platform, or sometimes even the C-level
in some organization, decided from cost-effective reasons,
decide from security reasons,
decide from budget reasons sometimes,
that they need to choose other platform.
And then when they examine the alternatives,
Kubernetes is usually the most common one or the most standard run to use right now.
If you want to migrate from your old, you know, VMware or even, you know,
on-prem machines and servers to the cloud.
So it can either come like bottom up or top down.
But I think that in both motions today,
Kubernetes is like, let's say, not only the top choice,
but also I think it's also like started to be the top choice
by far from the other alternatives out there.
So this is at least from what we see.
I assume there are tons of different ways
to evolve to Kubernetes,
but from our experience is what we see.
No, that makes sense.
And like where I'm seeing the,
really the benefit of this platform
is that it's just so flexible,
but it works at scale.
It feels like that's how you can summarize it.
Like whatever company you are,
you can use a managed service,
you can use on-prem,
you have all of these different needs,
you want to put sidecars, this, that.
It really does allow you to do anything that you would need
from an orchestration platform.
Is that fair to say?
Well, yeah, it is very flexible.
This is one of the things that I think make it so common.
But the flip side of being flexible is being complex, right?
So I think it does come with a price,
and we can talk about the price later on, but I definitely think that it's flexibility. And as you mentioned,
you can run it almost anywhere, right? Like in the edge, in the cloud, on-prem, et cetera.
It's what made it so, so common and so widespread across different organizations.
Yeah. I'd love to talk about the price right now. What is the price of the flexibility and what have you seen?
So, look, you know, we're a huge fan of Kubernetes, obviously, right?
But I think, you know, as we talked, most companies at the end of the day adopting microservices and Kubernetes, you know, from the promise of eventually running faster, right?
Deploying more, creating even bigger competitive advantage, being able to move faster, creating less dependency between the teams, right? Deploying more, creating even bigger competitive advantage, being able to move
faster, creating less dependency between the teams, right? And in the end of the day,
once companies are finishing the migration to Kubernetes and really try to utilize Kubernetes
to get all of those traits, in some cases, they're hitting a wall. In some cases, it's not only that
they're not getting what they wanted, they're even degrad wall. In some cases, it's not only that they're not getting what they wanted,
they're even degrading their original situation. And we saw an interesting report from one of the
incident management tools lately that the time spent on alerts and managing incidents and, you
know, the day-to-day operations around your infrastructure, et cetera, in the last three years actually
tripled since organization, you know, moved to this microservices environments and Kubernetes
architecture.
Now, you know, if you're spending three times more on those things, you probably don't move
faster, probably not being more cost effective, right?
So sometimes an organization, you know, finish the migration and really try, you know, to now enjoy the fruits of that.
It doesn't really work.
And I think the reason for that is Kubernetes is a very complex system, right?
It's very distributed.
It's very scattered.
It is composed of thousands of different resources and events that fortunately or unfortunately changed constantly.
And they have those non-trivial relations and connections, right?
And being able to understand how all of those things work
is mandatory when you're the person that needs to
manage operating troubleshoot Kubernetes, right?
So, you know, if you're the person,
if you're the team that basically needs not only to deploy code,
but actually to own your application end-to-end, and this is what is expected for most organizations when it comes
to their R&D, you are expecting your developers, you're expecting your team leads to be able to
own it end-to-end. They need to have a very vast understanding and experience and know-how around
how to do those things, right? And in most cases, they don't have this information.
They don't have this knowledge.
They don't have those even tools to do it efficiently.
And I think I saw a survey recently
that the number one concern about Kubernetes
for organization who just adopted Kubernetes
is the steep learning curve they're facing.
So, you know, we adopted Kubernetes
or we migrated to Kubernetes, great.
We have 5% to 10% of our workforce
that is familiar with Kubernetes,
that knows how to do those complex things in Kubernetes.
But we have 90% of our R&D
who doesn't really know what Kubernetes is
and definitely don't know how to do efficiently
the things that, you know,
require to do when you have an issue, when you have an incident, when something is not
working, right?
And this dissonance between, okay, we migrated to Kubernetes to, okay, we're really utilizing
Kubernetes, I think for a lot of organizations is very painful.
And, you know, in Commodore, we're trying to help to mitigate this pain, obviously,
right?
But I think it is a very painful area for a lot of organizations.
And if we're talking about the price, think about it that, you know, from the developer perspective, all of a sudden they have this new infrastructure they rely on.
They didn't choose it know, had any say to that. And all of a sudden they're required to manage Kubernetes, to operate Kubernetes, to troubleshoot Kubernetes, to use different tools and new processes to do that. And sometimes no one really taught them how to do that. Right. No one really gave them the tools, the knowledge to do that, or even the time to get to know those new technologies and new ways to debug and new ways to understand what's going on, etc. So for them, it can be very frustrating.
But also think about the platform teams, the DevOps and the SREs.
Since they are the ones that know how to do those things, immediately, in most cases,
they become the bottleneck, right?
They become the firefighters.
They become the people who do all of the tickets.
They become the people that everything is being escalated to. And those people that have tons of other initiatives to do are sometimes, you know, feel that,
you know, they are the firefighters of the organizations.
So we have this, you know, two different groups in your R&D that are currently, you know,
not being cost effective, right?
Not doing their best job.
And just a couple of days back, we saw a report by the cloud native observability organization that one out of five developers actually want to quit their job because of this frustration and because of those friction between the developers, the DevOps, the operations, the incident response, and so on.
So we definitely see that there is a price, right?
It's not something that
is going smooth and easy for most organizations. Yeah, that certainly tracks, right? And then when
you speak about a steep learning curve, when you're trying to adopt a tool like Kubernetes,
like how shared is that learning curve across organizations? And what I'm trying to ask is,
do you have to learn different aspects of like Kubernetes
and like, does it break in different ways
at different organizations?
Or is it mostly,
once you understand how Kubernetes works at place A,
you probably understand like 50 to 60%
of how it works in place B.
Like what kind of distinction do you see there?
Yeah, I think that there is like an 80-20 rule here.
I think that 80% of the things around Kubernetes
are similar from one organization to another.
People are doing the same mistakes.
People are following the same best practices.
Obviously, you do need to blend it
and to mix it with your own business logic.
Obviously, we'll use different configuration
and different architecture and different deployment strategy, et cetera, et cetera.
But I would say that if you have, you know, five to seven years of experience operating and managing Kubernetes,
it won't be impossible for you, you know, to onboard in a new organization, understand what's going on there,
and be able to operate things efficiently after a couple of weeks, definitely months.
I will say that for a developer that, you know, never heard about Kubernetes and they
used to always, you know, one huge monolith that is being pushed or released, you know,
every quarter to production on top of like EC2 machines, for those people all of a sudden
to be able to, you know, being efficient in operating and managing their application on top of Kubernetes, this can be very, very painful and very tough and sometimes very frustrating thing to do.
Yeah. And that's where, like, you can encode information about how Kubernetes works in a platform, right?
Just thinking through what you're saying, a developer with five or seven years of Kubernetes experience,
probably they're a hot commodity.
Everyone wants to hire them.
And just from all the surveys,
it seems like that person will have job security for life.
But you could also encode a lot of that information into a platform.
And is that kind of what you envisioned with Commodore?
That why don't we platformize all of the ways that makes the system hard to operationalize?
Exactly.
Yeah.
So we believe that, as you mentioned, the DevOps, the SREs, the people that have, you know, the right experience and expertise.
Obviously, you can also try to improve them and, you know, to give them tools to make them more efficient.
But to be honest, it will take them maybe from 90 or even 95 to, you know, 96 or 97, right?
So, like, the impact you can create is not that significant.
And also, product-wise, it will probably be very, very hard to make those people more efficient, right?
And to give them more insight or more automation to really, you know really significantly help them. Having said that, when we look on the developers
or the data engineers
or everyone that is using Kubernetes as a customer, right?
They're building their application
or pipelines on top of Kubernetes.
Those people, we believe that we can take their efficiency
and expertise in Kubernetes from a scale from zero to 100,
let's say from 20 to 60 or 70,
by giving them all of the automation and all of the tools baked in in the platform
that basically simplified significantly the operation and management
and troubleshooting aspects for their applications.
So our thesis was, when we started the company,
what if we can take the knowledge that already exists, right, for the DevOps and the senior SREs, and we can take their knowledge and expertise, and we can basically democratize what they're doing to the entire organization.
So every developer will be able to easily detect issues in Kubernetes, investigate them, and even remediate them independently and efficiently
without the need to escalate to the DevOps or to the SRE, right?
So basically, this is the entire reason why we founded Commodore.
And this is basically what we do today.
So we're actually offering an end-to-end platform for every engineer in the company,
from the most junior developer to the head of engineering to be able to do same things on top of Kubernetes,
even though they don't have the same amount of expertise
and knowledge around Kubernetes,
but we mitigate it using our methods and platform
to significantly simplify those things for them.
Is there a one-wow-moment kind of workflow
that you see in 20% to 30% of demos
or when people use the product for the first time?
It's like this one thing that keeps annoying them about their Kubernetes workflows
that your system just shows or automates.
I'm wondering if there's something like that that you all have noticed.
Yeah.
So, obviously, it's hard to say about one thing
because there are so many different things we built in the last couple of years.
But I think one of the most annoying issues that is very repetitive across different customers is that who should care about which issue in Kubernetes?
I will give you like the simplest example.
I have a pod that is doing restarts, right?
Everybody knows that like something that happens on
a daily basis, unfortunately.
Now, we don't know if this is, you know,
an infrastructure issue in the node, we don't know
if it's an application issue in the
specific deployment that tries to run this pod,
etc. So we need to start,
you know, triaging the issue, right? You need to
go one by one to different checks or
different tools to try to understand if it's an
application issue, if it's an infrastructure issue. And you do all of that just to understand if you or your
team are the owners for that, or maybe it's another team, right? Like the DevOps or the
SREs, if it's like a node issue, for example. So what we do in Commodore, it might sound very
simple, but it saves actually a lot of efforts and resources. Whenever there is a problem in the
pod, we run a bunch of checks that are predefined to try to identify why this pod is restarting.
Is it a problem that is an application issue, an infrastructure issue? If it's like a node issue,
if it is a node issue, why it's a node issue, what happened to the node, right? And we do all
of those checks for our users. And then basically we send them an automatic report saying, you have a restarting pod.
The reason it's in a crash loop back off, for example.
The reason for the crash loop back off is that you did a bad deploy a couple of minutes earlier.
Click here to roll it back.
Or if it's a node issue, we'll say, hey, you have a restarting pod.
The pod that is, you know, doing the restart, the node that is using is malfunctioning.
All of the pods that are using these nodes
are restarting around the same time.
You should probably escalate to your infrastructure team
or to your DevOps team, right?
So in one click, they can see
not only that they have a pod doing the restart,
but also, is it something that I should care about
because it's an application issue,
or is this something my DevOps or my SRE should care about because this's an application issue? Or is this something my DevOps or my SRE
should care about because this is a node issue, right? And I think when they see it in the same
timeline, right, and they get like a single report saying, you know, this is the symptom
and this is basically the root cause, then it clicks like, okay, now I understand like how I
can significantly improve my day-to-day, how I can significantly reduce the number of tickets I open,
how I can significantly reduce the number of escalations
and time being spent on similar issues, right?
So for them, it's like seeing it in a single timeline
with our report on top of it that can actually like make them understand
how much time they can save or how many resources they can save
using a tool like us.
That would be so helpful for my life as well.
We have probably the flakiest set of alerts
are just the container restart ones.
And right now we haven't even tuned them well enough
to not fire when we do deployments.
So we have an alert that says,
oh, pod is restarting too frequently in the last 30 minutes.
But what ends up happening is you increase the number of pods in a certain deployment
because that service needs more capacity and the alerts are not tuned.
If we had something that could tell us, this is normal, this happens every deployment,
we should just fix it.
Or, oh, it looks like it's an actual application issue.
That's when it's in a crash loop.
It would make my life easier. Yeah. So I think we spent like,
I don't know,
overall,
maybe 10 to 15 engineering,
like years solving only this.
Like,
you know,
one of the challenges in Kubernetes
is that,
you know,
basically it's,
you know,
it's a lot of resources
and each resource has a status,
right?
And one of the challenges
is to create a state out of it.
So in Commodore, one of the cool things is that you can log into Commodore.
Commodore will basically not only tell you what the current statuses are, but basically what's healthy and what's not healthy in each one of your services or resources and how it evolved or changed over time, right?
So we actually create a state out of all of the different statuses and events that
we read from the QAPI itself. So basically, you know, if you just look on a pod and its status
is not ready, right? It's something else, it's pending, it's whatever. Is it an issue or not?
You don't know, right? You need to check if it's in the middle of a deploy. You need to check if
maybe, you know, maybe we're using spot instances and something is up and down. You need to check if maybe we're using spot instances and something is up and down.
You need to check maybe after a couple of seconds
was it resolved? Maybe it was not resolved.
Some things are transient. Some things
are not transient. There are so
many different nuances
for each one of the different resources
in typing Kubernetes. So if you
just give it to someone who don't really know
what Kubernetes is, his chances
to being able to resolve an issue are very, very low without all of this understanding, right?
Having said that, if you send it to the most senior DevOps in the team, he will tell you, oh, yeah, this is how Kubernetes act.
You know, this alert, as you mentioned, is just because, you know, the deployment is being running right now.
Let's wait five minutes and it will automatically be resolved, right?
But you definitely don't want to escalate every time there is an issue for this very, very expensive and, you know, busy DevOps, right?
So this is exactly what we try.
We try only to monitor on the real issues. or we help you to investigate also why it happened so we can actually fix it or remediate it via Commodore,
hopefully in one to two clicks,
rather than, you know, hours of investigation sometimes.
How do you automate the fix, right?
And is that scary for engineers?
Like one thing that I would be very worried about
is some platform acting on my orchestration platform.
Like how do you kind of build that trust?
First of all all what you described
sounds awful i will never let the vendor to automate anything on my production system right
so yeah we're being very very very careful around that what we're doing we're starting with
suggestions right like hey we notice that it can be caused by x y and z here is how we can solve
x here is how we can solve y here is how we can solve y here is how we can
solve z right so we started with suggestions once we saw our suggestions you know they make sense
like users are try to follow them we started to add actually actions from our platform, everything that we added is being guarded behind our back.
So actually like the platform engineers
or the dev engineers can actually choose
which teams can take which actions
on which resources, right?
So for example, you're probably okay
with the developers that own some service
to be able to restart the pod, right?
Or even to do a pod right or even
to do a revert on staging okay but you're probably definitely not okay with you know letting the
developers to drain a node in production right so like you can actually choose which teams can do
what and using commodore we are basically suggesting them what to do, giving them the functionality to do that, but they need to take the ownership and responsibility to actually make the action.
We don't believe that the right move here is go to the auto-healing motion
and try to automate the things for users.
We know that it is hard enough for most organizations to trust themselves to make actions but we try to
take all of the heavy lifting part from it we're trying to be this single place that showing you
the symptom allowing you to explore and to find the root cause suggesting you what to do with it
allowing you to take the action to solve it. Everything is audited. Everything is under Auerbach and SSO.
But eventually you're the person that needs to decide
this is the right thing to do.
Click on the button and hopefully to resolve the issue.
And you can also probably click and track
how many times are people actually clicking the button.
And you can even see how useful your remediations are
and optimize that metric over time.
That is the ideal place, right?
You ping people when you know you should be pinging people.
You let them take the action and you can track if they're taking that action over time.
Exactly.
The cool thing around it is that, as I mentioned, we never just show actions arbitrary.
So every time a user takes action in Commodore
is because we gave them some context
that they decided that the right thing to do
is take this action, right?
So we can actually see when we were right
and we can actually fine tune
which suggestions we're providing our users
in which kind of scenario.
And sometimes, you know,
we understand that what we did doesn't make sense.
And sometimes we see that
what we did totally makes sense. And then obviously, you know, we can that what we did doesn't make sense. And sometimes we see that what we did totally makes sense.
And then obviously, you know, we can improve and iterate on top of it.
So in the end of the day, we do want to build this machine, right?
That for almost all of the scenarios in Kubernetes already knows what are the actions or steps you should take to resolve it, right?
And if we can get to this point, our customers will even benefit more from our capabilities.
So we're definitely building our brain in a way
that is data-driven,
and we do hope to improve our suggestions
and our methods around it over time.
Yeah. Maybe a dumb question,
but why doesn't Kubernetes have all of this stuff inbuilt?
I think it's a fair question. I think some you know, some of the reason is, you know, it's an open source project
that, you know, basically it can go to so many different directions, you know, from adding more
resources to adding more abstraction on top of it, to adding more security functionality. And,
you know, there are so many different things to do.
And I think specifically the layer that is around reliability, observability, you know,
operational is usually being done historically by either vendors or, you know, the cloud
providers themselves.
So we don't see it organically coming from the core community of Kubernetes.
Definitely, you know, might be a good addition. I do see different releases from the cloud providers you know, tries to take their own perspective and their own thesis on what exactly is missing and how to solve it.
And I think this is one of the things that make this project so,
you know, so fascinating, right?
Like how different companies, different players,
different, you know, layers in the ecosystem,
each one contributes something else to create this, you know,
eventually amazing, amazing experience
for the end user.
Yeah, that certainly tracks, right?
It's open source.
And then, but the fact that you have all of these like open layers, lets multiple people
build and make improvements.
How do you keep track of, you know, adoption of new features and making sure that Commodore
is satisfying the needs of customers using, like,
new features that Kubernetes has?
Like, is it kind of like a,
you see what Kubernetes is doing,
you see what problems there are coming.
Based on that, you add extra stuff to Commodore.
Like, how does that stuff work?
How quickly does Kubernetes actually ship out new things?
And, like, how much of a race is it
to, like, just keep up with what's happening in the ecosystem
yeah that's a good question first of all i am ex-googler and my partner is from ebay so like
we are very data driven so we're keeping a lot of track on what our users are doing what are they
doing in the platform you know which features they're finding more useful we're having tons
of conversations with our users we try to be very close to our users.
You know, in that way, we're very like product-led company
and not very like sales-led company.
So we're having, you know, interviews with end users,
I think, on a daily basis.
So we do try to keep track and to hear their pains,
their insights, you know, their feedback,
not only on our tool, but also on different tools in the ecosystem,
say CD tools, monitoring tools, and definitely on Kubernetes itself.
Now, I will say that in Kubernetes, obviously, it is changing quite a lot
and adding new capabilities, et cetera.
But what I think is very interesting is the usage and the adoption
of custom resources, of crds and
this is i think the equivalent of like a wild wild west right because everyone can create a new crd
it can all of a sudden become super popular and you know it can change and you can add more
functionality there so i think that we see a lot of new CRDs all of a sudden becoming from zero popularity to becoming very, very common in a matter of weeks or months.
And then obviously us as a vendor, like as a solution that tries to simplify Kubernetes operation, we need very quickly to understand, okay, what does this CRD does?
You know, when it breaks, what do you need to do?
What do you need to check?
How does it interact with your other Kubernetes resources?
And then basically we have some kind of a research department
that does all of those things very, very quickly,
translate it into product spec,
and then we ship it into the product.
So those things we do, I would say,
even like on a weekly basis sometimes,
but we are trying to respond very fast
to changes in the ecosystem.
And on that note of the research department,
I would imagine the hiring for your company
has to be of just deeply technical people
who know what they're doing
because this is like a deeply technical product, right?
Like you have to know you're aiming for the right things.
You have to think about how like DevOps engineers
and SREs are thinking.
Is that the primary population of where you can hire from definitely so we have a bunch of people coming
from different backgrounds all of our product managers for example we're platform engineers
system engineers or sometimes even sres right so, you know, it's super important to hire people
that have firsthand experience with the problem we're trying to solve.
It is a very technical problem that does require a lot of understanding.
If you want to simplify something very complex,
you do need to be an expert in this specific complexity, right?
So it is hard for us, you know, to find the right people.
And I would say that the bar, you know, walking for Commodore is quite high for that reason. But I think this is what makes sometimes walking here a good opportunity because all of the people here are obsessed on Kubernetes, on cloud native tools, on solving complex issues, on simplifying how to solve complex issues. And, you know, when this is something that fascinates you, when this is something that
you find interesting, you can create awesome things together, right?
So, you know, even our designers are people that are, you know, fascinated by solving
complex issues and simplifying complex tasks.
And, you know, working walking together software engineers, designers,
product managers that find this, you know, these kinds of challenges interesting, the
end result can sometimes be, you know, groundbreaking and, you know, non-orthodox in a lot of cases.
So it is obviously a challenge to find those people, but then, you know, when you do find
them, I think, you know, it can create an amazing environment to work together on
solving those issues.
Yeah.
And perhaps a similar note, not exactly the same.
How has the economy pivot really affected your team?
You can say in whatever detail you'd like to see, like, what are you seeing from, you
know, prospect of customers?
Is the adoption of tools like Kubernetes like slowing down?
Is the adoption of like vendors kubernetes like slowing down is the adoption
of like vendors on top of kubernetes slowing down like what are you seeing generally yeah so i do
see two different trends that are opposite to each other this is only obviously our own experience
right i'm not claiming to say that no it's the entire landscape but obviously the biggest
you know mindset right now for most companies is to do more with less right like a year ago it was
growth at any cost right it's just grow grow grow and today what we're hearing is you know
we need to do more but with less right and this is a big, big change, right? From going to, from growth at any cost to, you know, to do more with less.
And obviously it translates to sometimes cutting spends, cutting budget, letting people go,
right?
Very talented people sometimes.
And it definitely also affects us, right?
I can tell you that we had POCs that the entire team used the tool. All of a
sudden, you know, got the letter that they're being laid off from the company. So obviously,
it also kills the POC, right? We reached with some companies to commercial agreement. And,
you know, for some reason, now the C-level needed to approve it. And, you know, they said just,
how? No, like, we're not going to approve new tools in this
calendar year at all. It doesn't matter which tool it is. We're only going to use what we already
have. So obviously in some cases we saw the, you know, the effect of what's happening in the market.
I would say the second trend we're seeing is that because of companies trying to do more with less
and because of the reason that, you know,
those DevOps and SREs are so expensive
and you have so little of them,
you definitely need to guard their time
and to make sure that you're not wasting their time
on, you know, escalations, tickets and firefighting.
And you do need to do more with less.
So we try to understand how we can get more from what you have, right?
More from my developers, more from my DevOps, more from my SREs and knock.
And then a tool like Commodore that basically automates a lot of the, you know, manual operations,
management, and troubleshooting aspects that are, you know are happening on a daily basis for most organizations,
we can actually save a significant amount of time,
not only from the developers, but also from the DevOps and the SREs.
And if indeed you can prove that you can save X amount of times,
X amount of hours every week by automating those tasks,
for those organizations, this is actually a good story for them to buy a tool like Commodore,
right? So we can actually win, and we saw it in the last couple of months in some POCs that we
thought, you know, they're going to decide not because of what's happening in the market, but
actually they told us that the management is convinced that with a tool like Commodore,
they can actually reduce labor hours, they can actually
reduce the toil, and actually can focus on the things that actually can move the needle for the
business and not focus on Kubernetes troubleshooting and Kubernetes operations. So I think overall,
we see that for tools like us that can actually save time and automate, you know, IT tasks,
companies still have, you know,
let's say, big appetite to try to purchase,
but definitely these are not easy times, right?
So overall, as I mentioned,
like I think they opposite to each other,
those two trends.
So far we are, you know,
we as a company are doing okay,
but we're definitely not being cocky or anything.
And we will definitely try to understand where the market is going in 23,
which kind of trends are going to happen.
And as a company,
we need to make sure that we can thrive
and succeed even in tough times.
And hopefully things will,
in a couple of months,
will be stabilized and improved,
but we're definitely ready for each option.
Yeah.
The idea of those two trends that are in opposite directions makes a lot of sense to me.
Finally, as a set of, you know, just as a technologist, right, what gets you excited
about the next year, right?
Are there, you know, some platform improvements that are coming to Commodore?
Are there some capabilities that the Kubernetes API is going to provide that's going to expand what comodore can do is there anything like that that gets you excited from like a technology
perspective wow so many i would say that i think what i'm mostly excited is our plans to expand beyond only troubleshooting.
You know, when we started to talk about troubleshooting,
today we're allowing ourselves to say that we, you know,
we're also helping to manage, to operate,
and to troubleshoot the Kubernetes clusters.
What we want to do by the end of the year is to be very proactive
in allowing the end users, the developers and the DevOps,
basically not only to manage and to operate, but also to make sure they comply with best practices, to make sure
that they're running the most cost-effective way, to make sure that they're deploying in a way that
is healthy in the first place to their clusters so they will have less issues. So if we started
from a bit of a reactive place of, okay, you have an issue,
how to solve it, what we want to do in the next couple of months is to start being more proactive
and actually suggest and enforce different best practices, et cetera. So your system will be more
healthy and more reliable and more cost-effective in the first place. So we have tons of new
capabilities and features that we plan to roll out in the first place. So we have tons of new capabilities and features
that we plan to roll out in the next couple of weeks and months
to start and try it out.
But we already got a very, very positive feedback
from our beta customers that, you know,
using our suggestions, best practices, and enforcement,
they actually managed to reduce number of escalations,
reduce number of issues, reduce spend using
Commodore.
So this is definitely something we're all super excited about.
And it will be, I think, a pretty significant change and expansion for us as a company.
But, you know, we can talk in a year and I can tell you how it went.
Some kind of like service guardrails or like that's the kind of stuff you're talking about.
Like every service should make sure
they have like at least X nodes running
for like availability.
Is that kind of an example?
There are some things that are like
can be done in the static level,
but by reading all of the configurations in YAML
and see, you know,
if they're complying with best practices.
But think about like all of the dynamic stuff, right?
Can you identify idle resources, right? Like, you know, you deploy the PVC, it's deployed correctly,
but no one is using it, right? It's just out there. Same goes for like, you know, a config
map, right? Like you just have a config that no one reads and no one updated for a month, right?
Can you just kill it, right? Is it a good usage of your resources? Is it a good usage of your time?
So there are so many things from the static point of view,
but also from the dynamic point of view,
that if you can combine them both
and better understand what is actually deployed
and what is actually being used
and if it's being used correctly or not,
and then you can proactively suggest what to do with it,
either to change some configuration to kill this resource or to
run it in a different way, right? So we're taking it from like different perspective and different
approaches. And yeah, eventually it can be, you know, symbolized as a scorecard and things like
that. But I think, you know, the rule engine behind the scene is something that get us
very excited lately. Especially like if it's very easy to add new rules and see how that rule is going to affect or
how it operates on your organization, how easy it is to burn that rule down in terms
of doing the migration to fix what that rule says.
That is exciting.
Well, thank you so much for being a guest, Ben.
This was a lot of fun.
And yes, I hope to have you or someone from Commodore again in a year to see how all of
this goes.
I can tell you that if everything goes great, it will be me.
And if not, I will send my partner.
Yeah, that'd be great.
I'd like talking to both of you.
Awesome. Bye.