Screaming in the Cloud - Unpacking the Costs and Value of Observability with Martin Mao
Episode Date: July 18, 2023Martin Mao, CEO & Cofounder at Chronosphere, joins Corey on Screaming in the Cloud to discuss the trends he sees in the observability industry. Martin explains why he feels measuring obse...rvability costs isn’t nearly as important as understanding the velocity of observability costs increasing, and why he feels efficiency is something that has to be built into processes as companies scale new functionality. Corey and Martin also explore how observability can now be used by business executives to provide top line visibility and value, as opposed to just seeing observability as a necessary cost. About MartinMartin is a technologist with a history of solving problems at the largest scale in the world and is passionate about helping enterprises use cloud native observability and open source technologies to succeed on their cloud native journey. He's now the Co-Founder & CEO of Chronosphere, a Series C startup with $255M in funding, backed by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Previously, he worked at AWS, Microsoft, and Google. He and his family are based in the Seattle area, and he enjoys playing soccer and eating meat pies in his spare time.Links Referenced:Chronosphere: https://chronosphere.io/LinkedIn: https://www.linkedin.com/in/martinmao/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Human scale teams use Tailscale to build trusted networks.
Tailscale Funnel is a great way to share a local service with your team for collaboration, testing, and experimentation.
Funnel securely exposes your dev environment at a stable URL, complete with
auto-provisioned TLS certificates. Use it from the command line or the new VS Code extensions.
In a few keystrokes, you can securely expose a local port to the internet right from the IDE.
I did this in a talk I gave at Tailscale Up, their first inaugural developer conference.
I used it to present my slides and
only revealed that that's what I was doing at the end of it. It's awesome. It works. Check it out.
Their free plan now includes three users and 100 devices. Try it out at snark.cloud slash
tailscale scream. Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest
episode is brought to us by our friends at
Chronosphere. It's been a couple of years since I got to talk to their CEO and co-founder,
Martin Mao, who is kind enough to subject himself to my slings and arrows today.
Martin, great to talk to you. Great to talk to you again, Corey, and looking forward to it.
I should probably disclose that I did run into you at Monotorama a week before this recording.
So that was an awful lot of fun to just catch up and see people in person again.
But one thing that they started off the conference with in the welcome to the show style of talk was the question about benchmarking what observability spend should be as a percentage of your infrastructure spend.
And from my perspective, that really feels a lot like a question that looks like,
well, how long should a piece of string be?
It's always highly contextual.
Agree, disagree, or are you hopelessly compromised?
Because you are, in fact, an observability vendor,
and it should always be more than it is today.
I would say, definitely agree with you from an exact number perspective. I don't think there is a magic number like 13.82% that this should be. It definitely depends on the context of how
observability is used within a company. And really, ultimately, just like anything else you pay for,
it really gets derived from the value you get out of it. So I feel like if you feel like you're getting the value
out of it, it's sort of worth the dollars that you put in. I do see why a lot of companies out
there and people are interested because they're trying to benchmark to try to see, am I doing
best practice? So I do think that there are probably some best practice ranges that I'd say
most typical organizations out there that we see is one thing I would say.
The other thing I would say when it comes to observability costs is one of the concerns we've seen talking with companies, if the relative cost of observability is growing faster than infrastructure and you extrapolate that out
a few years, then the direction in which this is going is bad.
So it's probably more the velocity of growth than the absolute number that folks should
be worried about.
I think that that is probably a fair assessment.
I get it all the time, at least in years past, where companies will say,
for every thousand daily active users, what should it cost to service them? And I finally snapped in one of my talks that I gave at DevOps Enterprise Summit and said, I think it was something like
$7.34. It's an arbitrary number that has no context in your business, regardless of whether
those users are Twitter users or large banks you have partnerships with.
But now you have something to cite. Does it help you? Not really. But will it get people to leave you alone and stop asking you awkward questions? Also, not really. But at least now you have a
number. Yeah, 100%. And again, like I said, there's no, and I'm glad our magic numbers
weren't too far away from each other. But yeah, I mean, there's no exact number there for sure.
One pattern I've been seeing more recently is like rather than asking for the number,
there's been a lot more clarity in companies on figuring out, well, okay, before I even
pick what the target should be, how much am I spending on this per whatever unit of efficiency
is, right?
And generally that unit of efficiency, I've actually seen it be mapped more to the business
side of things.
So perhaps to the number of customers or to customer transactions and whatnot and those things are generally perhaps
easier to model out and easier to justify as opposed to purely you know the number of seats
or the number of end users but I've seen a lot more companies at least focus on the measurement
of things and again it's been more about this sort of rather than the absolute number, the relative
change in number.
Because I think a lot of these companies are trying to figure out, is my business scaling
in a linear fashion or a sub-linear fashion, or perhaps an exponential fashion?
If it's the cost or, you know, you can imagine growing exponentially, that's a really bad
thing that you want to get ahead of.
That I think is probably the real question people are getting at is it seems like this
number only really goes up and to the right. It's not something that we have any real visibility into. And in many cases, it's beating up your CloudWatch API charges all the time on this other side as well. And data egress is not free, surprise,
surprise. So it's the direct costs, it's the indirect costs. And the thing people never talk
about, of course, is the cost of people to feed and maintain these systems. Yeah, 100%. You're
spot on. There's the direct costs, there's the indirect costs, like you mentioned, in observability.
Network egress is a huge indirect cost. There's the people that, like you mentioned, in observability. Network egress is a huge indirect cost.
There's the people that you mentioned that need to maintain these systems.
And I think those are things that companies definitely should take into account when they
think about the total cost of ownership there.
I think what's more in observability actually is, and this is perhaps a hard thing to measure
as well, is often we ask companies ask companies well what is the cost of downtime
right like if your if your business is is impacted and your customers are impacted and you're down
what is the cost of each additional minute of downtime perhaps right and then the effectiveness
of the tool can be evaluated against that because you know it observability is one of these it's not
just any other developer tool. It's the thing
that's giving you insight into is my business or my product or my service operating in the way that
I intend? And, you know, is my infrastructure up, for example, as well, right? So I think there's
also the piece of like, what is the tool really doing in terms of like a lost revenue or brand
impact? Those are often things that are sort of quite easily overlooked as well.
I am curious to see whether you have noticed a shifting in the narrative lately,
where as someone who sells AWS cost optimization
consulting as a service,
something that I've noticed is that
until about a year ago,
no one really seemed to care overly much
about what the AWS bill was.
And suddenly my phone's been ringing off the
hook. Have you found that the same is true in the observability space where no one really cared what
the observability costs until suddenly recently everyone does, or has this been simmering for a
while? We have found that exact same phenomenon. And what I tell most companies out there is
we provide an observability platform that's targeted at cloud-native platforms.
So if you're a cloud-native architecture, so if you're running microservices-oriented architecture on containers,
that's a type of architecture that we've optimized our solution for.
And historically, we've always done two things to try to differentiate.
One is provide a better tool to solve that particular problem in that particular
architecture. And the second one is to be a more cost-efficient solution in doing so. And not just
cost-efficient, but a tool that shows you the cost and the value of the data that you're storing.
So we've always had both sides of that equation. And to your point, in conversations in the past
years, they've generally been led with, look, I'm looking for a better solution. If you just happen to be cheaper, great. That's a nice
cherry on top. Whereas this year, the conversations have flipped 180, in which case most companies are
looking for a more cost-efficient solution. If you just happen to be a better tool at the same time,
that's more of a nice to have than anything else. So that conversation has definitely flipped 180 for us. And we found a pretty similar experience to what you've been
seeing out in the market right now. Which makes a tremendous amount of sense.
I think that there's an awful lot of, we'll just call it strangeness. I think that's probably the
best way to think about it in terms of people waking up to the grim reality that not caring
about your bills
was functionally a zero interest rate phenomenon in the corporate sense.
Now, suddenly everyone really has to think about this in some unfortunate and
some would say displeasing ways.
Yeah, 100%. And it was a great environment for tech for over a decade, right? So it was an
environment that I think a lot of companies and a lot of individuals got
used to. And perhaps a lot of folks that have entered the market in the last decade don't know
of another situation or another set of conditions where efficiency and cost really do matter. So
it's definitely top of mind. And I do think it's top of mind for good reason. I do think
a lot of companies got fairly inefficient over the last few years chasing that top line growth. Yeah, that has been, I think it makes sense in the context
with which people were operating because before a lot of that wound up hitting, it was, well,
grow, grow, grow at all costs. What do you mean? You're not doing that right now. You should be
doing that right now. Are you being irresponsible? Do we need to come down there and talk to you?
A hundred percent. Yeah, so give me your vegetables. Now it's time to start paying attention to this.
Yeah, a hundred percent. It's always a trade-off, right? It's like an individual company,
an individual team, you only have so many resources and prioritization. And I do think,
to your point, in a zero-interest environment, trying to grow that top line was the main thing
to do. And hence, everything was pushed on how quickly can we deliver new functionality,
new features to grow that top line. Whereas the efficiency is always something I think a
lot of companies looked at as something I can go deal with later on and go fix. And I feel like
that time has now just come. I will say that I somewhat recently had the distinct privilege of
working with a company whose observability story was effectively, we wait for customers to call
and tell us there's a problem, and then we go looking into it.
And on the one hand, my immediate former SRE reflexes kicked in and I've recoiled.
But this company has been in this industry longer than I have.
They clearly have a model that is working for them and for their customers.
It's not the way I would build something, but it does seem that for some use cases, you absolutely are going to be okay
with something like that. And I should probably point out, they were not, for example, a bank
where, yeah, you kind of want to get some early warning on things that could destabilize the
economy. Right, right. I mean, to your point, depending on the context and the company,
it could definitely make sense and depending on how they execute as well, right?
So, you know, you caught an example already where if there were a bank or if any correctness or timeliness of a response was important to that business, perhaps not the best thing to do to have your customers find out, especially if you have a ton of customers at the same time. But however, if it's a different type of business
where the responses are perhaps more asynchronous
or you don't have a lot of users encountering at the same time,
or perhaps you have a great A-B experimentation platform,
testing platform,
there are definitely conditions in which that could be
potentially a viable option,
especially when you weigh up the cost and the benefit.
If the cost to having a few bad customers have a bad experience is not that much to the business,
and the benefit is that you don't have to spend a ton on observability, perhaps that's a trade-off
that the company is willing to make. In most of the businesses that we've been working with,
I would say that's probably not been the case, but I do think that there's probably some bias and some skew there in the sense that you can imagine a company that cares about these things perhaps is more likely to talk to an observability vendor like us to try to fix these problems.
When we spoke a few years back, you definitely were focused on the large, one would say almost hyperscale style of cloud-native build-out.
Is that still accurate,
or has the bar to entry changed since we last spoke? I know you've raised an awful lot of money,
which, good for you, it's a sign of a healthy, robust VC ecosystem. But the counterpoint to that
is they're probably not investing in a company whose total addressable market is like 15 companies
that must be at least this big. 100%, 100. So I would say that the bar to entry definitely has changed,
but it's not due to a business decision on our end. If you think about how we started and the
focus area, we're really targeting accounts that are adopting cloud-native technology.
And it just so happens that the large tech decacons and the hyperscalers were the earliest adopters
of cloud-native, so containerization or microservices.
They were the earliest adopters of that.
So hence, there was a high correlation
in the companies that had that problem
and the companies that we could serve.
Luckily for us, the trend has been that
more of the rest of the industry
has gone down this route as well.
And it's not just new startups. You can imagine any new startup these days probably starts off
cloud native from day one. But what we're finding is the more established larger enterprises are
doing this shift as well. And I think the folks out there like Gartner have studied this and
predicted that by about 2028, I believe was the date,
about 95% of applications are going to be containerized in large enterprises. So it's
definitely a trend that the rest of the industry will go on. And as they continue down that trend,
that's when sort of our addressable market will grow because the amount of use cases
where our technology shines will grow along with that as well.
I'm also curious about your description of being aimed at cloud-native companies.
You gave one example of microservices powered by containers, if I understood correctly.
What are the prerequisites for this?
When you say that, it almost sounds like you're trying to avoid defining a specific architecture
that you don't want to deal well with or don't want to support
for a variety of reasons. Is that what it is? Or is there certain you must be built in these ways
or the product does not work super well for you? What is it you're trying to say with that is what
I'm trying to get at here. Yeah, 100%. If you look at the founding story here, it's really myself and
my co-founder found Uber going through this transition of both a new architecture in the sense that, you know,
they were going containers,
they were building microservices-oriented architecture there,
but also adopting a DevOps mentality as well.
So it was just a new way of building software almost.
And what we found is that when you develop software in this particular way,
so you can imagine when you're developing a tiny piece of functionality as a microservice and you're an individual developer and you can imagine rolling
that out into production multiple times a day. In that way of developing software, what we found
was that the traditional tools, the application performance monitoring tools, the IT monitoring
tools that used to exist pre this way of both architecture and way of
developing software just weren't a good fit. So the whole reason we exist is that we had to figure
out a better way of solving this particular problem for the way that Uber built software,
which was more of a cloud native approach. And again, it just so happens that the rest of the
industry is moving down this path as well.
And hence, that problem is larger for a larger portion of the companies out there.
I'd say some of the things when you look into why the existing solutions can't solve these problems well,
if you look at an application performance monitoring tool, an APM tool,
it's really focused on introspecting into that application and its
interaction with the operating system or the underlying hardware. And yet these days, that
is less important when you're running inside a container. Perhaps you don't even have access to
the underlying hardware or the operating system. And what you care about, you can imagine, is how
that piece of functionality interacts with all the other pieces of functionality out there over a network core.
So just the architecture and the conditions ask for a different type of observability, a different type of monitoring.
And hence, you just need a different type of solution to go solve for this new world.
Along with this, which is sort of related to the cost as well, is that, you know, as we go from virtual machines onto containers, you can imagine the sheer volume
of data that gets produced now because everything is much smaller than it was before and a lot more
ephemeral than it was before. And hence, every small piece of infrastructure, every small piece
of code, you can imagine still needs as much monitoring and observability as it did before
as well. So just the sheer volume of data is so much larger
for the same amount of infrastructure,
for the same amount of hardware that you used to have.
And that's really driving a huge problem
in terms of being able to scale for it
and also being able to pay for these systems as well.
Tired of Apache Kafka's complexity
making your AWS bill look like a phone number?
Enter Red Panda.
You get 10x your streaming data performance without having to rob a bank. Thank you. it's reality, visit go.redpanda.com slash duckbill. Red Panda, because Kafka shouldn't
cause you nightmares. I think that there's a common misconception in the industry that
people are going to either have ancient servers rotting away in racks, or they're going to build
something greenfield the way that we see done on keynote stages all the time of companies that have been around with this architecture for less than 18 months. In practice, I find it's
awfully frequent that this is much more of a spectrum and a case-by-case per workload basis.
I haven't met too many data center companies where everything's a disaster that the cloud
companies like to paint it as. And vice versa, I also have never yet seen an architecture that really existed as described in a keynote presentation.
I 100% agree with you there.
And, you know, it's not clean cut from that perspective.
And also, you're also forgetting the messy middle as well, right?
Like often what happens is there's a transition.
If you don't start off cloud native from day one, You do need to transition there from your monolithic
applications, from your VM-based architectures. And often the use case can't transform over
perfectly. What ends up happening is you start moving some functionality and containerizing
some functionality, and that still has dependencies between the old architecture and the new
architecture. And companies have to live in this middle state, perhaps,
for a very long time. So it's definitely true. It's not a clean-cut transition.
But you can think about that middle state is actually one that a lot of companies struggle
with because all of a sudden, you only have a partial view of the world or what's happening
with your old tools. They're not well-suited for the new environments. Perhaps you've got to start
bringing new tools and new ways of doing things in your new environments, and they're not perhaps
the best suited for the old environments as well. So you do actually end up in this middle state
where you need a good solution that can really handle both because there are a lot of interdependencies
between the two. And it's actually one of the things that we strive to do here at Chronosphere
is to help companies through that transition. So it's not just all of your new use cases, and it's not just all of your new environments. It's
actually helping companies through this transition is actually pretty critical as well. My question
for you is that given that you have a, I don't want to say a preordained architecture that your
customers have to use, but there are certain assumptions you've made based upon both their scale and the environment in which they're operating. How heavy of a lift
is it for them to wind up getting Chronosphere into their environments? Just because it seems
to me that it's not that hard to design an architecture on a whiteboard that can meet
almost any requirement. The messy part is figuring out how to get something that resembles that into place on a pre-existing extant architecture.
Yeah, I would say it's something we've spent a lot of time on.
The good thing for the industry overall, for the observability industry, is that open source standards are now created and now exist when they didn't before. So if you look at the APM-based view, it was all proprietary agents producing
the data themselves that would only really work with one vended product. Whereas if you look at
a modern environment, the production of the data has actually been shifted from the vendor down to
the companies themselves. And they'll be producing these pieces of data in open source standard
formats like open telemetry for distributed traces or perhaps
prometheus or for metrics so the good thing is that for all of your new environments there's a
standard way to produce all of this data and you can send all that data to whichever vendor you
want on the back end so it just makes the implementation for the new environments so
much easier now for the legacy environments or if you're if a company is shifting over from an
existing tool there is actually a messy migration there because often you're trying to replace
proprietary formats and proprietary ways of producing data with open source standard ones.
So that's just something that us as Chronosphere just come in and we view that as a particular
problem that we need to solve. And we take the responsibility of solving for a company
because what we're trying to sell companies, not just a tool, we're really trying to solve them
as a solution to the problem. And the problem is they need an observability solution end to end.
So this often involves us coming in and helping them, you can imagine, not just convert the data
types over, but also move over existing dashboards, existing alerts. There's a huge piece of lift at the end
that perhaps every developer in a company would have to do
if we didn't come in and do it on behalf of those companies.
So it's just an additional responsibility.
It's not an easy thing to do.
We've built some tooling that helps with it
and we just spend a lot of manual hours going through this,
but it's a necessary one in order to help a company transition.
Now, the good thing is once they have transitioned into the new way of doing things and they are dependent on open source standard formats, they are no longer locked in.
So, you know, you can imagine future transitions will be much easier.
However, the current one does have to go through a little bit of effort.
I think that's probably fair. And then
there's no such thing in my experience as a easy deployment for something that is large enough to
matter. And let's be clear, people are not going to be deploying something as large scale as
Chronosphere on a LARC. This is going to be when they have a serious application with serious
observability challenges. So it feels like on some level that even doing a POC is a tricky proposition just due to the instrumentation part of it.
Something I've seen is that very often enterprise sales teams will decide that by the time that they can get someone to successfully pull off a POC,
at that point, the deal win rate is something like 95% just because no one wants to try that and bake off with something else.
Yeah, I'd say that we do see high pilot conversion rates, to your point. For us, it's perhaps a
little bit easier than other solutions out there in the sense that I think with our type of
observability tooling, the good thing is an individual team could pick this up for their one
use case, and they could get value out of it.
It's not that every team across an environment
or every team in an organization needs to adopt.
So while generally we do see that a company would want to pilot
and it's not something you can play around online with by yourself
because it does need a particular deployment,
it does need a little bit of setup,
generally one single team can come and perform that
and see value out of the tool.
And that sort of value can be extrapolated
and applied to all the other teams as well.
So you're correct, but it hasn't been a huge lift.
And these processes end-to-end,
we've seen be as short as perhaps 30-something days end-to-end,
which is generally a pretty fast
moving process there. I guess on some level, I'm still trying to wrap my head around the idea of
the scale that you operate at, just because as you mentioned, this came out of Uber, which is
beyond imagining for most people. And you take a look at a wide variety of different use cases.
And in my experience, it's never been, holy crap,
we have no observability and we need to fix that. It's there are a variety of systems in place that
just are not living up to the hopes, dreams, and potential that they had when they were originally
deployed, either due to growth or due to lack of product fit or the fact that it turns out in a
post zero interest rate world,
most people don't want to have a pipeline of 20 discrete observability tools.
Yep. Yep. A hundred percent. And to your point there, ultimately it's our goal. And, you know,
in many companies where replacing up to six to eight tools in a single platform,
it's always great to do that definitely doesn't happen overnight it
takes time you know you can imagine in a pilot when you're looking at it we're picking a few
of the use cases to demonstrate what our tool could do across many other use cases and then
generally on the onboarding during the onboarding time or perhaps over a period of months or perhaps
even a year plus we then go board these use cases piece by piece.
So it's definitely not a quick overnight process there.
But you can imagine something that can help each end developer
in a particular company be more effective
and something that can really help move the bottom line
in terms of far better price efficiency.
These things are generally not things that are quick fixes.
These are generally things that do take some time and a little bit of investment to achieve
the results.
So a question I do have for you, given that I just watched an awful lot of people talking
about observability for three days at Monitorama, what are people not talking about?
What did you not see discussed that you think should be?
Yeah, one thing I think often
gets overlooked, and especially in today's climate, is I think observability gets relegated to a cost
center. It's something that every company must have, every company has today. And it's often
looked at a tool that gives you insights about your infrastructure and your applications. And
it's a backend tool, something you have to have, something you have to pay for, and it doesn't really move the direct needle for the business
top line. And I think that's often something that companies don't talk about enough. And,
you know, from our experience at Uber and through most of the companies that we work with here at
Chronosphere, yes, there are infrastructure problems and application level problems that
we help companies solve. But
ultimately, the more mature organizations, or when it comes to observability, are often starting to
get real-time insights into the business more than the application layer and the infrastructure layer.
And if you think about it, companies that are cloud-native architected, there's not one single
endpoint or one single application that fulfills a single customer
request. So even if you could look at all the individual pieces, the actual what we have to
do for customers in our products and services span across so many of them that often you need
to introduce a new view, a view that's just focused on your customers, just focused on the business
and sort of apply the same type of techniques on your back end
infrastructure as you do for your business. Now, this isn't a replacement for your BI tools. You
still need those. But what we find is that BI tools are more used for longer term strategic
decisions, whereas you may need to do a lot of sort of tactical, more tactical business operational
functions based on having a live
view of your business. So what we find is often observability is only ever thought about for
infrastructure. It's only ever thought about for as a cost center, but ultimately observability
tooling can actually add a lot directly to your top line by giving you visibility into the products
and services that make up that top line. And I visibility into the products and services that make up that
top line. And I would say the more mature organizations that we work with here at
Chronosphere all have their executives looking at, you know, monitoring dashboards to really
get a good sense of what's happening in their business in real time. So I think that's something
that hopefully a lot more companies evolve into over time and they really see the full
benefit of observability and what it can do
to a business's top line. I think that's probably a fair way of approaching it. It seems similar in
some respects to what I tend to see over in the cloud cost optimization space. People often want
to have something prescriptive of do this, do that, do the other thing. But it depends entirely on
what the needs of the business are internally. It depends upon the stories that they wind up working with. It depends really on what their constraints are, what their architectures are
doing. Very often it's a, let's look and figure out what's going on. And accidentally they discover
they can blow 40% off their spend by just deleting things that aren't in use anymore.
That becomes increasingly uncommon with scale, but it's still one of those questions of
what do we do here and how?
Yep. A hundred percent.
I really want to thank you for taking the time to speak with me today about what you're seeing.
If people want to learn more, where's the best place for them to find you?
Yeah, the best place is probably going to our website, chronosphere.io to find out more about the company. Or if you want to chat with me directly, LinkedIn is probably the best place
to come find me via my name. And we will, of course, put links to both of those things in
the show notes. Thank you so much for suffering the slings and arrows I was able to throw at you
today. Thank you for having me, Corey. Always a pleasure to speak with you and looking forward
to our next conversation. Likewise. Martin Mao, CEO and co-founder of Chronosphere,
this promoted guest episode has been brought to us by Chronosphere here on Screaming in the Cloud. And I'm cloud economist, Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment that I will never notice because I have an observability gap.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need
the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business,
and we get to the point.
Visit duckbillgroup.com to get started.