Screaming in the Cloud - Open Core, Real-Time Observability Born in the Cloud with Martin Mao
Episode Date: June 22, 2021About MartinMartin Mao is the co-founder and CEO of Chronosphere. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Prior to that, he was a t...echnical lead on the EC2 team at AWS and has also worked for Microsoft and Google. He and his family are based in our Seattle hub and he enjoys playing soccer and eating meat pies in his spare time.Links:Chronosphere: https://chronosphere.io/Email: contact@chronosphere.io
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is sponsored in part by Thinkst.
This is going to take a minute to explain, so bear with me.
I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter.
And what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you
want to. It gives you fake AWS API credentials, for example. And the only thing that these things do
is alert you whenever someone attempts to use those things. It's an awesome approach. I've used
something similar for years. Check them out. But wait, there's more.
They also have an enterprise option that you should be very much aware of. Canary.tools.
You can take a look at this, but what it does is it provides an enterprise approach to drive
these things throughout your entire environment. You can get a physical device that hangs out on
your network and impersonates whatever you want to. When it gets NMAP scanned, or someone attempts to log into it or access files on it, you get instant alerts. It's awesome.
If you don't do something like this, you're likely to find out that you've gotten breached the hard
way. Take a look at this. It's one of those few things that I look at and say, wow, that is an
amazing idea. I love it. That's canarytokens.org and canary.tools. The first one is free. The second
one is enterprising. Take a look. I'm a big fan of this. More from them in the coming weeks.
If your meantime to WTF for a security alert is more than a minute, it's time to look at
Lacework. Lacework will help you get your security act together for everything from compliance service configurations
to container app relationships, all without the need for PhDs in AWS to write the rules.
If you're building a secure business on AWS with compliance requirements, you don't really have
time to choose between antivirus or firewall companies to help you secure your stack. That's
why Lacework is built from the ground up for the cloud. Low effort, high visibility, and detection. To learn more,
visit lacework.com. Welcome to Screaming in the Cloud. I'm Corey Quinn. I've often talked about
observability, or as I tend to think of it when people aren't listening, hipster monitoring.
Today, we have a promoted episode from a company called Chronosphere,
and I'm joined today by Martin Mao, their CEO and co-founder. Martin, thank you for coming on the
show and suffering my slings and arrows. Thanks for having me on the show, Corey, and looking
forward to our conversation today. So before we dive into what you're doing now, I'm always a big
sucker for origin stories. Historically,
you worked at Microsoft and Google, but then you really sort of entered my sphere of things that I
find myself having to care about when I'm lying awake at night and the power goes out
by working on the EC2 team over at AWS. Tell me a little bit about that. You've hit the big three
cloud providers at this point. What was that like?
Yeah, it was an amazing experience. I was a technical lead on one of the EC2 teams. And I think when an opportunity like that comes up on such a core foundational project for the cloud,
you take it. So it was an amazing opportunity to be a part of leading that team at a fairly early stage of AWS
and also helping them create a brand new service from scratch,
which was AWS Systems Manager,
which was targeted at fleet-wide management of EC2 instances.
So I'm a tremendous fan of Systems Manager,
but I'm still looking for the person
who named Systems Manager Session Manager
because at this point,
I'm about to put a bounty out on them.
Wonderful service, terrible name. That was not me. So yes, but yeah, no, it was a great experience
for sure. And I think, you know, just seeing how AWS operated from the inside was an amazing
learning experience for me and being able to create sort of foundational pieces for the cloud
was also an amazing experience. So only good things to say about my time at AWS.
And then after that, you left and you went to Uber where you led development and SRE teams that
created and operated something called M3. Alternately, I'm misreading your bio,
and you bought an M3 from BMW and went to drive for Uber. Which is it?
I wish it was the second one, but unfortunately it is the first one. So
yes, I did leave AWS and joined Uber in 2015 to lead a core part of their monitoring and eventually
larger observability team. And that team did go on to build open source projects such as M3,
which perhaps we should have thought about the name and the conflict with the car when we named
it at the time, and other projects such as Jaeger for distributed tracing as well, and a logging backend system too. So
I definitely spent many years there building out their observability stack.
We're going to tie a theme together here. You were at Microsoft, you were at Google,
you were at AWS, you were at Uber, and you look at all of this and decide, all right,
my entire career has been spent in large companies doing massive globally scaled things. I'm going to go build a small startup. What made you decide
that, all right, this is something I'm going to pursue? So definitely never part of the plan,
as you mentioned, a lot of big tech companies. And I think I always got a lot of joy building
large distributed systems, handling lots of load and solving problems at a really
grand scale and I think the reason for doing a startup was really the situation
that we were in so at uber as I mentioned myself and my co-founder led
the core part of the observability team then we were lucky to happen to solve
the problem not just for uber but for the broader community and especially the
community adopting cloud native architecture.
And it just so happened that we were solving the problem for Uber in 2015,
but the rest of the industry sort of has similar problems today.
So it was almost the perfect opportunity to solve this now for a broader range of companies out there.
And we already had a lot of the core technology built and open source as well.
So it was more of an opportunity rather
than a long-term plan or anything of that sort, Corey. So before we dive into the intricacies of
what you've built, I always like to ask people this question because it turns out that the only
thing that everyone agrees on is that everyone else is wrong. What is the dividing line, if any,
between monitoring and observability?
That's a great question. And I don't know if there's a easy answer.
I mean, my cynical approach is that, well, if you call it monitoring, you don't get to
bring in SRE style salaries, call it observability, and no one knows what the hell we're talking about.
So sure, it's a blank check at that point. It's cynical and probably not entirely correct.
So I'm curious to get your take on it.
Yeah, for sure. So, you know, there's definitely a lot of overlap there and it's not really two
separate things. In my mind, at least monitoring, which has been around for a very long time,
has always been around notification and having visibility into your systems. And then as the
systems got more complex over time, being able to sort of understand that
and not just have visibility into it, but understand it a little bit more sort of required
perhaps additional new data types to go and solve those problems. And that's how, in my mind,
monitoring sort of morphed into observability. So perhaps one is a subset of the other and they're
not competing concepts there.
But at least that's my opinion.
I'm sure there are plenty out there that would perhaps disagree with that.
On some level, it almost hits to the adage of at a certain point of scale with distributed
systems is never a question of is the app up or down?
It's more a question of how down is it?
At least that's how it was explained to me at one point.
And it was someone who was incredibly convincing.
So I smiled, nodded, and never really thought to question it any deeper than that.
But I look back at the large-scale environments I've been in, and yeah, things are always
on fire on some level.
And ideally, there are ways to handle and mitigate that.
Past a certain point, the approach of small-scale systems stops working at large scale.
I mean, I see that over in the costing world, where people will put tools up on GitHub of,
hey, I ran this script, and it worked super well on my 10 instances.
And then you try and run the thing on 10,000 instances, and the thing melts into the floor,
hits rate limits left and right, because people don't think in terms of those scales.
So it seems like you're sort of going from the opposite end, where, well, this is how
we know things work at large scale.
Let's go ahead and build that out as an initially smaller team.
Because I'm going to assume, not knowing much about Chronosphere yet, that it's the sort of thing that will help a
company before they get to the hyperscaler stage. 100%. And you're spot on there, Corey. And it's
not even just a company going from small stage, small scale, simple systems to more complicated
ones. Actually, if you think about this shift in the cloud right now, it's really going from cloud to cloud native, right? So going
from VMs to container on the infrastructure tier and going from monoliths to microservices. So
it's not even the growth of the company necessarily, or the growth of the load that
the system has to handle. But this sort of this shift to containers and microservices
heavily accelerates the growth of the amount of data that gets produced.
And that is causing a lot of these problems.
So Uber was famous for disrupting effectively the taxi market.
What made you folks decide, I know we're going to reinvent observability slash monitoring while we're at it too?
What was it about existing
approaches that fell down and I guess necessitated you folks to build your own? Yeah, great question,
Corey. And actually it goes to the first part. We were disrupting the taxi industry. And I think
the ability for Uber to iterate extremely fast and respond as a business to changing market
conditions was key to that disruption. So
monitoring and observability was a key part of that because you can imagine it was providing
all of the real-time visibility to not only what was happening in our infrastructure and
applications, but the business as well. So it really came out of a necessity more than anything
else. We found that in order to be more competitive, we had to adopt what is probably today known as
cloud-native architecture, adopt running on containers and microservices so that we can
move faster. And along with that, we found that all of the existing monitoring tools we were using
weren't really built for this type of environment. And it was that that was the forcing function
for us to create our own technologies that were really purpose-built for this modern type of environment that gave us the visibility we needed to be competitive as a company and a business.
So talk to me a little bit more about what observability is.
I hear people talking about it, to be frank, in a bunch of ways so that they're trying to, I guess, appropriate the term to cover what they already are doing or selling because changing vocabulary is easier than changing an entire product philosophy.
What is it?
Yeah, we actually had a very similar view on observability.
And originally, you know, we thought that it is a combination of metrics logs and traces and that's a very common
view you have the three pillars it's almost like three check boxes you tick them off and you have
quote-unquote observability and that's actually how we we looked at the problem at uber and we
built solutions for each one of those and we checked all three boxes what we've come to realize
since then is perhaps that was not the best way to look at it because we had all three.
But what we realized is that actually just having all three doesn't really help you with the ultimate goal of what you want from this platform.
And having more of each of the types of data didn't really help us with that either.
So, you know, taking a step back from there and when we really looked at it, the lesson that we learned in our view on observability is really more from a end user perspective rather than a data type or data input perspective and
really from an end user perspective if you think about why you want to use your monitoring tool
or your observability tool you really want to be notified of issues and remediate them as quickly
as possible and to do that it really just comes down to answering three questions.
Can I get notified when something is wrong?
Yes or no?
Do I even know something is wrong?
The second question is, can I triage it quickly to know what the impact is?
Do I know if it's impacting all of my customers or just the subset of them?
And how bad is the issue?
Can I go back to sleep if I'm being paged at two o'clock in the morning?
And the third one is, can I figure out the underlying root cause to the problem and go and actually fix it so this is how we think about the
problem now is from the end user perspective and it's not that you don't need metrics logs or
distributed traces to solve the problem but we are now orienting our solution around solving the
problem for the end user as opposed to just orienting our solution
around the three data types per se. I'm going to self-admit to a fun billing experience I had once
with a different monitoring vendor whom I will not name, because it turns out you can tell stories,
you can name names, but doing both gets you in trouble. It was a more traditional approach in a
simpler time. And they wound up
sending me a message saying, oh, we're hitting rate limits on CloudWatch. Go ahead and open a
ticket asking for them to raise it. And in a rare display of foresight, AWS responded to my ticket
with a, we can do this, but understand at this level of concurrency, it will cost something like $90,000 a month on increased charges with that frequency for that many metrics.
And that was roughly twice what our AWS bill was in those days.
So I'm curious as to how you can offer predictable pricing when you can have things that emit so much data so quickly.
I believe you when you say you can do it. I'm
just trying to understand the philosophy of how that works. As I said earlier, we started to
approach this by trying to solve it in a very engineering fashion where we just wanted to create
more efficient backend technology so that it would be cheaper for the increased amount of data.
What we realized over time is that no matter
how much cheaper we make it, the amount of data being produced, especially from monitoring and
observability, kept increasing, not even in a linear fashion, but in an exponential fashion.
And because of that, it really switched the problem not to how efficiently can we store this,
it really changed our focus of the problem to how are users using this data and do
they even understand the data that's being produced. So in addition to the couple of properties I
mentioned earlier around cost accounting and rate limiting those are definitely required. The other
things we try to make available for our end users is introspection tools such that they understand
the type of data that's being produced. It's actually very easy in the monitoring
and observability world to write a single line of code that actually produces a lot
of data and most developers don't understand that that single line of code
produces so much data. So our approach to this is to provide a tool so that
developers can introspect and understand what is produced on the back end side,
not what is being inputted from
their code and then not only have an understanding of that but also dynamic ways to deal with it so
that you know again when they hit the rate limit they don't just have to monitor less they understand
that oh i inserted this particular label and now i have 20 times the amount of data that i needed
before do i really need that
particular label in there and if not perhaps dropping it dynamically on the server side is a
much better way of dealing with that problem than having to roll back your code and change your
metric instrumentation so for us you know the way to deal with it is not to just make the back end
even more efficient but really to have end users understand the data that they're
producing and make decisions on which parts of it is really useful and which parts of it do they
perhaps not want or perhaps want to retain for shorter periods of time, for example, and then
allow them to actually implement those changes on that data on the back end. And that is really how
the end users control the bills and the cost themselves.
So there are a number of different companies in the observability space that have different
approaches to what they solve for. In some cases, to be very honest, it seems like, well,
I have 15 different observability and monitoring tools. Which ones do you replace? And the answer
is, oh, we're number 16. And it's easy to be cynical and
down on that entire approach, but then you start digging into it and they're actually right. I
didn't expect that to be the case. What was your perspective that made you look around the, let's
be honest, fairly crowded landscape of observability companies, tools that gave insight into the health
status and well-being
of various applications in different ways and say, you know, no one's quite gotten this right yet.
I have a better idea. Yeah, you're completely correct. And perhaps the previous environments
that everybody was operating in, there were a lot of different tools for different purposes, right?
A company would purchase a infrastructure monitoring tool, perhaps even a network monitoring tool tool and then that would have perhaps an APM solution for the applications and then
perhaps some BI tools for the business so there was always historically a collection of different
tools to go and solve this problem and I think again what has really happened recently with this
shift to cloud native recently is that the need for a lot of this data to be in a single
tool has become more important than ever. So if you think about your microservices running on a
single container today, if a single container dies in isolation without knowing perhaps which
microservice was running on it, it doesn't mean very much. And just having that visibility is not
going to be enough. Just like if you don't know which business use case that microservice was serving that's not going to be very useful for
you either so with cloud native architecture there is more of a need to have all of this data and
visibility in a single tool which hasn't historically happened and also none of the
existing tools today so if you think about both the existing apm solutions out there and the
existing hosted solutions that exist in the world today, none of them were really built for a cloud native
environment because you can think about even the timing that these companies were created
at, you know, back in early 2010s, Kubernetes and containers weren't really a thing.
So a lot of these tools weren't really built for the modern architecture that we see most
companies shifting towards.
So the opportunity was really to build something for where we think the industry and everyone's
technology stack was going to be as opposed to where the technology stack has been in the past
before. And that was really the opportunity there. And it just so happened that we had built a lot of
these solutions for a similar type of environment for Uber many years before. So
leveraging a lot of our lessons learned there put us in a good spot to build a new solution
that we believe is fairly different from everything else that exists today in the market.
And it's going to be a good fit for companies moving forward.
So on your website, one of the things that you, I assume, put up there just to pick a fight,
because if there's one thing these people love,
it's fighting, is a use case is outgrowing Prometheus. The entire story behind Prometheus
is, oh, it scales forever. It's what the hyperscalers would use. This came out of the
way that Google does things. And everyone talks about Google as if it's this mythical Valhalla
place where everything is amazing and nothing ever goes wrong. I've seen the conference docs and that's great. What does outgrowing Prometheus look like? Yeah, it's a great question, Corey.
So if you look at Prometheus and, you know, it is the graduated and the recommended monitoring tool
for cloud native environments. If you look at it and the way it scales, actually it's a single
binary solution, which is great because it's really easy to get
started you deploy a single instance and you have sort of ingestion storage and visibility and
dashboarding and alerting all packaged together into one solution that's definitely great and it
can scale sort of by itself to a certain point and is definitely the recommended starting point
but as you really start to grow your business, increase your cluster sizes, increase the number of applications you have, actually isn't a great
fit for horizontal scale. So by default, there isn't really a high availability in horizontal
scale built into Prometheus by default. And that's why other projects in the CNCF, such as Cortex and
Thanos, were created to solve some of these problems. So we sort of looked at the problem
in a similar fashion. And when we created M3, the open source metrics platform that came out of Uber,
it was also sort of approaching it from this different perspective where we built it to be
horizontally scalable and highly reliable from the beginning. But yet we don't really want it to be a
competing project with Prometheus. So it is actually something that works in tandem with Prometheus in the sense that it can ingest Prometheus metrics
and you can issue Prometheus query language queries against it and it will fulfill those.
But it is really built for a more scalable environment. And I would say that once a
company starts to grow and they run into some of these pain points and these pain points are
surrounding how reliable a Prometheus instance is, how you can scale it up beyond just you know giving it more resources on
the vm that it runs on vertical scale you know runs out at a certain point those are some of
the pain points that you know a lot of companies do run into and need to solve eventually and there
are various solutions out there both in open source and in the commercial world that are
designed to solve those pain points m3 being one of the open source ones, and of course, Chronosphere
being one of the commercial ones. This episode is sponsored in part by Salesforce. Salesforce
invites you to Salesforce and AWS. What's ahead for architects, admins, and developers on June 24th
at 10 a.m. Pacific time. It's a virtual event where you'll get a first look
at the latest innovations from the Salesforce
and AWS partnership
and have an opportunity to have your questions answered.
Plus, you'll get to enjoy an exclusive performance
from Grammy Award-winning artist The Roots.
I think they're talking about a band,
not people with super user access to a system.
Registration is free at salesforce.com
slash what's ahead. Now, you've also gone ahead and more or less dangled raw meat in front of a
tiger in some respects here, because one of the things that you wind up saying on your site of
why people would go with Chronosphere is, ah, this doesn't allow for Bill Spike overages as far as
what the Chronosphere bill is.
And that's awesome. I love predictable pricing. It's sort of the antithesis of cloud bills.
But there is the counter argument too, which is with many approaches to monitoring,
I don't actually care what my monitoring vendor is going to charge me because they wind up
costing me five times more just in terms of CloudWatch charges. How does your billing work? And how do
you avoid causing problems for me on the AWS side or other cloud provider? I mean, again,
GCP and Azure are not immune from this. So if you look at the built-in solutions by the cloud
providers, a lot of those metrics and monitoring you get from those at CloudWatch or Stackdriver,
a lot of it you get sort of included for free with your AWS bill already.
It's only if you want additional data and additional retention
do you choose to pay more there.
So I think a lot of companies do use those solutions
for the default set of monitoring that they want,
especially for the AWS services.
But generally, a lot of companies have sort of custom monitoring requirements
outside of that in the application tier,
or even more detailed monitoring
in the infrastructure that is required,
especially if you think about Kubernetes.
Oh, yeah.
And then I see people using CloudWatch
as basically a monitoring or metric or log router,
which at its price point, don't do that.
It doesn't end well for anyone involved.
100%. So our solution and our approach is a little bit different. So it doesn't actually
go through CloudWatch or any of these other inbuilt cloud-hosted solutions as a router,
because to your point, there's a lot of costs there as well. It actually goes and collects
the data from the infrastructure tier or the applications. And what we have found is that
not only does the bill for
monitoring climb exponentially and not just as you grow especially as you shift towards
cloud native architecture our very first take of solving that problem is to make the back end a lot
more efficient than before so it just is cheaper overall and we approached it that way at uber
and we had great results there. So when we created
them originally before M3, 8% of Uber's infrastructure bill was spent on monitoring
all that infrastructure and the application. And by the time we were done with M3, the cost was a
little over 1%. So the very first solution was just make it more efficient. And that worked for a
while. But what we saw is that over time, this grew again, and there wasn't any more efficiency
we can crank out of the backend storage system.
There's only so much optimization you can do to the compression algorithms in the backend
and how much you can get there.
So what we realized the problem shifted towards was not, can we store this data more efficiently?
Because we're already reaching sort of limitations there.
And what we noticed is more towards getting the users of this data,
so individual developers themselves,
to start to understand what data is being produced,
how they're using it, whether it's even useful,
and then taking control from that perspective.
And this is not a problem isolated to the SRE team
or the observability team anymore.
If you think about modern DevOps practices,
every developer needs to take control of monitoring their own applications, right? So
this responsibility is really in the hands of the developers. And the way we approach this
from a Chronosphere perspective is really in four steps. The first one is that we have cost
accounting so that every developer and every team and the central observability team know
how much data
is being produced because it's actually a hard thing to measure, especially in the monitoring
world.
Oh yeah, even AWS bills get this wrong.
If you're sending data between one availability zone to another in the same region, it charges
a penny to leave an AZ and a penny to enter an AZ in that scenario.
And the way that they reflect this on the bill is they double it.
So if you're sending one gigabyte across AZLink in a month,
you'll see two gigabytes on the bill, and that's how it's reflected.
And that is just a glimpse of the monstrosity that is the AWS billing system.
But yeah, exposing that to folks so they can understand how much data their application's spitting off,
forget it. That never happens.
Right, right. And it's not even exposing it to the company as a whole. It's to each use case,
right, to each developer. So they know how much data they are producing themselves.
They know how much of the bill is being consumed. And then the second step in that is to put up
bumper lanes to that so that, you know, once you hit the limit, you don't just get a surprise bill
at the end of the month. When each developer hits that limit, they rate limit themselves and they only impact
their own data.
There's no impact to the other developers or to the other teams or to the rest of the
company.
So we found that those two were necessary initial steps.
And then there were additional steps beyond that to help deal with this problem.
So in order for this to work without a multi-day lag, in some cases, it's a near certainty
that you're looking at what is happening
and what the expense that is being incurred in real time,
not waiting for it to pass its way
through the AWS billing system
and then do some tag attribution back.
A hundred percent.
It's in real time for the stream of data.
And as I mentioned earlier,
for the monitoring data we are collecting,
it goes straight from the customer environment to our backend. So we're not waiting for it to
be routed through the cloud providers because rightly so, there is a multi-day or multi-hour
delay there. So as the data is coming straight to our backend, we are actively in real time
measuring that and cost accounting it to each individual team. And in real time, if the usage goes above what is allocated,
we'll actually limit that particular team or that particular developer
and prevent them by default from using more.
And with that mechanism, you can imagine that's how the build is controlled
and controlled in real time.
So help me understand on some level, is your architecture then agent-based?
Is it a library that gets included in the application code itself, all of the above
and more, something else entirely?
Or is this just such a ridiculous question that you can't believe that no one has ever
asked it before?
No, it's a great question, Corey, and we'd love to give some more insight there.
So it is an agent that runs in the customer environment because it does need to be something
there that goes and collects all the data we're interested in to send it to the back end. This agent is unlike
a lot of APM agents out there where it does sort of introspection, things like that. We really
believe in the power of the open source community and in particular open source standards like the
Prometheus format for metrics. So what this agent does, it actually goes and discovers Prometheus endpoints
exposed by the infrastructure and applications
and scrapes those endpoints
to collect the monitoring data to send to the backend.
And that is the only piece of software
that runs in our customer environments.
And then from that point on,
all of the data is in our backend
and that's where we go and process it
and give visibility into the end users
as well as store it and make it available for alerting and dashboarding purposes as well.
So when did you found Chronosphere? I know that you folks recently raised a series B,
congratulations on that, by the way, that generally means at least if I understand the VC world
correctly, that you've established product market fit. And now we're talking about,
let's scale this thing. My experience in startup land was, oh, we've raised a series B. That means it's
probably time to bring in the first DevOps hire. And that was invariably me. And I wound up
screaming and freaking out for three months and then things were better. So that was my exposure
to series B. But it seems like given what you do, you probably had a few SRE folks kicking around,
even on the product team.
Because everything you're saying so far absolutely resonates with the experience of someone who has run these large-scale things in production.
No big surprise there.
Is that where you are?
I mean, how long have you been around?
Yeah, so we've been around for a couple of years thus far.
So still a relatively new company for sure.
A lot of the core team were the team that both built the underlying technology and also
ran it in production for many years at Uber.
And that team is now here at Chronosphere.
So you can imagine from very beginning, we had DevOps and SREs running this hosted platform
for us.
And it's the folks that actually built the technology and ran it for years running it again outside of Uber now. And then to your first question, yes, we did
establish fairly early on. And I think that is also because we could leverage a lot of the
technology that we had built at Uber. And it sort of gave us a boost to have a product ready for the
market much faster. And what we're seeing in the industry right now is, you know, the adoption of cloud native is so fast that it's sort of accelerating a need
of a new monitoring solution that, you know, historical solutions perhaps cannot handle a lot
of the use cases there. It's a new architecture, it's a new technology stack, and we have the
solution purpose built for that particular stack. So, you know, we are seeing a fairly
fast acceleration and adoption of our product right now.
One problem that an awful lot of monitoring slash observability companies have gotten into
in the last few years, at least it feels this way, and maybe I'm wildly incorrect,
is that it seems that the target market is the Ubers of the world, the hyperscalers,
where once you're at that scale, then you need a tool like this. But if you're just building a standard three-tier web
app, oh, you're nowhere near that level of scale. And the problem with go-to-market in those stories
inherently seems that by the time you are a hyperscaler, you have already built a somewhat
significant observability apparatus. Otherwise, you would not have survived or stayed up long enough to become a hyperscaler. How do you find that the on-ramp
looks? I mean, your website does talk about when you outgrow Prometheus, is there a certain point
of scale that customers should be at before they start looking at things like Chronosphere?
I think if you think about the companies that are born in the cloud today and how quickly they are running and they are iterating their technology stack, monitoring is so critical to that, right?
It's the real-time visibility of these changes that are going out multiple times a day is critical to the success and the growth of a lot of new companies. And because of how critical that piece is, we're finding that you
don't have to be a giant hyperscaler like Uber to need technology like this. And as you rightly
pointed out, you sort of need technology like this as you scale up. And what we're finding is that,
while a lot of large tech companies can invest a lot of resources into hiring these teams and
building out custom software themselves, generally
it's not a great investment on their behalf because those are not companies that are selling
monitoring technology as their core business. So generally what we find is that it is better for
companies to perhaps outsource or purchase or at least use open source solutions to solve some of
these problems rather than custom build in-house. And we're finding that earlier and earlier on in a company's life cycle, they're needing
technology like this.
Part of the problem I always ran into was, again, I come from the old world of grumpy
Unix sysadmins.
For me, using Nagios was my approach to monitoring.
And that's great when you have a persistent, stateful single node or a couple of single nodes, and then you outgrow it because, well, now everything's ephemeral. And by the time you realize that there's an outage or an issue with a container, the container hasn't existed for 20 minutes. And you better have good telemetry into what's going on and how your application behaves, especially at scale, because at that point, edge cases, one in a million events happen multiple times
a second, depending upon scale.
And that's a different way of thinking.
I've been somewhat fortunate in that, in my experience at least, I've not usually had
to go through those transformative leaps.
I've worked with Prometheus, I've worked with Nagios, but never in the same shop.
That's the joy of being a consultant.
You go into one environment, you see what they're doing, and you take notes on what works and what doesn't.
You move on to the next one.
And it's clear that there's a definite defined benefit
to approaching observability in a more modern way.
But I despair at the idea of trying to go from one to the other.
And maybe that just speaks to a lack of vision for me.
No, I don't think that's the case at all, Corey.
I think we are seeing a lot of companies do this transition. I don't think a's the case at all, Corey. I think we are seeing a lot
of companies do this transition. I don't think a lot of companies go and ditch everything that
they've done and things that they put years of investment into. There's definitely a gradual
migration process here. And what we're seeing is that a lot of the newer projects, newer environments,
newer efforts that have been kicked off are being monitored and observed using modern technology
like Prometheus. And then there's also a lot of legacy systems, which are still going to be around
in legacy processes, which is still going to be around for a very long time. It's actually
something we had to deal with at Uber as well. We were actually using Nagios and a StatsD graphite
stack for a very long time before switching over to a more modern tag like system like Prometheus.
So modern Nagios, what was it?
Icinga.
That's what it was.
Yes.
Yes.
It was actually the system that we're using at Uber.
So and I think, you know, for us, it's not just about ditching all of that investment.
It's really about supporting this migration as well.
And this is why both in the open source technology m3
we actually support both the more legacy data types like statsd and the graphite query language
as well as the more modern types like prometheus and promptql and having support for both allows
for a migration and a transition and not even a complete transition i'm sure there will always be
statsd graphite data in a lot of these companies because they're just legacy applications that nobody owns or touches anymore.
And they're just going to be lying around for a long time.
So it's actually something that we proactively get ahead of and ensure that we can support
both use cases, even though we see a lot of companies trending towards the modern technology
solutions for sure.
The last point I want to raise has always been a personal, I guess, area of focus for me. I allude
to it sometimes. I've done a Twitter thread or two on it. But on your website, you say something
that completely resonates with my entire philosophy. And to be blunt, is why in many cases,
I'm down on an awful lot of vendor tooling across a wide variety of disciplines. On the open source
page on your site, near the bottom, you say,
and I quote, we want our end users to build transferable skills that are not vendor or
product specific. And I don't think I've ever seen a vendor come out and say something like that.
Where did that come from? Yeah, if you look at the core of the company, it is built on top of open source technology, right? So it is a very open core company here at Chronosphere. And we really believe in the power of the open source community. And in particular, perhaps not even individual projects, but industry standards and open standards. why we don't have a proprietary protocol or proprietary agent or,
you know,
proprietary query language in our product,
because we truly believe in allowing our end users to build these transferable
skills and industry standard skills,
right?
And right now that is using Prometheus as the client library for monitoring and
PromQL as the query language.
And I think it's not just a transferable skill that you can,
you can bring with you across multiple companies it's also the power of that broader community so you can
imagine now that there is a lot more sharing of hey i am monitoring for example mongodb how should
i best do that those skills can be shared because the common language that they're all speaking the
the queries that everybody is sharing with each other, the dashboards everybody's sharing with each other, are all sort of open source standards now.
And we really believe in the power of that. We really do everything we can to promote that.
And that is why in our product, there isn't any proprietary query language or definitions of
dashboarding or learning or anything like that. So yeah, it is definitely just a core tenant of
the company, I would say. It's really something that I think is admirable. I've known too many people who wind up,
I guess, stuck in various environments where the thing that they work on is an internal application
to the company and nothing else like it exists anywhere else. So if they ever want to change
jobs, they effectively have a black hole on their resume for a number of years. This speaks directly
to the opposite.
It seems like it's not built on a lock-in story.
It's built around actually solving problems.
And I'm a little ashamed to say how refreshing that is,
just based upon what that says about our industry.
Yeah, Corey.
And I think what we're seeing is actually the power of these open source standards.
Let's say Pr prometheus is actually having
effects on the broader industry which i think is great for everybody so you know while a company
like chronosphere is supporting these from day one you see how pervasive the prometheus protocol
and the query language are that actually all of these probably more traditional vendors providing
proprietary protocols and proprietary query languages all actually have to have Prometheus, well, not have to have, but we're seeing that
more and more of them are having Prometheus compatibility as well. And I think that just
speaks to the power of the industry and it really benefits all of the end users in the industry as a
whole, as opposed to the vendors, which, you know, we are really happy to be supporters of.
Thank you so much for taking the time to speak with me today. If people want to learn more about
what you're up to, how you're thinking about these things, where can they find you? And I'm
going to go out on a limb and assume you're also hiring. We are definitely hiring right now. And
you can find us on our website at chronosphere.io, or feel free to shoot me an email directly. My email is martin
at chronosphere.io. Definitely massively hiring right now. And also if you do have problems trying
to monitor your cloud native environment, please come check out our website and our product.
And we will of course include links to that in the show notes. Thank you so much for taking the
time to speak with me today. I really appreciate it. Thanks a lot for having me, Corey. I really enjoyed this.
Martin Mao, CEO and co-founder of Chronosphere. I'm cloud economist Corey Quinn, and this is
Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on
your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your
podcast platform of choice, along with an insulting comment speculating about how long it took to
convince Martin not to name the company Observability Manager Chronosphere Manager.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill
Group works for you, not AWS. We tailor recommendations to your business and we get
to the point. Visit duckbillgroup.com to get started.
This has been a HumblePod production.
Stay humble.