Screaming in the Cloud - Open Core, Real-Time Observability Born in the Cloud with Martin Mao

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me.

Starting point is 00:00:34 I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter. And what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to. It gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It's an awesome approach. I've used something similar for years. Check them out. But wait, there's more. They also have an enterprise option that you should be very much aware of. Canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on

Starting point is 00:01:20 your network and impersonates whatever you want to. When it gets NMAP scanned, or someone attempts to log into it or access files on it, you get instant alerts. It's awesome. If you don't do something like this, you're likely to find out that you've gotten breached the hard way. Take a look at this. It's one of those few things that I look at and say, wow, that is an amazing idea. I love it. That's canarytokens.org and canary.tools. The first one is free. The second one is enterprising. Take a look. I'm a big fan of this. More from them in the coming weeks. If your meantime to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules.

Starting point is 00:02:12 If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the cloud. Low effort, high visibility, and detection. To learn more, visit lacework.com. Welcome to Screaming in the Cloud. I'm Corey Quinn. I've often talked about observability, or as I tend to think of it when people aren't listening, hipster monitoring. Today, we have a promoted episode from a company called Chronosphere, and I'm joined today by Martin Mao, their CEO and co-founder. Martin, thank you for coming on the show and suffering my slings and arrows. Thanks for having me on the show, Corey, and looking

Starting point is 00:02:57 forward to our conversation today. So before we dive into what you're doing now, I'm always a big sucker for origin stories. Historically, you worked at Microsoft and Google, but then you really sort of entered my sphere of things that I find myself having to care about when I'm lying awake at night and the power goes out by working on the EC2 team over at AWS. Tell me a little bit about that. You've hit the big three cloud providers at this point. What was that like? Yeah, it was an amazing experience. I was a technical lead on one of the EC2 teams. And I think when an opportunity like that comes up on such a core foundational project for the cloud, you take it. So it was an amazing opportunity to be a part of leading that team at a fairly early stage of AWS

Starting point is 00:03:46 and also helping them create a brand new service from scratch, which was AWS Systems Manager, which was targeted at fleet-wide management of EC2 instances. So I'm a tremendous fan of Systems Manager, but I'm still looking for the person who named Systems Manager Session Manager because at this point, I'm about to put a bounty out on them.

Starting point is 00:04:04 Wonderful service, terrible name. That was not me. So yes, but yeah, no, it was a great experience for sure. And I think, you know, just seeing how AWS operated from the inside was an amazing learning experience for me and being able to create sort of foundational pieces for the cloud was also an amazing experience. So only good things to say about my time at AWS. And then after that, you left and you went to Uber where you led development and SRE teams that created and operated something called M3. Alternately, I'm misreading your bio, and you bought an M3 from BMW and went to drive for Uber. Which is it? I wish it was the second one, but unfortunately it is the first one. So

Starting point is 00:04:45 yes, I did leave AWS and joined Uber in 2015 to lead a core part of their monitoring and eventually larger observability team. And that team did go on to build open source projects such as M3, which perhaps we should have thought about the name and the conflict with the car when we named it at the time, and other projects such as Jaeger for distributed tracing as well, and a logging backend system too. So I definitely spent many years there building out their observability stack. We're going to tie a theme together here. You were at Microsoft, you were at Google, you were at AWS, you were at Uber, and you look at all of this and decide, all right, my entire career has been spent in large companies doing massive globally scaled things. I'm going to go build a small startup. What made you decide

Starting point is 00:05:29 that, all right, this is something I'm going to pursue? So definitely never part of the plan, as you mentioned, a lot of big tech companies. And I think I always got a lot of joy building large distributed systems, handling lots of load and solving problems at a really grand scale and I think the reason for doing a startup was really the situation that we were in so at uber as I mentioned myself and my co-founder led the core part of the observability team then we were lucky to happen to solve the problem not just for uber but for the broader community and especially the community adopting cloud native architecture.

Starting point is 00:06:06 And it just so happened that we were solving the problem for Uber in 2015, but the rest of the industry sort of has similar problems today. So it was almost the perfect opportunity to solve this now for a broader range of companies out there. And we already had a lot of the core technology built and open source as well. So it was more of an opportunity rather than a long-term plan or anything of that sort, Corey. So before we dive into the intricacies of what you've built, I always like to ask people this question because it turns out that the only thing that everyone agrees on is that everyone else is wrong. What is the dividing line, if any,

Starting point is 00:06:44 between monitoring and observability? That's a great question. And I don't know if there's a easy answer. I mean, my cynical approach is that, well, if you call it monitoring, you don't get to bring in SRE style salaries, call it observability, and no one knows what the hell we're talking about. So sure, it's a blank check at that point. It's cynical and probably not entirely correct. So I'm curious to get your take on it. Yeah, for sure. So, you know, there's definitely a lot of overlap there and it's not really two separate things. In my mind, at least monitoring, which has been around for a very long time,

Starting point is 00:07:15 has always been around notification and having visibility into your systems. And then as the systems got more complex over time, being able to sort of understand that and not just have visibility into it, but understand it a little bit more sort of required perhaps additional new data types to go and solve those problems. And that's how, in my mind, monitoring sort of morphed into observability. So perhaps one is a subset of the other and they're not competing concepts there. But at least that's my opinion. I'm sure there are plenty out there that would perhaps disagree with that.

Starting point is 00:07:50 On some level, it almost hits to the adage of at a certain point of scale with distributed systems is never a question of is the app up or down? It's more a question of how down is it? At least that's how it was explained to me at one point. And it was someone who was incredibly convincing. So I smiled, nodded, and never really thought to question it any deeper than that. But I look back at the large-scale environments I've been in, and yeah, things are always on fire on some level.

Starting point is 00:08:11 And ideally, there are ways to handle and mitigate that. Past a certain point, the approach of small-scale systems stops working at large scale. I mean, I see that over in the costing world, where people will put tools up on GitHub of, hey, I ran this script, and it worked super well on my 10 instances. And then you try and run the thing on 10,000 instances, and the thing melts into the floor, hits rate limits left and right, because people don't think in terms of those scales. So it seems like you're sort of going from the opposite end, where, well, this is how we know things work at large scale.

Starting point is 00:08:41 Let's go ahead and build that out as an initially smaller team. Because I'm going to assume, not knowing much about Chronosphere yet, that it's the sort of thing that will help a company before they get to the hyperscaler stage. 100%. And you're spot on there, Corey. And it's not even just a company going from small stage, small scale, simple systems to more complicated ones. Actually, if you think about this shift in the cloud right now, it's really going from cloud to cloud native, right? So going from VMs to container on the infrastructure tier and going from monoliths to microservices. So it's not even the growth of the company necessarily, or the growth of the load that the system has to handle. But this sort of this shift to containers and microservices

Starting point is 00:09:23 heavily accelerates the growth of the amount of data that gets produced. And that is causing a lot of these problems. So Uber was famous for disrupting effectively the taxi market. What made you folks decide, I know we're going to reinvent observability slash monitoring while we're at it too? What was it about existing approaches that fell down and I guess necessitated you folks to build your own? Yeah, great question, Corey. And actually it goes to the first part. We were disrupting the taxi industry. And I think the ability for Uber to iterate extremely fast and respond as a business to changing market

Starting point is 00:10:02 conditions was key to that disruption. So monitoring and observability was a key part of that because you can imagine it was providing all of the real-time visibility to not only what was happening in our infrastructure and applications, but the business as well. So it really came out of a necessity more than anything else. We found that in order to be more competitive, we had to adopt what is probably today known as cloud-native architecture, adopt running on containers and microservices so that we can move faster. And along with that, we found that all of the existing monitoring tools we were using weren't really built for this type of environment. And it was that that was the forcing function

Starting point is 00:10:41 for us to create our own technologies that were really purpose-built for this modern type of environment that gave us the visibility we needed to be competitive as a company and a business. So talk to me a little bit more about what observability is. I hear people talking about it, to be frank, in a bunch of ways so that they're trying to, I guess, appropriate the term to cover what they already are doing or selling because changing vocabulary is easier than changing an entire product philosophy. What is it? Yeah, we actually had a very similar view on observability. And originally, you know, we thought that it is a combination of metrics logs and traces and that's a very common view you have the three pillars it's almost like three check boxes you tick them off and you have quote-unquote observability and that's actually how we we looked at the problem at uber and we

Starting point is 00:11:36 built solutions for each one of those and we checked all three boxes what we've come to realize since then is perhaps that was not the best way to look at it because we had all three. But what we realized is that actually just having all three doesn't really help you with the ultimate goal of what you want from this platform. And having more of each of the types of data didn't really help us with that either. So, you know, taking a step back from there and when we really looked at it, the lesson that we learned in our view on observability is really more from a end user perspective rather than a data type or data input perspective and really from an end user perspective if you think about why you want to use your monitoring tool or your observability tool you really want to be notified of issues and remediate them as quickly as possible and to do that it really just comes down to answering three questions.

Starting point is 00:12:26 Can I get notified when something is wrong? Yes or no? Do I even know something is wrong? The second question is, can I triage it quickly to know what the impact is? Do I know if it's impacting all of my customers or just the subset of them? And how bad is the issue? Can I go back to sleep if I'm being paged at two o'clock in the morning? And the third one is, can I figure out the underlying root cause to the problem and go and actually fix it so this is how we think about the

Starting point is 00:12:50 problem now is from the end user perspective and it's not that you don't need metrics logs or distributed traces to solve the problem but we are now orienting our solution around solving the problem for the end user as opposed to just orienting our solution around the three data types per se. I'm going to self-admit to a fun billing experience I had once with a different monitoring vendor whom I will not name, because it turns out you can tell stories, you can name names, but doing both gets you in trouble. It was a more traditional approach in a simpler time. And they wound up sending me a message saying, oh, we're hitting rate limits on CloudWatch. Go ahead and open a

Starting point is 00:13:31 ticket asking for them to raise it. And in a rare display of foresight, AWS responded to my ticket with a, we can do this, but understand at this level of concurrency, it will cost something like $90,000 a month on increased charges with that frequency for that many metrics. And that was roughly twice what our AWS bill was in those days. So I'm curious as to how you can offer predictable pricing when you can have things that emit so much data so quickly. I believe you when you say you can do it. I'm just trying to understand the philosophy of how that works. As I said earlier, we started to approach this by trying to solve it in a very engineering fashion where we just wanted to create more efficient backend technology so that it would be cheaper for the increased amount of data.

Starting point is 00:14:22 What we realized over time is that no matter how much cheaper we make it, the amount of data being produced, especially from monitoring and observability, kept increasing, not even in a linear fashion, but in an exponential fashion. And because of that, it really switched the problem not to how efficiently can we store this, it really changed our focus of the problem to how are users using this data and do they even understand the data that's being produced. So in addition to the couple of properties I mentioned earlier around cost accounting and rate limiting those are definitely required. The other things we try to make available for our end users is introspection tools such that they understand

Starting point is 00:15:02 the type of data that's being produced. It's actually very easy in the monitoring and observability world to write a single line of code that actually produces a lot of data and most developers don't understand that that single line of code produces so much data. So our approach to this is to provide a tool so that developers can introspect and understand what is produced on the back end side, not what is being inputted from their code and then not only have an understanding of that but also dynamic ways to deal with it so that you know again when they hit the rate limit they don't just have to monitor less they understand

Starting point is 00:15:37 that oh i inserted this particular label and now i have 20 times the amount of data that i needed before do i really need that particular label in there and if not perhaps dropping it dynamically on the server side is a much better way of dealing with that problem than having to roll back your code and change your metric instrumentation so for us you know the way to deal with it is not to just make the back end even more efficient but really to have end users understand the data that they're producing and make decisions on which parts of it is really useful and which parts of it do they perhaps not want or perhaps want to retain for shorter periods of time, for example, and then

Starting point is 00:16:16 allow them to actually implement those changes on that data on the back end. And that is really how the end users control the bills and the cost themselves. So there are a number of different companies in the observability space that have different approaches to what they solve for. In some cases, to be very honest, it seems like, well, I have 15 different observability and monitoring tools. Which ones do you replace? And the answer is, oh, we're number 16. And it's easy to be cynical and down on that entire approach, but then you start digging into it and they're actually right. I didn't expect that to be the case. What was your perspective that made you look around the, let's

Starting point is 00:16:57 be honest, fairly crowded landscape of observability companies, tools that gave insight into the health status and well-being of various applications in different ways and say, you know, no one's quite gotten this right yet. I have a better idea. Yeah, you're completely correct. And perhaps the previous environments that everybody was operating in, there were a lot of different tools for different purposes, right? A company would purchase a infrastructure monitoring tool, perhaps even a network monitoring tool tool and then that would have perhaps an APM solution for the applications and then perhaps some BI tools for the business so there was always historically a collection of different tools to go and solve this problem and I think again what has really happened recently with this

Starting point is 00:17:40 shift to cloud native recently is that the need for a lot of this data to be in a single tool has become more important than ever. So if you think about your microservices running on a single container today, if a single container dies in isolation without knowing perhaps which microservice was running on it, it doesn't mean very much. And just having that visibility is not going to be enough. Just like if you don't know which business use case that microservice was serving that's not going to be very useful for you either so with cloud native architecture there is more of a need to have all of this data and visibility in a single tool which hasn't historically happened and also none of the existing tools today so if you think about both the existing apm solutions out there and the

Starting point is 00:18:24 existing hosted solutions that exist in the world today, none of them were really built for a cloud native environment because you can think about even the timing that these companies were created at, you know, back in early 2010s, Kubernetes and containers weren't really a thing. So a lot of these tools weren't really built for the modern architecture that we see most companies shifting towards. So the opportunity was really to build something for where we think the industry and everyone's technology stack was going to be as opposed to where the technology stack has been in the past before. And that was really the opportunity there. And it just so happened that we had built a lot of

Starting point is 00:19:00 these solutions for a similar type of environment for Uber many years before. So leveraging a lot of our lessons learned there put us in a good spot to build a new solution that we believe is fairly different from everything else that exists today in the market. And it's going to be a good fit for companies moving forward. So on your website, one of the things that you, I assume, put up there just to pick a fight, because if there's one thing these people love, it's fighting, is a use case is outgrowing Prometheus. The entire story behind Prometheus is, oh, it scales forever. It's what the hyperscalers would use. This came out of the

Starting point is 00:19:35 way that Google does things. And everyone talks about Google as if it's this mythical Valhalla place where everything is amazing and nothing ever goes wrong. I've seen the conference docs and that's great. What does outgrowing Prometheus look like? Yeah, it's a great question, Corey. So if you look at Prometheus and, you know, it is the graduated and the recommended monitoring tool for cloud native environments. If you look at it and the way it scales, actually it's a single binary solution, which is great because it's really easy to get started you deploy a single instance and you have sort of ingestion storage and visibility and dashboarding and alerting all packaged together into one solution that's definitely great and it can scale sort of by itself to a certain point and is definitely the recommended starting point

Starting point is 00:20:21 but as you really start to grow your business, increase your cluster sizes, increase the number of applications you have, actually isn't a great fit for horizontal scale. So by default, there isn't really a high availability in horizontal scale built into Prometheus by default. And that's why other projects in the CNCF, such as Cortex and Thanos, were created to solve some of these problems. So we sort of looked at the problem in a similar fashion. And when we created M3, the open source metrics platform that came out of Uber, it was also sort of approaching it from this different perspective where we built it to be horizontally scalable and highly reliable from the beginning. But yet we don't really want it to be a competing project with Prometheus. So it is actually something that works in tandem with Prometheus in the sense that it can ingest Prometheus metrics

Starting point is 00:21:09 and you can issue Prometheus query language queries against it and it will fulfill those. But it is really built for a more scalable environment. And I would say that once a company starts to grow and they run into some of these pain points and these pain points are surrounding how reliable a Prometheus instance is, how you can scale it up beyond just you know giving it more resources on the vm that it runs on vertical scale you know runs out at a certain point those are some of the pain points that you know a lot of companies do run into and need to solve eventually and there are various solutions out there both in open source and in the commercial world that are designed to solve those pain points m3 being one of the open source ones, and of course, Chronosphere

Starting point is 00:21:48 being one of the commercial ones. This episode is sponsored in part by Salesforce. Salesforce invites you to Salesforce and AWS. What's ahead for architects, admins, and developers on June 24th at 10 a.m. Pacific time. It's a virtual event where you'll get a first look at the latest innovations from the Salesforce and AWS partnership and have an opportunity to have your questions answered. Plus, you'll get to enjoy an exclusive performance from Grammy Award-winning artist The Roots.

Starting point is 00:22:18 I think they're talking about a band, not people with super user access to a system. Registration is free at salesforce.com slash what's ahead. Now, you've also gone ahead and more or less dangled raw meat in front of a tiger in some respects here, because one of the things that you wind up saying on your site of why people would go with Chronosphere is, ah, this doesn't allow for Bill Spike overages as far as what the Chronosphere bill is. And that's awesome. I love predictable pricing. It's sort of the antithesis of cloud bills.

Starting point is 00:22:50 But there is the counter argument too, which is with many approaches to monitoring, I don't actually care what my monitoring vendor is going to charge me because they wind up costing me five times more just in terms of CloudWatch charges. How does your billing work? And how do you avoid causing problems for me on the AWS side or other cloud provider? I mean, again, GCP and Azure are not immune from this. So if you look at the built-in solutions by the cloud providers, a lot of those metrics and monitoring you get from those at CloudWatch or Stackdriver, a lot of it you get sort of included for free with your AWS bill already. It's only if you want additional data and additional retention

Starting point is 00:23:31 do you choose to pay more there. So I think a lot of companies do use those solutions for the default set of monitoring that they want, especially for the AWS services. But generally, a lot of companies have sort of custom monitoring requirements outside of that in the application tier, or even more detailed monitoring in the infrastructure that is required,

Starting point is 00:23:51 especially if you think about Kubernetes. Oh, yeah. And then I see people using CloudWatch as basically a monitoring or metric or log router, which at its price point, don't do that. It doesn't end well for anyone involved. 100%. So our solution and our approach is a little bit different. So it doesn't actually go through CloudWatch or any of these other inbuilt cloud-hosted solutions as a router,

Starting point is 00:24:15 because to your point, there's a lot of costs there as well. It actually goes and collects the data from the infrastructure tier or the applications. And what we have found is that not only does the bill for monitoring climb exponentially and not just as you grow especially as you shift towards cloud native architecture our very first take of solving that problem is to make the back end a lot more efficient than before so it just is cheaper overall and we approached it that way at uber and we had great results there. So when we created them originally before M3, 8% of Uber's infrastructure bill was spent on monitoring

Starting point is 00:24:51 all that infrastructure and the application. And by the time we were done with M3, the cost was a little over 1%. So the very first solution was just make it more efficient. And that worked for a while. But what we saw is that over time, this grew again, and there wasn't any more efficiency we can crank out of the backend storage system. There's only so much optimization you can do to the compression algorithms in the backend and how much you can get there. So what we realized the problem shifted towards was not, can we store this data more efficiently? Because we're already reaching sort of limitations there.

Starting point is 00:25:22 And what we noticed is more towards getting the users of this data, so individual developers themselves, to start to understand what data is being produced, how they're using it, whether it's even useful, and then taking control from that perspective. And this is not a problem isolated to the SRE team or the observability team anymore. If you think about modern DevOps practices,

Starting point is 00:25:44 every developer needs to take control of monitoring their own applications, right? So this responsibility is really in the hands of the developers. And the way we approach this from a Chronosphere perspective is really in four steps. The first one is that we have cost accounting so that every developer and every team and the central observability team know how much data is being produced because it's actually a hard thing to measure, especially in the monitoring world. Oh yeah, even AWS bills get this wrong.

Starting point is 00:26:13 If you're sending data between one availability zone to another in the same region, it charges a penny to leave an AZ and a penny to enter an AZ in that scenario. And the way that they reflect this on the bill is they double it. So if you're sending one gigabyte across AZLink in a month, you'll see two gigabytes on the bill, and that's how it's reflected. And that is just a glimpse of the monstrosity that is the AWS billing system. But yeah, exposing that to folks so they can understand how much data their application's spitting off, forget it. That never happens.

Starting point is 00:26:44 Right, right. And it's not even exposing it to the company as a whole. It's to each use case, right, to each developer. So they know how much data they are producing themselves. They know how much of the bill is being consumed. And then the second step in that is to put up bumper lanes to that so that, you know, once you hit the limit, you don't just get a surprise bill at the end of the month. When each developer hits that limit, they rate limit themselves and they only impact their own data. There's no impact to the other developers or to the other teams or to the rest of the company.

Starting point is 00:27:15 So we found that those two were necessary initial steps. And then there were additional steps beyond that to help deal with this problem. So in order for this to work without a multi-day lag, in some cases, it's a near certainty that you're looking at what is happening and what the expense that is being incurred in real time, not waiting for it to pass its way through the AWS billing system and then do some tag attribution back.

Starting point is 00:27:37 A hundred percent. It's in real time for the stream of data. And as I mentioned earlier, for the monitoring data we are collecting, it goes straight from the customer environment to our backend. So we're not waiting for it to be routed through the cloud providers because rightly so, there is a multi-day or multi-hour delay there. So as the data is coming straight to our backend, we are actively in real time measuring that and cost accounting it to each individual team. And in real time, if the usage goes above what is allocated,

Starting point is 00:28:07 we'll actually limit that particular team or that particular developer and prevent them by default from using more. And with that mechanism, you can imagine that's how the build is controlled and controlled in real time. So help me understand on some level, is your architecture then agent-based? Is it a library that gets included in the application code itself, all of the above and more, something else entirely? Or is this just such a ridiculous question that you can't believe that no one has ever

Starting point is 00:28:34 asked it before? No, it's a great question, Corey, and we'd love to give some more insight there. So it is an agent that runs in the customer environment because it does need to be something there that goes and collects all the data we're interested in to send it to the back end. This agent is unlike a lot of APM agents out there where it does sort of introspection, things like that. We really believe in the power of the open source community and in particular open source standards like the Prometheus format for metrics. So what this agent does, it actually goes and discovers Prometheus endpoints exposed by the infrastructure and applications

Starting point is 00:29:08 and scrapes those endpoints to collect the monitoring data to send to the backend. And that is the only piece of software that runs in our customer environments. And then from that point on, all of the data is in our backend and that's where we go and process it and give visibility into the end users

Starting point is 00:29:25 as well as store it and make it available for alerting and dashboarding purposes as well. So when did you found Chronosphere? I know that you folks recently raised a series B, congratulations on that, by the way, that generally means at least if I understand the VC world correctly, that you've established product market fit. And now we're talking about, let's scale this thing. My experience in startup land was, oh, we've raised a series B. That means it's probably time to bring in the first DevOps hire. And that was invariably me. And I wound up screaming and freaking out for three months and then things were better. So that was my exposure to series B. But it seems like given what you do, you probably had a few SRE folks kicking around,

Starting point is 00:30:04 even on the product team. Because everything you're saying so far absolutely resonates with the experience of someone who has run these large-scale things in production. No big surprise there. Is that where you are? I mean, how long have you been around? Yeah, so we've been around for a couple of years thus far. So still a relatively new company for sure. A lot of the core team were the team that both built the underlying technology and also

Starting point is 00:30:31 ran it in production for many years at Uber. And that team is now here at Chronosphere. So you can imagine from very beginning, we had DevOps and SREs running this hosted platform for us. And it's the folks that actually built the technology and ran it for years running it again outside of Uber now. And then to your first question, yes, we did establish fairly early on. And I think that is also because we could leverage a lot of the technology that we had built at Uber. And it sort of gave us a boost to have a product ready for the market much faster. And what we're seeing in the industry right now is, you know, the adoption of cloud native is so fast that it's sort of accelerating a need

Starting point is 00:31:09 of a new monitoring solution that, you know, historical solutions perhaps cannot handle a lot of the use cases there. It's a new architecture, it's a new technology stack, and we have the solution purpose built for that particular stack. So, you know, we are seeing a fairly fast acceleration and adoption of our product right now. One problem that an awful lot of monitoring slash observability companies have gotten into in the last few years, at least it feels this way, and maybe I'm wildly incorrect, is that it seems that the target market is the Ubers of the world, the hyperscalers, where once you're at that scale, then you need a tool like this. But if you're just building a standard three-tier web

Starting point is 00:31:50 app, oh, you're nowhere near that level of scale. And the problem with go-to-market in those stories inherently seems that by the time you are a hyperscaler, you have already built a somewhat significant observability apparatus. Otherwise, you would not have survived or stayed up long enough to become a hyperscaler. How do you find that the on-ramp looks? I mean, your website does talk about when you outgrow Prometheus, is there a certain point of scale that customers should be at before they start looking at things like Chronosphere? I think if you think about the companies that are born in the cloud today and how quickly they are running and they are iterating their technology stack, monitoring is so critical to that, right? It's the real-time visibility of these changes that are going out multiple times a day is critical to the success and the growth of a lot of new companies. And because of how critical that piece is, we're finding that you don't have to be a giant hyperscaler like Uber to need technology like this. And as you rightly

Starting point is 00:32:50 pointed out, you sort of need technology like this as you scale up. And what we're finding is that, while a lot of large tech companies can invest a lot of resources into hiring these teams and building out custom software themselves, generally it's not a great investment on their behalf because those are not companies that are selling monitoring technology as their core business. So generally what we find is that it is better for companies to perhaps outsource or purchase or at least use open source solutions to solve some of these problems rather than custom build in-house. And we're finding that earlier and earlier on in a company's life cycle, they're needing technology like this.

Starting point is 00:33:31 Part of the problem I always ran into was, again, I come from the old world of grumpy Unix sysadmins. For me, using Nagios was my approach to monitoring. And that's great when you have a persistent, stateful single node or a couple of single nodes, and then you outgrow it because, well, now everything's ephemeral. And by the time you realize that there's an outage or an issue with a container, the container hasn't existed for 20 minutes. And you better have good telemetry into what's going on and how your application behaves, especially at scale, because at that point, edge cases, one in a million events happen multiple times a second, depending upon scale. And that's a different way of thinking. I've been somewhat fortunate in that, in my experience at least, I've not usually had to go through those transformative leaps.

Starting point is 00:34:16 I've worked with Prometheus, I've worked with Nagios, but never in the same shop. That's the joy of being a consultant. You go into one environment, you see what they're doing, and you take notes on what works and what doesn't. You move on to the next one. And it's clear that there's a definite defined benefit to approaching observability in a more modern way. But I despair at the idea of trying to go from one to the other. And maybe that just speaks to a lack of vision for me.

Starting point is 00:34:41 No, I don't think that's the case at all, Corey. I think we are seeing a lot of companies do this transition. I don't think a's the case at all, Corey. I think we are seeing a lot of companies do this transition. I don't think a lot of companies go and ditch everything that they've done and things that they put years of investment into. There's definitely a gradual migration process here. And what we're seeing is that a lot of the newer projects, newer environments, newer efforts that have been kicked off are being monitored and observed using modern technology like Prometheus. And then there's also a lot of legacy systems, which are still going to be around in legacy processes, which is still going to be around for a very long time. It's actually

Starting point is 00:35:14 something we had to deal with at Uber as well. We were actually using Nagios and a StatsD graphite stack for a very long time before switching over to a more modern tag like system like Prometheus. So modern Nagios, what was it? Icinga. That's what it was. Yes. Yes. It was actually the system that we're using at Uber.

Starting point is 00:35:34 So and I think, you know, for us, it's not just about ditching all of that investment. It's really about supporting this migration as well. And this is why both in the open source technology m3 we actually support both the more legacy data types like statsd and the graphite query language as well as the more modern types like prometheus and promptql and having support for both allows for a migration and a transition and not even a complete transition i'm sure there will always be statsd graphite data in a lot of these companies because they're just legacy applications that nobody owns or touches anymore. And they're just going to be lying around for a long time.

Starting point is 00:36:09 So it's actually something that we proactively get ahead of and ensure that we can support both use cases, even though we see a lot of companies trending towards the modern technology solutions for sure. The last point I want to raise has always been a personal, I guess, area of focus for me. I allude to it sometimes. I've done a Twitter thread or two on it. But on your website, you say something that completely resonates with my entire philosophy. And to be blunt, is why in many cases, I'm down on an awful lot of vendor tooling across a wide variety of disciplines. On the open source page on your site, near the bottom, you say,

Starting point is 00:36:45 and I quote, we want our end users to build transferable skills that are not vendor or product specific. And I don't think I've ever seen a vendor come out and say something like that. Where did that come from? Yeah, if you look at the core of the company, it is built on top of open source technology, right? So it is a very open core company here at Chronosphere. And we really believe in the power of the open source community. And in particular, perhaps not even individual projects, but industry standards and open standards. why we don't have a proprietary protocol or proprietary agent or, you know, proprietary query language in our product, because we truly believe in allowing our end users to build these transferable skills and industry standard skills, right?

Starting point is 00:37:34 And right now that is using Prometheus as the client library for monitoring and PromQL as the query language. And I think it's not just a transferable skill that you can, you can bring with you across multiple companies it's also the power of that broader community so you can imagine now that there is a lot more sharing of hey i am monitoring for example mongodb how should i best do that those skills can be shared because the common language that they're all speaking the the queries that everybody is sharing with each other, the dashboards everybody's sharing with each other, are all sort of open source standards now. And we really believe in the power of that. We really do everything we can to promote that.

Starting point is 00:38:13 And that is why in our product, there isn't any proprietary query language or definitions of dashboarding or learning or anything like that. So yeah, it is definitely just a core tenant of the company, I would say. It's really something that I think is admirable. I've known too many people who wind up, I guess, stuck in various environments where the thing that they work on is an internal application to the company and nothing else like it exists anywhere else. So if they ever want to change jobs, they effectively have a black hole on their resume for a number of years. This speaks directly to the opposite. It seems like it's not built on a lock-in story.

Starting point is 00:38:48 It's built around actually solving problems. And I'm a little ashamed to say how refreshing that is, just based upon what that says about our industry. Yeah, Corey. And I think what we're seeing is actually the power of these open source standards. Let's say Pr prometheus is actually having effects on the broader industry which i think is great for everybody so you know while a company like chronosphere is supporting these from day one you see how pervasive the prometheus protocol

Starting point is 00:39:17 and the query language are that actually all of these probably more traditional vendors providing proprietary protocols and proprietary query languages all actually have to have Prometheus, well, not have to have, but we're seeing that more and more of them are having Prometheus compatibility as well. And I think that just speaks to the power of the industry and it really benefits all of the end users in the industry as a whole, as opposed to the vendors, which, you know, we are really happy to be supporters of. Thank you so much for taking the time to speak with me today. If people want to learn more about what you're up to, how you're thinking about these things, where can they find you? And I'm going to go out on a limb and assume you're also hiring. We are definitely hiring right now. And

Starting point is 00:39:59 you can find us on our website at chronosphere.io, or feel free to shoot me an email directly. My email is martin at chronosphere.io. Definitely massively hiring right now. And also if you do have problems trying to monitor your cloud native environment, please come check out our website and our product. And we will of course include links to that in the show notes. Thank you so much for taking the time to speak with me today. I really appreciate it. Thanks a lot for having me, Corey. I really enjoyed this. Martin Mao, CEO and co-founder of Chronosphere. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your

Starting point is 00:40:45 podcast platform of choice, along with an insulting comment speculating about how long it took to convince Martin not to name the company Observability Manager Chronosphere Manager. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

Screaming in the Cloud - Open Core, Real-Time Observability Born in the Cloud with Martin Mao

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.