Screaming in the Cloud - Unpacking the Costs and Value of Observability with Martin Mao

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Human scale teams use Tailscale to build trusted networks. Tailscale Funnel is a great way to share a local service with your team for collaboration, testing, and experimentation.

Starting point is 00:00:41 Funnel securely exposes your dev environment at a stable URL, complete with auto-provisioned TLS certificates. Use it from the command line or the new VS Code extensions. In a few keystrokes, you can securely expose a local port to the internet right from the IDE. I did this in a talk I gave at Tailscale Up, their first inaugural developer conference. I used it to present my slides and only revealed that that's what I was doing at the end of it. It's awesome. It works. Check it out. Their free plan now includes three users and 100 devices. Try it out at snark.cloud slash tailscale scream. Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted guest

Starting point is 00:01:23 episode is brought to us by our friends at Chronosphere. It's been a couple of years since I got to talk to their CEO and co-founder, Martin Mao, who is kind enough to subject himself to my slings and arrows today. Martin, great to talk to you. Great to talk to you again, Corey, and looking forward to it. I should probably disclose that I did run into you at Monotorama a week before this recording. So that was an awful lot of fun to just catch up and see people in person again. But one thing that they started off the conference with in the welcome to the show style of talk was the question about benchmarking what observability spend should be as a percentage of your infrastructure spend. And from my perspective, that really feels a lot like a question that looks like,

Starting point is 00:02:11 well, how long should a piece of string be? It's always highly contextual. Agree, disagree, or are you hopelessly compromised? Because you are, in fact, an observability vendor, and it should always be more than it is today. I would say, definitely agree with you from an exact number perspective. I don't think there is a magic number like 13.82% that this should be. It definitely depends on the context of how observability is used within a company. And really, ultimately, just like anything else you pay for, it really gets derived from the value you get out of it. So I feel like if you feel like you're getting the value

Starting point is 00:02:48 out of it, it's sort of worth the dollars that you put in. I do see why a lot of companies out there and people are interested because they're trying to benchmark to try to see, am I doing best practice? So I do think that there are probably some best practice ranges that I'd say most typical organizations out there that we see is one thing I would say. The other thing I would say when it comes to observability costs is one of the concerns we've seen talking with companies, if the relative cost of observability is growing faster than infrastructure and you extrapolate that out a few years, then the direction in which this is going is bad. So it's probably more the velocity of growth than the absolute number that folks should be worried about.

Starting point is 00:03:38 I think that that is probably a fair assessment. I get it all the time, at least in years past, where companies will say, for every thousand daily active users, what should it cost to service them? And I finally snapped in one of my talks that I gave at DevOps Enterprise Summit and said, I think it was something like $7.34. It's an arbitrary number that has no context in your business, regardless of whether those users are Twitter users or large banks you have partnerships with. But now you have something to cite. Does it help you? Not really. But will it get people to leave you alone and stop asking you awkward questions? Also, not really. But at least now you have a number. Yeah, 100%. And again, like I said, there's no, and I'm glad our magic numbers weren't too far away from each other. But yeah, I mean, there's no exact number there for sure.

Starting point is 00:04:23 One pattern I've been seeing more recently is like rather than asking for the number, there's been a lot more clarity in companies on figuring out, well, okay, before I even pick what the target should be, how much am I spending on this per whatever unit of efficiency is, right? And generally that unit of efficiency, I've actually seen it be mapped more to the business side of things. So perhaps to the number of customers or to customer transactions and whatnot and those things are generally perhaps easier to model out and easier to justify as opposed to purely you know the number of seats

Starting point is 00:04:54 or the number of end users but I've seen a lot more companies at least focus on the measurement of things and again it's been more about this sort of rather than the absolute number, the relative change in number. Because I think a lot of these companies are trying to figure out, is my business scaling in a linear fashion or a sub-linear fashion, or perhaps an exponential fashion? If it's the cost or, you know, you can imagine growing exponentially, that's a really bad thing that you want to get ahead of. That I think is probably the real question people are getting at is it seems like this

Starting point is 00:05:24 number only really goes up and to the right. It's not something that we have any real visibility into. And in many cases, it's beating up your CloudWatch API charges all the time on this other side as well. And data egress is not free, surprise, surprise. So it's the direct costs, it's the indirect costs. And the thing people never talk about, of course, is the cost of people to feed and maintain these systems. Yeah, 100%. You're spot on. There's the direct costs, there's the indirect costs, like you mentioned, in observability. Network egress is a huge indirect cost. There's the people that, like you mentioned, in observability. Network egress is a huge indirect cost. There's the people that you mentioned that need to maintain these systems. And I think those are things that companies definitely should take into account when they think about the total cost of ownership there.

Starting point is 00:06:16 I think what's more in observability actually is, and this is perhaps a hard thing to measure as well, is often we ask companies ask companies well what is the cost of downtime right like if your if your business is is impacted and your customers are impacted and you're down what is the cost of each additional minute of downtime perhaps right and then the effectiveness of the tool can be evaluated against that because you know it observability is one of these it's not just any other developer tool. It's the thing that's giving you insight into is my business or my product or my service operating in the way that I intend? And, you know, is my infrastructure up, for example, as well, right? So I think there's

Starting point is 00:06:55 also the piece of like, what is the tool really doing in terms of like a lost revenue or brand impact? Those are often things that are sort of quite easily overlooked as well. I am curious to see whether you have noticed a shifting in the narrative lately, where as someone who sells AWS cost optimization consulting as a service, something that I've noticed is that until about a year ago, no one really seemed to care overly much

Starting point is 00:07:21 about what the AWS bill was. And suddenly my phone's been ringing off the hook. Have you found that the same is true in the observability space where no one really cared what the observability costs until suddenly recently everyone does, or has this been simmering for a while? We have found that exact same phenomenon. And what I tell most companies out there is we provide an observability platform that's targeted at cloud-native platforms. So if you're a cloud-native architecture, so if you're running microservices-oriented architecture on containers, that's a type of architecture that we've optimized our solution for.

Starting point is 00:07:57 And historically, we've always done two things to try to differentiate. One is provide a better tool to solve that particular problem in that particular architecture. And the second one is to be a more cost-efficient solution in doing so. And not just cost-efficient, but a tool that shows you the cost and the value of the data that you're storing. So we've always had both sides of that equation. And to your point, in conversations in the past years, they've generally been led with, look, I'm looking for a better solution. If you just happen to be cheaper, great. That's a nice cherry on top. Whereas this year, the conversations have flipped 180, in which case most companies are looking for a more cost-efficient solution. If you just happen to be a better tool at the same time,

Starting point is 00:08:41 that's more of a nice to have than anything else. So that conversation has definitely flipped 180 for us. And we found a pretty similar experience to what you've been seeing out in the market right now. Which makes a tremendous amount of sense. I think that there's an awful lot of, we'll just call it strangeness. I think that's probably the best way to think about it in terms of people waking up to the grim reality that not caring about your bills was functionally a zero interest rate phenomenon in the corporate sense. Now, suddenly everyone really has to think about this in some unfortunate and some would say displeasing ways.

Starting point is 00:09:15 Yeah, 100%. And it was a great environment for tech for over a decade, right? So it was an environment that I think a lot of companies and a lot of individuals got used to. And perhaps a lot of folks that have entered the market in the last decade don't know of another situation or another set of conditions where efficiency and cost really do matter. So it's definitely top of mind. And I do think it's top of mind for good reason. I do think a lot of companies got fairly inefficient over the last few years chasing that top line growth. Yeah, that has been, I think it makes sense in the context with which people were operating because before a lot of that wound up hitting, it was, well, grow, grow, grow at all costs. What do you mean? You're not doing that right now. You should be

Starting point is 00:09:58 doing that right now. Are you being irresponsible? Do we need to come down there and talk to you? A hundred percent. Yeah, so give me your vegetables. Now it's time to start paying attention to this. Yeah, a hundred percent. It's always a trade-off, right? It's like an individual company, an individual team, you only have so many resources and prioritization. And I do think, to your point, in a zero-interest environment, trying to grow that top line was the main thing to do. And hence, everything was pushed on how quickly can we deliver new functionality, new features to grow that top line. Whereas the efficiency is always something I think a lot of companies looked at as something I can go deal with later on and go fix. And I feel like

Starting point is 00:10:33 that time has now just come. I will say that I somewhat recently had the distinct privilege of working with a company whose observability story was effectively, we wait for customers to call and tell us there's a problem, and then we go looking into it. And on the one hand, my immediate former SRE reflexes kicked in and I've recoiled. But this company has been in this industry longer than I have. They clearly have a model that is working for them and for their customers. It's not the way I would build something, but it does seem that for some use cases, you absolutely are going to be okay with something like that. And I should probably point out, they were not, for example, a bank

Starting point is 00:11:13 where, yeah, you kind of want to get some early warning on things that could destabilize the economy. Right, right. I mean, to your point, depending on the context and the company, it could definitely make sense and depending on how they execute as well, right? So, you know, you caught an example already where if there were a bank or if any correctness or timeliness of a response was important to that business, perhaps not the best thing to do to have your customers find out, especially if you have a ton of customers at the same time. But however, if it's a different type of business where the responses are perhaps more asynchronous or you don't have a lot of users encountering at the same time, or perhaps you have a great A-B experimentation platform, testing platform,

Starting point is 00:11:57 there are definitely conditions in which that could be potentially a viable option, especially when you weigh up the cost and the benefit. If the cost to having a few bad customers have a bad experience is not that much to the business, and the benefit is that you don't have to spend a ton on observability, perhaps that's a trade-off that the company is willing to make. In most of the businesses that we've been working with, I would say that's probably not been the case, but I do think that there's probably some bias and some skew there in the sense that you can imagine a company that cares about these things perhaps is more likely to talk to an observability vendor like us to try to fix these problems. When we spoke a few years back, you definitely were focused on the large, one would say almost hyperscale style of cloud-native build-out.

Starting point is 00:12:44 Is that still accurate, or has the bar to entry changed since we last spoke? I know you've raised an awful lot of money, which, good for you, it's a sign of a healthy, robust VC ecosystem. But the counterpoint to that is they're probably not investing in a company whose total addressable market is like 15 companies that must be at least this big. 100%, 100. So I would say that the bar to entry definitely has changed, but it's not due to a business decision on our end. If you think about how we started and the focus area, we're really targeting accounts that are adopting cloud-native technology. And it just so happens that the large tech decacons and the hyperscalers were the earliest adopters

Starting point is 00:13:27 of cloud-native, so containerization or microservices. They were the earliest adopters of that. So hence, there was a high correlation in the companies that had that problem and the companies that we could serve. Luckily for us, the trend has been that more of the rest of the industry has gone down this route as well.

Starting point is 00:13:45 And it's not just new startups. You can imagine any new startup these days probably starts off cloud native from day one. But what we're finding is the more established larger enterprises are doing this shift as well. And I think the folks out there like Gartner have studied this and predicted that by about 2028, I believe was the date, about 95% of applications are going to be containerized in large enterprises. So it's definitely a trend that the rest of the industry will go on. And as they continue down that trend, that's when sort of our addressable market will grow because the amount of use cases where our technology shines will grow along with that as well.

Starting point is 00:14:24 I'm also curious about your description of being aimed at cloud-native companies. You gave one example of microservices powered by containers, if I understood correctly. What are the prerequisites for this? When you say that, it almost sounds like you're trying to avoid defining a specific architecture that you don't want to deal well with or don't want to support for a variety of reasons. Is that what it is? Or is there certain you must be built in these ways or the product does not work super well for you? What is it you're trying to say with that is what I'm trying to get at here. Yeah, 100%. If you look at the founding story here, it's really myself and

Starting point is 00:15:00 my co-founder found Uber going through this transition of both a new architecture in the sense that, you know, they were going containers, they were building microservices-oriented architecture there, but also adopting a DevOps mentality as well. So it was just a new way of building software almost. And what we found is that when you develop software in this particular way, so you can imagine when you're developing a tiny piece of functionality as a microservice and you're an individual developer and you can imagine rolling that out into production multiple times a day. In that way of developing software, what we found

Starting point is 00:15:35 was that the traditional tools, the application performance monitoring tools, the IT monitoring tools that used to exist pre this way of both architecture and way of developing software just weren't a good fit. So the whole reason we exist is that we had to figure out a better way of solving this particular problem for the way that Uber built software, which was more of a cloud native approach. And again, it just so happens that the rest of the industry is moving down this path as well. And hence, that problem is larger for a larger portion of the companies out there. I'd say some of the things when you look into why the existing solutions can't solve these problems well,

Starting point is 00:16:17 if you look at an application performance monitoring tool, an APM tool, it's really focused on introspecting into that application and its interaction with the operating system or the underlying hardware. And yet these days, that is less important when you're running inside a container. Perhaps you don't even have access to the underlying hardware or the operating system. And what you care about, you can imagine, is how that piece of functionality interacts with all the other pieces of functionality out there over a network core. So just the architecture and the conditions ask for a different type of observability, a different type of monitoring. And hence, you just need a different type of solution to go solve for this new world.

Starting point is 00:16:57 Along with this, which is sort of related to the cost as well, is that, you know, as we go from virtual machines onto containers, you can imagine the sheer volume of data that gets produced now because everything is much smaller than it was before and a lot more ephemeral than it was before. And hence, every small piece of infrastructure, every small piece of code, you can imagine still needs as much monitoring and observability as it did before as well. So just the sheer volume of data is so much larger for the same amount of infrastructure, for the same amount of hardware that you used to have. And that's really driving a huge problem

Starting point is 00:17:32 in terms of being able to scale for it and also being able to pay for these systems as well. Tired of Apache Kafka's complexity making your AWS bill look like a phone number? Enter Red Panda. You get 10x your streaming data performance without having to rob a bank. Thank you. it's reality, visit go.redpanda.com slash duckbill. Red Panda, because Kafka shouldn't cause you nightmares. I think that there's a common misconception in the industry that people are going to either have ancient servers rotting away in racks, or they're going to build

Starting point is 00:18:20 something greenfield the way that we see done on keynote stages all the time of companies that have been around with this architecture for less than 18 months. In practice, I find it's awfully frequent that this is much more of a spectrum and a case-by-case per workload basis. I haven't met too many data center companies where everything's a disaster that the cloud companies like to paint it as. And vice versa, I also have never yet seen an architecture that really existed as described in a keynote presentation. I 100% agree with you there. And, you know, it's not clean cut from that perspective. And also, you're also forgetting the messy middle as well, right? Like often what happens is there's a transition.

Starting point is 00:18:59 If you don't start off cloud native from day one, You do need to transition there from your monolithic applications, from your VM-based architectures. And often the use case can't transform over perfectly. What ends up happening is you start moving some functionality and containerizing some functionality, and that still has dependencies between the old architecture and the new architecture. And companies have to live in this middle state, perhaps, for a very long time. So it's definitely true. It's not a clean-cut transition. But you can think about that middle state is actually one that a lot of companies struggle with because all of a sudden, you only have a partial view of the world or what's happening

Starting point is 00:19:39 with your old tools. They're not well-suited for the new environments. Perhaps you've got to start bringing new tools and new ways of doing things in your new environments, and they're not perhaps the best suited for the old environments as well. So you do actually end up in this middle state where you need a good solution that can really handle both because there are a lot of interdependencies between the two. And it's actually one of the things that we strive to do here at Chronosphere is to help companies through that transition. So it's not just all of your new use cases, and it's not just all of your new environments. It's actually helping companies through this transition is actually pretty critical as well. My question for you is that given that you have a, I don't want to say a preordained architecture that your

Starting point is 00:20:20 customers have to use, but there are certain assumptions you've made based upon both their scale and the environment in which they're operating. How heavy of a lift is it for them to wind up getting Chronosphere into their environments? Just because it seems to me that it's not that hard to design an architecture on a whiteboard that can meet almost any requirement. The messy part is figuring out how to get something that resembles that into place on a pre-existing extant architecture. Yeah, I would say it's something we've spent a lot of time on. The good thing for the industry overall, for the observability industry, is that open source standards are now created and now exist when they didn't before. So if you look at the APM-based view, it was all proprietary agents producing the data themselves that would only really work with one vended product. Whereas if you look at a modern environment, the production of the data has actually been shifted from the vendor down to

Starting point is 00:21:17 the companies themselves. And they'll be producing these pieces of data in open source standard formats like open telemetry for distributed traces or perhaps prometheus or for metrics so the good thing is that for all of your new environments there's a standard way to produce all of this data and you can send all that data to whichever vendor you want on the back end so it just makes the implementation for the new environments so much easier now for the legacy environments or if you're if a company is shifting over from an existing tool there is actually a messy migration there because often you're trying to replace proprietary formats and proprietary ways of producing data with open source standard ones.

Starting point is 00:21:55 So that's just something that us as Chronosphere just come in and we view that as a particular problem that we need to solve. And we take the responsibility of solving for a company because what we're trying to sell companies, not just a tool, we're really trying to solve them as a solution to the problem. And the problem is they need an observability solution end to end. So this often involves us coming in and helping them, you can imagine, not just convert the data types over, but also move over existing dashboards, existing alerts. There's a huge piece of lift at the end that perhaps every developer in a company would have to do if we didn't come in and do it on behalf of those companies.

Starting point is 00:22:33 So it's just an additional responsibility. It's not an easy thing to do. We've built some tooling that helps with it and we just spend a lot of manual hours going through this, but it's a necessary one in order to help a company transition. Now, the good thing is once they have transitioned into the new way of doing things and they are dependent on open source standard formats, they are no longer locked in. So, you know, you can imagine future transitions will be much easier. However, the current one does have to go through a little bit of effort.

Starting point is 00:23:03 I think that's probably fair. And then there's no such thing in my experience as a easy deployment for something that is large enough to matter. And let's be clear, people are not going to be deploying something as large scale as Chronosphere on a LARC. This is going to be when they have a serious application with serious observability challenges. So it feels like on some level that even doing a POC is a tricky proposition just due to the instrumentation part of it. Something I've seen is that very often enterprise sales teams will decide that by the time that they can get someone to successfully pull off a POC, at that point, the deal win rate is something like 95% just because no one wants to try that and bake off with something else. Yeah, I'd say that we do see high pilot conversion rates, to your point. For us, it's perhaps a

Starting point is 00:23:53 little bit easier than other solutions out there in the sense that I think with our type of observability tooling, the good thing is an individual team could pick this up for their one use case, and they could get value out of it. It's not that every team across an environment or every team in an organization needs to adopt. So while generally we do see that a company would want to pilot and it's not something you can play around online with by yourself because it does need a particular deployment,

Starting point is 00:24:22 it does need a little bit of setup, generally one single team can come and perform that and see value out of the tool. And that sort of value can be extrapolated and applied to all the other teams as well. So you're correct, but it hasn't been a huge lift. And these processes end-to-end, we've seen be as short as perhaps 30-something days end-to-end,

Starting point is 00:24:43 which is generally a pretty fast moving process there. I guess on some level, I'm still trying to wrap my head around the idea of the scale that you operate at, just because as you mentioned, this came out of Uber, which is beyond imagining for most people. And you take a look at a wide variety of different use cases. And in my experience, it's never been, holy crap, we have no observability and we need to fix that. It's there are a variety of systems in place that just are not living up to the hopes, dreams, and potential that they had when they were originally deployed, either due to growth or due to lack of product fit or the fact that it turns out in a

Starting point is 00:25:24 post zero interest rate world, most people don't want to have a pipeline of 20 discrete observability tools. Yep. Yep. A hundred percent. And to your point there, ultimately it's our goal. And, you know, in many companies where replacing up to six to eight tools in a single platform, it's always great to do that definitely doesn't happen overnight it takes time you know you can imagine in a pilot when you're looking at it we're picking a few of the use cases to demonstrate what our tool could do across many other use cases and then generally on the onboarding during the onboarding time or perhaps over a period of months or perhaps

Starting point is 00:26:01 even a year plus we then go board these use cases piece by piece. So it's definitely not a quick overnight process there. But you can imagine something that can help each end developer in a particular company be more effective and something that can really help move the bottom line in terms of far better price efficiency. These things are generally not things that are quick fixes. These are generally things that do take some time and a little bit of investment to achieve

Starting point is 00:26:28 the results. So a question I do have for you, given that I just watched an awful lot of people talking about observability for three days at Monitorama, what are people not talking about? What did you not see discussed that you think should be? Yeah, one thing I think often gets overlooked, and especially in today's climate, is I think observability gets relegated to a cost center. It's something that every company must have, every company has today. And it's often looked at a tool that gives you insights about your infrastructure and your applications. And

Starting point is 00:27:01 it's a backend tool, something you have to have, something you have to pay for, and it doesn't really move the direct needle for the business top line. And I think that's often something that companies don't talk about enough. And, you know, from our experience at Uber and through most of the companies that we work with here at Chronosphere, yes, there are infrastructure problems and application level problems that we help companies solve. But ultimately, the more mature organizations, or when it comes to observability, are often starting to get real-time insights into the business more than the application layer and the infrastructure layer. And if you think about it, companies that are cloud-native architected, there's not one single

Starting point is 00:27:41 endpoint or one single application that fulfills a single customer request. So even if you could look at all the individual pieces, the actual what we have to do for customers in our products and services span across so many of them that often you need to introduce a new view, a view that's just focused on your customers, just focused on the business and sort of apply the same type of techniques on your back end infrastructure as you do for your business. Now, this isn't a replacement for your BI tools. You still need those. But what we find is that BI tools are more used for longer term strategic decisions, whereas you may need to do a lot of sort of tactical, more tactical business operational

Starting point is 00:28:23 functions based on having a live view of your business. So what we find is often observability is only ever thought about for infrastructure. It's only ever thought about for as a cost center, but ultimately observability tooling can actually add a lot directly to your top line by giving you visibility into the products and services that make up that top line. And I visibility into the products and services that make up that top line. And I would say the more mature organizations that we work with here at Chronosphere all have their executives looking at, you know, monitoring dashboards to really get a good sense of what's happening in their business in real time. So I think that's something

Starting point is 00:28:58 that hopefully a lot more companies evolve into over time and they really see the full benefit of observability and what it can do to a business's top line. I think that's probably a fair way of approaching it. It seems similar in some respects to what I tend to see over in the cloud cost optimization space. People often want to have something prescriptive of do this, do that, do the other thing. But it depends entirely on what the needs of the business are internally. It depends upon the stories that they wind up working with. It depends really on what their constraints are, what their architectures are doing. Very often it's a, let's look and figure out what's going on. And accidentally they discover they can blow 40% off their spend by just deleting things that aren't in use anymore.

Starting point is 00:29:38 That becomes increasingly uncommon with scale, but it's still one of those questions of what do we do here and how? Yep. A hundred percent. I really want to thank you for taking the time to speak with me today about what you're seeing. If people want to learn more, where's the best place for them to find you? Yeah, the best place is probably going to our website, chronosphere.io to find out more about the company. Or if you want to chat with me directly, LinkedIn is probably the best place to come find me via my name. And we will, of course, put links to both of those things in the show notes. Thank you so much for suffering the slings and arrows I was able to throw at you

Starting point is 00:30:14 today. Thank you for having me, Corey. Always a pleasure to speak with you and looking forward to our next conversation. Likewise. Martin Mao, CEO and co-founder of Chronosphere, this promoted guest episode has been brought to us by Chronosphere here on Screaming in the Cloud. And I'm cloud economist, Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment that I will never notice because I have an observability gap. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business,

Starting point is 00:31:09 and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud - Unpacking the Costs and Value of Observability with Martin Mao

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.