Screaming in the Cloud - Episode 15: Nagios was the Original Call of Duty

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Elon Rabinovich of Datadog, where he's the VP of product and community. Welcome to the show,

Starting point is 00:00:32 Elon. Thanks for having me, Corey. No, a pleasure. Before we dive in, I want to call out that Datadog is relatively well known in the, I guess, operational space, in no small part due to the fact that you folks sponsor an awful lot of things. I want to be very clear, this is not an episode that you are sponsoring. This is having a conversation with you. It is not pay to play. It always feels a little weird to have folks who have sponsored things that I've worked on on the show and not call that out. So thank you for your support, but that's not what's going on here. Yeah, I mean, we always enjoy the show and not call that out. So thank you for your support, but that's not what's going on here. Yeah, I mean, we always enjoy the newsletter

Starting point is 00:01:07 and all the Corey talks, but yeah, this is just wanted to, sounds like a fun time to just chat with you about the cloud and everything in between, so. Absolutely. So let's start with that. If you take a look at the history of monitoring or observability or whatever it is

Starting point is 00:01:23 we're calling it this hour, the world has more or less grown fairly comfortable with the idea of not running physical servers themselves and trusting someone else to run them, be it one of the large cloud providers, another of the large cloud providers, etc. But there still seems to be a bit of a psychological barrier where people will say things like, oh, I'm absolutely not going to run my own servers. That's lunacy. And then immediately follow it up with, but we absolutely have to build, design, and run our own monitoring system. How do you see that evolving? And how do you, I guess, combat that, frankly, ridiculous perspective? Surprisingly, it's actually not that big of a challenge these days

Starting point is 00:02:06 to get folks to do that. I think we're now at a spot in our industry where more and more organizations are realizing that their core competency is not monitoring, it's not managing servers, it's not necessarily the, you know, installing and racking switches in a data center. There, you know, you're focused on something else, right?

Starting point is 00:02:27 If your customers are consuming your chat platform, then you want to build the best chat platform ever. And for that, you want to make sure that you have the best monitoring system ever so that you can ensure you're addressing the infrastructure or code challenges that you may be encountering that you have data to make your decisions based on. And it turns out that Datadog has an amazing monitoring product

Starting point is 00:02:52 and we don't have a lot of challenges getting customers to use us. Even when they have on-prem servers, they're happy to take advantage of that. Monitoring and metrics, it's a big data problem. Whether it's metrics or tracing or logs. These are difficult problems. If you're having to run an indexing system for your logs, and your logs are generating terabytes or petabytes or whatever it might be a day, that's a really complex database that may be as difficult for you to interact with as the clickstream logs from your consumer website, for example.

Starting point is 00:03:29 And so why would you want your teams focusing on that part when they could be focusing on building out the platforms that your customers actually consume? Similarly, if you're talking about storing metrics, whether it's columnar data stores or time series databases or whatever else you might be coming up with, in some cases the size of your monitoring data is bigger than the size of the data that your customers actually interact with. These are difficult problems and we focus on them every day and so our customers are willing to let us take that burden off their plate and specialize in making monitoring great.

Starting point is 00:04:04 Which makes an awful lot of sense. I started off my career as a systems administrator in on-prem data centers, and monitoring was sort of one of the things that fell to me. It was always the either it's invisible or I'm in the doghouse because surprise, something broke and I didn't think to monitor the thing that I was monitoring. When I started moving to the cloud, it's okay, we're going to take that same model and move it forward. And all right, time building an AWS environment, which I was doing at the time. And all right, let's roll out Nagios to monitor my instances. Oh, and we're in an auto-scaling group. And then I'm researching, okay, how quickly can I regenerate Nagios configuration when the auto group scales? And I realized midway through, oh, I'm stupid. Wonderful. And that's sort of what was opening my eyes to the idea of there

Starting point is 00:04:52 being alternate ways to do this. Yeah. I mean, you know, I was a customer of Datadog long before I was an employee. And I tried my damn death to automate my Nagios configs as fast as I could, whether it was things pulling the AWS API and trying to update configs as fast as it could be or Chef recipes or whatever it might be. And it turns out, regardless of how tight that loop is, Amazon or Google or Azure, they're going to destroy a server faster than you can reload that Nagios config. And sometimes it's, you know, those configs take forever to reload, especially at scale.

Starting point is 00:05:32 So it's interesting. Speaking to that end, one of the interesting challenges we see with very large services that everyone starts to use is there is a monitoring gap, not necessarily on our own infrastructure, but on seeing what the underlying platform is doing. We've all had times where two in the morning we've woken up because our pagers are going off. It's not entirely clear what's broken. And we effectively all prairie dog onto DevOps Twitter. And is your stuff broken? Is your stuff broken? Oh, great. It's effectively Nagios has become the original Call of Duty. And that's sort of a terrible pattern to fall into, because

Starting point is 00:06:12 status pages for these providers never update in as responsive a way as we might like them to. There needs to be confirmation, there's process on their side. I'm not blaming them for this, but it occurs to me that the monitoring companies like Datadog are probably almost uniquely positioned to know almost before anyone else on the planet when there's a widespread underlying infrastructure issue. And I'm not necessarily blaming cloud providers alone. I'm talking about things like routing flaps. We're waiting for BGP reconvergence. Have you seen that? And is there any, I guess, effort underway to start surfacing that data in a sanitized, safe way, both without exposing your customers as well as irritating large providers? Yeah, we definitely are able to see those patterns in near real time.

Starting point is 00:06:59 When something breaks in one of the cloud providers or a popular CDN goes down or what have you, we definitely see those patterns amongst our customers. There's a lot of work that needs to go into things like anonymizing that data, making sure that not every customer is willing to share that type of data, etc. But there's definitely some patterns there. We've done a little bit more work on this on the side of technologies folks use. So earlier this week on Wednesday, we released our annual Docker adoption report. And so one of the things that we looked at there every year is sort of how are folks using these technologies? What are they running in containers? How long are those containers living for? Which orchestrators are popular, etc. And it's been interesting to be able to look at that and see the trends of our space at such a large scale in near real time.

Starting point is 00:07:57 When seeing Docker go from pre 1.0 in 2014 to we're now at what they just now released Docker Enterprise 2.0, I believe, at DockerCon this week. And seeing the adoption trends around that skyrocket has been interesting. You can also, you can very clearly see on graphs, for example, like Kubernetes hit 1.0 here. All of a sudden, you know, containers skyrocket even further into popularity. So we've similarly done things for other technologies, orchestrators like ECS and Kubernetes and Mesos around events like that.

Starting point is 00:08:32 So it's something that we're interested in diving more into, both in terms of monitoring those cloud providers where we're already pulling in all the metrics, and that's from CDNs and caches and IaaS providers, as well as the technologies that folks run on their VMs. There's some interesting trends there. Right. It's always delicate to wind up presenting that data in a way that isn't naming and shaming. Ha, Twitter for pets is crappy, is not a terrific narrative for that to wind up turning into. But to your point of being a trends observer, there was a giant shift as the

Starting point is 00:09:06 world started moving away from on-premise into cloud. Same with, okay, taking long-running instances and replacing those with ephemeral nodes. Then you saw the container revolution that we're in the midst of, and now people are talking frantically about serverless. And in the eight years Datadog has been around, we've seen a number of these giant shifts in industry. How does seeing these trends emerge, I guess, shape the direction and the evolution of Datadog, the service slash product? I mean, as a product manager, for me, a lot of these questions, a lot of these studies that we run are actually, they start off as questions internally of what are our customers doing and what do I need to build for them to be successful in their migration from VMs to containers or from their on-prem

Starting point is 00:09:51 environment into the cloud? What types of queries are they going to want to look at? What types of metrics should we be pulling? What integrations need more investment from me? And so any product manager is going to be looking for that kind of data and studying that. It just turns out that in some cases, this becomes interesting for our external customers as well, as we turn these into studies or into reports or blog posts about how to best monitor a technology or how to best take advantage of it.

Starting point is 00:10:16 The big thing that we've seen is just the fast rate at which things are turning. If we look back on our studies, even from year to year on hosts and containers, we're seeing things like, we used to see just a year ago, we were seeing containers living around two days at a time and VMs having mean lifetimes of 23 to 30 days, depending on the environment and what have you. We're now seeing containers, if they're orchestrated,

Starting point is 00:10:41 taking in some cases less than half a day lifetimes. So that changes a lot of how you would define normal and how you'd want to define normal in your environment and how you want to monitor things. It also changes on how you want to manage them. So making sure that we're adjusting our tools based on all that is important so that our customers continue to be able to rely on us to monitor their environments. That makes an awful lot of sense. The challenge, of course, is you don't want to be the first and be able to rely on us to monitor their environments. That makes an awful lot of sense. The challenge, of course, is you don't want to be the first person to support something new and find out you spent a lot of time and effort

Starting point is 00:11:13 diving into what the next big thing is going to be and then do a swing and a miss. But you also don't want to be a trailing indicator and lagging. It's interesting. From that perspective, I signed up for a Datadog account somewhat recently. I am probably one of the smallest, crappiest customers you can possibly imagine. I have a few Lambda functions, an API gateway, and an AWS bill that I obsessively watch, and that's about it. So when I look inside of Datadog, the product, at those aspects of it, it feels like I'm just barely scratching the surface of what it is that the product is capable of doing. I mean, the product is great, don't get me wrong, but do you feel that it's challenging to both present information in a

Starting point is 00:11:58 relevant way to what someone's looking for, as well as not, I guess, overwhelming people as they're coming in from with a somewhat naive perspective of, well, I just have these two hosts I want to monitor. What is all of this? So, you know, from our perspective, the goal is to make it easy to monitor what you have and identify what's important to you. So that may be making it point and click easy to enable a bunch of integrations for the technologies you do care about. It may mean using our machine learning capabilities around forecasting and anomaly detection to help you discover things before you realize that they were problems. Or to help you do that without having to set a bunch of thresholds yourself.

Starting point is 00:12:39 You know, with over 300 integrations out of the box right now, it's a little hard to say that every single one is going to be relevant to every single person. But what's important to us is that when you do adopt a technology, we're already there to support you. So last week, EKS launched. At the launch day, at Ecosystem Day, we were there launching our EKS support. Back in December, November, Amazon announced Fargate at reInvent. We were working with them as launch partners to get that out the door and make sure that there was monitoring capabilities for it. So I don't know that there's, like you said, there's a lot in the platform and maybe not every single integration or every metric is there is for everybody. But the last thing you want to do is be in a spot where you have picked a new system and you don't have a way to monitor it. Or worse yet, you don't have the data when you're

Starting point is 00:13:32 trying to resolve an incident or you're trying to work on a postmortem and figure out what went wrong. And so we like to say that collecting the data is cheap. Not having the data when you need it is the expensive part. And I like that approach a fair bit. The challenge, of course, is on the other side, is not even the cost of the service itself, but in some ways the cost the service itself can incur. An example of this is years ago,

Starting point is 00:14:00 I was working with a non-DataDog monitoring system, but this is not any monitoring system's fault, where I was hitting rate limits, pulling data out of an AWS environment. So, hey, if you want your data sooner, go ahead and increase the API rate limit was the automated notice we got. Terrific, great.

Starting point is 00:14:18 So we reach out to AWS support, and to their credit, they warned us that we're willing to do this, but at this rate, that's going to turn into something that winds up costing you a couple orders of magnitude more than the monitoring system does. Are you sure? And that's a difficult challenge where it's not just the cost of Datadog, which I will point out, is very straightforward and easy to understand at a glance. It's the, what other things is this going to incur on the part of the cloud provider, whose pricing is generally pretty close to inscrutable?

Starting point is 00:14:55 That's definitely a balancing act. I think we have knobs to help customers address that challenge. We have some customers that want to grab every metric as it drops into CloudWatch at the very second that it showed up there at the finest granularity available, and they want it now. And we can do that. We can turn that knob all the way up to 11 and basically pulling CloudWatch all the time.

Starting point is 00:15:18 There are costs there from CloudWatch for doing so. Other cloud providers have similar cost structures. We also have the ability if there's a particular resource or there's a particular namespace you don't want to monitor as much, we can dial that one back. And so these are trade-offs. You have to choose between frequency that we collect data and latency. And over time, hopefully some of the costing models around how cloud providers expose those metrics may change. But this is a choice that each person has to make for themselves.

Starting point is 00:15:49 The nice thing is that a lot of the metrics that we gather within Datadog, they duplicate a lot of the metrics that are available from the cloud providers. Are you interested in what your cloud provider thinks you're using CPU-wise? Or are you interested in the actual CPU that your VM is seeing and memory and network traffic and seeing that by process or by container? We can probably offer that to you, the visibility that you're looking for directly from within your host using our agent, and you may not necessarily need some of that cloud data if you don't want it. It also is nice to have it and be able to tie the two together if you're able to do that. Of course, that's not possible with things like with the PASI services, whether it

Starting point is 00:16:32 be Redshift or ELBs or some other component. The only way to get at that data is CloudWatch. And so we'll want to pull that data from there. Yeah, I think you're right. There's only so much that you're going to be able to do without having the platform that has generated the metrics working hand-in-hand with your system. If you're looking at this from an observer perspective, you're not going to be able to change everything about it. You're limited inherently to what is given to you. To that end, something that often seems to arise every time I talk to someone about what they want from a monitoring system, the same phrase comes up all the time, which is a single pane of glass. Great. Awesome. But if you take a look at even a small environment in something like Datadog, where you can look at this from a lot of different axes, in order to gather all of that data onto a single pane of glass, terrific. You're turning an entire wall of your office into a television that better have retina capability because it's going to be really

Starting point is 00:17:30 small dashboards to fit all of that there. How do you find that that winds up turning into something that can be reasonably answered when customers ask about it? I mean, it sounds like on the one hand, it's like arguing with Hacker News. Oh, that doesn't sound hard. I could build that in a weekend and come to find out it's a little more complex than that. So I don't think that dashboards are the answer to everything, right? At least not having every metric that you could possibly look at on the dashboard above your head in a virtual knock or on your extra monitor. You're not necessarily looking to have that all there right now. What you want to have above your head or on the dashboards in your NOC or in your office are the key metrics that tell you whether or not your customers are happy and whether

Starting point is 00:18:13 or not you're serving them well. So that might mean, if you're an e-commerce site, it might be, how many checkouts have we had this hour, this second, what have you. And these are what we call your work metrics. These are the things that your customers are paying you for. And these are very good indicators as to whether or not your service is working right now. Something may change, though. There might be an event, like maybe you got a Super Bowl ad, maybe you went on streaming in the cloud, and now everybody wants to buy some Datadog monitoring, and your usage jumps or it drops. And you're going to want to dig into that.

Starting point is 00:18:44 And that's where you're going to want to dig into that. And that's where you're going to want to have additional dashboards and other things that you can query and tease out of your monitoring system. And you're going to want to have all that data there. But the idea that I'm going to have up on a single dashboard, every single metric that I collect, and I'm going to look at all of my knockouts, I don't think that that's reasonable. You want systems like Datadog to be able to make it easy to explore that data,

Starting point is 00:19:05 make it easy to raise it up for you when something changes. So whether it's our anomaly detection or other ML type capabilities that we use to quickly identify things that are changing, that's what you're going to want to focus on. You want your systems to be able to raise that for you. Hopefully that answers the question. No, it does. But it also opens another one in the sense of when I was running ops teams, monitoring systems always felt like a relatively thankless thing to work on because invariably people tended to ignore it and never look at the dashboards until there was an issue where something broke. And the question was always raised after the fact of, well, why didn't monitoring catch this? So you're always building new checks and new alarms that alert when particular

Starting point is 00:19:50 patterns hit, and you're persistently fighting the last war when that happens. And if you continue following that to its logical conclusion, well, we'll just alert on everything. Great. Now, in a typical day, you're getting paged 4,000 times. That is not going to make anyone happy. Their cell phones are running out of power after four hours. How do you wind up scaling it back? And this may not even be a product question. This may be a philosophy of monitoring question. But I'm curious as to how you see that.

Starting point is 00:20:17 So I definitely think it's a philosophy of monitoring question. I've lived through that approach in my career as well, right? Every time something breaks, let's create an alert for that. And now we're alerting people on every NTP time skew on every machine because one time it caused an issue for us. You want to make sure that your alerts are actionable. And so I think starting with those work metrics, the ones that are actually relevant to your customers and to the services you provide, and figuring out how those systems behave, that's going to be your first step. And it's also important to clean that up fairly regularly over time.

Starting point is 00:20:54 If you're seeing something noisy, get rid of it. If you're seeing something cause issues repeatedly, it's not just create an alert for it. It's probably also fix it so it's not happening as frequently um i you know i think it's it's on monitoring systems like like datadog and like and like others in the space also though to to try to make that a less a less manual and a less a less human process we should be looking at your metrics and identifying things for you as they happen uh and raising them for you so that you're not in this never ending battle to create alerts for every single metric every time. I also think that in some cases,

Starting point is 00:21:31 a lot of this data doesn't need to be alerted on, but you do want to have it. So collecting it is one thing, alerting on it is something else. But you never want to be that team that's getting alerts just to prove that the data is flowing. I once worked with a team that I looked at their, you know, one of the things I used to do

Starting point is 00:21:46 when I was more in the operations space and before I joined Datadog was I'd consult with the teams in my organization and say like, this week you had the largest number of alerts across all the other teams in the organization. Let's sit down for an hour or two. Let's look at what you're paging on. Let's look at your systems

Starting point is 00:22:03 and see how we can make them either more resilient or let's look at your monitoring and see how we can make it more actionable. And a team once told me, I went to a team once that had gotten 10,000 pages in a week. There is no way that they are sleeping if they are responding to every one of those. And more likely, they're just ignoring when there's a pager under their pillow. And in their case, what they had done is it it was more of a heartbeat like the system is alive in many cases of these alerts it wasn't actually a you know they were not actionable uh and so that that was a problem we were able to we were able to sit down and clean that up and sort of flip things around and get them to a more manageable spot but again a lot of this is around that like

Starting point is 00:22:38 that alerting and monitoring philosophy it's not about um you know it's not necessarily about the tooling it's about deciding on on what you care about. And you're right. The counterpoint is that when you have an outage, you didn't know you cared about a thing until right after you really could have used an alert on this. An example would be if your site slams to a halt one day and there's an incident and the investigation determines,

Starting point is 00:23:03 oh, it's because the primary database had its disk fill up. And then you pull up the graph and for the past number of months, you see that the line getting closer and closer and closer to the top of the graph and then it hits and then the incidence is triggered. It's not the most defensible thing to have on a screen in an after action report doing a postmortem of why the site went down, and you have a bunch of executives and partners who are very upset by that.

Starting point is 00:23:29 Yeah. So we have a sort of a rubric or a mental model that we suggest you go through that we've written about on the Datadog site. I'll send a link over for the show notes. And again, as I mentioned before, we tend to suggest taking a look at those work metrics and then working your way backwards. So you're an e-commerce website and you notice that a number of checkout, the metric you care about is things ending up in shopping carts and things actually getting checked out in those shopping carts. That's a top level metric that you want to alert on and probably that you want to have

Starting point is 00:24:07 on your dashboards because it determines the health of your business, right? Are you making money today or not? Are your customers actually happy or not? Great. And now work backwards from that and figure out what are the resources that go into making that. And if you do that for each of your systems as you're building them, you're going to get to the point where you're like, oh, I have a database. What does that database depend on? Oh, it depends on disk. That's not to say you're never going to miss anything, but that workflow is pretty helpful for figuring out what data would be actionable and when. And the thing is, in most cases, you don't have just one person on a team doing these things, right? And it's not just one person on call. Each team has their own work metrics, right?

Starting point is 00:24:43 The team that's running storage for your underlying databases, their work metric is going to be around IOPS and how much storage is available. If they alerted on that, you probably would have avoided that outage you just talked about. Your database team, their work metric is how many queries per second are they returning

Starting point is 00:25:00 and how long are each of those queries taking. If they alert on that, they're going to notice that, hey, inserts are failing right now. We should catch that incident before it happens. We should fix this before it impacts our users. And so if you work your way down the stack that way, you're going to catch the big things that are important. And those are the areas that you really want to focus on.

Starting point is 00:25:20 Everything else, I think, is data that you want to have around for troubleshooting purposes. But I don't care if CPU is at 90% if my site's still working. That's the most useless thing to page somebody on in the middle of the night. Absolutely. That's right up there with load average is high. There are 15 different factors that weigh into that. Great. Tell me the real world impact. And it's on one system, and I have 200 of those. Maybe I don't care about that particular cattle hanging out in that environment. One other thing that's coming up, I believe,

Starting point is 00:25:49 in a month or so is your Dash conference. Yeah. So Dash is our first-time user conference for Datadog. And it's coming up on July 11th and 12th in New York. And so if folks are in the area, we'd love to have you join us. We have some great presentations from folks like Shopify and Google

Starting point is 00:26:12 and DraftKings and a number of other organizations talking about how they're scaling up and speeding up their infrastructure, their teams and their applications. And so this is not two days of how do I monitor X, but rather how am I,

Starting point is 00:26:25 this is an opportunity to learn about how folks are solving real business problems. So whether it be Shopify talking about how to scale up their infrastructure 3X while also moving it to GCP at the same time and containerizing it, or the folks at Segment talking about how they've built a culture of shared outages

Starting point is 00:26:44 within their organization, how they're taking a lot of the challenges that, you know, Corey mentioned earlier around, you know, what do you alert on and how do we prevent problems from recurring and building that into their processes within the organization. You know, there's a lot of opportunities to learn about how to, again, scale up and speed up your organization. And, you know, there should be some fun news from Datadog on various features as well. So if you're in the area or want to travel out and join us in New York this summer, July 11th and 12th. But yeah, hope to see you all in New York. And thanks for having us. No, thank you very much for taking the time to speak with me. This has been Elon Rabinovich of Datadog. My name's Corey Quinn, and this is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. This has been this week's episode

Starting point is 00:27:25 of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

Screaming in the Cloud - Episode 15: Nagios was the Original Call of Duty

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.