Screaming in the Cloud - Episode 15: Nagios was the Original Call of Duty
Episode Date: June 20, 2018Let’s chat about the Cloud and everything in between. The people in this world are pretty comfortable with not running physical servers on their own, but trusting someone else to run them. ...Yet, people suffer from the psychological barrier of thinking they need to build, design, and run their own monitoring system. Fortunately, more companies are turning to Datadog. Today, we’re talking to Ilan Rabinovitch, Datadog’s vice president of product and community. He spends his days diving into container monitoring metrics, collaborating with Datadog’s open source community, and evangelizing observability best practices. Previously, Ilan led infrastructure and reliability engineering teams at various organizations, including Ooyala and Edmunds.com. He’s active in the open source and DevOps communities, where he is a co-organizer of events, such as SCALE and Texas Linux Fest. Some of the highlights of the show include: Datadog is well-known, especially because it is a frequent sponsor More organizations know their core competency is not monitoring or managing servers Monitoring/metrics is a big data problem; Datadog takes monitoring off your plate Alternate ways, other than using Nagios, to monitor instances and regenerate configurations Datadog is first to identify patterns when there is a widespread underlying infrastructure issue Trends of moving from on-premise to Cloud; serverless is on the horizon How trends affect evolution of Datadog; adjusting tools to monitor customers’ environments Datadog’s scope is enormous; the company tries to present relevant information as the scale of what it’s watching continues to grow Datadog’s pricing is straightforward and simple to understand; how much Cloud providers charge to use Datadog is less clear Single Pane of Glass: Too much data to gather in small areas (dashboards)  Why didn’t monitoring catch this? Alerts need to be actionable and relevant How to use Datadog’s workflow for setting alerts and work metrics Datadog’s first Dash user conference will be held in July in New York; addresses how to solve real business problems, how to scale/speed up your organization Links: Ilan Rabinovitch on Twitter Datadog Docker Adoption Survey Results  Rubric for Setting Alerts/Work Metrics Dash Conference re:Invent Nagios .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn. I'm joined today
by Elon Rabinovich of Datadog, where he's the VP of product and community. Welcome to the show,
Elon. Thanks for having me, Corey. No, a pleasure. Before we dive in, I want to call out that Datadog
is relatively well known in the, I guess, operational space, in no small part due to
the fact that you folks sponsor an awful
lot of things. I want to be very clear, this is not an episode that you are sponsoring. This is
having a conversation with you. It is not pay to play. It always feels a little weird to have folks
who have sponsored things that I've worked on on the show and not call that out. So thank you for
your support, but that's not what's going on here. Yeah, I mean, we always enjoy the show and not call that out. So thank you for your support, but that's not what's going on here.
Yeah, I mean, we always enjoy the newsletter
and all the Corey talks,
but yeah, this is just wanted to,
sounds like a fun time to just chat with you
about the cloud and everything in between, so.
Absolutely.
So let's start with that.
If you take a look at the history of monitoring
or observability or whatever it is
we're calling it this hour,
the world has more or less grown fairly comfortable with the idea of not running physical servers themselves and trusting someone else to run them, be it one of the large cloud providers, another of the large cloud providers, etc.
But there still seems to be a bit of a psychological barrier where people will say things like, oh, I'm absolutely not going to run my own servers.
That's lunacy.
And then immediately follow it up with, but we absolutely have to build, design, and run our own monitoring system.
How do you see that evolving?
And how do you, I guess, combat that, frankly, ridiculous perspective?
Surprisingly, it's actually not that big of a challenge these days
to get folks to do that.
I think we're now at a spot in our industry
where more and more organizations are realizing
that their core competency is not monitoring,
it's not managing servers,
it's not necessarily the, you know,
installing and racking switches in a data center.
There, you know, you're focused on something else, right?
If your customers are consuming your chat platform,
then you want to build the best chat platform ever.
And for that, you want to make sure
that you have the best monitoring system ever
so that you can ensure you're addressing the infrastructure
or code challenges that you may be encountering
that you have data to make your decisions based on.
And it turns out that Datadog has an amazing monitoring product
and we don't have a lot of challenges getting customers to use us.
Even when they have on-prem servers, they're happy to take advantage of that.
Monitoring and metrics, it's a big data problem.
Whether it's metrics or tracing or logs.
These are difficult problems. If you're having to run an indexing system for your logs,
and your logs are generating terabytes or petabytes or whatever it might be a day,
that's a really complex database that may be as difficult for you to interact with as the clickstream logs from your consumer
website, for example.
And so why would you want your teams focusing on that part when they could be focusing on
building out the platforms that your customers actually consume?
Similarly, if you're talking about storing metrics, whether it's columnar data stores
or time series databases or whatever else you might be coming up with, in some cases
the size of your monitoring data is bigger than the size of the data
that your customers actually interact with. These are difficult problems
and we focus on them every day and so our customers are willing to let us
take that burden off their plate and specialize in making monitoring great.
Which makes an awful lot of sense.
I started off my career as a systems administrator in on-prem data centers, and monitoring was sort of one of the things that fell to me.
It was always the either it's invisible or I'm in the doghouse because surprise, something broke and I didn't think to monitor the thing that I was monitoring.
When I started moving to the cloud, it's okay, we're going to take that same model and move it forward. And all right,
time building an AWS environment, which I was doing at the time. And all right, let's roll out
Nagios to monitor my instances. Oh, and we're in an auto-scaling group. And then I'm researching,
okay, how quickly can I regenerate Nagios configuration when the auto group scales? And I realized midway through,
oh, I'm stupid. Wonderful. And that's sort of what was opening my eyes to the idea of there
being alternate ways to do this. Yeah. I mean, you know, I was a customer of Datadog long before I
was an employee. And I tried my damn death to automate my Nagios configs as fast as I could,
whether it was things pulling the AWS API
and trying to update configs as fast as it could be
or Chef recipes or whatever it might be.
And it turns out, regardless of how tight that loop is,
Amazon or Google or Azure,
they're going to destroy a server faster than you can reload that Nagios config. And sometimes it's, you know, those configs take forever to reload, especially at scale.
So it's interesting.
Speaking to that end, one of the interesting challenges we see with very large services that everyone starts to use is there is a monitoring gap,
not necessarily on our own infrastructure,
but on seeing what the underlying platform is doing. We've all had times where two in the
morning we've woken up because our pagers are going off. It's not entirely clear what's broken.
And we effectively all prairie dog onto DevOps Twitter. And is your stuff broken? Is your stuff
broken? Oh, great. It's effectively Nagios
has become the original Call of Duty. And that's sort of a terrible pattern to fall into, because
status pages for these providers never update in as responsive a way as we might like them to.
There needs to be confirmation, there's process on their side. I'm not blaming them for this,
but it occurs to me that the monitoring companies like Datadog
are probably almost uniquely positioned to know almost before anyone else on the planet when
there's a widespread underlying infrastructure issue. And I'm not necessarily blaming cloud
providers alone. I'm talking about things like routing flaps. We're waiting for BGP reconvergence.
Have you seen that? And is there any, I guess, effort underway to start surfacing that data in a sanitized, safe way, both without exposing your customers as well as irritating large providers?
Yeah, we definitely are able to see those patterns in near real time.
When something breaks in one of the cloud providers or a popular CDN goes down or what have you, we definitely see those patterns amongst our customers.
There's a lot of work that needs to go into things like anonymizing that data, making sure that not every customer is willing to share that type of data, etc.
But there's definitely some patterns there.
We've done a little bit more work on this on the side of technologies folks use.
So earlier this week on Wednesday, we released our annual Docker adoption report.
And so one of the things that we looked at there every year is sort of how are folks using
these technologies? What are they running in containers? How long are those containers living
for? Which orchestrators are popular, etc. And it's been interesting to be able to look at that and see the trends of our space at such a large scale in near real time.
When seeing Docker go from pre 1.0 in 2014 to we're now at what they just now released Docker Enterprise 2.0,
I believe, at DockerCon this week.
And seeing the adoption trends around that skyrocket has been interesting.
You can also, you can very clearly see on graphs, for example, like Kubernetes hit 1.0 here.
All of a sudden, you know, containers skyrocket even further into popularity.
So we've similarly done things for other technologies,
orchestrators like ECS and Kubernetes
and Mesos around events like that.
So it's something that we're interested in diving more into,
both in terms of monitoring those cloud providers
where we're already pulling in all the metrics,
and that's from CDNs and caches and IaaS providers, as well as the
technologies that folks run on their VMs. There's some interesting trends there.
Right. It's always delicate to wind up presenting that data in a way that isn't
naming and shaming. Ha, Twitter for pets is crappy, is not a terrific narrative for that
to wind up turning into. But to your point of being a trends observer, there was a giant shift as the
world started moving away from on-premise into cloud. Same with, okay, taking long-running
instances and replacing those with ephemeral nodes. Then you saw the container revolution
that we're in the midst of, and now people are talking frantically about serverless.
And in the eight years Datadog has been around, we've seen a number of these giant shifts in industry. How does seeing these trends emerge, I guess, shape the direction and the evolution of Datadog,
the service slash product? I mean, as a product manager, for me, a lot of these questions,
a lot of these studies that we run are actually, they start off as questions internally of what
are our customers doing and what do I need to build
for them to be successful in their migration from VMs to containers or from their on-prem
environment into the cloud?
What types of queries are they going to want to look at?
What types of metrics should we be pulling?
What integrations need more investment from me?
And so any product manager is going to be looking for that kind of data and studying
that.
It just turns out that in some cases, this becomes interesting for our external customers as well,
as we turn these into studies or into reports or blog posts about how to best monitor a technology or how to best take advantage of it.
The big thing that we've seen is just the fast rate at which things are turning.
If we look back on our studies, even from year to year on hosts and containers, we're seeing things like,
we used to see just a year ago,
we were seeing containers living around two days at a time
and VMs having mean lifetimes of 23 to 30 days,
depending on the environment and what have you.
We're now seeing containers,
if they're orchestrated,
taking in some cases less than half a day lifetimes.
So that changes a lot of how you would define normal and how you'd want to define normal in
your environment and how you want to monitor things. It also changes on how you want to
manage them. So making sure that we're adjusting our tools based on all that is important so that
our customers continue to be able to rely on us to monitor their environments.
That makes an awful lot of sense. The challenge, of course, is you don't want to be the first and be able to rely on us to monitor their environments.
That makes an awful lot of sense.
The challenge, of course, is you don't want to be the first person to support something new and find out you spent a lot of time and effort
diving into what the next big thing is going to be and then do a swing and a miss.
But you also don't want to be a trailing indicator and lagging.
It's interesting. From that perspective, I signed up for a Datadog account
somewhat recently. I am probably one of the smallest, crappiest customers you can possibly
imagine. I have a few Lambda functions, an API gateway, and an AWS bill that I obsessively watch,
and that's about it. So when I look inside of Datadog, the product, at those aspects of it, it feels like I'm just barely scratching
the surface of what it is that the product is capable of doing. I mean, the product is great,
don't get me wrong, but do you feel that it's challenging to both present information in a
relevant way to what someone's looking for, as well as not, I guess, overwhelming people as
they're coming in from with a somewhat
naive perspective of, well, I just have these two hosts I want to monitor. What is all of this?
So, you know, from our perspective, the goal is to make it easy to monitor what you have and
identify what's important to you. So that may be making it point and click easy to enable a bunch
of integrations for the technologies you do care about.
It may mean using our machine learning capabilities around forecasting and anomaly detection to help you discover things before you realize that they were problems.
Or to help you do that without having to set a bunch of thresholds yourself.
You know, with over 300 integrations out of the box right now, it's a little hard to say that every single one is going to be relevant to every single person.
But what's important to us is that when you do adopt a technology, we're already there to support you.
So last week, EKS launched.
At the launch day, at Ecosystem Day, we were there launching our EKS support.
Back in December, November, Amazon announced Fargate at reInvent.
We were working with them as launch partners to get that out the door and make sure that there was monitoring capabilities for it.
So I don't know that there's, like you said, there's a lot in the platform and maybe not every single integration or every metric is there is for everybody.
But the last thing you want to do is be in a spot where you have picked a new system and you don't have a way to monitor it. Or worse yet, you don't have the data when you're
trying to resolve an incident or you're trying to work on a postmortem and figure out what went
wrong. And so we like to say that collecting the data is cheap. Not having the data when you need
it is the expensive part.
And I like that approach a fair bit.
The challenge, of course, is on the other side,
is not even the cost of the service itself,
but in some ways the cost the service itself can incur.
An example of this is years ago,
I was working with a non-DataDog monitoring system,
but this is not any monitoring system's fault,
where I was hitting rate limits,
pulling data out of an AWS environment.
So, hey, if you want your data sooner,
go ahead and increase the API rate limit
was the automated notice we got.
Terrific, great.
So we reach out to AWS support,
and to their credit, they warned us
that we're willing to do this,
but at this rate, that's going to turn into something that winds up costing you a couple orders of magnitude more than the monitoring system does.
Are you sure?
And that's a difficult challenge where it's not just the cost of Datadog, which I will point out, is very straightforward and easy to understand at a glance.
It's the, what other things is this going to incur on the part of the cloud provider,
whose pricing is generally pretty close to inscrutable?
That's definitely a balancing act. I think we have knobs to help customers
address that challenge. We have some customers that want to grab every metric
as it drops into CloudWatch at the very second
that it showed up there at the finest granularity available,
and they want it now.
And we can do that.
We can turn that knob all the way up to 11
and basically pulling CloudWatch all the time.
There are costs there from CloudWatch for doing so.
Other cloud providers have similar cost structures.
We also have the
ability if there's a particular resource or there's a particular namespace you don't want
to monitor as much, we can dial that one back. And so these are trade-offs. You have to choose
between frequency that we collect data and latency. And over time, hopefully some of the
costing models around how cloud providers expose those metrics may change.
But this is a choice that each person has to make for themselves.
The nice thing is that a lot of the metrics that we gather within Datadog, they duplicate a lot of the metrics that are available from the cloud providers.
Are you interested in what your cloud provider thinks you're using CPU-wise?
Or are you interested in the actual CPU that your VM is seeing and
memory and network traffic and seeing that by process or by container? We can probably offer
that to you, the visibility that you're looking for directly from within your host using our agent,
and you may not necessarily need some of that cloud data if you don't want it. It also is nice
to have it and be able to tie the two together if you're able to
do that. Of course, that's not possible with things like with the PASI services, whether it
be Redshift or ELBs or some other component. The only way to get at that data is CloudWatch.
And so we'll want to pull that data from there. Yeah, I think you're right. There's only so much
that you're going to be able to do without having the platform that has generated the metrics working hand-in-hand with your system.
If you're looking at this from an observer perspective, you're not going to be able to change everything about it. You're limited inherently to what is given to you.
To that end, something that often seems to arise every time I talk to someone about what they want from a monitoring system, the same phrase comes up all the time, which is a single pane of glass.
Great. Awesome.
But if you take a look at even a small environment in something like Datadog, where you can look at this from a lot of different axes, in order to gather all of that data onto a single pane of glass, terrific. You're turning an entire wall
of your office into a television that better have retina capability because it's going to be really
small dashboards to fit all of that there. How do you find that that winds up turning into something
that can be reasonably answered when customers ask about it? I mean, it sounds like on the one hand,
it's like arguing with Hacker News. Oh, that doesn't sound hard. I could build that in a weekend and come to find out it's a little
more complex than that. So I don't think that dashboards are the answer to everything, right?
At least not having every metric that you could possibly look at on the dashboard above your head
in a virtual knock or on your extra monitor. You're not necessarily looking to have that all
there right now.
What you want to have above your head or on the dashboards in your NOC or in your office are the key metrics that tell you whether or not your customers are happy and whether
or not you're serving them well.
So that might mean, if you're an e-commerce site, it might be, how many checkouts have
we had this hour, this second, what have you.
And these are what we call your work metrics.
These are the things that your customers are paying you for. And these are very good indicators as to whether or not your service
is working right now. Something may change, though. There might be an event, like maybe you
got a Super Bowl ad, maybe you went on streaming in the cloud, and now everybody wants to buy some
Datadog monitoring, and your usage jumps or it drops. And you're going to want to dig into that.
And that's where you're going to want to dig into that.
And that's where you're going to want to have additional dashboards and other things that you can query and tease out of your monitoring system.
And you're going to want to have all that data there.
But the idea that I'm going to have up on a single dashboard,
every single metric that I collect,
and I'm going to look at all of my knockouts,
I don't think that that's reasonable.
You want systems like Datadog to be able to make it easy to explore that data,
make it easy to raise it up for you when something changes. So whether it's our anomaly detection or
other ML type capabilities that we use to quickly identify things that are changing,
that's what you're going to want to focus on. You want your systems to be able to
raise that for you. Hopefully that answers the question.
No, it does. But it also opens another one in the sense of when I was running ops teams, monitoring systems always felt like a relatively
thankless thing to work on because invariably people tended to ignore it and never look at
the dashboards until there was an issue where something broke. And the question was always
raised after the fact of, well, why didn't monitoring catch this? So you're always building new checks and new alarms that alert when particular
patterns hit, and you're persistently fighting the last war when that happens. And if you continue
following that to its logical conclusion, well, we'll just alert on everything. Great. Now, in a
typical day, you're getting paged 4,000 times. That is not going to make anyone happy.
Their cell phones are running out of power after four hours.
How do you wind up scaling it back?
And this may not even be a product question.
This may be a philosophy of monitoring question.
But I'm curious as to how you see that.
So I definitely think it's a philosophy of monitoring question.
I've lived through that approach in my career as well, right?
Every time something
breaks, let's create an alert for that. And now we're alerting people on every NTP time skew on
every machine because one time it caused an issue for us. You want to make sure that your alerts
are actionable. And so I think starting with those work metrics, the ones that are actually
relevant to your customers and to the services you provide, and figuring out how those systems behave, that's going to be your first step.
And it's also important to clean that up fairly regularly over time.
If you're seeing something noisy, get rid of it.
If you're seeing something cause issues repeatedly, it's not just create an alert for it.
It's probably also fix it so it's not happening as frequently um i you know i think it's it's on monitoring systems like like datadog and like
and like others in the space also though to to try to make that a less a less manual and a less
a less human process we should be looking at your metrics and identifying things for you
as they happen uh and raising them for you so that you're not in this never ending battle
to create alerts for every single metric every time.
I also think that in some cases,
a lot of this data doesn't need to be alerted on,
but you do want to have it.
So collecting it is one thing,
alerting on it is something else.
But you never want to be that team that's getting alerts
just to prove that the data is flowing.
I once worked with a team that I looked at their,
you know, one of the things I used to do
when I was more in the operations space
and before I joined Datadog
was I'd consult with the teams in my organization
and say like, this week you had the largest number of alerts
across all the other teams in the organization.
Let's sit down for an hour or two.
Let's look at what you're paging on.
Let's look at your systems
and see how we can make them either more resilient or let's look at your monitoring and see how we can make
it more actionable. And a team once told me, I went to a team once that had gotten 10,000 pages
in a week. There is no way that they are sleeping if they are responding to every one of those.
And more likely, they're just ignoring when there's a pager under their pillow.
And in their case, what they had done is it it was more of a heartbeat like the system is alive
in many cases of these alerts it wasn't actually a you know they were not actionable uh and so that
that was a problem we were able to we were able to sit down and clean that up and sort of flip
things around and get them to a more manageable spot but again a lot of this is around that like
that alerting and monitoring philosophy it's not about um you know it's not necessarily about the
tooling it's about deciding on on what you care about.
And you're right.
The counterpoint is that when you have an outage,
you didn't know you cared about a thing
until right after you really could have used an alert on this.
An example would be if your site slams to a halt one day
and there's an incident and the investigation determines,
oh, it's because the primary database had
its disk fill up.
And then you pull up the graph and for the past number of months, you see that the line
getting closer and closer and closer to the top of the graph and then it hits and then
the incidence is triggered.
It's not the most defensible thing to have on a screen in an after action report doing
a postmortem of why the site went
down, and you have a bunch of executives and partners who are very upset by that.
Yeah. So we have a sort of a rubric or a mental model that we suggest you go through
that we've written about on the Datadog site. I'll send a link over for the show notes.
And again, as I mentioned before, we tend to suggest taking a look at
those work metrics and then working your way backwards.
So you're an e-commerce website and you notice that a number of checkout, the metric you
care about is things ending up in shopping carts and things actually getting checked
out in those shopping carts.
That's a top level metric that you want to alert on and probably that you want to have
on your dashboards because it determines the health of your business, right? Are you making
money today or not? Are your customers actually happy or not? Great. And now work backwards from
that and figure out what are the resources that go into making that. And if you do that for each
of your systems as you're building them, you're going to get to the point where you're like,
oh, I have a database. What does that database depend on? Oh, it depends on disk. That's not to say you're never going to miss
anything, but that workflow is pretty helpful for figuring out what data would be actionable and
when. And the thing is, in most cases, you don't have just one person on a team doing these things,
right? And it's not just one person on call. Each team has their own work metrics, right?
The team that's running storage for your
underlying databases,
their work metric is going to be around IOPS
and how much storage is available. If they alerted
on that, you probably would have avoided
that outage you just talked about.
Your database team, their work
metric is how many queries per second are they returning
and how long are each of those queries taking.
If they alert on that, they're going to notice
that, hey, inserts are failing right now.
We should catch that incident before it happens.
We should fix this before it impacts our users.
And so if you work your way down the stack that way,
you're going to catch the big things that are important.
And those are the areas that you really want to focus on.
Everything else, I think, is data that you want to have around
for troubleshooting purposes.
But I don't care if CPU is at 90% if my site's still working. That's the most useless
thing to page somebody on in the middle of the night. Absolutely. That's right up there with
load average is high. There are 15 different factors that weigh into that. Great. Tell me
the real world impact. And it's on one system, and I have 200 of those. Maybe I don't care about that
particular cattle hanging out in that environment.
One other thing that's coming up, I believe,
in a month or so is your Dash conference.
Yeah.
So Dash is our first-time user conference for Datadog.
And it's coming up on July 11th and 12th in New York.
And so if folks are in the area,
we'd love to have you join us.
We have some great presentations
from folks like Shopify and Google
and DraftKings
and a number of other organizations
talking about how they're scaling up
and speeding up their infrastructure,
their teams and their applications.
And so this is not two days
of how do I monitor X,
but rather how am I,
this is an opportunity to learn about
how folks are solving real business problems.
So whether it be Shopify talking about
how to scale up their infrastructure 3X
while also moving it to GCP at the same time
and containerizing it,
or the folks at Segment talking about
how they've built a culture of shared outages
within their organization, how they're taking a lot of the challenges that, you know, Corey mentioned earlier around, you know, what do you alert on and how do we prevent problems from recurring and building that into their processes within the organization.
You know, there's a lot of opportunities to learn about how to, again, scale up and speed up your organization.
And, you know, there should be some fun news from Datadog on various
features as well. So if you're in the area or want to travel out and join us in New York this summer,
July 11th and 12th. But yeah, hope to see you all in New York. And thanks for having us.
No, thank you very much for taking the time to speak with me. This has been Elon Rabinovich
of Datadog. My name's Corey Quinn, and this is Screaming in the Cloud.
This has been this week's episode of Screaming in the Cloud. This has been this week's episode
of Screaming in the Cloud.
You can also find more
Corey at screaminginthecloud.com
or wherever fine snark is sold.