PurePerformance - 055 Monitoring in the Time of Cloud Native with James Turnbull
Episode Date: February 12, 2018James Turnbull ( https://jamesturnbull.net/ ) is an author of 10 books on topics like Docker, Packer, Terraform, Monitoring, … and is currently writing a book on Monitoring with Prometheus https://p...rometheusbook.com/ . We got to chat about what modern monitoring approaches look like, how to pull in developers to start building monitoring into their systems and how to bridge the gap between monitoring for operations vs monitoring for business. Having a monitoring expert like James that knows many tools in the space was great to validate what we at Dynatrace have been doing to solve modern monitoring problems. We learned a lot about key monitoring capabilities such as capturing data vs capturing information, providing just nice dashboards vs providing answers to known and unknown questions and making monitoring easy accessible so that monitoring can benefit both business, operations and developers.We hope you enjoy the conversation and learn as much as we did. A blog we have been referencing several times during the talk was this one from Cindy Sridharan on Monitoring in the time of Cloud Native: https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody, and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always we have with us Andy Grabner.
Andy, how are you doing today?
Almost freezing. It's getting cold up here.
Well, not now. It's above freezing.
I think 6 degrees 7 Celsius.
But Boston is getting a little colder over the next couple of days.
But otherwise, I'm good.
Yeah, we just started getting a cold front in here finally.
Last week in Denver, we were in the 70s, and now we're finally getting cold.
And it's rightfully so, right?
So this episode will be airing in 2018.
But we're wrapping up the year of 2017.
We have a few more podcasts to record, and we're just fresh off of Black Friday and Cyber Monday,
and hopefully everyone survived that well.
So for everybody who had to go through all that fun retail support and keeping the systems up and lights on, we salute you.
And hopefully you're getting to take it a little bit easier now, but it never gets easy, does it?
Hopefully the wallets also survived and the credit cards, your personal credit cards survived it.
Yeah, that's it.
It's not a big thing, right?
Right.
So, Andy, we've got a very special guest, as always.
All of our guests are always very special.
And I just want to, before we introduce him, definitely recommend checking out his blogs,
checking out his books, a lot of great stuff.
So I wanted to say that ahead of time because I think a lot of the writings he's doing and links to other things that he's used as inspiration as well.
It's quite amazing stuff. And I literally just discovered it two days ago. and links to other things that he uses inspiration as well.
It's quite amazing stuff.
And I literally just discovered it two days ago.
Well, I won't say the name, but I've heard of our guests before.
But I just, you know, being so busy, never really had a chance to dive deep into it.
So it's really exciting for me.
So, Andy, why don't you take it?
Sure.
So actually, when I'm on the plane, I listen to podcasts and I listen to the DevOps Cafe.
And one of the episodes that I recently listened to, which made me reach out to James, James Turnbull, was about the art of monitoring. And it was recorded about a year ago.
And it was very fascinating what James and the host had to say.
And I reached out to James and James immediately
replied, very well performing, right? James, and this is why we have you on board, because I'm
sure there's a lot of stuff that happened in the last year since you did that podcast, you did a
lot of work. But before we get started, first of all, hi, and maybe you want to tell the audience
who you are for those that have never come across your name. Hi, thanks for having me on. I'm James Stoneball.
I've been an engineer for 25 years this year, I think,
and primarily sort of
doing a bunch of infrastructure stuff, and then more recent last few years doing sort of
product and leadership sort of things. I'm currently the CTO
at a not-for-profit called Empatico.
We connect classrooms together globally to help elementary school students
develop empathy skills.
Prior to that, I was the CTO at Kickstarter.
I've worked on Venmo, which is a payments platform.
I was one of the early engineers at Docker after I flipped over from Docker
to Cloud to Docker, and I was one of the early engineers who Docker after I flipped over from Docker to.cloud to Docker.
And I was one of the early engineers who worked on Puppet as well.
So a long history in infrastructure software.
And in my spare time, copious spare time, I write technical books.
I've written 10 technical books largely about infrastructure software and engineering practice. And I do want to say, as we mentioned earlier,
I am using one of your technical books right now, the Docker book.
So thank you for writing that.
It's been very helpful as I'm going through learning Docker.
Oh, awesome. Glad you liked it.
Yeah.
And I think you also said you're currently writing a book.
I think it's on Prometheus.
Yes. I'm writing a book called Monitoring with Prometheus,
which is – I'm particularly – obviously, being on the podcast,
I'm particularly fascinated by monitoring and the monitoring landscape.
And Prometheus has obviously had a quick rise,
very closely associated with Kubernetes
and with the changes in architecture and containerization.
So I thought it was worth doing a deep dive into that
and writing a book to cover off sort of give folks an idea
who might not have heard of Prometheus or want a place to start an introduction um
it's uh the url is prometheusbook.com so it's uh pretty easy to find and we'll link to that in the
description on the page too as well and well it's a great coincidence that you are into monitoring
and obviously brian and i are are big into monitoring i mean we're working for dynatrace
even though we know we try to keep this podcast
kind of tool neutral,
even though I think today
we'll definitely go into tools
and it will be very interesting
for me now to understand.
So first of all,
what has changed since the last year
when you gave the DevOps Cafe interview?
What is going on,
especially in the areas
you mentioned earlier
when it comes to building
these new, very dynamic applications with containers using Kubernetes.
Also, when we had the email exchange prior to this podcast, I brought up serverless monitoring.
Is this any different or not?
And I think you have your opinion on that.
So I was just wondering, what gets you excited about monitoring these days and what are kind of the capabilities that people need to look for when monitoring their new systems?
And maybe not only their new systems, because what I see, yes, it's great that we can build new cool applications and new cool architectures.
But the reality is that most people do not only have the new cool stuff, but they still have a lot of legacy systems that also interact with the new cool stuff.
So kind of how can we bridge the gap?
And maybe let's get started with what has changed and what gets you excited these days
and what do people need to look for?
And I hope you kept track of all that.
I know.
Hopefully you can remind me of some of those questions.
Yeah.
So the funny thing about our industry is a year doesn't seem like a long time, but in fact, sometimes quite is. pockets of the community of our industry that had not yet sort of –
the topic of things like containerization and microservices
was sort of only on the edge of the horizon.
I think by this year – and it's obviously hard to tell.
I work in New York and I work with a lot of people in the Valley,
so there's the hype sort of window there.
But I think in the last sort of 12 months, um, it's become pretty clear that, uh, we're on
the path to a pretty fundamental change about in the way that we build applications, uh, and the
way that we manage infrastructure, um, that, that, uh, previously, I think we were sort of
seeing the bleeding edge of, but now we're seeing more definite. Um, and I think we were sort of seeing the bleeding edge of, but now we're seeing more definite.
And I think that that sort of, you know, probably 12, 18 months ago,
I would have said that, you know, what are the workloads
that are going to get flipped over to, say, containerization
or the workloads are going to get flipped over to public cloud.
Now I'm more and more convinced that with the exception
of a very small group of people,
particularly those with large brownfield installations
and those with sort of regulatory obligations
where industries are not moving forward,
will be the sort of last remaining groups
who will have infrastructure on premise in data centers
and infrastructure that is running on physical servers.
So that means if we trust, and I totally agree with you, right?
I mean, I've been at a re-event last week, and I think Werner Vogel in his keynote,
I mean, he basically said, well, go build developers you build with the care of the rest
and basically move everything to the cloud.
I mean, there's obviously different flavors
and different service they provide.
But if we're all moving to the cloud
and using cloud services,
does that mean monitoring is completely changing?
Because obviously we're no longer
owning the underlying infrastructure.
Should we still worry about the underlying infrastructure?
Should we solely focus on the business value we create,
which is what is my code actually doing?
What is my user experience?
Or do we still need to tie all the knots kind of together
and also make sure that we're not blindly trusting
the underlying infrastructure?
Well, I think that the future is not evenly distributed across various tools.
So in some circumstances, services is very much a black box to some regards,
particularly with the new fairway, particularly with the new – sorry, the new –
The Fargate.
Fargate, sorry.
I was thinking golf references for some reason.
Yeah, that's good.
The Fargate infrastructure where you're not even managing the instance that Docker is running on,
that heads to a place where you've got to ask yourself, how much investment would I make in
monitoring that? But there are other services. RDS is a good example where you are essentially
getting a database that's running on top of an instance. you have some input into the nature of that instance, but your
application's behavior is still going to be reflected in metrics on that database server,
and identifying issues and problems and bottlenecks and challenges with, say, a query still requires
you to instrument that service. So I don't think it's entirely a black box and we're not just simply talking about
if I instrument my code
and generate metrics around that
that's going to solve my problems
and the other area where the complexity
is increasing rapidly is tracing
so if you are
consuming multiple services
all of which are essentially
even if they're provided by the single provider
like Amazon or Azure are essentially siloed, have their own APIs, their own reporting, their own metrics.
It is often hard to trace the path of your transactions or your customer experience through that maze of events.
And you often need to have an overlay, which, you know, could either be a monitoring system or some sort of observability system over the top of those services to provide you with the sort of coherent viewpoint of a customer's experience.
And in this house – so obviously you're excited about Prometheus.
So how does Prometheus – what does Prometheus provide to solve this problem?
What are the best practices for people that build these new systems?
Is there something where you need to obviously think about observability and about monitoring
when you build these systems to make it easier for you, whatever monitoring solution you
pick, to be actually to be able to monitor all that?
Or how would you go about that?
I think the biggest change, and I think people are still fundamentally coming to grips with this change, is that monitoring was always very siloed.
So infrastructure people monitored operating systems and maybe like New Relic or the like,
which provided them with sort of views on how code was performing or possibly transactions and things like that in the state.
So, you know, I think the biggest change we're seeing is that you can no longer maintain that solid environment.
Everybody needs to have a view from end to end, which means that when you start thinking about monitoring,
you need to think about, you know, where am I going to start?
So I need to start in the code base,
and I need to ensure that my code is instrumented
from top to bottom,
and both in terms of the performance of the code itself
and in terms of information performance of the code itself and in terms of
information that is useful to the business that consumes those applications or uses those
applications and then all the way along the life cycle I need to ensure that that the right groups
are involved in determining whether it be database performance or middleware performance or the
security compliance of the platform or the operating system performance or metrics around things like deployment, as well as sort of, you know, things like
uptime and latency and the sort of core things that measure the performance of an application
and directly sort of tie into sort of customer satisfaction with that service or application.
So I think that, yeah, that's what I think about as the sort of basis of that change
and monitoring.
Yeah, and I think that's interesting because I think it's, I mean, I'm not sure how much
you are familiar with what we are doing at Dynatrace, but I think we see the same challenges
and also the same requirements.
So monitoring everything from the end user through your services containers
all the way down to the infrastructure if possible, and then pulling more information
in from your cloud providers, from your services to make sense, right?
To actually understand, is there a business impact right now with what we are running?
Because I think that's the bottom line.
The bottom line is, and I think you actually are, you put this very well in the podcast last year.
You said if an organization that is kind of like in the Stone Age and what would you tell them to do as a next step if they kind of wake up in 2017 and have no clue what's going on?
And then you said, well, you know, pick the business metric that is most important for you and then figure out how you can monitor that business metric and everything that correlates to it so that you know what's actually going on.
And I think that's – if I hear this correctly, even though the technology changes, but in the end, it's really – it's very much about what do we actually do to support our business? And that could be something for e-commerce, whether it's an order rate or the number of items sold or for insurance companies, how many claims are opened
and how fast is the processing of claims. So I think it's the business aspect, but then obviously
figuring out how can we monitor the key important pieces of the underlying application to figure out
in case there is a problem, how to address this problem? Am I kind of getting into the right direction here?
Yeah, I think so.
I think that if I look at that, you know, essentially we build products and services because they have customers.
You know, in order to validate to our customers that they're getting what they paid for
and to validate and to be able to provide them with an insight into, you know, if I change this environment or I change this application,
this has a corresponding impact on my customers with their satisfaction or their churn.
You know, we have that obligation to provide that sort of data to people.
And I think the other part of this is the sort of, you know of that means that having that end-to-end view means that all of a sudden the stakeholders in the monitoring process are more than just operations single business level but who have individual levels of granularity that they care about.
And it starts to present that sort of – some challenges around how do we handle granularity?
The concept of a single pane of glass is kind of laughable now.
Like every audience has a different pane of glass they would like to see
and at a different level of resolution.
And we also start to see some movement around the fact that, you know,
instead of perceiving an application or a service as being, you know,
it's code, it's machines, it's services, it's middleware, its machines, its services,
its middleware, we are seeing them as a coherent whole.
And as a result, particularly in the case of distributed systems,
the path through that is long and torturous,
and we need to have something where we start to think about
correlation of events and tracing of events across multiple systems and multiple resolutions
and potentially in multiple geographies and time zones, et cetera, et cetera.
And so do you – does Prometheus provide that?
Again, coming back to that book that you're writing and the tool, because I want to understand what your solution is.
So is there a solution like – is there a solution like is there
is there a tool like Prometheus that can help us or I mean again we from a Dynatrace perspective
we also believe we're going into that direction and I think we are we're covering we're addressing
all of these aspects but I want to just understand from obviously you have more expertise on the
Prometheus side so help me understand if you think Prometheus can help here
and how it actually does it.
Sure.
So I'm generally fairly vendor agnostic.
I'm also the maintainer of Fremant, which is an event monitoring system.
So I don't have a particular axe to grind in the sort of tool sense,
and I firmly believe people should choose the tools
that work best to solve their problems
and are easily consumable by their colleagues
and easily manageable and maintainable.
I think the really interesting thing about Prometheus
is that it does emerge out of the, you know,
there's been a lot of talk in the last few years
about Google, Google's tooling, the Google SRE culture,
and particularly around tools like Borg and Borgmon,
Borg being the sort of the internal Google tool that Kubernetes is modeled on, and Borgmon being
the monitoring tool that monitors that. So Prometheus has a heritage that comes out of
that community. So the original engineers who worked on Prometheus at SoundCloud, where it was
first open sourced, are ex-Google SREs. And they took the heritage of Borgmon and built a tool
that reflected that heritage. But in my view, it's somewhat easier to consume and somewhat easier to use than possibly an internal
Google tool with 10 years worth of heritage and a bunch of different systems.
And I think the thing they were looking at primarily is they were attempting to address
the fact that we live in this dynamic world and we live in this world where hosts and
services and jobs appear and disappear quite
rapidly and we need to be able to manage those and monitor them in a coherent kind of way.
So what I find interesting about Prometheus is that it's very much aimed to have, you know,
you are a department or a group of people, you manage a service.
Prometheus is provided as a, let's say, a service on demand to a team of engineers working
on a distributed system.
They can expose the metrics that they feel are important to their group or important
to them or that roll up into a top
level metrics that maybe care about federated or care about it from a business perspective.
They can expose those metrics really easily and they can point Prometheus or have the
teams that manage Prometheus point Prometheus at their services and consume those metrics.
I think that in the sort of bad old days, we'll call it the sort of older
environments, monitoring was very much an afterthought because it really was about,
you know, we launched a new system, now let's monitor it. And people would say, okay, well,
we've got operating system level monitoring on the host with Nagios, and maybe we've got an APM plugin of some kind, and those events are going over there.
And more often than not, things like alerts or concerns about performance were raised as a result of an incident.
So there were post facto sort of implementations. A tool like Prometheus makes
it easy for a team to say, well, I can embed my monitoring from day one. I can expose the metrics
that are important to me that I've built based on my design considerations or my business
requirements. And I can then have a team acquire those metrics and present me with a dashboard or
an aggregation or together with
the metrics from the other parts of the service that allow me to see it as a holistic view.
Cool. Yeah, I think that's, I mean, I love it because this is kind of the same strategy that
we've been following. We also strongly believe that monitoring has to start on the dev side.
When we talk about DevOps and pipelines, then monitoring has to be something that is following the code from the workstation through code checking, through a CI, CD, all the way into production.
And you want to use monitoring to get early feedback, performance feedback, resource consumption feedback through the monitoring tools when you run your unit test,
your integration and performance tests, right?
This is something we've been promoting and also the industry,
and I think most people call it shift left.
But I totally agree.
And I think we also see a shift in our,
in the people we work with
that monitoring used to be seen as
it helps us
in case we have problems in production
and then we turn it on and then we are also,
we're also, we want to pay some money, obviously,
for APM tools as well to kind of keep the lights on.
But I think the industry has shifted
and I think in a big way that they understand
that monitoring has to be something
that is seen holistically.
Monitoring has to be a feature that comes,
or a capability that comes with the code that you're deploying.
I believe so.
And it's got to be baked in, right?
It's got to be baked in.
Yeah, exactly.
And whether developers build it in directly into their code,
or if it's part of the platform that they're using,
whether it's a PaaS platform or anything that comes with the cloud vendor that you are using.
But I totally agree with you, James.
I mean, it's something that has to be part of the development culture, right?
I mean, you should not release code that doesn't expose any type of metrics so that we know how it is behaving.
Otherwise, you're flying blind.
Yeah, and I think that there's another really interesting side effect
of this that a lot of application developers haven't thought about
is that a significant amount of engineering leadership overhead
is consistent in measuring velocity, like how much have we shipped.
And a lot of that is related to how many stories did we complete or how many story that is related to like how many stories did we complete
or how many story points did we ship or how many features did we ship
or did we meet the roadmap on the features?
A significant amount of – I mean, that measures some success,
but if you actually ask the sort of business to say, well,
they're actually not interested in necessarily in how many stories you finished.
They're interested in the impact on their customers.
So by instrumenting code early and by instrumenting applications and systems early,
and particularly by measuring things like latency and performance from an end-to-end point of view,
you're able to provide the business with, okay, we invested in refactoring this subsystem.
Maybe it's a payment subsystem.
And we cut 30 microseconds
off every transaction. We can see customers going through the system at a faster rate,
and we can see the growth of customers and customer transactions as a result.
And we can see return business based on that experience. And I think that's really valuable
to engineering leaders who are attempting to essentially validate their existence in a way that the business understands.
Because the business doesn't care about my database transaction was 30 microseconds faster.
What they care about is their customer sat numbers where customers are now 75% happy with our service, whereas last quarter they were 70% happy with our service. So if you can tie those business metrics back to the changes you
made in the environment, then you validate your existence in such a way that's positive
and guarantees your continued employment and hopefully your promotion and bonus and
the value of your stock. And I think that's very much a continuation of monitoring. If you're
starting your monitoring and your metric collection all the way into the development side, it doesn't end when you push your code out. It
has to continue out into production. And besides collecting business metrics that the business team
is using, that has to go back to the product owners. That has to go back to the developers
to understand, number one, we put out code, but was that code quality and of use to the end users? Are the end users using that
code? Is it having a positive or negative impact on the users and their experience? And most
importantly, whatever we put out, let's say it's a new feature, monitoring if users are using it,
and if they're not using it, is it maybe because the performance of it is slow? You can look at
and say, hey, if we speed that up, as you're saying on that back end, do we get greater adoption
or does it not have an impact on adoption?
And maybe this new feature that we put out because we've extended
that monitoring into the actual customer,
we can then suddenly start cutting out features that are not being used.
So I think that's – you talked about silos earlier between operations
and the application teams.
I think there's that other silo that I don't know if we really mentioned back then of that production feedback into this whole cycle of things.
Yeah.
Yeah.
And Brian, to add to this, I mean, this is what I love about kind of our transformation story that happened within our organization where developers are now actually responsible for production which means they make the conscious decision when they deploy something
in production but they're also responsible for dealing with impact and their number one impact
now as you said james is no longer the database statement is low or slow but they're they're the
thing that they're looking at now as well is hey how many people are
now actually using that feature is it breaking for them or is it usable for them and very important
for them how many people are opening up a support ticket afterwards because something breaks and so
we also found it very useful to kind of extend i mean to kind to kind of interpret DevOps as we're all in one boat, right?
We're all an engineering organization and we need to provide the benefit to the business.
And therefore, developers also have to, you know, be responsible for what they do out there that impacts the business.
And once they started to look at the business metrics that they defined, obviously,
in combination with the business people, within our case, with our product managers, and then looking at them, they saw immediately what type of impact they have.
But what I thought was so cool about it, and I have to quote one of our engineers, and he said, Finally, I can be proud of my features because I immediately see when I push the deploy button how many people are using that feature.
I do not only get feedback from customers when they open up support ticket and write nasty comments about the bad quality of the code.
I thought that was actually pretty cool.
It only obviously works because we care about monitoring and we monitor the end user and then bring it back all the way to engineering.
So that was pretty cool.
And one thing I want to just take a step back on,
as well as James was reading your Prometheus article,
you had referenced what your inspiration for looking into Prometheus was,
which was this post by Sridharan, I hope I'm getting that right. So I went back and read
that as well, which is a fascinating read. We'll put a link up to that as well. But
I wanted to touch slightly upon the difference between data and information, which that kind
of touches upon. Because I think that's an important thing to point out. I think a lot
of us take for
granted information and might not even think of it as information. But, you know, as it was defined
in that article, we're talking about, you know, data being just simply facts and figures. It's
just the data that you're collecting. It's the numbers, it's the metrics, but they're meaningless
unless you're structuring them, unless you're doing something with them to present them to make them meaningful and actionable. So I just want to get your thoughts
on, you know, in terms of, let's say, Prometheus, or in general, what a lot of, you know, I'm not
sure if you're interacting with people, or if you're seeing a lot of the state of monitoring
out there these days. Do you think people are falling into the trap of collecting data?
Or do you think a lot of people, as we're seeing this is becoming more important, are they focusing on information and taking that data that they're collecting and transforming
it into something useful?
Yeah, I think there was an interesting transitionary phase.
Four or five years ago, I think somebody,
I can't remember who it was, somebody at Etsy described
the Etsy monitoring environment as the church of graphs
because they collected a lot of data.
Like pretty much anything you, any bit of infrastructure or code,
you poked a little bit and then if it didn't move,
you stuck some instrumentation in there and so would collect from it.
But I think that for a lot of people, when they sort of cargo coded that or they sort of modeled that behavior, they were like, I'm going to need to collect everything.
And they didn't make the next logical conclusion, which was to ask themselves, why is Etsy collecting that?
What are they hoping to learn from it?
Which, as you describe, is sort of the information side of the equation. And I think that what's really interestingly happened is that this is actually something that, and I'd be curious to ask a Google SRE who was in the early days,
whether they had this sort of endorsement, but this is something that comes out of the data
management sort of data analysis world. And Google is obviously well known for the fact that they do a lot of, they do a lot of poking at the performance of, of their platforms. AdWords
being a prime example here, like they, you know, it's, they have some fairly complex algorithms
that make them understand, like collecting most, all sorts of information going, you know, how
valuable is, is this click worth? What is this click worth? How, how many, did it reach the right
demographic? You know, have we, have we presented, have we presented the right ad to the right person? And I think about the sort of data
information question in the same way, in the same way that a data engineer does or a data scientist
does. And that is that, you know, I need to have all of this information, but I can't be buried in
it. I can't be analysis paralysis. I need to have some
really good questions. Like I need to have some questions that demonstrate to me, like, you know,
what am I trying to understand here? And then one of those might be, you know, how successful are
our customers at using our product? Or, you know, what is the average, you know, what is the average,
you know, what is the average, you know, ASP size, average sale price for a particular checkout?
How many checkouts are discarded and why?
You know, questions like that that actually sort of directly tie to the success of your business.
And then piece together the right bits of data to say, okay, I can make a hypothesis based on this person of data. And I've turned it into information, which I can then answer questions about
and then provide to,
whether it be a product community,
an engineering community,
or the business community about like,
these are the sort of decisions strategically
and tactically we should make
to make our business more successful.
And do you see with that, right?
There's a lot of data.
We've heard of ideas of data scientists recently.
And as you mentioned earlier,
you were the CTO for a few companies there. And the role of this information and data,
oftentimes people turn to it when a problem arises. But would you see, or if you were going
to go back into, you know, that full-on tech company CTO side of things again, would you be able to see or justify
hiring people whose job is just really making sense of this information outside of negative
effects going on? Like to say, hey, we have all this information, we should be mining it for
everything we can to find optimizations, to find, you know, obviously to find problems and
resolve issues that are going on. But would it make sense or would it be justifiably financially for an organization just to have somebody consuming this data to see what they can mine out of it?
I think so very much so.
I mean, I was previously at Kickstarter.
We had a data team of five people, including a VP of data.
And the reason we did that was because we collected everything, events, we tracked users
through the system, we understood what they visited and how they visited it, how they got
their path from finding a project they liked to backing it to finding another project and so on
and so forth. Understanding that experience and understanding the underlying experiences,
like how long it took them to, say, complete a payment
or how responsive the website was or how fast search results returned,
that piece of data put together allowed us to ask questions like,
you know, what should we work on next?
Is a recommendation engine or making a recommendation engine better
something that is a valuable investment for making a recommendation engine better, something
that is a valuable investment for the engineering and product teams to make.
So I very much think that it's not only justified, but I think that particularly in environments
where you are customer facing and you previously relied on, say, support tickets or outbound product marketing feedback to make product decisions,
that it is much more viable and much more valuable to put together a team of folks whose job it is to look at that data,
reach conclusions, and make recommendations to leadership about what to do.
Great.
That's pretty cool.
Hey, coming back from the end user and the behavior analytics
and to make the business decisions, coming back to data versus information,
what can we do from a monitoring side to not only collect data
but actually to add more context to it?
I know in the blog posts that you referenced and that you wrote,
we talk about the combination of monitoring, log analytics, and tracing to, I think, get more context into data,
like knowing the relationship of the response time of a web service with the disk utilization on a maybe depending machine or hopefully
probably depending machine uh is there at any best practices on how we can how we as the developers
can can add more metadata to it and what what what do the monitoring tools need to do to actually
get more meaning how to transform data into into information from the way we collect the data?
Yeah, I think – and the blog post talks a little about this, but taxonomies are important, particularly if we're trying to break down those silos.
So if I have a metric or a log event or – I need to know where it comes from. I need to know when you refer to something,
whether, you know, if you're calling this a payment transaction or you're generating a rate
of some kind, or you're providing me with some piece of information, whether it be a CPU memory
that I understand at the resolution or the granularity that you're collecting it at,
I understand where it comes from, understand what systems, you know, it will impact,
you know, whether that's attaching metadata
or flagging events in a particular way
or aggregating events together.
I need to have that sort of consistent view
across my environment.
So someone needs to own that sort of taxonomy of like,
this is what a system looks like from a monitoring perspective.
Here are the things that we – the guidelines.
You think about the RFC for instrumenting your application, for monitoring your host or monitoring your application or middleware or database or whatever it is. the high level taxonomy that allows us to say, okay, I am a, you know, I care about
this, I'm a developer, I care about this particular feature.
How do I aggregate together the data I have to be able to answer the questions I have
about, you know, how this feature is performing or, you know, what is the impact of changing
this feature or changing the performance of this feature?
And so coming back to Prometheus or tools of that like, does this mean as a developer,
I have to obviously figure out how to collect this data and where to collect it from?
Or is there some smart tooling already out there that does some of the legwork for me. So for instance, why do I as a developer need to figure out how I can correlate something that happens in my microservice to something that happens in a depending machine that I
call because the database sits on there?
Is this something that modern tooling should take care of automatically?
Or is this still something where whoever takes care
of implementing monitoring, giving recommendations,
that these people have to put it into their monitoring strategy?
I think this is a collaborative sort of effort.
Like I don't expect an application developer
who works on a product engine of some kind
to be deeply interested in
the work that an SRE or an ops person does. But I think they need to understand the constraints
of those people and need to understand somewhat of the view of the world of those folks,
the worldview, I guess. And so I think that modern tooling needs to be able to say, okay,
we can impose a taxonomy of kinds, which means that, you know,
I can say that it might be simple as saying these methods live inside this
service, which, you know, performs this function and is grouped,
you know, it rolls up into this business application of some kind.
And as long as everybody is aggregating their information
or labelling their information in the same way, you know,
we're already a significant step further down the path of being able to say,
ah, these things are interconnected versus the sort of siloed world
where it's like, okay, I'm collecting this piece of information at this granularity,
and I think that some stuff runs on top of this,
like some services and stuff, but I don't have anything to do with that.
That gets deployed by the release team, and, yeah,
occasionally they might call me when I need to kick the box to do something,
but I don't really understand what's happening.
So, I mean, the reason why I bring up all these questions is
I'm just trying to validate
the way we try to solve the problem with our products, if it is the right approach or not.
And I believe the more I listen to you, I believe we are going down the right path.
And I'm not sure if you know what we are doing exactly these days at Dynatrace. But one way we
try to solve the problem, I think think we solved the problem is that we
have a single agent that we now install on a host level and that agent automatically monitors every
single network connection on that machine so knowing all the dependencies of that host
to all the other hosts but also knowing actually where these which processes are opening up these
connections but then we also combine that with our distributed tracing.
So we automatically instrument the application processes.
So traditionally what APM tools,
I think have been doing pretty well over the last couple of years.
And then we also pull in log information that we capture from the host,
knowing which process at which time wrote which log message.
And on top of that, also adding configuration change events
or any events we can either automatically detect on that machine.
So like something was deployed on that process,
a new Docker container came up,
or something where you can also use our REST API
to tell us what you have just changed.
And that is typically then integrated
with your deployment pipeline.
And I think this helps a lot of our users
to really just,
to not have to think about
how can I correlate all this data?
How can I build up this dependency map?
Because this is what we try to automate.
And so I want to bring this up here.
I'm not sure if you're following what we've been doing.
I just wanted to see, because you are an expert in this space
and you've been working with large-scale production environments
for quite a while.
And I just wanted to get some validation if what we are doing
actually makes sense and if it is the right approach to solving the challenge of modern application monitoring.
Obviously, I don't have a huge amount of insight into how the product works.
But I think broadly speaking, that approach is correct. I think from the point of view of sort of monitoring concerns
and I guess more accurately observability concerns,
understanding what's running on that system,
being able to correlate those events together,
being able to trace those actions through that event,
and then be able to see factors that influence that host,
as you said, deploying something or upgrading some software
or changing some setting is pretty crucial to sort of providing
that sort of full set of data that allows us to ask those interesting questions.
Cool. And I assume you're mentioning, you're talking about these things also in your book, in the new Prometheus book, how to implement these things?
I'm talking about it obviously at a reasonably high level.
I guess I'm talking about I do cover monitoring architecture and some suggestions around monitoring architecture and then use Prometheus as an example of how to implement that architecture. what exactly in certain circumstances and certain combinations of services or tools you should be monitoring because I feel that's something that is a bit more subjective
to the environment you work in, but sort of providing an overview of how do I put these pieces together
and how do I at the end of it come away with a system or a platform that allows me to sort of make those choices and build those sort of monitoring tools and get that sort of understanding of what I need to know about my environment?
That's pretty cool.
Hey, I want to ping you on one more topic because I know I wrote this in the email and then you have an interesting answer.
So I talked about I would like to cover monitoring serverless.
And then you said, well, I don't know why serverless is special, but happy to discuss why it isn't anything different.
And I actually like that.
So can you fill me in a little bit about what your thoughts are on serverless? Well, we talked earlier about sort of the level of abstraction of various services that we might consume.
And Lambda or Azure Functions and their Google equivalent are good examples of this.
Essentially, claiming it's serverless is kind of a misnomer. There are
obviously servers underneath there somewhere. But the level of abstraction that we're seeing
is essentially a thing that we load some code on and we ping transactions at. So to me, I can't
see inside what's happening underneath that box and maybe maybe I don't care, because what I do see is the
performance of my transactions or the latency of my serverless functions. And that's all I really
need to see. So, you know, to some extent, you know, I can determine whether, you know, I've
optimized the code and I see this particular latency response, like, I don't know why I need special
tools to do that, since I should be measuring that same thing on the top of any of my applications,
even my legacy applications. So to me, serverless just means that certain parts of the system
are not exposed to me. That means I make certain assumptions about that black box,
and some of those assumptions may be bad. You bad. I think that a lack of granularity into certain systems is not necessarily always awesome.
But if I'm prepared to adopt that constraint and I'm prepared to say,
assuming that the black box underneath performs in a manner that I'm comfortable with,
then all I need to care about is how it performs, the layer above that,
how that performs, what I want to know about.
And I don't think monitoring tools, sort of modern monitoring tools,
require any special magic to be able to do that.
Yeah, that's correct.
The only thing we see is, I mean, talking about Lambda, for instance, the AWS version
of serverless, the only real way of monitoring Lambda was kind of through CloudWatch, getting
the metrics out there, as you mentioned, throughput, latency, response time.
And it was a little challenging to get end-to-end tracing in, for instance,
because typically Lambda functions are sometimes part of a distributed activity
or business process.
And I think that's some of the technical challenges
the tool vendors face.
How can we make it easy for users of Lambda,
developers that write code,
to get more insight into what's actually really going on and where time is spent
and to which external services they reach out.
And maybe the external services actually add all the latency
or they have problem patterns in their code
that are maybe making too many calls to external services
and therefore just extending the runtime of their function
and therefore price obviously goes up
because we are charged by the execution time of Lambda functions
by Amazon and Microsoft and the like.
So I think what we've seen as a vendor,
we're just trying to figure out how to build tooling
to circumvent some of the technical constraints
to get more detailed information out of these systems?
Yeah, look, I don't disagree.
I think there's an element here of monitoring something by its absence.
Like, you know, if you can trace the transaction
as far as whatever serverless function you're calling
and then trace it coming out the other side
and you can identify that you don't have a latency problem at the beginning and you don't have a
latency problem at the end then probably somewhere in the middle is your latency problem um now
obviously that's not ideal and you'd love to be more granular and have a better idea of what's
happening but um uh you know to some extent you're constrained by the fact that the public cloud
provider is a walled garden they do want you to play in their garden.
And that means that in their best interest to instrument their services
using the monitoring tool they recommend for their community and their customers,
that's not always ideal,
particularly if you're looking at things that are high-level sophistication.
CloudWatch, I'm not a huge fan. I think it could be far more sophisticated than it
is. I recently spent a deep dive in CloudWatch logs. I think it's a, you know, I look at it and
I look at the functionality around it and it's sort of like log processing 1.0, whereas something
like Logstash or Splunk is sort of log processing 5.0.
You know, that doesn't mean that Amazon won't improve on that. And, you know, I'm certainly
one of the people that's given them feedback on that and particularly how it integrates with
things like ECS and probably in future how it will integrate with their Kubernetes service.
You know, I presume that for more sophisticated customers,
particularly as they're focused on the enterprise,
they will take that feedback and either improve those services
or make it easier for customers who fit a certain profile
to be able to get the information they require.
Going back to this idea of, and I'll use serverless as the example,
I just wanted to know if you had any thoughts on this.
Obviously, we know for serverless, CloudWatch is going to expose some of those high-level metrics about your processes or your functions that are being run.
And the idea here is for us to just trust Amazon or trust whatever cloud provider you're using to say that the servers that are actually running this code and everything is running fine, right?
It's almost like the similar to the CDN trust, right? You're supposed to trust your CDN to perform well.
And sometimes they do. And sometimes they don't. Obviously, a company like Amazon or Azure or
Google, that's funny, I just called Microsoft Azure, that's kind of where it's going, isn't it?
Those companies are staking their reputations on that performance. However, it's not even the type of black box where we can
do black box monitoring to see, are those systems up and running? If we see a slowdown in our
function, there is currently no way to find out, is there an infrastructure issue going on Amazon
that might be impacting that? So with that in context, I mean, there's some information they expose in certain services that you're using and all, but not everything.
In that context, do you think the cloud vendors should have a bit more openness to the monitoring of those functions or those components that are just supposed to be on and performing well uh you know more more visibility into that for their customers is that important
or is it not or is it just like just trust that they're not going to screw it up but i know it's
kind of like black or white or somewhere in between it's probably a spectrum here like i'm
a big believer in trust but verify um so uh and i'm also someone who works heavily in the sort of open source world.
I don't believe vendors who say that they have a producer tool
that will solve all of your problems.
I think that that trivializes people's problems
and also trivializes the complexity in people's environments,
particularly as we're distributing more complex applications.
So most people I see are consuming services from multiple vendors.
And, you know, to some extent, it is antithetical to those vendors to want to make it easier for people to consume more than one,
something else in addition to their product.
So, you know, obviously there's some friction involved in being more open because it obviously doesn't feel like,
as a first-order concern, it should incentivize
your product team.
But I do think that ultimately most of those vendors
will have to either mature their internal solutions
to provide the information that customers require or mature the APIs
and software development kits around that infrastructure to allow customers to be able
to make their own choices about how they choose to consume that data or how they choose to
monitor that service.
I suspect that the path of least resistance is the former, but I think Amazon
is aware of the fact that, and so is Azure and Google, that their communities are still heavily
not enterprise customers. They're still heavily folks who write their own, who are software
engineers, and that they need to produce APIs and software development kits that support
the needs of those users.
Great.
Cool.
Hey, James, do you have time for one excursion, one more topic?
Sure.
Cool.
And even though we touched base on it, but talking about containers and orchestration of containers, I know we've been, if I go to traditional infrastructure monitoring, right, people worried about, you know, how are the servers doing?
Are they up and running?
And then they alert in case systems fail.
Obviously, in the world of containers where containers come and go, this is no longer the metric.
I think looking at the number of containers that are running is probably not a metric that necessarily makes a lot of sense.
But what actually makes sense?
Can you give me a little insight on what you recommend
on what we actually need to monitor
when we talk about containerized applications?
Yeah, like I think counts of containers is probably fairly pointless.
I don't understand why that would be a viable metric.
Again, sort of the availability of containers,
the model has changed to being like you don't no longer measure
the availability of individual hosts,
but more like the availability of a service.
So I think the abstraction has moved up.
So I generally start with looking at, you know,
what does this service do?
What is the prime function of it?
Let's instrument that and measure it.
And then if I identify that there are hiccups in that performance,
I dig down and say, okay, you know,
let's look at this aggregate group of containers.
What's happening here?
Oh, wow, okay.
Memory is exhausted on all of these containers.
We need to double the amount of memory each of these have.
Or you can see that it's constrained by CPU or by disk,
and therefore I need to change the profile of the group of containers
or the definition of the container or the pod in the Kubernetes world
that this service runs on.
And that, to me, is the sort of appropriate level.
So you have the sort of proactive thing,
which is monitoring the high-level performance of the service.
And then you have the reactive thing, which is if you identify a problem,
you have the data that you can dig into to be able to say,
ah, here is the fault or here is the issue and here is a path to resolution.
And I assume when we talk about performance of a service,
let's say a service endpoint, we're not only talking about response time,
but actually the resource consumption of that service,
meaning how many CPU cycles, how much IO,
because essentially if I deploy a service on a self-scaling environment,
then if the environment just scales up,
that means response time may always stay stable,
but I'm actually throwing more virtual resources on it.
So I guess when we talk about performance,
we really talk about resource consumption
and obviously response time and throughput, correct?
Yeah, I think so.
I think this is like the classic,
I think disk monitoring is the classic sort of metaphor I'd use here.
It's like classic threshold, static threshold-based disk monitoring,
which is like disk reaches 80%, triggers warning alert,
disk reaches 90%, triggers critical alert.
Like the major fallacy there was what is the time it's going to take
to actually exhaust the disk is more interesting
than whether the disk has reached the threshold.
And I think that same principle applies to service monitoring
on virtual and containerized environments,
is that you have an upward threshold,
which is the capacity or some cost constraint,
and then you are watching auto-scaling happen
or watching resources get consumed,
and you are able to say, okay, over the last 24
hours, I've consumed resources at this rate. By this time tomorrow, I'm going to look like this,
or this time next month, it's going to look like this. That has this cost implication or this
particular consumption implication. I need to make some decisions about what I do next. And that's also cool metrics, actually.
If you look at then resource consumption per throughput,
if you are deploying new versions
or if you're doing some canary releases,
then you can immediately see if a new update
has any resource constraints
or let's say resource impact, right?
Yes, we pushed an update.
Performance-wise, it is still performing the same,
but it consumes that many more resources.
So it's probably too costly to really run this
and roll it out to everyone.
So I think that makes a lot of sense.
That's pretty cool.
All right.
Did we, anything else, James uh that you wanted to to touch upon
anything that we missed anything important no i i don't think so i would i would strongly
recommend the the blog post we've been talking about um uh it's it's uh it's i think it's
interesting to more than just sres and ops people. It really sort of explains to people in the engineering space, you know,
what is monitoring, what is observability, why is it valuable,
what is data, what is information, how to ask these questions.
And just a quick skim of that is well worthwhile.
And I'd also plug Jason Dixon's Monitorama conference.
It has a number of great speakers and all of the videos are up online.
If you are interested in monitoring and observability, it's a great event.
It's being run again in Portland, Oregon in the middle of next year.
And I can't recommend it enough if you're interested in the topic.
Jason's also a really lovely chap, wrote a great book about graphite.
And certainly, you know, if you're playing in this space, it's worth going to.
And I guess my last plug is I'm also the co-chair of O'Reilly's Velocity Conference.
We run three conferences in San Jose, New York, and London.
But one of the sort of prime tracks we run
is a monitoring and observability track.
We also, you know, focused heavily on distributed systems,
tracing and understanding how to manage these sort of complex systems.
So if you're interested in that sort of stuff,
it's also an event that I would thoroughly recommend.
Yeah, I think we've been at velocity for the last couple of years.
It's really, I can just echo what you said.
A very good event to meet a lot of people that are interested in monitoring,
building resilient systems.
Really cool.
So, Andy, shall we summon the Submariner?
Sure, summon the Submariner.
Let's do it.
All right. So I think, I mean,
it's amazing how many different areas
we covered today.
I think definitely what I learned
is that developers need to take charge
of monitoring, right?
Monitoring is no longer
just something we do in production,
but we need to do it holistically,
end-to-end.
And there's a lot of great tools out there
and a lot of great ways for developers to capture more data. But that not only capture
data, but I believe what we also learned today is converting that data into meaningful information
by augmenting a bit more metadata, by understanding dependencies. We can highly recommend the blogs,
which we will be linking to the blog post to read up on
the difference between data and information, as well as what's the difference between monitoring
and observability.
Thanks, James, a lot for all of your insight also when it comes to the different approaches
of, let's say, more modern monitoring of what we do with container monitoring,
that monitoring just the existence of a resource when it goes away,
obviously no longer what we should do,
but we should focus on the actual services that deliver business to our end users.
And I believe to come back to what we mentioned in the very beginning,
the bottom line is we're all building software that typically services our customers. And whether that customer is a real end user sitting in front of a browser or a mobile
app, or whether the customer is using one of our REST APIs, it is a customer.
And if that bottom line is impacted, then we need to make sure we have the right data,
hopefully proactively, to figure out what's wrong.
And the last aspect that I like what you also said,
the reason why data analysts are so important,
you have to collect a lot of data.
And the reason is we don't yet know all the questions
we wanna answer with all the questions we have.
But if we have more data,
we can sit down and actually ask the right questions
to drive the business,
making the next best business decision,
like which features are we going to implement, what to do next. And I think that's also very important.
Excellent, Andy. And I would add to that, again, just thanking James for taking the time to be
with us today. I probably had one of the most enjoyable show preps that I've had reading
between reading your blogs and reading some of the other data
that you referenced. It's just been great reading for me. You know, we work in monitoring all the
time, but to see some of these things written out the way they are was just really, really fun for
me. I'd also suggest to a lot of times Andy and I talk about the idea of shifting left and level,
not just the shifting left, but more of the leveling up where you have, obviously you have operations teams and you have the development teams and they
do a lot of this intense work. And the important people that you have in the middle who sometimes
feel left out and lost are the testers, whether or not they're the functional testers or the
performance testers or whatever those roles they might play in,
a lot of times the question is, well, what do I do next? How do I level up? How do I
improve what I'm doing and make myself more valuable to the company? And I think getting
into all this monitoring is one of those key areas that you can go into. You know,
just background on myself, James, I was a performance tester before I got into all this.
And to me, that avenue of leveling up has always been the monitoring, taking all this data,
turning it into information and figuring out how to get insights into the stuff early,
how to share this with the other teams. If you're learning as much as you can about monitoring and
observability and the different metrics and what they mean, these are things you can bring back to the other team
members to try to collaborate and build a better infrastructure of monitoring for the entire
organization. So I definitely recommend anybody who has not started looking into this stuff yet
to really just, just start diving deep into it because I think, you know, performance monitoring, um, and all this other
kind of monitoring is really coming into, um, it's, it's already very important, but I think
it's really taking the spotlight these days. And it's really, it's really exciting to see that
happening. So again, thanks James for, for being with us today. Awesome. Thank you so much for
having me. All right. Any final words, anybody have anything else? So we're going to put the links to your blogs, your book, some other things on the website.
And obviously you have Twitter. We'll put your Twitter handle up there.
If anybody has any feedback or any questions for us, you can reach us at pureperformance at dynatrace.com
or you can tweet us at pure underscore DT.
And I guess that's it.
Thank you, James.
Thanks.