PurePerformance - 055 Monitoring in the Time of Cloud Native with James Turnbull

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian Wilson and as always we have with us Andy Grabner. Andy, how are you doing today? Almost freezing. It's getting cold up here. Well, not now. It's above freezing.

Starting point is 00:00:43 I think 6 degrees 7 Celsius. But Boston is getting a little colder over the next couple of days. But otherwise, I'm good. Yeah, we just started getting a cold front in here finally. Last week in Denver, we were in the 70s, and now we're finally getting cold. And it's rightfully so, right? So this episode will be airing in 2018. But we're wrapping up the year of 2017.

Starting point is 00:01:16 We have a few more podcasts to record, and we're just fresh off of Black Friday and Cyber Monday, and hopefully everyone survived that well. So for everybody who had to go through all that fun retail support and keeping the systems up and lights on, we salute you. And hopefully you're getting to take it a little bit easier now, but it never gets easy, does it? Hopefully the wallets also survived and the credit cards, your personal credit cards survived it. Yeah, that's it. It's not a big thing, right? Right.

Starting point is 00:01:43 So, Andy, we've got a very special guest, as always. All of our guests are always very special. And I just want to, before we introduce him, definitely recommend checking out his blogs, checking out his books, a lot of great stuff. So I wanted to say that ahead of time because I think a lot of the writings he's doing and links to other things that he's used as inspiration as well. It's quite amazing stuff. And I literally just discovered it two days ago. and links to other things that he uses inspiration as well. It's quite amazing stuff. And I literally just discovered it two days ago.

Starting point is 00:02:14 Well, I won't say the name, but I've heard of our guests before. But I just, you know, being so busy, never really had a chance to dive deep into it. So it's really exciting for me. So, Andy, why don't you take it? Sure. So actually, when I'm on the plane, I listen to podcasts and I listen to the DevOps Cafe. And one of the episodes that I recently listened to, which made me reach out to James, James Turnbull, was about the art of monitoring. And it was recorded about a year ago. And it was very fascinating what James and the host had to say.

Starting point is 00:02:43 And I reached out to James and James immediately replied, very well performing, right? James, and this is why we have you on board, because I'm sure there's a lot of stuff that happened in the last year since you did that podcast, you did a lot of work. But before we get started, first of all, hi, and maybe you want to tell the audience who you are for those that have never come across your name. Hi, thanks for having me on. I'm James Stoneball. I've been an engineer for 25 years this year, I think, and primarily sort of doing a bunch of infrastructure stuff, and then more recent last few years doing sort of

Starting point is 00:03:19 product and leadership sort of things. I'm currently the CTO at a not-for-profit called Empatico. We connect classrooms together globally to help elementary school students develop empathy skills. Prior to that, I was the CTO at Kickstarter. I've worked on Venmo, which is a payments platform. I was one of the early engineers at Docker after I flipped over from Docker to Cloud to Docker, and I was one of the early engineers who Docker after I flipped over from Docker to.cloud to Docker.

Starting point is 00:03:46 And I was one of the early engineers who worked on Puppet as well. So a long history in infrastructure software. And in my spare time, copious spare time, I write technical books. I've written 10 technical books largely about infrastructure software and engineering practice. And I do want to say, as we mentioned earlier, I am using one of your technical books right now, the Docker book. So thank you for writing that. It's been very helpful as I'm going through learning Docker. Oh, awesome. Glad you liked it.

Starting point is 00:04:18 Yeah. And I think you also said you're currently writing a book. I think it's on Prometheus. Yes. I'm writing a book called Monitoring with Prometheus, which is – I'm particularly – obviously, being on the podcast, I'm particularly fascinated by monitoring and the monitoring landscape. And Prometheus has obviously had a quick rise, very closely associated with Kubernetes

Starting point is 00:04:48 and with the changes in architecture and containerization. So I thought it was worth doing a deep dive into that and writing a book to cover off sort of give folks an idea who might not have heard of Prometheus or want a place to start an introduction um it's uh the url is prometheusbook.com so it's uh pretty easy to find and we'll link to that in the description on the page too as well and well it's a great coincidence that you are into monitoring and obviously brian and i are are big into monitoring i mean we're working for dynatrace even though we know we try to keep this podcast

Starting point is 00:05:26 kind of tool neutral, even though I think today we'll definitely go into tools and it will be very interesting for me now to understand. So first of all, what has changed since the last year when you gave the DevOps Cafe interview?

Starting point is 00:05:40 What is going on, especially in the areas you mentioned earlier when it comes to building these new, very dynamic applications with containers using Kubernetes. Also, when we had the email exchange prior to this podcast, I brought up serverless monitoring. Is this any different or not? And I think you have your opinion on that.

Starting point is 00:06:00 So I was just wondering, what gets you excited about monitoring these days and what are kind of the capabilities that people need to look for when monitoring their new systems? And maybe not only their new systems, because what I see, yes, it's great that we can build new cool applications and new cool architectures. But the reality is that most people do not only have the new cool stuff, but they still have a lot of legacy systems that also interact with the new cool stuff. So kind of how can we bridge the gap? And maybe let's get started with what has changed and what gets you excited these days and what do people need to look for? And I hope you kept track of all that. I know.

Starting point is 00:06:36 Hopefully you can remind me of some of those questions. Yeah. So the funny thing about our industry is a year doesn't seem like a long time, but in fact, sometimes quite is. pockets of the community of our industry that had not yet sort of – the topic of things like containerization and microservices was sort of only on the edge of the horizon. I think by this year – and it's obviously hard to tell. I work in New York and I work with a lot of people in the Valley, so there's the hype sort of window there.

Starting point is 00:07:24 But I think in the last sort of 12 months, um, it's become pretty clear that, uh, we're on the path to a pretty fundamental change about in the way that we build applications, uh, and the way that we manage infrastructure, um, that, that, uh, previously, I think we were sort of seeing the bleeding edge of, but now we're seeing more definite. Um, and I think we were sort of seeing the bleeding edge of, but now we're seeing more definite. And I think that that sort of, you know, probably 12, 18 months ago, I would have said that, you know, what are the workloads that are going to get flipped over to, say, containerization or the workloads are going to get flipped over to public cloud.

Starting point is 00:08:02 Now I'm more and more convinced that with the exception of a very small group of people, particularly those with large brownfield installations and those with sort of regulatory obligations where industries are not moving forward, will be the sort of last remaining groups who will have infrastructure on premise in data centers and infrastructure that is running on physical servers.

Starting point is 00:08:30 So that means if we trust, and I totally agree with you, right? I mean, I've been at a re-event last week, and I think Werner Vogel in his keynote, I mean, he basically said, well, go build developers you build with the care of the rest and basically move everything to the cloud. I mean, there's obviously different flavors and different service they provide. But if we're all moving to the cloud and using cloud services,

Starting point is 00:08:53 does that mean monitoring is completely changing? Because obviously we're no longer owning the underlying infrastructure. Should we still worry about the underlying infrastructure? Should we solely focus on the business value we create, which is what is my code actually doing? What is my user experience? Or do we still need to tie all the knots kind of together

Starting point is 00:09:15 and also make sure that we're not blindly trusting the underlying infrastructure? Well, I think that the future is not evenly distributed across various tools. So in some circumstances, services is very much a black box to some regards, particularly with the new fairway, particularly with the new – sorry, the new – The Fargate. Fargate, sorry. I was thinking golf references for some reason.

Starting point is 00:09:58 Yeah, that's good. The Fargate infrastructure where you're not even managing the instance that Docker is running on, that heads to a place where you've got to ask yourself, how much investment would I make in monitoring that? But there are other services. RDS is a good example where you are essentially getting a database that's running on top of an instance. you have some input into the nature of that instance, but your application's behavior is still going to be reflected in metrics on that database server, and identifying issues and problems and bottlenecks and challenges with, say, a query still requires you to instrument that service. So I don't think it's entirely a black box and we're not just simply talking about

Starting point is 00:10:45 if I instrument my code and generate metrics around that that's going to solve my problems and the other area where the complexity is increasing rapidly is tracing so if you are consuming multiple services all of which are essentially

Starting point is 00:11:01 even if they're provided by the single provider like Amazon or Azure are essentially siloed, have their own APIs, their own reporting, their own metrics. It is often hard to trace the path of your transactions or your customer experience through that maze of events. And you often need to have an overlay, which, you know, could either be a monitoring system or some sort of observability system over the top of those services to provide you with the sort of coherent viewpoint of a customer's experience. And in this house – so obviously you're excited about Prometheus. So how does Prometheus – what does Prometheus provide to solve this problem? What are the best practices for people that build these new systems? Is there something where you need to obviously think about observability and about monitoring

Starting point is 00:11:51 when you build these systems to make it easier for you, whatever monitoring solution you pick, to be actually to be able to monitor all that? Or how would you go about that? I think the biggest change, and I think people are still fundamentally coming to grips with this change, is that monitoring was always very siloed. So infrastructure people monitored operating systems and maybe like New Relic or the like, which provided them with sort of views on how code was performing or possibly transactions and things like that in the state. So, you know, I think the biggest change we're seeing is that you can no longer maintain that solid environment. Everybody needs to have a view from end to end, which means that when you start thinking about monitoring,

Starting point is 00:12:47 you need to think about, you know, where am I going to start? So I need to start in the code base, and I need to ensure that my code is instrumented from top to bottom, and both in terms of the performance of the code itself and in terms of information performance of the code itself and in terms of information that is useful to the business that consumes those applications or uses those applications and then all the way along the life cycle I need to ensure that that the right groups

Starting point is 00:13:15 are involved in determining whether it be database performance or middleware performance or the security compliance of the platform or the operating system performance or metrics around things like deployment, as well as sort of, you know, things like uptime and latency and the sort of core things that measure the performance of an application and directly sort of tie into sort of customer satisfaction with that service or application. So I think that, yeah, that's what I think about as the sort of basis of that change and monitoring. Yeah, and I think that's interesting because I think it's, I mean, I'm not sure how much you are familiar with what we are doing at Dynatrace, but I think we see the same challenges

Starting point is 00:13:58 and also the same requirements. So monitoring everything from the end user through your services containers all the way down to the infrastructure if possible, and then pulling more information in from your cloud providers, from your services to make sense, right? To actually understand, is there a business impact right now with what we are running? Because I think that's the bottom line. The bottom line is, and I think you actually are, you put this very well in the podcast last year. You said if an organization that is kind of like in the Stone Age and what would you tell them to do as a next step if they kind of wake up in 2017 and have no clue what's going on?

Starting point is 00:14:39 And then you said, well, you know, pick the business metric that is most important for you and then figure out how you can monitor that business metric and everything that correlates to it so that you know what's actually going on. And I think that's – if I hear this correctly, even though the technology changes, but in the end, it's really – it's very much about what do we actually do to support our business? And that could be something for e-commerce, whether it's an order rate or the number of items sold or for insurance companies, how many claims are opened and how fast is the processing of claims. So I think it's the business aspect, but then obviously figuring out how can we monitor the key important pieces of the underlying application to figure out in case there is a problem, how to address this problem? Am I kind of getting into the right direction here? Yeah, I think so. I think that if I look at that, you know, essentially we build products and services because they have customers. You know, in order to validate to our customers that they're getting what they paid for

Starting point is 00:15:42 and to validate and to be able to provide them with an insight into, you know, if I change this environment or I change this application, this has a corresponding impact on my customers with their satisfaction or their churn. You know, we have that obligation to provide that sort of data to people. And I think the other part of this is the sort of, you know of that means that having that end-to-end view means that all of a sudden the stakeholders in the monitoring process are more than just operations single business level but who have individual levels of granularity that they care about. And it starts to present that sort of – some challenges around how do we handle granularity? The concept of a single pane of glass is kind of laughable now. Like every audience has a different pane of glass they would like to see and at a different level of resolution.

Starting point is 00:16:49 And we also start to see some movement around the fact that, you know, instead of perceiving an application or a service as being, you know, it's code, it's machines, it's services, it's middleware, its machines, its services, its middleware, we are seeing them as a coherent whole. And as a result, particularly in the case of distributed systems, the path through that is long and torturous, and we need to have something where we start to think about correlation of events and tracing of events across multiple systems and multiple resolutions

Starting point is 00:17:27 and potentially in multiple geographies and time zones, et cetera, et cetera. And so do you – does Prometheus provide that? Again, coming back to that book that you're writing and the tool, because I want to understand what your solution is. So is there a solution like – is there a solution like is there is there a tool like Prometheus that can help us or I mean again we from a Dynatrace perspective we also believe we're going into that direction and I think we are we're covering we're addressing all of these aspects but I want to just understand from obviously you have more expertise on the Prometheus side so help me understand if you think Prometheus can help here

Starting point is 00:18:05 and how it actually does it. Sure. So I'm generally fairly vendor agnostic. I'm also the maintainer of Fremant, which is an event monitoring system. So I don't have a particular axe to grind in the sort of tool sense, and I firmly believe people should choose the tools that work best to solve their problems and are easily consumable by their colleagues

Starting point is 00:18:30 and easily manageable and maintainable. I think the really interesting thing about Prometheus is that it does emerge out of the, you know, there's been a lot of talk in the last few years about Google, Google's tooling, the Google SRE culture, and particularly around tools like Borg and Borgmon, Borg being the sort of the internal Google tool that Kubernetes is modeled on, and Borgmon being the monitoring tool that monitors that. So Prometheus has a heritage that comes out of

Starting point is 00:18:59 that community. So the original engineers who worked on Prometheus at SoundCloud, where it was first open sourced, are ex-Google SREs. And they took the heritage of Borgmon and built a tool that reflected that heritage. But in my view, it's somewhat easier to consume and somewhat easier to use than possibly an internal Google tool with 10 years worth of heritage and a bunch of different systems. And I think the thing they were looking at primarily is they were attempting to address the fact that we live in this dynamic world and we live in this world where hosts and services and jobs appear and disappear quite rapidly and we need to be able to manage those and monitor them in a coherent kind of way.

Starting point is 00:19:54 So what I find interesting about Prometheus is that it's very much aimed to have, you know, you are a department or a group of people, you manage a service. Prometheus is provided as a, let's say, a service on demand to a team of engineers working on a distributed system. They can expose the metrics that they feel are important to their group or important to them or that roll up into a top level metrics that maybe care about federated or care about it from a business perspective. They can expose those metrics really easily and they can point Prometheus or have the

Starting point is 00:20:35 teams that manage Prometheus point Prometheus at their services and consume those metrics. I think that in the sort of bad old days, we'll call it the sort of older environments, monitoring was very much an afterthought because it really was about, you know, we launched a new system, now let's monitor it. And people would say, okay, well, we've got operating system level monitoring on the host with Nagios, and maybe we've got an APM plugin of some kind, and those events are going over there. And more often than not, things like alerts or concerns about performance were raised as a result of an incident. So there were post facto sort of implementations. A tool like Prometheus makes it easy for a team to say, well, I can embed my monitoring from day one. I can expose the metrics

Starting point is 00:21:31 that are important to me that I've built based on my design considerations or my business requirements. And I can then have a team acquire those metrics and present me with a dashboard or an aggregation or together with the metrics from the other parts of the service that allow me to see it as a holistic view. Cool. Yeah, I think that's, I mean, I love it because this is kind of the same strategy that we've been following. We also strongly believe that monitoring has to start on the dev side. When we talk about DevOps and pipelines, then monitoring has to be something that is following the code from the workstation through code checking, through a CI, CD, all the way into production. And you want to use monitoring to get early feedback, performance feedback, resource consumption feedback through the monitoring tools when you run your unit test,

Starting point is 00:22:29 your integration and performance tests, right? This is something we've been promoting and also the industry, and I think most people call it shift left. But I totally agree. And I think we also see a shift in our, in the people we work with that monitoring used to be seen as it helps us

Starting point is 00:22:45 in case we have problems in production and then we turn it on and then we are also, we're also, we want to pay some money, obviously, for APM tools as well to kind of keep the lights on. But I think the industry has shifted and I think in a big way that they understand that monitoring has to be something that is seen holistically.

Starting point is 00:23:03 Monitoring has to be a feature that comes, or a capability that comes with the code that you're deploying. I believe so. And it's got to be baked in, right? It's got to be baked in. Yeah, exactly. And whether developers build it in directly into their code, or if it's part of the platform that they're using,

Starting point is 00:23:23 whether it's a PaaS platform or anything that comes with the cloud vendor that you are using. But I totally agree with you, James. I mean, it's something that has to be part of the development culture, right? I mean, you should not release code that doesn't expose any type of metrics so that we know how it is behaving. Otherwise, you're flying blind. Yeah, and I think that there's another really interesting side effect of this that a lot of application developers haven't thought about is that a significant amount of engineering leadership overhead

Starting point is 00:23:57 is consistent in measuring velocity, like how much have we shipped. And a lot of that is related to how many stories did we complete or how many story that is related to like how many stories did we complete or how many story points did we ship or how many features did we ship or did we meet the roadmap on the features? A significant amount of – I mean, that measures some success, but if you actually ask the sort of business to say, well, they're actually not interested in necessarily in how many stories you finished. They're interested in the impact on their customers.

Starting point is 00:24:26 So by instrumenting code early and by instrumenting applications and systems early, and particularly by measuring things like latency and performance from an end-to-end point of view, you're able to provide the business with, okay, we invested in refactoring this subsystem. Maybe it's a payment subsystem. And we cut 30 microseconds off every transaction. We can see customers going through the system at a faster rate, and we can see the growth of customers and customer transactions as a result. And we can see return business based on that experience. And I think that's really valuable

Starting point is 00:25:00 to engineering leaders who are attempting to essentially validate their existence in a way that the business understands. Because the business doesn't care about my database transaction was 30 microseconds faster. What they care about is their customer sat numbers where customers are now 75% happy with our service, whereas last quarter they were 70% happy with our service. So if you can tie those business metrics back to the changes you made in the environment, then you validate your existence in such a way that's positive and guarantees your continued employment and hopefully your promotion and bonus and the value of your stock. And I think that's very much a continuation of monitoring. If you're starting your monitoring and your metric collection all the way into the development side, it doesn't end when you push your code out. It has to continue out into production. And besides collecting business metrics that the business team

Starting point is 00:25:54 is using, that has to go back to the product owners. That has to go back to the developers to understand, number one, we put out code, but was that code quality and of use to the end users? Are the end users using that code? Is it having a positive or negative impact on the users and their experience? And most importantly, whatever we put out, let's say it's a new feature, monitoring if users are using it, and if they're not using it, is it maybe because the performance of it is slow? You can look at and say, hey, if we speed that up, as you're saying on that back end, do we get greater adoption or does it not have an impact on adoption? And maybe this new feature that we put out because we've extended

Starting point is 00:26:31 that monitoring into the actual customer, we can then suddenly start cutting out features that are not being used. So I think that's – you talked about silos earlier between operations and the application teams. I think there's that other silo that I don't know if we really mentioned back then of that production feedback into this whole cycle of things. Yeah. Yeah. And Brian, to add to this, I mean, this is what I love about kind of our transformation story that happened within our organization where developers are now actually responsible for production which means they make the conscious decision when they deploy something

Starting point is 00:27:10 in production but they're also responsible for dealing with impact and their number one impact now as you said james is no longer the database statement is low or slow but they're they're the thing that they're looking at now as well is hey how many people are now actually using that feature is it breaking for them or is it usable for them and very important for them how many people are opening up a support ticket afterwards because something breaks and so we also found it very useful to kind of extend i mean to kind to kind of interpret DevOps as we're all in one boat, right? We're all an engineering organization and we need to provide the benefit to the business. And therefore, developers also have to, you know, be responsible for what they do out there that impacts the business.

Starting point is 00:27:59 And once they started to look at the business metrics that they defined, obviously, in combination with the business people, within our case, with our product managers, and then looking at them, they saw immediately what type of impact they have. But what I thought was so cool about it, and I have to quote one of our engineers, and he said, Finally, I can be proud of my features because I immediately see when I push the deploy button how many people are using that feature. I do not only get feedback from customers when they open up support ticket and write nasty comments about the bad quality of the code. I thought that was actually pretty cool. It only obviously works because we care about monitoring and we monitor the end user and then bring it back all the way to engineering. So that was pretty cool. And one thing I want to just take a step back on,

Starting point is 00:28:51 as well as James was reading your Prometheus article, you had referenced what your inspiration for looking into Prometheus was, which was this post by Sridharan, I hope I'm getting that right. So I went back and read that as well, which is a fascinating read. We'll put a link up to that as well. But I wanted to touch slightly upon the difference between data and information, which that kind of touches upon. Because I think that's an important thing to point out. I think a lot of us take for granted information and might not even think of it as information. But, you know, as it was defined

Starting point is 00:29:31 in that article, we're talking about, you know, data being just simply facts and figures. It's just the data that you're collecting. It's the numbers, it's the metrics, but they're meaningless unless you're structuring them, unless you're doing something with them to present them to make them meaningful and actionable. So I just want to get your thoughts on, you know, in terms of, let's say, Prometheus, or in general, what a lot of, you know, I'm not sure if you're interacting with people, or if you're seeing a lot of the state of monitoring out there these days. Do you think people are falling into the trap of collecting data? Or do you think a lot of people, as we're seeing this is becoming more important, are they focusing on information and taking that data that they're collecting and transforming it into something useful?

Starting point is 00:30:16 Yeah, I think there was an interesting transitionary phase. Four or five years ago, I think somebody, I can't remember who it was, somebody at Etsy described the Etsy monitoring environment as the church of graphs because they collected a lot of data. Like pretty much anything you, any bit of infrastructure or code, you poked a little bit and then if it didn't move, you stuck some instrumentation in there and so would collect from it.

Starting point is 00:30:44 But I think that for a lot of people, when they sort of cargo coded that or they sort of modeled that behavior, they were like, I'm going to need to collect everything. And they didn't make the next logical conclusion, which was to ask themselves, why is Etsy collecting that? What are they hoping to learn from it? Which, as you describe, is sort of the information side of the equation. And I think that what's really interestingly happened is that this is actually something that, and I'd be curious to ask a Google SRE who was in the early days, whether they had this sort of endorsement, but this is something that comes out of the data management sort of data analysis world. And Google is obviously well known for the fact that they do a lot of, they do a lot of poking at the performance of, of their platforms. AdWords being a prime example here, like they, you know, it's, they have some fairly complex algorithms that make them understand, like collecting most, all sorts of information going, you know, how

Starting point is 00:31:38 valuable is, is this click worth? What is this click worth? How, how many, did it reach the right demographic? You know, have we, have we presented, have we presented the right ad to the right person? And I think about the sort of data information question in the same way, in the same way that a data engineer does or a data scientist does. And that is that, you know, I need to have all of this information, but I can't be buried in it. I can't be analysis paralysis. I need to have some really good questions. Like I need to have some questions that demonstrate to me, like, you know, what am I trying to understand here? And then one of those might be, you know, how successful are our customers at using our product? Or, you know, what is the average, you know, what is the average,

Starting point is 00:32:20 you know, what is the average, you know, ASP size, average sale price for a particular checkout? How many checkouts are discarded and why? You know, questions like that that actually sort of directly tie to the success of your business. And then piece together the right bits of data to say, okay, I can make a hypothesis based on this person of data. And I've turned it into information, which I can then answer questions about and then provide to, whether it be a product community, an engineering community, or the business community about like,

Starting point is 00:32:51 these are the sort of decisions strategically and tactically we should make to make our business more successful. And do you see with that, right? There's a lot of data. We've heard of ideas of data scientists recently. And as you mentioned earlier, you were the CTO for a few companies there. And the role of this information and data,

Starting point is 00:33:12 oftentimes people turn to it when a problem arises. But would you see, or if you were going to go back into, you know, that full-on tech company CTO side of things again, would you be able to see or justify hiring people whose job is just really making sense of this information outside of negative effects going on? Like to say, hey, we have all this information, we should be mining it for everything we can to find optimizations, to find, you know, obviously to find problems and resolve issues that are going on. But would it make sense or would it be justifiably financially for an organization just to have somebody consuming this data to see what they can mine out of it? I think so very much so. I mean, I was previously at Kickstarter.

Starting point is 00:33:56 We had a data team of five people, including a VP of data. And the reason we did that was because we collected everything, events, we tracked users through the system, we understood what they visited and how they visited it, how they got their path from finding a project they liked to backing it to finding another project and so on and so forth. Understanding that experience and understanding the underlying experiences, like how long it took them to, say, complete a payment or how responsive the website was or how fast search results returned, that piece of data put together allowed us to ask questions like,

Starting point is 00:34:39 you know, what should we work on next? Is a recommendation engine or making a recommendation engine better something that is a valuable investment for making a recommendation engine better, something that is a valuable investment for the engineering and product teams to make. So I very much think that it's not only justified, but I think that particularly in environments where you are customer facing and you previously relied on, say, support tickets or outbound product marketing feedback to make product decisions, that it is much more viable and much more valuable to put together a team of folks whose job it is to look at that data, reach conclusions, and make recommendations to leadership about what to do.

Starting point is 00:35:25 Great. That's pretty cool. Hey, coming back from the end user and the behavior analytics and to make the business decisions, coming back to data versus information, what can we do from a monitoring side to not only collect data but actually to add more context to it? I know in the blog posts that you referenced and that you wrote, we talk about the combination of monitoring, log analytics, and tracing to, I think, get more context into data,

Starting point is 00:35:57 like knowing the relationship of the response time of a web service with the disk utilization on a maybe depending machine or hopefully probably depending machine uh is there at any best practices on how we can how we as the developers can can add more metadata to it and what what what do the monitoring tools need to do to actually get more meaning how to transform data into into information from the way we collect the data? Yeah, I think – and the blog post talks a little about this, but taxonomies are important, particularly if we're trying to break down those silos. So if I have a metric or a log event or – I need to know where it comes from. I need to know when you refer to something, whether, you know, if you're calling this a payment transaction or you're generating a rate of some kind, or you're providing me with some piece of information, whether it be a CPU memory

Starting point is 00:36:56 that I understand at the resolution or the granularity that you're collecting it at, I understand where it comes from, understand what systems, you know, it will impact, you know, whether that's attaching metadata or flagging events in a particular way or aggregating events together. I need to have that sort of consistent view across my environment. So someone needs to own that sort of taxonomy of like,

Starting point is 00:37:23 this is what a system looks like from a monitoring perspective. Here are the things that we – the guidelines. You think about the RFC for instrumenting your application, for monitoring your host or monitoring your application or middleware or database or whatever it is. the high level taxonomy that allows us to say, okay, I am a, you know, I care about this, I'm a developer, I care about this particular feature. How do I aggregate together the data I have to be able to answer the questions I have about, you know, how this feature is performing or, you know, what is the impact of changing this feature or changing the performance of this feature? And so coming back to Prometheus or tools of that like, does this mean as a developer,

Starting point is 00:38:19 I have to obviously figure out how to collect this data and where to collect it from? Or is there some smart tooling already out there that does some of the legwork for me. So for instance, why do I as a developer need to figure out how I can correlate something that happens in my microservice to something that happens in a depending machine that I call because the database sits on there? Is this something that modern tooling should take care of automatically? Or is this still something where whoever takes care of implementing monitoring, giving recommendations, that these people have to put it into their monitoring strategy? I think this is a collaborative sort of effort.

Starting point is 00:38:57 Like I don't expect an application developer who works on a product engine of some kind to be deeply interested in the work that an SRE or an ops person does. But I think they need to understand the constraints of those people and need to understand somewhat of the view of the world of those folks, the worldview, I guess. And so I think that modern tooling needs to be able to say, okay, we can impose a taxonomy of kinds, which means that, you know, I can say that it might be simple as saying these methods live inside this

Starting point is 00:39:39 service, which, you know, performs this function and is grouped, you know, it rolls up into this business application of some kind. And as long as everybody is aggregating their information or labelling their information in the same way, you know, we're already a significant step further down the path of being able to say, ah, these things are interconnected versus the sort of siloed world where it's like, okay, I'm collecting this piece of information at this granularity, and I think that some stuff runs on top of this,

Starting point is 00:40:10 like some services and stuff, but I don't have anything to do with that. That gets deployed by the release team, and, yeah, occasionally they might call me when I need to kick the box to do something, but I don't really understand what's happening. So, I mean, the reason why I bring up all these questions is I'm just trying to validate the way we try to solve the problem with our products, if it is the right approach or not. And I believe the more I listen to you, I believe we are going down the right path.

Starting point is 00:40:36 And I'm not sure if you know what we are doing exactly these days at Dynatrace. But one way we try to solve the problem, I think think we solved the problem is that we have a single agent that we now install on a host level and that agent automatically monitors every single network connection on that machine so knowing all the dependencies of that host to all the other hosts but also knowing actually where these which processes are opening up these connections but then we also combine that with our distributed tracing. So we automatically instrument the application processes. So traditionally what APM tools,

Starting point is 00:41:11 I think have been doing pretty well over the last couple of years. And then we also pull in log information that we capture from the host, knowing which process at which time wrote which log message. And on top of that, also adding configuration change events or any events we can either automatically detect on that machine. So like something was deployed on that process, a new Docker container came up, or something where you can also use our REST API

Starting point is 00:41:39 to tell us what you have just changed. And that is typically then integrated with your deployment pipeline. And I think this helps a lot of our users to really just, to not have to think about how can I correlate all this data? How can I build up this dependency map?

Starting point is 00:42:01 Because this is what we try to automate. And so I want to bring this up here. I'm not sure if you're following what we've been doing. I just wanted to see, because you are an expert in this space and you've been working with large-scale production environments for quite a while. And I just wanted to get some validation if what we are doing actually makes sense and if it is the right approach to solving the challenge of modern application monitoring.

Starting point is 00:42:31 Obviously, I don't have a huge amount of insight into how the product works. But I think broadly speaking, that approach is correct. I think from the point of view of sort of monitoring concerns and I guess more accurately observability concerns, understanding what's running on that system, being able to correlate those events together, being able to trace those actions through that event, and then be able to see factors that influence that host, as you said, deploying something or upgrading some software

Starting point is 00:43:13 or changing some setting is pretty crucial to sort of providing that sort of full set of data that allows us to ask those interesting questions. Cool. And I assume you're mentioning, you're talking about these things also in your book, in the new Prometheus book, how to implement these things? I'm talking about it obviously at a reasonably high level. I guess I'm talking about I do cover monitoring architecture and some suggestions around monitoring architecture and then use Prometheus as an example of how to implement that architecture. what exactly in certain circumstances and certain combinations of services or tools you should be monitoring because I feel that's something that is a bit more subjective to the environment you work in, but sort of providing an overview of how do I put these pieces together and how do I at the end of it come away with a system or a platform that allows me to sort of make those choices and build those sort of monitoring tools and get that sort of understanding of what I need to know about my environment? That's pretty cool.

Starting point is 00:44:39 Hey, I want to ping you on one more topic because I know I wrote this in the email and then you have an interesting answer. So I talked about I would like to cover monitoring serverless. And then you said, well, I don't know why serverless is special, but happy to discuss why it isn't anything different. And I actually like that. So can you fill me in a little bit about what your thoughts are on serverless? Well, we talked earlier about sort of the level of abstraction of various services that we might consume. And Lambda or Azure Functions and their Google equivalent are good examples of this. Essentially, claiming it's serverless is kind of a misnomer. There are obviously servers underneath there somewhere. But the level of abstraction that we're seeing

Starting point is 00:45:31 is essentially a thing that we load some code on and we ping transactions at. So to me, I can't see inside what's happening underneath that box and maybe maybe I don't care, because what I do see is the performance of my transactions or the latency of my serverless functions. And that's all I really need to see. So, you know, to some extent, you know, I can determine whether, you know, I've optimized the code and I see this particular latency response, like, I don't know why I need special tools to do that, since I should be measuring that same thing on the top of any of my applications, even my legacy applications. So to me, serverless just means that certain parts of the system are not exposed to me. That means I make certain assumptions about that black box,

Starting point is 00:46:22 and some of those assumptions may be bad. You bad. I think that a lack of granularity into certain systems is not necessarily always awesome. But if I'm prepared to adopt that constraint and I'm prepared to say, assuming that the black box underneath performs in a manner that I'm comfortable with, then all I need to care about is how it performs, the layer above that, how that performs, what I want to know about. And I don't think monitoring tools, sort of modern monitoring tools, require any special magic to be able to do that. Yeah, that's correct.

Starting point is 00:47:04 The only thing we see is, I mean, talking about Lambda, for instance, the AWS version of serverless, the only real way of monitoring Lambda was kind of through CloudWatch, getting the metrics out there, as you mentioned, throughput, latency, response time. And it was a little challenging to get end-to-end tracing in, for instance, because typically Lambda functions are sometimes part of a distributed activity or business process. And I think that's some of the technical challenges the tool vendors face.

Starting point is 00:47:38 How can we make it easy for users of Lambda, developers that write code, to get more insight into what's actually really going on and where time is spent and to which external services they reach out. And maybe the external services actually add all the latency or they have problem patterns in their code that are maybe making too many calls to external services and therefore just extending the runtime of their function

Starting point is 00:48:06 and therefore price obviously goes up because we are charged by the execution time of Lambda functions by Amazon and Microsoft and the like. So I think what we've seen as a vendor, we're just trying to figure out how to build tooling to circumvent some of the technical constraints to get more detailed information out of these systems? Yeah, look, I don't disagree.

Starting point is 00:48:31 I think there's an element here of monitoring something by its absence. Like, you know, if you can trace the transaction as far as whatever serverless function you're calling and then trace it coming out the other side and you can identify that you don't have a latency problem at the beginning and you don't have a latency problem at the end then probably somewhere in the middle is your latency problem um now obviously that's not ideal and you'd love to be more granular and have a better idea of what's happening but um uh you know to some extent you're constrained by the fact that the public cloud

Starting point is 00:49:02 provider is a walled garden they do want you to play in their garden. And that means that in their best interest to instrument their services using the monitoring tool they recommend for their community and their customers, that's not always ideal, particularly if you're looking at things that are high-level sophistication. CloudWatch, I'm not a huge fan. I think it could be far more sophisticated than it is. I recently spent a deep dive in CloudWatch logs. I think it's a, you know, I look at it and I look at the functionality around it and it's sort of like log processing 1.0, whereas something

Starting point is 00:49:43 like Logstash or Splunk is sort of log processing 5.0. You know, that doesn't mean that Amazon won't improve on that. And, you know, I'm certainly one of the people that's given them feedback on that and particularly how it integrates with things like ECS and probably in future how it will integrate with their Kubernetes service. You know, I presume that for more sophisticated customers, particularly as they're focused on the enterprise, they will take that feedback and either improve those services or make it easier for customers who fit a certain profile

Starting point is 00:50:14 to be able to get the information they require. Going back to this idea of, and I'll use serverless as the example, I just wanted to know if you had any thoughts on this. Obviously, we know for serverless, CloudWatch is going to expose some of those high-level metrics about your processes or your functions that are being run. And the idea here is for us to just trust Amazon or trust whatever cloud provider you're using to say that the servers that are actually running this code and everything is running fine, right? It's almost like the similar to the CDN trust, right? You're supposed to trust your CDN to perform well. And sometimes they do. And sometimes they don't. Obviously, a company like Amazon or Azure or Google, that's funny, I just called Microsoft Azure, that's kind of where it's going, isn't it?

Starting point is 00:51:02 Those companies are staking their reputations on that performance. However, it's not even the type of black box where we can do black box monitoring to see, are those systems up and running? If we see a slowdown in our function, there is currently no way to find out, is there an infrastructure issue going on Amazon that might be impacting that? So with that in context, I mean, there's some information they expose in certain services that you're using and all, but not everything. In that context, do you think the cloud vendors should have a bit more openness to the monitoring of those functions or those components that are just supposed to be on and performing well uh you know more more visibility into that for their customers is that important or is it not or is it just like just trust that they're not going to screw it up but i know it's kind of like black or white or somewhere in between it's probably a spectrum here like i'm a big believer in trust but verify um so uh and i'm also someone who works heavily in the sort of open source world.

Starting point is 00:52:08 I don't believe vendors who say that they have a producer tool that will solve all of your problems. I think that that trivializes people's problems and also trivializes the complexity in people's environments, particularly as we're distributing more complex applications. So most people I see are consuming services from multiple vendors. And, you know, to some extent, it is antithetical to those vendors to want to make it easier for people to consume more than one, something else in addition to their product.

Starting point is 00:52:41 So, you know, obviously there's some friction involved in being more open because it obviously doesn't feel like, as a first-order concern, it should incentivize your product team. But I do think that ultimately most of those vendors will have to either mature their internal solutions to provide the information that customers require or mature the APIs and software development kits around that infrastructure to allow customers to be able to make their own choices about how they choose to consume that data or how they choose to

Starting point is 00:53:19 monitor that service. I suspect that the path of least resistance is the former, but I think Amazon is aware of the fact that, and so is Azure and Google, that their communities are still heavily not enterprise customers. They're still heavily folks who write their own, who are software engineers, and that they need to produce APIs and software development kits that support the needs of those users. Great. Cool.

Starting point is 00:53:51 Hey, James, do you have time for one excursion, one more topic? Sure. Cool. And even though we touched base on it, but talking about containers and orchestration of containers, I know we've been, if I go to traditional infrastructure monitoring, right, people worried about, you know, how are the servers doing? Are they up and running? And then they alert in case systems fail. Obviously, in the world of containers where containers come and go, this is no longer the metric. I think looking at the number of containers that are running is probably not a metric that necessarily makes a lot of sense.

Starting point is 00:54:28 But what actually makes sense? Can you give me a little insight on what you recommend on what we actually need to monitor when we talk about containerized applications? Yeah, like I think counts of containers is probably fairly pointless. I don't understand why that would be a viable metric. Again, sort of the availability of containers, the model has changed to being like you don't no longer measure

Starting point is 00:54:55 the availability of individual hosts, but more like the availability of a service. So I think the abstraction has moved up. So I generally start with looking at, you know, what does this service do? What is the prime function of it? Let's instrument that and measure it. And then if I identify that there are hiccups in that performance,

Starting point is 00:55:16 I dig down and say, okay, you know, let's look at this aggregate group of containers. What's happening here? Oh, wow, okay. Memory is exhausted on all of these containers. We need to double the amount of memory each of these have. Or you can see that it's constrained by CPU or by disk, and therefore I need to change the profile of the group of containers

Starting point is 00:55:39 or the definition of the container or the pod in the Kubernetes world that this service runs on. And that, to me, is the sort of appropriate level. So you have the sort of proactive thing, which is monitoring the high-level performance of the service. And then you have the reactive thing, which is if you identify a problem, you have the data that you can dig into to be able to say, ah, here is the fault or here is the issue and here is a path to resolution.

Starting point is 00:56:13 And I assume when we talk about performance of a service, let's say a service endpoint, we're not only talking about response time, but actually the resource consumption of that service, meaning how many CPU cycles, how much IO, because essentially if I deploy a service on a self-scaling environment, then if the environment just scales up, that means response time may always stay stable, but I'm actually throwing more virtual resources on it.

Starting point is 00:56:45 So I guess when we talk about performance, we really talk about resource consumption and obviously response time and throughput, correct? Yeah, I think so. I think this is like the classic, I think disk monitoring is the classic sort of metaphor I'd use here. It's like classic threshold, static threshold-based disk monitoring, which is like disk reaches 80%, triggers warning alert,

Starting point is 00:57:10 disk reaches 90%, triggers critical alert. Like the major fallacy there was what is the time it's going to take to actually exhaust the disk is more interesting than whether the disk has reached the threshold. And I think that same principle applies to service monitoring on virtual and containerized environments, is that you have an upward threshold, which is the capacity or some cost constraint,

Starting point is 00:57:35 and then you are watching auto-scaling happen or watching resources get consumed, and you are able to say, okay, over the last 24 hours, I've consumed resources at this rate. By this time tomorrow, I'm going to look like this, or this time next month, it's going to look like this. That has this cost implication or this particular consumption implication. I need to make some decisions about what I do next. And that's also cool metrics, actually. If you look at then resource consumption per throughput, if you are deploying new versions

Starting point is 00:58:14 or if you're doing some canary releases, then you can immediately see if a new update has any resource constraints or let's say resource impact, right? Yes, we pushed an update. Performance-wise, it is still performing the same, but it consumes that many more resources. So it's probably too costly to really run this

Starting point is 00:58:35 and roll it out to everyone. So I think that makes a lot of sense. That's pretty cool. All right. Did we, anything else, James uh that you wanted to to touch upon anything that we missed anything important no i i don't think so i would i would strongly recommend the the blog post we've been talking about um uh it's it's uh it's i think it's interesting to more than just sres and ops people. It really sort of explains to people in the engineering space, you know,

Starting point is 00:59:09 what is monitoring, what is observability, why is it valuable, what is data, what is information, how to ask these questions. And just a quick skim of that is well worthwhile. And I'd also plug Jason Dixon's Monitorama conference. It has a number of great speakers and all of the videos are up online. If you are interested in monitoring and observability, it's a great event. It's being run again in Portland, Oregon in the middle of next year. And I can't recommend it enough if you're interested in the topic.

Starting point is 00:59:43 Jason's also a really lovely chap, wrote a great book about graphite. And certainly, you know, if you're playing in this space, it's worth going to. And I guess my last plug is I'm also the co-chair of O'Reilly's Velocity Conference. We run three conferences in San Jose, New York, and London. But one of the sort of prime tracks we run is a monitoring and observability track. We also, you know, focused heavily on distributed systems, tracing and understanding how to manage these sort of complex systems.

Starting point is 01:00:17 So if you're interested in that sort of stuff, it's also an event that I would thoroughly recommend. Yeah, I think we've been at velocity for the last couple of years. It's really, I can just echo what you said. A very good event to meet a lot of people that are interested in monitoring, building resilient systems. Really cool. So, Andy, shall we summon the Submariner?

Starting point is 01:00:41 Sure, summon the Submariner. Let's do it. All right. So I think, I mean, it's amazing how many different areas we covered today. I think definitely what I learned is that developers need to take charge of monitoring, right?

Starting point is 01:00:58 Monitoring is no longer just something we do in production, but we need to do it holistically, end-to-end. And there's a lot of great tools out there and a lot of great ways for developers to capture more data. But that not only capture data, but I believe what we also learned today is converting that data into meaningful information by augmenting a bit more metadata, by understanding dependencies. We can highly recommend the blogs,

Starting point is 01:01:21 which we will be linking to the blog post to read up on the difference between data and information, as well as what's the difference between monitoring and observability. Thanks, James, a lot for all of your insight also when it comes to the different approaches of, let's say, more modern monitoring of what we do with container monitoring, that monitoring just the existence of a resource when it goes away, obviously no longer what we should do, but we should focus on the actual services that deliver business to our end users.

Starting point is 01:01:59 And I believe to come back to what we mentioned in the very beginning, the bottom line is we're all building software that typically services our customers. And whether that customer is a real end user sitting in front of a browser or a mobile app, or whether the customer is using one of our REST APIs, it is a customer. And if that bottom line is impacted, then we need to make sure we have the right data, hopefully proactively, to figure out what's wrong. And the last aspect that I like what you also said, the reason why data analysts are so important, you have to collect a lot of data.

Starting point is 01:02:30 And the reason is we don't yet know all the questions we wanna answer with all the questions we have. But if we have more data, we can sit down and actually ask the right questions to drive the business, making the next best business decision, like which features are we going to implement, what to do next. And I think that's also very important. Excellent, Andy. And I would add to that, again, just thanking James for taking the time to be

Starting point is 01:02:56 with us today. I probably had one of the most enjoyable show preps that I've had reading between reading your blogs and reading some of the other data that you referenced. It's just been great reading for me. You know, we work in monitoring all the time, but to see some of these things written out the way they are was just really, really fun for me. I'd also suggest to a lot of times Andy and I talk about the idea of shifting left and level, not just the shifting left, but more of the leveling up where you have, obviously you have operations teams and you have the development teams and they do a lot of this intense work. And the important people that you have in the middle who sometimes feel left out and lost are the testers, whether or not they're the functional testers or the

Starting point is 01:03:41 performance testers or whatever those roles they might play in, a lot of times the question is, well, what do I do next? How do I level up? How do I improve what I'm doing and make myself more valuable to the company? And I think getting into all this monitoring is one of those key areas that you can go into. You know, just background on myself, James, I was a performance tester before I got into all this. And to me, that avenue of leveling up has always been the monitoring, taking all this data, turning it into information and figuring out how to get insights into the stuff early, how to share this with the other teams. If you're learning as much as you can about monitoring and

Starting point is 01:04:23 observability and the different metrics and what they mean, these are things you can bring back to the other team members to try to collaborate and build a better infrastructure of monitoring for the entire organization. So I definitely recommend anybody who has not started looking into this stuff yet to really just, just start diving deep into it because I think, you know, performance monitoring, um, and all this other kind of monitoring is really coming into, um, it's, it's already very important, but I think it's really taking the spotlight these days. And it's really, it's really exciting to see that happening. So again, thanks James for, for being with us today. Awesome. Thank you so much for having me. All right. Any final words, anybody have anything else? So we're going to put the links to your blogs, your book, some other things on the website.

Starting point is 01:05:09 And obviously you have Twitter. We'll put your Twitter handle up there. If anybody has any feedback or any questions for us, you can reach us at pureperformance at dynatrace.com or you can tweet us at pure underscore DT. And I guess that's it. Thank you, James. Thanks.

PurePerformance - 055 Monitoring in the Time of Cloud Native with James Turnbull

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.