PurePerformance - 063 Discussing the Unbreakable Delivery Pipeline with Donovan Brown

Episode Date: June 4, 2018

Donovan Brown, Principal DevOps Manager at Microsoft, is back for a second episode on CI/CD & DevOps. We started our discussion around “The role of Monitoring in Continuous Delivery & DevOps” but ...soon transferred over to our recent most favorite topic “The Unbreakable Delivery Pipeline”. Listen in and learn more about how monitoring, monitoring as code and automated quality gates can give developers faster and more reliable feedback on the code changes they want to push into production.Also make sure to follow up on Donovan’s road show when he shows Java developers how to build an end-to-end delivery pipeline in 4 minutes. And lets all make sure to remind him about the promise he made during the podcast: Building a Dynatrace Integration into TFS and adopt the “Monitoring as Code” principle

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it another episode of Pure Performance. My name is Brian Wilson and as always I have with me Andy Grabner, my co-host. Andy, how are you doing today? Pretty good, I'm really good. Sitting here in lovely Boston again, still waiting for spring. It hasn't shown up yet, so uh well it's the same keep repeating it yeah if you put some boots on you can at least have a spring boot bad joke um hey you know what you mentioned you sent an email earlier to me andy and i had no i
Starting point is 00:00:58 wasn't even paying attention uh by the time this episode's air airs, dear listeners, we'll be beyond our two-year anniversary. Yeah, exactly. Yeah, I was looking back at the speaker content. It looks like the first episode aired on May 5th of 2016. So I had no idea. And I'm really surprised that I haven't said something to get me fired yet. So I'm really happy that we're still here. Yeah.
Starting point is 00:01:23 And not only are we here, but we also have Donovan Brown back on the show. Hey, Donovan, are you with us? Yes, I am here. I'm hearing about how it feels in Boston. And it's miserable here, too, in Houston. It's 75 degrees, and I've had enough of this temperature. Yeah, yeah, yeah. Rub it in.
Starting point is 00:01:38 That's all. It's gorgeous here. It is sunny. It is 75, and it's just beautiful. So, yeah, you know there's places in the United States where it doesn't snow, right? Yeah, I heard about them. Yeah, you can live there, too. We're only hitting the upper 60s today in Denver, so it's, you know.
Starting point is 00:01:57 It's not too bad. No, not too bad. I can live with it. Yeah, for sure. 60s are not too savvy. 75, though, I'll take it. So, Donovan, the reason why we obviously wanted to record more episodes with you is because you are the DevOps master from Microsoft. DevOps advocate, DevOps ninja, DevOps blackbird man, right? And one big piece, I mean, last time we heard from Microsoft DevOps Transformation, and that was pretty cool.
Starting point is 00:02:26 Today, we want to talk more about continuous delivery, building pipelines. And then one thing that is very dear to our heart, especially Brian and mine, is using monitoring earlier in the pipeline and how we can leverage monitoring for early feedback. And I think what people call shift left or maybe they're out of terms. But I think that's what I want to discuss and also get your perspective and opinion on it, how maybe you at Microsoft, you're doing it, but in general, what you're advocating for when it comes to building delivery pipelines with monitoring baked in. Well, I think it's crucial that you have the monitoring baked into your pipeline because the whole goal is to deliver value.
Starting point is 00:03:08 And you can't just add a new feature and assume that you've delivered value. We're always monitoring, even if you're not doing it very well. A lot of us monitor just our bottom line. Did we make money or did we not make money? And they assume that if they made money, then that they're doing things really well. But I think it's more important to understand what were the actions that you took that allowed you to make more money. You need to dig a little bit deeper than just the bottom line. And that's where monitoring comes in.
Starting point is 00:03:33 It allows you to look at the path that the user took through your application. Yeah, maybe you did make a lot more money this quarter, but was it because of the feature that you added? Was it a promo that you were running? Was it you're now higher in search results than you were a month ago? If you don't understand what caused that movement, you can't go do more of that, right? And you're just guessing that, oh yeah, it must've been that cool new feature I told you to add. See, I told you it was a good idea and realizing that absolutely nothing to do with the movement of your bottom line. So I think monitoring is crucial to make sure that you understand and can quantify what was it that we did that made this improvement so that we can
Starting point is 00:04:11 go do more of that. And also answer the question, was the feature that we just delivered, which hopefully if you're doing Agile and Scrum correctly, you're working on the most important thing first. Was it truly the most important thing? Did it really have an impact on our development? And if you don't monitor your application, once it's deployed into production, you're just guessing that it was. But now there's no reason to guess. I mean, you could actually see that, yes, 95% of the people that visited our website used that new feature and that turned into revenue because they added items to their basket that were on that page. And you can start to make sense of the movement and not just looking at the bottom line. So I think it's crucial to be successful. Yeah. So but in this case, you're talking about in obviously monitoring how users react to your new feature, as you said, right? Instead of having an anecdotal fact, well,
Starting point is 00:05:02 instead of having anecdotes, you're actually basing it on facts i think that's what what the key thing is but what about monitoring even earlier before things hit production so one thing that we've been advocating for is actually using monitoring as part of your cicd meaning when you are pushing a new code change through the pipeline already using the same metrics, the same monitoring that it would use later on in production to figure out how heavy is my feature on resource consumption, whether it's memory, whether it is how does garbage collection change, how many log files are we generating, how many database statements are we executing. So looking at some of these metrics early on to give a developer already feedback minutes after he committed the
Starting point is 00:05:51 code to say, hey, your code change has potential impacts because you just increased the number of round trips to the database by 10%. And then combining that with production data where we can see, hey, your feature is actually used by 90% of the people. And if you're now increasing the database round trips by 10%, that translates to that many more round trips to the database. Are you aware of that? Is this something you're also advocating for? Absolutely.
Starting point is 00:06:20 And it's interesting because I don't want to do anything in production for the first time. It should have been tested in QA. It should have been tested in staging. It should be tested in dev. And that includes the telemetry that we're going to collect. I can't assume or ignore it in dev and QA and assume that I'm going to get good numbers out in production. Because I also think of it from a developer's perspective of custom telemetry that you're going to put in as well. That's telemetry that is literally coded into the product. When this action happens,
Starting point is 00:06:46 please send me some type of information that I can aggregate later. I have to test in dev that I'm actually getting the numbers out of the system. And I'm going to distribute that application to my beta testers and my QA testers and have them use the app and then go review those numbers.
Starting point is 00:07:00 It's like, yep, those are the buckets I thought the numbers would be going into. Based on your uses, these numbers look accurate. Perfect. We can now push that out into production. So anything that's going to happen in our production, I don't care if it's monitoring or performance tuning or bug fixes needs to be tested throughout every stage of your pipeline. So I violently agree with what you're saying. You know what I would add to that one too, in terms of testing your telemetry in that development phase, testing the validity of your telemetry in that phase.
Starting point is 00:07:30 So go ahead and collect the data that you think you want to collect. And besides figuring out, as you mentioned, is it collecting the data and is the data telling me what I thought it was going to tell me? Finding out if those metrics are actually useful in helping you make decisions. No, that's great because you have to have a question that you're trying to answer. That's the reason for the telemetry. And a lot of times I get asked a very generic question. Okay, Donovan, we're all excited about telemetry monitoring.
Starting point is 00:07:57 Where should we put it? And I just shrug at them like, I don't know where you should put it. I know where we put it because we were looking to answer a particular question. But don't ask me where the telemetry needs to go. Ask yourself, what is it that we're struggling with it? What is it that we need to know about our system? And then that will tell you what to monitor, how often to monitor it, and where to put that custom telemetry. So I think that's a very personal question. It's like, what should we monitor? Well, it depends on what it is that you're looking for. And you should be looking for something, in in my opinion you shouldn't just put monitoring everywhere because eventually you have
Starting point is 00:08:27 so much data you don't even know what you're looking at or looking for at that point right and then testing that that data is actually useful before you go to production so that you're not sitting there wasting your time with a bunch of metrics and data that aren't adding to you know the success of the project and you can you can test your testing data in a way. Yeah, because you produce, well, we produce petabytes of data. And you don't want that to be wasted space, because that's literally what it is. It's taking up space somewhere, and it's going to make combing through that data even that much more difficult when the volume continues to increase. So you want to collect what you need to be able to answer that question
Starting point is 00:09:04 quickly and efficiently. And you don't want to just be collecting data just for data's sake. You should be looking for something. Yeah, but you could just put it in the cloud. It's free. Yeah, sure. Okay. As long as it's Azure, knock yourself out. So I like that. So basically what you're saying is when it comes to monitoring as a development team, you want to actually define what is relevant for you to have, what piece of information is it that you want to have in the downstream environment, in that is actually valid data and that it's actionable. So that's one aspect of putting kind of making sure that everything is monitored correctly, kind of defining what you want to see, maybe putting in some custom telemetry and then maybe defining the dashboards and educating people downstream. Hey, here are some new things that I want you to monitor. Now, what about another concept that I call it monitoring as code, but I'm not sure if this is some term that I just came up with or maybe somebody else did too. For me,
Starting point is 00:10:12 it is something where if I'm a developer and I build a new feature and then my business team says, well, this feature has to respond in a certain time. It should only cost us so much in terms of how much does it cost to run it on a certain infrastructure. Then these are actually requirements, not functional requirements, but more performance and resource consumption requirements that I could potentially put into a config file, like a JSON file, a YAML file, or even my code, right? And then what I've been advocating, and Donovan, this is where I want to get your feedback on, every time I push a build through the pipeline, and I know that this particular
Starting point is 00:10:55 service that I'm pushing through should be able to respond within 100 milliseconds when it's been hit by 50 TPS, transactions per second, and it should be able to run on, let's say, a particular container with a particular size. And if I have specified this in my config files, then I can take this config file and put in a quality gate in my pipeline and say, okay, every time Andy is pushing a code change, I'm deploying it, I'm running the test,
Starting point is 00:11:23 so similar to 50 TPS, and then I'm looking at the monitoring data and say, okay, how many resources do we need? And what's the response time? And what's the failure rate? And do we have, how many dependencies do we have to other services? So this is what I've been advocating for monitoring as code. So as a developer, I check in these specifications with my source code, and then it can be automatically validated in the pipeline. Is this something that you've also seen other people do? Is this something that actually makes sense, or is this too early in the stage?
Starting point is 00:11:52 What's your take on that? Now, it's interesting hearing you describe that because what it sounded like you were describing at first was just what we normally do with performance testing, right? There's a bottleneck or there's an SLA that we have to adhere to. So we basically stand up a test rig that can generate that load and then verify that we can meet the SLA that we've actually put in place. But you've taken it or it sounds like you're trying to take it a step even further than that, because the building of the rig and the configuring of the thresholds and the alerts would all happen external to the piece of software. I would have another piece of software where I'm going to be doing my performance and load testing. I would set the thresholds. I would configure the test to run, and then I would just go beat this poor little app to death. And then
Starting point is 00:12:31 I would go watch the metrics from that device to determine if I met my SLA. But it sounds like you're trying to put that in code somehow. So then my question would be, what application is reading that config file and then configuring your test rig to then go generate the appropriate load and watch the current metric. So it sounds like an interesting idea, but technically having built rigs before, I'm thinking, okay, I don't know what app you're using
Starting point is 00:12:55 that can read that file and then configure itself to not only generate the load, but then read the right perf mons off of the machine to be able to know, am I looking at database connections? Am I looking at database round trips? Am I looking at CPU utilization? Am I looking at disk utilization? There's so many metrics you have to look at and configure as part of your testing that has nothing to do with the app, right? To set those thresholds. So I'm just kind of curious of what is the product that you're using that can then read from source control the definition of the configuration and test
Starting point is 00:13:26 and then go generate that test for you? Yeah, I mean, in my case, for monitoring, I mean, obviously we are using Dynatrace and I've been using my own JSON file format that I call mon spec monitoring as code and my pipeline itself. So into pipeline, I wrote these integrations now with Node.js. So I
Starting point is 00:13:47 wrote a little Node.js function that is basically reading that property file. And then it is reaching out to the monitoring tool and say, hey, we just deployed this particular version of the app into the into the test environment. We're currently running tests against it. So I'm not standing up the test exactly. I'm not I'm not yet generating the tests but that would actually be the next thing but what i have in this config file as a developer i can say here is my service you can detect this service in the different stages by looking at this metadata so every time when we deploy a service into a different environment whether it's dev test or, we can pass metadata like the stage name or the service name,
Starting point is 00:14:29 and all this gets picked up as a tag, as metadata. So I can actually ask the monitoring tool, give me the response time, the CPU utilization, the number of database queries from this particular service that is running in this particular environment and give it to me from the last 30 minutes when I knew I ran some tests. And then I'm using this and then validate it against what my developer wants me to validate it against. And I can actually either specify, let's say, hard-coded SLA. So if I really have a hard limit, but I can also say,
Starting point is 00:15:02 compare it to a different environment. So for instance, compare it to production. Because if I'm building a continuous delivery pipeline, my point of view is I never want to push something through the pipeline that is resulting in a worse state than we currently have in production. Production should always be my golden standard. So everything I do should be at least the same or improving production. So when I push something through, I can also say, hey, we're pushing it into a testing environment that is on the load and look at the values from this current test and compare it with what's happening currently in production or with a representative timeframe in production and then tell me,
Starting point is 00:15:42 are we getting better or worse? And if it's worse, then stop the pipeline and throw it back to the developers. Man, I love that idea. So I don't know anyone that's doing that right now, but I love that idea. And obviously, I'm starting to see, like, where do I put that in my pipeline? How do I actually configure that? I obviously want an extension in VSTS that can read that config file, you know, and just kind of wire that up for us. So again, to make sure I understand it, the tests are the test. Those have been run and defined outside of this entire environment. There are metrics that you know that you can monitor already from the monitoring tool of choice. And in our case, it's Dynatrace so that you know you have access to the CPU. You know you have access to the memory. You can count the round
Starting point is 00:16:23 chips to the database and the latency and things like that. So you know that that exists. And all you're doing in this config file is saying, I've deployed a new version, I want you to watch these metrics, and these are the thresholds on those metrics. Yeah, or the thresholds, or you can say compare it with a baseline, and the baseline can come from a different environment. So perfect also for blue-green deployments or canary releases. You can say, I just deployed a canary. I let it run for, keep an eye on my canary and only let it in there or tell me how the canary is comparing itself with my current production. And right now, I understand it's a config file, just a JSON format that you're using
Starting point is 00:17:06 to define the metrics, but how are the results being then displayed to the end user? Am I going to a dashboard? Is it part of my CICD summary page? How am I seeing the results?
Starting point is 00:17:18 Yeah, in my case, and it would be great if you actually volunteer to put it for TFS. So I have two implementations, one for one of your competitors they start with an a and with ws i know and there i put the i put the
Starting point is 00:17:34 results in a dynamo db table and then i have a little dashboard on top but also with links back to the dynatrace dashboards if you want to have all the details behind the metrics and then i also just built the same thing for Jenkins where the results will just be a build artifact in Jenkins. Nice. No, yeah, we need to talk about that when I'm there for Dev1 because I would like to actually see that. And, yeah, that's really cool.
Starting point is 00:18:01 That's like to the point where we need to make sure that that works inside of VSTS because that is just, has my brain running right now of all the cool stuff. And we have, we actually have what we call delivery gates built inside of our
Starting point is 00:18:13 release management product. And they can be custom gates that literally will run for as long as you tell them to run, validating whatever you tell them to validate. And if and only if this stays true, will it then say,
Starting point is 00:18:23 okay, it's safe to go to the next environment. Because we deploy, we use safe deployment for release management. So I think we talked about this a little bit in the last show. It goes through several different rings. And historically, we sit it for 24 to 48 hours in each ring as we monitor the things that we find important and if and only if they're good. And what we've done with release gates is we've automated safe deployment because instead of a human being having to go run a query to see if any new bugs have been logged in the last 48 hours, we literally have our tool go run that query for us and see if there's
Starting point is 00:18:53 been any new bugs. You can run arbitrary functions inside of Azure. You can run REST API calls. And we could also wire in something like what you just described. I want you to go run for the next 24 hours this in production and make sure that we don't break any of these SLAs that we have guaranteed for usage. And then if and only if they're green, give us a signal that it's safe to go to QA and staging and all that good stuff. Yeah. So the way when I present this, and I did a meetup this week
Starting point is 00:19:22 in Boston and I did some other presentations, I said as you just said normally we have somebody that knows there's a new build i need to run my tests at the end of the test to look at the dashboards from my let's say uh visual studio load or from my g meter or from my getling or from my new list and then i look at the dashboards and i compare it but we can automate all of that because, you know, we are in 2018. I mean, why do I need to look at dashboards? And instead of knowing, I need to look at response time and failure rate and CPU consumption. I can put these metrics into a config file. And that's what I'm doing.
Starting point is 00:19:56 That's what we automated. And, yeah, check it out. I will, you know, we talk anyway. So I'll show you more, and then hopefully we have you on another episode where you show us or talk about how you integrated the whole thing with TFS. No, no, I think it would be fantastic. And once I know the plumbing, I'm already – in the back of my mind, I'm already teeing up people I'm going to have write the extension for us. So this has really gotten me thinking I really like this idea a lot. So that means what people, the listeners will now know, if they want to get anything done on the Microsoft product side, get Donovan on a podcast.
Starting point is 00:20:33 Get him excited. Get him excited about the idea. I will find the resources to go get it for you. That is a true fact. I'm about to do the same thing for two database deployment technologies that we currently don't support. When I got wind of who they were and what they did, I'm like, holy crap, we need that in VSTS. And we have a group of people called the ALM Rangers that are – they don't work for Microsoft, but they're big Microsoft fans. They're very influential in the community, and they're all technical.
Starting point is 00:21:01 And they will come and fill these kind of gaps for us so if you can get me excited and i can get the rangers excited we can write a lot of cool extensions to make vsts do whatever we wanted to do awesome it's really and i want to and i want to tell you one additional thought i think it's not only about the classical metrics that we look at what i've been advocating for in the um i call it the unbreakable pipeline so the idea is you cannot push something through the pipeline to actually break the user experience of your customers in production. So we break the pipeline somewhere. One thing that I've also been advocating is looking at the number of dependencies of your services. So if I want to treat my microservice like I treat my LinkedIn profile, right?
Starting point is 00:21:42 If I look at my LinkedIn profile, I know how many connections I have. And if I post a link on LinkedIn or share a link, I always get to see how many people viewed this in my first generation of connections and in my second generation or first grade and second grade. I think that's what they call it. So I want to do the same thing for microservices. If I push a microservice through a pipeline and I know that this microservice in the previous builds had one dependency in first grade and this one dependency translated to two dependencies in the second grade, then this is my baseline. Every time I push a change through because I make a co-configuration change, I add in a new third-party library, I a new third-party library i updated a third-party library and all of a sudden the number of dependencies goes from one to two in my first grade and these two translates into 10 in second grade then i should flag this configuration change or code change because maybe this change came in through an unconscious decision, right? We know this. So that's also why I'm stopping the pipeline in case an unintentional change results in more dependencies. No, more dependencies that you're taking on, right?
Starting point is 00:22:55 For example, you added a new NPM package or a new NuGet package, which then had additional dependencies. Is that the number you're trying to track there? I'm trying to track the additional dependencies. Is that the number you're trying to track there? I'm trying to track the dynamic dependencies. So if, you know, I mean, obviously I refer it back to the data that we have on the Dynatrace side, and we see if a service calls another service, if a service calls a database, if a service puts something into a queue,
Starting point is 00:23:17 if a service makes a call to an external service. So these are the dependencies that I'm talking about. Okay. And if I, let's say I'm adding a new third-party library and that third-party library all of a sudden makes remote calls to a new backend service or a database or has an additional round trip to a database that we didn't have before,
Starting point is 00:23:36 then this is an additional dependency that we include. I see. So I'm basically looking at the actual dependencies between two services and how many interactions go on between them for a particular use case. I got you. I got you. Okay. Yeah, because I obviously see every hop adds latency, right? I mean, everyone thinks microservices are this silver bullet that come with no cons, but that's not true.
Starting point is 00:23:58 And by adding – for you taking on another dependency and not realizing that that actually is five more dependencies in the hops that you're taking inside of your microservices infrastructure, you might not be realizing the latency that you're actually adding to your application. Yeah. Now, Andy, in your situation there, let's say you are under the impression that your new third party will add one more dependency. Is that something you would be able to define in your JSON file so that when you're checking, it would see, okay, one new dependency added. That's what we were expecting, so I won't break the pipeline. Or is this something you... So the way I see monitoring as code right now, you have two options.
Starting point is 00:24:39 You can always say compare my current metrics with something else with some something else right with let's say the previous build then it will automatically flag it but you can also say uh i want i have a hard-coded number let's say two as dependencies so then the the pipeline will be green if i have two dependencies but if it's not two if it's one or or three, then it would raise a flag. So yes, you can, you can hard, if you know what it is, how many dependencies you expect, then you can put it in, uh, or you can compare with a different environment or with a baseline. Yeah. But I think that, that, that begs another question because I, let's say I want to compare it to production, but I know that I'm adding a dependency. Production will still be at two.
Starting point is 00:25:24 I've added a third that I wanted to add. It'll fail right how do i get it into production well that's that's why in the in your file you can say i'm actually expecting one one additional so when you when you compare this the dependency numbers we're actually going to accept three or a deviation i see so it's either or but when i'm adding a new so what i would have to do is that like a two-phase deployment or i wouldn't be able to go back to comparing to production until I've already deployed with a specific number as in my config file. I would say no longer compare to production because I know I'm about to break that rule. But I do want to add one more. So to say, okay, you're allowed to have three.
Starting point is 00:25:57 You only have two now. Now you have a third. I'm going to push you into production. And then the next deployment you could switch that flag back to now compare me to production because I'd never expect to go above that. Yeah. Got it. Okay. Got it.
Starting point is 00:26:09 Let me ask another question then in terms of the monitoring. I don't know if this fits into the pipeline build or not, but this came up in a discussion yesterday. I was at a DevOps conference, and I think it's something that monitoring helps with. It fits in somewhere, but maybe not quite sure where. Let's say you have, you know, whatever service you're writing and you test it, it runs well. You put it in production and at a certain point you need more instances of it. So you start spinning up new instances of your service on a specified size VM or whatever it might be. Now, that's all going to work very nicely.
Starting point is 00:26:50 However, the big question comes up, too, is what size instance versus, so you're going to pay for whatever instance size you choose. Your function or your service has a response time as a performance profile. So when in the cycle should we be testing what is the optimal size instance to run your service on for both best performance and best cost so that you can determine to say, hey, we're always going to run it on a medium-sized instance and we'll spin up those ones instead of running it on an extra large every time or something. You'd have to look historically at your, this isn't like a
Starting point is 00:27:34 just a wild swag, right? This is something where you're going to obviously do some performance tuning in the previous environments. Because again, this is where the load testing and performance testing comes in. At some point, you should have some SLA that you're trying to adhere to, right? We want to be able to have a thousand simultaneous users. And you're going to pick a size of a machine that you think is going to work, and then you're going to put it in QA, you're going to run load tests on it, and you're either going to find out that it does or does not do a thousand simultaneous users. And that's where you're going to be able to turn those dials on. Okay, let's try a bigger VM, let's try more memory, let's try faster SSDs instead. And you can play with tweaking that image
Starting point is 00:28:09 or the profile of what you're going to be running, scaling out, not scaling up when you're going to need more load. And then you get to determine, so what's that threshold? When we get to, if we want to do a thousand simultaneous users, when do we spin up that second instance? When we get to 700 current ones or when we go over that threshold? And those are the kind of numbers you start to work on, because again, scaling out is completely different than scaling up. And what you just asked on is, when do we scale up the machine or scale down the machine versus scaling in or out our infrastructure, right? Right. And it's twofold. Point number one from a performance point of view, but point number two from a cost point of view and i guess from from what you're describing it really doesn't sound
Starting point is 00:28:47 like it's part of the the pipeline it's more of the specialized testing in in something like loader performance that you'd be tackling that situation that's what i would historically be doing and i'm kind of interested with with andy's kind of ahead of the curve there on the way that he's comparing some of his other metrics i'm not i'm curious of what you've done in this area as well, because that's something that I would test out in an environment as close to production as I can get. I would know what numbers and targets I'm trying to hit, and then I would go turn the dials until I felt comfortable that I could hit that and scale up and scale out at the appropriate time to make sure I don't drop any users or have a bad user experience. Then I would probably run forward with that until I learned that that was simply no longer sustainable
Starting point is 00:29:27 or our load is so much more drastic and we hit 1,000 so often that we're constantly having this accordion scaling out and scaling in. Maybe it'll be quicker for us to just go ahead and scale up an instance so that we don't do that as often. There's all sorts of different questions I have to ask. And again, there are some monetary concerns there, obviously, because you don't want to be running a ginormous machine that only is doing 100 users for the majority of its life. And then it only spikes every once in a while. So I would look at our user patterns and determine what's the best use of our money and then either scale out
Starting point is 00:29:59 a lot of small devices or just sit on one big one that ends up being cheaper over time. Now, does Azure have any, and Andy, I want to get your take on that as well, but does Azure have, I don't know if any cloud providers do have this at all, I'm just asking in general, does Azure have any sort of API that you can interface to tell you how much your instance is costing at the moment or given a historical cost or is it all just looking at the pricing Well, I know that data is in there and we have an API for pretty much everything. So I've never used it, so I can't say definitively yes that we do. But the fact that that data exists
Starting point is 00:30:29 and almost everything that we do is backed by an API, my gut's saying, I bet you I can go find that data in real time. That would be really interesting. But I haven't done it myself, so I can't say for sure. But it's all available through APIs that we can get access to once you have the right credentials. So I would guess that we could do it, but I've never done it myself.
Starting point is 00:30:52 One thing that I wanted to add, so I think what we are talking about here is predictive capacity planning, right? I mean, you know the capacity that you need for a certain load of a certain component, and I believe what we can do is by keeping a close eye on the resource consumption of your individual services or features in your CICD, if you see, hey, that code change has for this particular endpoint, REST endpoint, means 5% more round trips to the database
Starting point is 00:31:22 or it's writing 5% more logs or it is consuming that much more CPU, then you can obviously, again, correlate that with how often does this feature get hit in production, and then you can kind of predictively or you can factor this into your future capacity planning.
Starting point is 00:31:41 But more importantly, I think the first thing you want to do, if any of these metrics change, raise a flag in the pipeline and then say, is this an intentional change? Do we add more functionality that justifies the additional resource consumption? Or was it unintentional? Was it a bug? Was it a wrong configuration change?
Starting point is 00:32:03 And then obviously it needs to be addressed before it hits production. So I think these are some of the things that I would add here. Yeah, I agree with that. But I think it's a little different than what was asked. I thought the question was, how do I know if I'm running on the right size or not? And doing it the most cost-effective way. Is that not the original question? Yeah, that's the original question.
Starting point is 00:32:23 Yeah, I think in this case, it has to to be also as you said just as we did it historically you need to figure out uh in a special environment you know what's what's the sweet spot uh right now what i've what i've seen though and i've been doing a lot of work these days around you know breaking the monolith into smaller pieces so the reason maybe why you need a big, big box for running a certain monolithic gap is because your monolith just has certain requirements. But yet we know that only certain parts, certain features of that monolith are used on a regular basis, but you still need to provision to all of the resources because maybe some libraries need all that. So what we are trying to do now with our work is to figure out which components within
Starting point is 00:33:08 a monolith are used frequently, what is the resource consumption, what are the dependencies, and then use this data to actually make suggestions on how to break the monolith apart and where to break it apart so that you end up with, let's say, one piece that includes the features that are very often used that you can then run separately from, let's say, other pieces of the previous monoliths that are less often used and maybe even consume more resources, but then you can separate it out. And so I know I'm going into a different direction with my discussion, but I believe the reason why we traditionally provided a lot of resources to handle this particular spike of load is because we had to provision for all the features that were part of the monolith, even though only small parts of the monolith was actually ever utilized on a regular basis. And now breaking it into smaller pieces and then being able to scale up and down these smaller pieces
Starting point is 00:34:07 obviously makes us more efficient, more cost efficient. And that's what we're also trying to help with analyzing the data that we have with our monitoring solution. And it also helps you develop and move faster too because it's much easier to deploy a microservice than a monolith
Starting point is 00:34:25 and we're in that exact same world at microsoft with the visual studio team services product there's portions of it now that are true microservices but it originally began as a monolith called team foundation server it was a one big everything was in there because you installed it on your own hardware and we basically lifted and shifted that into the cloud as it was. And we've slowly started to tease more and more parts away from the monolith and nothing new is added to the monolith. Everything like the release management was a service package management with its own service. And we're working on like teasing apart build and work item tracking, because as you pointed out, we might need lots of build resources and not a lot of work item tracking resources,
Starting point is 00:35:03 but because they're monolith to get one, you got to get them both. And now we're having to scale out bigger machines because they have to be able to sustain a whole new work item tracking, a whole new source control, and a whole new build when all we really needed was more build. But you can't get build without the rest of it. And so it's really interesting to hear you describe that because we're in that exact same cycle right now, figuring out how can we tease apart from this monolith these services and use them as true microservices so that we can scale them up and it's funny because it all comes back to monitoring right because i gotta know which one's the most popular and that all comes back to how do you monitor your
Starting point is 00:35:37 application how do you get the telemetry letting you know which of those services are used most often so you can strategically start to tease those really high volume services apart so that you can now manage them much more efficiently than you do as a monolith. And you can also take, there's two additional takes on it. First of all, you can say, hey, we now know which feature is actually very popular and where we make most of the money. And then maybe this is a good point where you say, all right, that's cool. It's part of the monolith. Let's build a new microservice that is kind of replacing, is going to replace that feature. Instead of extracting it out, maybe you build something new on top using some late technology,
Starting point is 00:36:18 whether it's serverless or microservices. And then just use this as also a way to not extract features from the monolith, but just replace features, you know, one by one until you are at a state where you say, well, now we have all the good features that we know we make money off extracted. We can deploy them independently. And now it's time to get rid of parts of the monolith no for sure yeah and the other point that i wanted to make with monitoring and this is kind of closing the the feedback loop or closing the loop to your initial thought is monitoring uh in production and knowing what people use but also knowing what people don't use right now is very good because if you keep
Starting point is 00:36:59 features along the way right and if you keep dragging them along because somebody- Technical debt. It's technical. Yeah, technical debt and business debt. I call it business debt too, because it's basically, why keep things alive? Because one person thought it was a great idea, but they have only anecdotal data to justify. But now we have real proof with the monitoring data, and let's kick it out, kick out the things we no longer need. And it's really good to be able to make an informed decision. I run a website and there's three different ways to view the core data of this website. So it's basically for people who race their cars.
Starting point is 00:37:34 I race cars for fun. And when I used to go to the track, you have to fill out all this paper. And being a technical guy, I'm like, why am I filling out my name every freaking week that I want to come race? This is stupid. This should be stored somewhere. So I wrote this website that allows you to go register for an event. And I remember everything about you. So registering is just a few clicks and you're done. And the track loves it because now they get these printed reports if they want to print them out or
Starting point is 00:37:55 the data goes directly into their timing system with no error. So it's this great way of monitoring and using your information. But one you can look at is like a traditional calendar. There's a year view and a month view. And I thought to myself, you know what? I'm sick of maintaining this stupid calendar view. I wrote it 15 years ago. I can't imagine anyone's using it. And I was going to just delete it. But before I did that, hold on. Let me go ahead and put some telemetry in here and say, every time someone clicks on calendar, let me know. Every time someone selects month view, let me know. And every time someone selects year view, let me know. Every time someone selects month view, let me know. And every time someone selects year view, let me know. And I let it run for a week.
Starting point is 00:38:29 And I realized that I was about to remove the most popular feature of my site. I was like, holy crap. I cannot imagine how many people I would have upset. Well, I knew exactly the number. It's like 95% of the people use the feature that I thought no one was using anymore. And again, it was just all anecdotal. I never use that feature anymore. I always use the month view. Figured everyone sees this as more valuable than the calendar view. Nope, I was completely wrong. And it saved me from making a huge mistake of removing a feature that I thought no one used, but I had the actual data that said that would have been an enormous mistake. If you really want to, you can get rid of that year view because no one uses that. But here you are maintaining that code
Starting point is 00:39:06 that you thought was valuable. Again, I love that because, again, I'm not guessing anymore. That's what we had been doing for decades was, here's our priorities, this is our product backlog. I know it's perfect, let's just go do the top thing. Really? Do we know if it's perfect? Because I added a feature that clearly no one was using and
Starting point is 00:39:23 wanted to delete a feature that everyone is using because I did not have my priorities correct. So again, monitoring is crucial to being successful when it comes to DevOps. Cool. All right. Hey, having that said, I think this was actually a great discussion about how important monitoring and CI, CD, continuous delivery and DevOps obviously is, right? Because with monitoring, you have the real facts to make informed decisions. And there's obviously different phases of the pipeline with different type of monitoring data or the same data should be used. Any final thoughts before we kind of summarize and wrap it up?
Starting point is 00:40:01 No, this was good. Actually, one other thing I might like to say is that if you are a Pluralsight customer – I'm not an author on Pluralsight, but I watched a show on there recently. And the only reason that made me think about this was you talking about teasing apart a monolith. It was an amazing show. So if you are on there, I think it's Eric Sutton's about modernizing a monolith, right? And I would think it's an amazing watch. So, and it kind of just made me think about that when I heard you talking about teasing apart a monolith. It's an amazing course on adding microservices to a monolith and doing exactly what you and I just described. So if you're a Pluralsight person, I'd highly recommend looking up that course. And it said modernizing. Your ASP.NET app. Yeah. As a matter of fact, I have a channel on there. So if you go to, I'm not an author again, but they call them expert channels. And there's a DevOps expert channel.
Starting point is 00:40:48 That course is in my expert channel. I'll tweet it. So I'm at DonovanBrown on Twitter. If you follow me on Twitter, I'll tweet it after this show airs. So let me know when the show airs, and I will tweet it so everyone can go and find it. Awesome. You can also add it to the podcast notes. Yep.
Starting point is 00:41:04 That sounds good cool so yeah andy i did have one thing i wanted to bring up um one because this came up yesterday at the conference so we spoke about in last episode we spoke about how at least in my opinion as i'm finding out dynatrace is becoming cool again right now uh i mean microsoft is becoming yeah i mean microsoft is always cool awesome well there was that there was that weird teen period when i had all those pimples now um yes how microsoft is like becoming cool again right they have all you know serverless.net core running on linux all these you know all these fun things and so i was i was out giving a demo um at a conference yesterday and it was almost like the class, you know, a high, I'm a Mac,
Starting point is 00:41:49 I'm a PC such that commercial. If you remember that, I'm sure you love those ones. Uh, um, so we had the first person come over and, uh, he, he's stated that he works a lot with, uh, Microsoft products. He's, he's doing a.net on windows. Um, and as we were starting, uh, a Java guy came over and, um, you know, I asked the Microsoft guy, so are you looking to do anything like eventually like moving to.net core, moving microservices on Azure? And he's like, yeah, we're, we're just starting that kind of process. And I said, isn't that so cool how like Microsoft is like really doing all these really cool, amazing things. It's becoming cool again. Right. I said, isn't that so cool how like Microsoft is like really doing all these really cool, amazing things. It's becoming cool again. Right. I said that cause I, I genuinely believe it. And the, the, the Microsoft guy started, you know, nodding his head a little bit as in like,
Starting point is 00:42:34 yeah, it is kind of cool. And the Java guy was like, well, you mean it's cool that they're just finally catching up with everybody. And I just had to bite my tongue i was like oh are you kidding me come on come on give him some credit the opens it's give him some so yeah there's still there's still an attitude but i i think i'm on your side i i think microsoft is definitely getting cool um i just wanted to bring that up because i it was the first time i encountered somebody being snarky about it i was like oh man wow you haven't said that. Then you haven't had that conversation enough yet because everyone I talked to about it gets snarky about it. Right. And there were still the evil empire to a lot of people. If you're, if you're not just coming out of college, I'm 45
Starting point is 00:43:19 years old. I remember the Microsoft that everyone is afraid of, right? Oh, yeah. So, and those scars and those wounds aren't going to heal overnight. Just being the number one contributor to open source in the world is not enough. Having a product that we're releasing that happens to be built on Linux clearly is not enough. Open sourcing.NET, running SQL Server on Linux, running.NET Core everywhere, having Xamarin so you can build any language, any platform, clearly is not enough. I mean, that should be more than enough, right? Because those people who are coming out of college now don't see Microsoft like the Java person that you just spoke to sees Microsoft. And I've noticed in the Java community, I have the hardest time breaking through. I did a talk in South Africa once, and they wanted me to come and keynote some conference and I said okay that's a long trip I'm only coming if you can get me in front of Java developers and they
Starting point is 00:44:10 were like why do you want to talk about Java developers like because we add value to every language and every platform you get me Java meetups and I'll come to your conference and they got me two of them and I had to go in like under like disguise right like you can't come in here and you can't pitch any Microsoft stuff. Like, no, I'm just going to tell you how we made our transformation. It doesn't matter. It works for any language in any platform.
Starting point is 00:44:30 And it's just generic theory stuff. Like, all right, fine. You can come say that. I'm like, all right, great. So I come in and I give basically the transformation talk and there's no pitching there. There's no promo there, but I left myself like 10 minutes at the end.
Starting point is 00:44:41 Like, okay, I just want to ask you one question, Java devs, right? So you're a Java dev. Let me just throw out a scenario for you real quick. And I just want to see how long this is going to take. So imagine that you have nothing on your desktop, no code, no pipeline, no nothing. And you can use all the open source tools that you want. I want you to write me a Spring MVC application. I want you to be running JUnit test. I want you to build an entire CICD pipeline that goes from dev, QA, and prod upon every commit. I want you to be running JUnit test. I want you to build an entire CICD pipeline that goes from dev, QA, and prod upon every commit. I want there to be UI test run during your build. I want SonarQube integrated, and I want there to be approvals between QA and the production.
Starting point is 00:45:16 How long, starting from absolutely nothing, would it take you to build that entire pipeline with a sample app? I've gotten anywhere from four hours to a week as the answer. I was like, that's interesting. Hold on. And then four minutes later, I was done, right? I had built a Java application, Spring MVC, full CICD pipeline and Visual Studio Team Services deploying out into Azure in four minutes, right? And that's what it took for me to finally win them over. I literally had one person's mouth dropped open and did not close. It was like, what just happened to me? I'm like, this is Microsoft. Please stop thinking of us as the people that you hate. We can do this for your languages too. And that, and I've been doing that at every Java meetup I can. Matter of fact, I believe I have an open
Starting point is 00:45:57 source meetup when I get to Austria. And the whole point was so I can do that demo and say, listen, you got to look at Microsoft differently. We're not who you think we are. That's awesome. I love that demo. People are always blown away. Pretty cool. Awesome, awesome. Thanks for having me again.
Starting point is 00:46:15 Yeah. All right. So, Andy, do you want to do the summarization? Let's do it. Yeah. Just a quick summary, as always, I believe, what we discussed and thanks Donovan for agreeing with us that the concept that we call the unbreakable pipeline makes a lot of sense, meaning we're using monitoring data from dev all the way through the different stages into production to make better automated informed decisions about whether a code change is a good code change or a bad code change. You started your discussion on let's first figure out what happens in production, right? And then making decisions on how our code changes actually impact on the bottom line, which is very important.
Starting point is 00:46:55 But I think we can then take it from there and then shift it left. We discussed a little bit about monitoring as code, that concept. I'm sure it can be extended. But great also to hear your commitment that you are dedicating resources from your team to build something like this into TFS. And yeah, I think we all agree that the most important thing is that you have to have
Starting point is 00:47:19 trustable monitoring data. And because only with that you can make really informed fact based decisions and not just anecdotal decisions and we can automate most of the stuff and as you just said we can we have the capabilities now the tools that allow us to build an end-to-end pipeline in four minutes as you
Starting point is 00:47:37 demonstrate with your meetups and so can we maybe it takes a little more than four minutes but it should not take longer but to bake monitoring into the pipeline. Right. And I, thanks for, thanks for being on the show again. Thanks for, by the time this airs, that you have been to Australia and spoke at DEF One.
Starting point is 00:47:57 There's also going to be a DEF One coming up later this year in Detroit that we are hosting. So a little shout out there in October. We are doing a Dev One in our Detroit office. Maybe Donovan, we can convince you to come there as well. There's a lot of developers in Detroit that are interested in hearing that as well. So I'll give you some updates on that. And I'm very much looking forward to continue working with you and promoting everything around automating the pipeline to push changes faster but more safer out to production. So I'm happy to work with you on that and educate the community.
Starting point is 00:48:32 As am I. I appreciate it. Awesome. Well, Donovan, thank you once again for being a guest. You've joined the Two Timers Club, so congratulations to that. Nice. I forget, Andy, we have a three timer already or no, I forget. Have we, have we,
Starting point is 00:48:47 yeah. Okay. So, all right. So I'll be on the show at least four times. Yeah. Yeah. It's a competition.
Starting point is 00:48:52 I'm very competitive. You can have me back as long as I'm number one. I'll keep coming back to stay number one. I'll be on the show at least two more times. Every, so every time somebody ties you, we'll make sure we let you know, Hey,
Starting point is 00:49:03 you know, number one. Yeah. That's let you know hey thanks again and andy hey i know again it's this is a little bit later but hey two years right that's awesome two years uh yeah two years of this podcast and donovan we're so glad you're a part of it you're you're a very enjoyable guest thank you for being thank you so much thank you for having me guys and congratulations congratulations on two years. Thank you. And thanks everyone else. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.