PurePerformance - 063 Discussing the Unbreakable Delivery Pipeline with Donovan Brown
Episode Date: June 4, 2018Donovan Brown, Principal DevOps Manager at Microsoft, is back for a second episode on CI/CD & DevOps. We started our discussion around “The role of Monitoring in Continuous Delivery & DevOps” but ...soon transferred over to our recent most favorite topic “The Unbreakable Delivery Pipeline”. Listen in and learn more about how monitoring, monitoring as code and automated quality gates can give developers faster and more reliable feedback on the code changes they want to push into production.Also make sure to follow up on Donovan’s road show when he shows Java developers how to build an end-to-end delivery pipeline in 4 minutes. And lets all make sure to remind him about the promise he made during the podcast: Building a Dynatrace Integration into TFS and adopt the “Monitoring as Code” principle
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it another episode of Pure Performance.
My name is Brian Wilson and as always I have with me Andy Grabner, my co-host.
Andy, how are you doing today?
Pretty good, I'm really good. Sitting here in lovely Boston again, still waiting for spring.
It hasn't shown up yet, so uh well it's the same keep repeating it
yeah if you put some boots on you can at least have a spring boot
bad joke um hey you know what you mentioned you sent an email earlier to me andy and i had no i
wasn't even paying attention uh by the time this episode's air airs, dear listeners, we'll be beyond our two-year anniversary.
Yeah, exactly.
Yeah, I was looking back at the speaker content.
It looks like the first episode aired on May 5th of 2016.
So I had no idea.
And I'm really surprised that I haven't said something to get me fired yet.
So I'm really happy that we're still here.
Yeah.
And not only are we here, but we also have Donovan Brown back on the show.
Hey, Donovan, are you with us?
Yes, I am here.
I'm hearing about how it feels in Boston.
And it's miserable here, too, in Houston.
It's 75 degrees, and I've had enough of this temperature.
Yeah, yeah, yeah.
Rub it in.
That's all.
It's gorgeous here.
It is sunny.
It is 75, and it's just beautiful.
So, yeah, you know there's places in the United States where it doesn't snow, right?
Yeah, I heard about them.
Yeah, you can live there, too.
We're only hitting the upper 60s today in Denver, so it's, you know.
It's not too bad.
No, not too bad.
I can live with it.
Yeah, for sure.
60s are not too savvy.
75, though, I'll take it.
So, Donovan, the reason why we obviously wanted to record more episodes with you is because you are the DevOps master from Microsoft.
DevOps advocate, DevOps ninja, DevOps blackbird man, right? And one big piece, I mean, last time we heard from Microsoft DevOps Transformation, and that was pretty cool.
Today, we want to talk more about continuous delivery, building pipelines.
And then one thing that is very dear to our heart, especially Brian and mine, is using monitoring earlier in the pipeline and how we can leverage monitoring for early feedback.
And I think what people call shift left or maybe they're out of terms.
But I think that's what I want to discuss and also get your perspective and opinion on it,
how maybe you at Microsoft, you're doing it, but in general, what you're advocating for
when it comes to building delivery pipelines with monitoring baked in.
Well, I think it's crucial that you have the monitoring baked into your pipeline
because the whole goal is to deliver value.
And you can't just add a new feature and assume that you've delivered value.
We're always monitoring, even if you're not doing it very well.
A lot of us monitor just our bottom line.
Did we make money or did we not make money?
And they assume that if they made money, then that they're doing things really well.
But I think it's more important to understand what were the actions that you took that allowed you to make more money.
You need to dig a little bit deeper than just the bottom line.
And that's where monitoring comes in.
It allows you to look at the path that the user took through your application.
Yeah, maybe you did make a lot more money this quarter, but was it because of the feature that you added?
Was it a promo that you were running?
Was it you're now higher in search results than you were a month ago? If you don't understand what caused that movement, you can't go do more of that, right?
And you're just guessing that, oh yeah, it must've been that cool new feature I told you to add. See,
I told you it was a good idea and realizing that absolutely nothing to do with the movement of
your bottom line. So I think monitoring is crucial to make sure that
you understand and can quantify what was it that we did that made this improvement so that we can
go do more of that. And also answer the question, was the feature that we just delivered, which
hopefully if you're doing Agile and Scrum correctly, you're working on the most important
thing first. Was it truly the most important thing? Did it really have an impact on our development? And if you don't monitor your application, once it's deployed into production,
you're just guessing that it was. But now there's no reason to guess. I mean, you could actually
see that, yes, 95% of the people that visited our website used that new feature and that turned into
revenue because they added items to their basket that were on that page. And you can start to make sense of the movement and not just looking at the bottom line. So I think it's crucial to be
successful. Yeah. So but in this case, you're talking about in obviously monitoring how users
react to your new feature, as you said, right? Instead of having an anecdotal fact, well,
instead of having anecdotes, you're actually basing it on facts i think that's what
what the key thing is but what about monitoring even earlier before things hit production so one
thing that we've been advocating for is actually using monitoring as part of your cicd meaning
when you are pushing a new code change through the pipeline already using the same metrics, the same monitoring
that it would use later on in production to figure out how heavy is my feature on resource
consumption, whether it's memory, whether it is how does garbage collection change,
how many log files are we generating, how many database statements are we executing.
So looking at some of these metrics early on to give a developer already feedback minutes after he committed the
code to say, hey, your code change has potential impacts because you just increased the number of
round trips to the database by 10%. And then combining that with production data where we can see, hey, your feature is
actually used by 90% of the people.
And if you're now increasing the database round trips by 10%, that translates to that
many more round trips to the database.
Are you aware of that?
Is this something you're also advocating for?
Absolutely.
And it's interesting because I don't want to do anything in production for the first
time.
It should have been tested in QA. It should have been tested in staging. It should be tested in dev.
And that includes the telemetry that we're going to collect.
I can't assume or ignore it in dev and QA and assume that I'm going to get good numbers out in production.
Because I also think of it from a developer's perspective of custom telemetry that you're going to put in as well.
That's telemetry that is literally coded into the product.
When this action happens,
please send me some type of information
that I can aggregate later.
I have to test in dev
that I'm actually getting the numbers out of the system.
And I'm going to distribute that application
to my beta testers and my QA testers
and have them use the app
and then go review those numbers.
It's like, yep, those are the buckets
I thought the numbers would be going into.
Based on your uses, these numbers look accurate. Perfect. We can now push that out
into production. So anything that's going to happen in our production, I don't care if it's
monitoring or performance tuning or bug fixes needs to be tested throughout every stage of
your pipeline. So I violently agree with what you're saying. You know what I would add to that
one too, in terms of testing your telemetry in that development phase,
testing the validity of your telemetry in that phase.
So go ahead and collect the data that you think you want to collect.
And besides figuring out, as you mentioned,
is it collecting the data and is the data telling me what I thought it was going to tell me?
Finding out if those metrics are actually useful in helping you make decisions.
No, that's great because you have to have a question that you're trying to answer.
That's the reason for the telemetry.
And a lot of times I get asked a very generic question.
Okay, Donovan, we're all excited about telemetry monitoring.
Where should we put it?
And I just shrug at them like, I don't know where you should put it.
I know where we put it because we were looking to answer a particular question.
But don't ask me where the telemetry needs to go. Ask yourself, what is it that we're struggling with it?
What is it that we need to know about our system? And then that will tell you what to monitor,
how often to monitor it, and where to put that custom telemetry. So I think that's a very
personal question. It's like, what should we monitor? Well, it depends on what it is that
you're looking for. And you should be looking for something, in in my opinion you shouldn't just put monitoring everywhere because eventually you have
so much data you don't even know what you're looking at or looking for at that point right
and then testing that that data is actually useful before you go to production so that you're not
sitting there wasting your time with a bunch of metrics and data that aren't adding to
you know the success of the project and you can you can test your testing data in a way.
Yeah, because you produce, well, we produce petabytes of data. And you don't want that to
be wasted space, because that's literally what it is. It's taking up space somewhere,
and it's going to make combing through that data even that much more difficult when the volume
continues to increase. So you want to collect what you need to be able to answer that question
quickly and efficiently. And you don't want to just be collecting data just for data's
sake. You should be looking for something. Yeah, but you could just put it in the cloud. It's free.
Yeah, sure. Okay. As long as it's Azure, knock yourself out.
So I like that. So basically what you're saying is when it comes to monitoring as a development team, you want to actually define what is relevant for you to have, what piece of information is it that you want to have in the downstream environment, in that is actually valid data and that it's actionable.
So that's one aspect of putting kind of making sure that everything is monitored correctly, kind of defining what you want to see, maybe putting in some custom telemetry and then maybe defining the dashboards and educating people downstream.
Hey, here are some new things that I want you to monitor.
Now, what about another concept that I call it monitoring as code, but I'm not
sure if this is some term that I just came up with or maybe somebody else did too. For me,
it is something where if I'm a developer and I build a new feature and then my business team
says, well, this feature has to respond in a certain time. It should only cost us so much in terms of how much does it cost to run it on a certain
infrastructure.
Then these are actually requirements, not functional requirements, but more performance
and resource consumption requirements that I could potentially put into a config file,
like a JSON file, a YAML file, or even my code, right?
And then what I've been advocating, and Donovan, this is where I want to get your feedback
on, every time I push a build through the pipeline, and I know that this particular
service that I'm pushing through should be able to respond within 100 milliseconds when
it's been hit by 50 TPS, transactions per second, and it should be able to run on, let's say,
a particular container with a particular size.
And if I have specified this in my config files,
then I can take this config file
and put in a quality gate in my pipeline and say,
okay, every time Andy is pushing a code change,
I'm deploying it, I'm running the test,
so similar to 50 TPS,
and then I'm looking at the monitoring data and say, okay, how many resources do we need? And
what's the response time? And what's the failure rate? And do we have, how many dependencies do
we have to other services? So this is what I've been advocating for monitoring as code. So as a
developer, I check in these specifications with my source code, and then it can be automatically
validated in the pipeline.
Is this something that you've also seen other people do?
Is this something that actually makes sense, or is this too early in the stage?
What's your take on that?
Now, it's interesting hearing you describe that because what it sounded like you were describing at first was just what we normally do with performance testing, right?
There's a bottleneck or there's an SLA that we have to adhere to. So we basically stand up a test rig that can generate that load and then verify that we can meet the SLA
that we've actually put in place. But you've taken it or it sounds like you're trying to take it a
step even further than that, because the building of the rig and the configuring of the thresholds
and the alerts would all happen external to the piece of software. I would have another piece of
software where I'm going to be doing my performance and load testing. I would set the thresholds. I would
configure the test to run, and then I would just go beat this poor little app to death. And then
I would go watch the metrics from that device to determine if I met my SLA. But it sounds like
you're trying to put that in code somehow. So then my question would be, what application
is reading that config file and then configuring your test rig
to then go generate the appropriate load
and watch the current metric.
So it sounds like an interesting idea,
but technically having built rigs before,
I'm thinking, okay, I don't know what app you're using
that can read that file and then configure itself
to not only generate the load,
but then read the right perf mons off of the machine
to be able to know, am I looking at database connections? Am I looking at database round trips? Am I looking at CPU utilization? Am I looking at
disk utilization? There's so many metrics you have to look at and configure as part of your
testing that has nothing to do with the app, right? To set those thresholds. So I'm just kind
of curious of what is the product that you're using that can then read from source control
the definition of the configuration and test
and then go generate that test for you?
Yeah, I mean, in my case, for monitoring,
I mean, obviously we are using Dynatrace
and I've been using my own JSON file format
that I call mon spec monitoring as code
and my pipeline itself.
So into pipeline, I wrote these integrations now
with Node.js. So I
wrote a little Node.js function that is basically reading that property file. And then it is
reaching out to the monitoring tool and say, hey, we just deployed this particular version of the
app into the into the test environment. We're currently running tests against it. So I'm not
standing up the test exactly. I'm not I'm not yet generating the tests but that would actually be the next thing but what i have
in this config file as a developer i can say here is my service you can detect this service in the
different stages by looking at this metadata so every time when we deploy a service into a
different environment whether it's dev test or, we can pass metadata like the stage name
or the service name,
and all this gets picked up as a tag, as metadata.
So I can actually ask the monitoring tool,
give me the response time, the CPU utilization,
the number of database queries from this particular service
that is running in this particular environment and give it to me
from the last 30 minutes when I knew I ran some tests. And then I'm using this and then validate
it against what my developer wants me to validate it against. And I can actually either specify,
let's say, hard-coded SLA. So if I really have a hard limit, but I can also say,
compare it to a different environment. So for instance, compare it to
production. Because if I'm building a continuous delivery pipeline, my point of view is I never
want to push something through the pipeline that is resulting in a worse state than we currently
have in production. Production should always be my golden standard. So everything I do should be at
least the same or improving production. So when I push
something through, I can also say, hey, we're pushing it into a testing environment that is
on the load and look at the values from this current test and compare it with what's happening
currently in production or with a representative timeframe in production and then tell me,
are we getting better or worse? And if it's worse,
then stop the pipeline and throw it back to the developers.
Man, I love that idea. So I don't know anyone that's doing that right now,
but I love that idea. And obviously, I'm starting to see, like, where do I put that in my pipeline?
How do I actually configure that? I obviously want an extension in VSTS that can read that config file, you know, and just kind of wire that up for us. So again, to make sure I understand it, the tests are the test. Those have been run
and defined outside of this entire environment. There are metrics that you know that you can
monitor already from the monitoring tool of choice. And in our case, it's Dynatrace so that you know
you have access to the CPU. You know you have access to the memory. You can count the round
chips to the database and the latency and things like that. So you know that that exists. And all you're doing in this config
file is saying, I've deployed a new version, I want you to watch these metrics, and these are
the thresholds on those metrics. Yeah, or the thresholds, or you can say compare it with a
baseline, and the baseline can come from a different environment. So perfect also for blue-green
deployments or canary releases.
You can say, I just deployed a canary. I let it run for, keep an eye on my canary and only let it in there or tell me how the canary is comparing itself with my current production.
And right now, I understand it's a config file, just a JSON format
that you're using
to define the metrics,
but how are the results
being then
displayed to the end user?
Am I going to a dashboard?
Is it part of my CICD
summary page?
How am I seeing the results?
Yeah,
in my case,
and it would be great
if you actually volunteer
to put it for TFS.
So I have two implementations,
one for
one of your competitors they start with an a and with ws i know and there i put the i put the
results in a dynamo db table and then i have a little dashboard on top but also with links back
to the dynatrace dashboards if you want to have all the details behind the metrics and then i
also just built the same thing for Jenkins
where the results will just be a build artifact in Jenkins.
Nice.
No, yeah, we need to talk about that when I'm there for Dev1
because I would like to actually see that.
And, yeah, that's really cool.
That's like to the point where we need to make sure that that works inside of VSTS
because that is just,
has my brain running right now
of all the cool stuff.
And we have,
we actually have
what we call delivery gates
built inside of our
release management product.
And they can be custom gates
that literally will run
for as long as you tell them to run,
validating whatever you tell them
to validate.
And if and only if this stays true,
will it then say,
okay, it's safe to go
to the next environment. Because we deploy, we use safe deployment for release management.
So I think we talked about this a little bit in the last show. It goes through several different
rings. And historically, we sit it for 24 to 48 hours in each ring as we monitor the things that
we find important and if and only if they're good. And what we've done with release gates is we've
automated safe deployment because instead
of a human being having to go run a query to see if any new bugs have been logged in
the last 48 hours, we literally have our tool go run that query for us and see if there's
been any new bugs.
You can run arbitrary functions inside of Azure.
You can run REST API calls.
And we could also wire in something like what you just described.
I want you to go run for the next 24
hours this in production and make sure that we don't break any of these SLAs that we have guaranteed
for usage. And then if and only if they're green, give us a signal that it's safe to go to QA and
staging and all that good stuff. Yeah. So the way when I present this, and I did a meetup this week
in Boston and I did some other presentations, I said as you just said normally we have somebody that knows there's a new build
i need to run my tests at the end of the test to look at the dashboards from my let's say uh
visual studio load or from my g meter or from my getling or from my new list and then i look at
the dashboards and i compare it but we can automate all of that because, you know, we are in 2018.
I mean, why do I need to look at dashboards?
And instead of knowing, I need to look at response time and failure rate and CPU consumption.
I can put these metrics into a config file.
And that's what I'm doing.
That's what we automated.
And, yeah, check it out.
I will, you know, we talk anyway.
So I'll show you more, and then hopefully we have you on another episode where you show us or talk about how you integrated the whole thing with TFS.
No, no, I think it would be fantastic.
And once I know the plumbing, I'm already – in the back of my mind, I'm already teeing up people I'm going to have write the extension for us.
So this has really gotten me thinking I really like this idea a lot.
So that means what people, the listeners will now know, if they want to get anything done on the Microsoft product side, get Donovan on a podcast.
Get him excited.
Get him excited about the idea.
I will find the resources to go get it for you.
That is a true fact.
I'm about to do the same thing for two database deployment technologies that we currently don't support.
When I got wind of who they were and what they did, I'm like, holy crap, we need that in VSTS.
And we have a group of people called the ALM Rangers that are – they don't work for Microsoft, but they're big Microsoft fans.
They're very influential in the community, and they're all technical.
And they will come and fill these kind of gaps for us so if you can get me excited and i can get the rangers excited we can write a lot of cool
extensions to make vsts do whatever we wanted to do awesome it's really and i want to and i want
to tell you one additional thought i think it's not only about the classical metrics that we look
at what i've been advocating for in the um i call it the unbreakable pipeline so the idea is you
cannot push something through the pipeline to actually break the user experience of your customers in production.
So we break the pipeline somewhere.
One thing that I've also been advocating is looking at the number of dependencies of your services.
So if I want to treat my microservice like I treat my LinkedIn profile, right?
If I look at my LinkedIn profile, I know how many connections I have.
And if I post a link on LinkedIn or share a link, I always get to see how many people viewed this in my first generation of connections and in my second generation or first grade and second grade.
I think that's what they call it.
So I want to do the same thing for microservices. If I push a microservice through a pipeline and I know that this microservice in the previous builds had one dependency in first grade and this one dependency translated to two dependencies in the second grade, then this is my baseline.
Every time I push a change through because I make a co-configuration change, I add in a new third-party library, I a new third-party library i updated a third-party library and all of a sudden the number of dependencies goes from one to two in my first
grade and these two translates into 10 in second grade then i should flag this configuration change
or code change because maybe this change came in through an unconscious decision, right? We know this. So that's also why I'm stopping the pipeline in case an unintentional change results in more dependencies.
No, more dependencies that you're taking on, right?
For example, you added a new NPM package or a new NuGet package, which then had additional dependencies.
Is that the number you're trying to track there?
I'm trying to track the additional dependencies. Is that the number you're trying to track there? I'm trying to track the dynamic dependencies.
So if, you know, I mean, obviously I refer it back
to the data that we have on the Dynatrace side,
and we see if a service calls another service,
if a service calls a database,
if a service puts something into a queue,
if a service makes a call to an external service.
So these are the dependencies that I'm talking about.
Okay.
And if I, let's say I'm adding a new third-party library
and that third-party library all of a sudden makes remote calls
to a new backend service or a database
or has an additional round trip to a database
that we didn't have before,
then this is an additional dependency that we include.
I see.
So I'm basically looking at the actual dependencies
between two services and how many interactions go on
between them for a particular use case.
I got you. I got you. Okay.
Yeah, because I obviously see every hop adds latency, right?
I mean, everyone thinks microservices are this silver bullet that come with no cons, but that's not true.
And by adding – for you taking on another dependency and not realizing that that actually is five more dependencies
in the hops that you're taking inside of your microservices infrastructure, you might not be realizing the latency that you're actually adding to your application.
Yeah.
Now, Andy, in your situation there, let's say you are under the impression that your new third party will add one more dependency. Is that something you would be able to define in your JSON file
so that when you're checking, it would see, okay, one new dependency added.
That's what we were expecting, so I won't break the pipeline.
Or is this something you...
So the way I see monitoring as code right now, you have two options.
You can always say compare my current metrics with something else with some something else right with let's say
the previous build then it will automatically flag it but you can also say uh i want i have a
hard-coded number let's say two as dependencies so then the the pipeline will be green if i have
two dependencies but if it's not two if it's one or or three, then it would raise a flag. So yes,
you can, you can hard, if you know what it is, how many dependencies you expect, then you can
put it in, uh, or you can compare with a different environment or with a baseline.
Yeah. But I think that, that, that begs another question because I, let's say I want to compare
it to production, but I know that I'm adding a dependency. Production will still be at two.
I've added a third that I wanted to add. It'll fail right how do i get it into production well that's that's why
in the in your file you can say i'm actually expecting one one additional so when you when
you compare this the dependency numbers we're actually going to accept three or a deviation
i see so it's either or but when i'm adding a new so what i would have to do is that like a two-phase
deployment or i wouldn't be able to go back to comparing to production until I've already deployed with a specific number as in my config file.
I would say no longer compare to production because I know I'm about to break that rule.
But I do want to add one more.
So to say, okay, you're allowed to have three.
You only have two now.
Now you have a third.
I'm going to push you into production.
And then the next deployment you could switch that flag back to now compare me to production because I'd never expect to go above that.
Yeah.
Got it.
Okay.
Got it.
Let me ask another question then in terms of the monitoring.
I don't know if this fits into the pipeline build or not, but this came up in a discussion yesterday.
I was at a DevOps conference, and I think it's something that monitoring helps with.
It fits in somewhere, but maybe not quite sure where.
Let's say you have, you know, whatever service you're writing and you test it, it runs well.
You put it in production and at a certain point you need more instances of it.
So you start spinning up new instances of your service on a specified size VM or whatever it might be.
Now, that's all going to work very nicely.
However, the big question comes up, too, is what size instance versus,
so you're going to pay for whatever instance size you choose.
Your function or your service has a response time as a performance profile.
So when in the cycle should we be testing what is the optimal size instance
to run your service on for both best performance and best cost
so that you can determine to say, hey, we're always going to run it on a medium-sized instance
and we'll spin up those ones instead of running it on an extra
large every time or something. You'd have to look historically at your, this isn't like a
just a wild swag, right? This is something where you're going to obviously do some performance
tuning in the previous environments. Because again, this is where the load testing and
performance testing comes in. At some point, you should have some SLA that you're trying to adhere to, right?
We want to be able to have a thousand simultaneous users. And you're going to pick a size of a
machine that you think is going to work, and then you're going to put it in QA, you're going to run
load tests on it, and you're either going to find out that it does or does not do a thousand
simultaneous users. And that's where you're going to be able to turn those dials on. Okay, let's try
a bigger VM, let's try more memory, let's try faster SSDs instead. And you can play with tweaking that image
or the profile of what you're going to be running, scaling out, not scaling up when you're going to
need more load. And then you get to determine, so what's that threshold? When we get to, if we want
to do a thousand simultaneous users, when do we spin up that second instance? When we get to 700
current ones or when we go over that threshold? And those are the kind of numbers you start to work on,
because again, scaling out is completely different than scaling up. And what you just asked on is,
when do we scale up the machine or scale down the machine versus scaling in or out our
infrastructure, right? Right. And it's twofold. Point number one from a performance point of view,
but point number two from a cost point of view and i guess from from what you're describing it really doesn't sound
like it's part of the the pipeline it's more of the specialized testing in in something like loader
performance that you'd be tackling that situation that's what i would historically be doing and i'm
kind of interested with with andy's kind of ahead of the curve there on the way that he's comparing
some of his other metrics i'm not i'm curious of what you've done in this area as well, because that's something that I would test out in an
environment as close to production as I can get. I would know what numbers and targets I'm trying
to hit, and then I would go turn the dials until I felt comfortable that I could hit that and scale
up and scale out at the appropriate time to make sure I don't drop any users or have a bad user
experience. Then I would probably run forward with that until I learned that that was simply no longer sustainable
or our load is so much more drastic and we hit 1,000 so often
that we're constantly having this accordion scaling out and scaling in.
Maybe it'll be quicker for us to just go ahead and scale up an instance so that we don't do that as often.
There's all sorts of different questions I have to ask.
And again, there are some monetary concerns there,
obviously, because you don't want to be running a ginormous machine that only is doing 100 users
for the majority of its life. And then it only spikes every once in a while. So I would look
at our user patterns and determine what's the best use of our money and then either scale out
a lot of small devices or just sit on one big one that ends up being cheaper over time.
Now, does Azure have any, and Andy, I want to get your take on that as well, but does
Azure have, I don't know if any cloud providers do have this at all, I'm just asking in general,
does Azure have any sort of API that you can interface to tell you how much your instance
is costing at the moment or given a historical cost or is it all just looking at the pricing
Well, I know that data is in there and we have an API for pretty much everything.
So I've never used it, so I can't say definitively yes that we do.
But the fact that that data exists
and almost everything that we do is backed by an API,
my gut's saying, I bet you I can go find that data in real time.
That would be really interesting.
But I haven't done it myself, so I can't say for sure.
But it's all available through APIs that we can get access to
once you have the right credentials.
So I would guess that we could do it,
but I've never done it myself.
One thing that I wanted to add,
so I think what we are talking about here
is predictive capacity planning, right?
I mean, you know the capacity that you need
for a certain load of a certain component,
and I believe what we can do is by keeping a close eye on the resource consumption of
your individual services or features in your CICD, if you see, hey, that code change has
for this particular endpoint, REST endpoint, means 5% more round trips to the database
or it's writing 5% more logs or it is consuming that much more
CPU, then you
can obviously, again, correlate
that with how often does this feature get
hit in production, and then
you can kind of predictively
or you can factor this into your
future capacity planning.
But more importantly,
I think the first thing you want
to do, if any of these metrics change,
raise a flag in the pipeline and then say, is this an intentional change?
Do we add more functionality that justifies the additional resource consumption?
Or was it unintentional?
Was it a bug?
Was it a wrong configuration change?
And then obviously it needs to be addressed before it hits production.
So I think these are some of the things that I would add here.
Yeah, I agree with that.
But I think it's a little different than what was asked.
I thought the question was, how do I know if I'm running on the right size or not?
And doing it the most cost-effective way.
Is that not the original question?
Yeah, that's the original question.
Yeah, I think in this case, it has to to be also as you said just as we did it historically
you need to figure out uh in a special environment you know what's what's the sweet spot uh right now
what i've what i've seen though and i've been doing a lot of work these days around you know
breaking the monolith into smaller pieces so the reason maybe why you need a big, big box for running a
certain monolithic gap is because your monolith just has certain requirements. But yet we know
that only certain parts, certain features of that monolith are used on a regular basis,
but you still need to provision to all of the resources because maybe some libraries need all
that. So what we are trying to do now with our work is to figure out which components within
a monolith are used frequently, what is the resource consumption, what are the dependencies,
and then use this data to actually make suggestions on how to break the monolith apart and where to
break it apart so that you end up with, let's say, one piece that includes the features that are
very often used that you can then run separately from, let's say, other pieces of the previous
monoliths that are less often used and maybe even consume more resources, but then you
can separate it out.
And so I know I'm going into a different direction with my discussion, but I believe the reason why we traditionally provided a lot of resources to handle this particular spike of load is because we had to provision for all the features that were part of the monolith, even though only small parts of the monolith was actually ever utilized on a regular basis. And now breaking it into smaller pieces and then being able to scale up and down
these smaller pieces
obviously makes us more efficient,
more cost efficient.
And that's what we're also trying to help
with analyzing the data that we have
with our monitoring solution.
And it also helps you develop and move faster too
because it's much easier to deploy a microservice
than a monolith
and we're in that exact same world at microsoft with the visual studio team services product
there's portions of it now that are true microservices but it originally began as a
monolith called team foundation server it was a one big everything was in there because you
installed it on your own hardware and we basically lifted and shifted that into the cloud as it was. And we've
slowly started to tease more and more parts away from the monolith and nothing new is added to the
monolith. Everything like the release management was a service package management with its own
service. And we're working on like teasing apart build and work item tracking, because as you
pointed out, we might need lots of build resources and not a lot of work item tracking resources,
but because they're monolith to get one, you got to get them both.
And now we're having to scale out bigger machines because they have to be able to sustain a whole new work item tracking,
a whole new source control, and a whole new build when all we really needed was more build.
But you can't get build without the rest of it.
And so it's really interesting to hear you describe that because we're in that exact same cycle right now,
figuring out how can we tease apart from this monolith these services and use them as true microservices
so that we can scale them up and it's funny because it all comes back to monitoring right
because i gotta know which one's the most popular and that all comes back to how do you monitor your
application how do you get the telemetry letting you know which of those services are used most
often so you can strategically start to tease those really high volume services apart so that you can now manage them much more efficiently than
you do as a monolith. And you can also take, there's two additional takes on it. First of all,
you can say, hey, we now know which feature is actually very popular and where we make most of
the money. And then maybe this is a good point where you say, all right, that's cool.
It's part of the monolith.
Let's build a new microservice that is kind of replacing, is going to replace that feature.
Instead of extracting it out, maybe you build something new on top using some late technology,
whether it's serverless or microservices.
And then just use this as also a way to not extract features from the monolith,
but just replace features, you know, one by one until you are at a state where you say, well,
now we have all the good features that we know we make money off extracted. We can deploy them
independently. And now it's time to get rid of parts of the monolith no for sure yeah and the
other point that i wanted to make with monitoring and this is kind of closing the the feedback loop
or closing the loop to your initial thought is monitoring uh in production and knowing what
people use but also knowing what people don't use right now is very good because if you keep
features along the way right and if you keep dragging them along because somebody- Technical debt.
It's technical. Yeah, technical debt and business debt. I call it business debt too,
because it's basically, why keep things alive? Because one person thought it was a great idea,
but they have only anecdotal data to justify. But now we have real proof with the monitoring data,
and let's kick it out, kick out the things we no longer need.
And it's really good to be able to make an informed decision.
I run a website and there's three different ways to view the core data of this website.
So it's basically for people who race their cars.
I race cars for fun.
And when I used to go to the track, you have to fill out all this paper.
And being a technical guy, I'm like, why am I filling out my name every freaking week that I want to come race?
This is stupid.
This should be stored somewhere.
So I wrote this website that allows you to go register for an event.
And I remember everything about you. So registering is just a few clicks and you're done.
And the track loves it because now they get these printed reports if they want to print them out or
the data goes directly into their timing system with no error. So it's this great way of monitoring
and using your information. But one you can look at is like a traditional calendar.
There's a year view and a month view. And I thought to myself, you know what? I'm sick of
maintaining this stupid calendar view. I wrote it 15 years ago. I can't imagine anyone's using it.
And I was going to just delete it. But before I did that, hold on. Let me go ahead and put some
telemetry in here and say, every time someone clicks on calendar, let me know. Every time
someone selects month view, let me know. And every time someone selects year view, let me know. Every time someone selects month view,
let me know. And every time someone selects year view, let me know. And I let it run for a week.
And I realized that I was about to remove the most popular feature of my site. I was like,
holy crap. I cannot imagine how many people I would have upset. Well, I knew exactly the number.
It's like 95% of the people use the feature that I thought no one was using anymore.
And again, it was just all anecdotal. I never use that feature anymore. I always use the month view.
Figured everyone sees this as more valuable than the calendar view. Nope, I was completely wrong.
And it saved me from making a huge mistake of removing a feature that I thought no one used,
but I had the actual data that said that would have been an enormous mistake. If you really want to, you can get rid of that year view because no one uses that.
But here you are maintaining that code
that you thought was valuable.
Again, I love that because,
again, I'm not guessing anymore.
That's what we had been doing for decades was,
here's our priorities, this is our product backlog.
I know it's perfect, let's just go do the top thing.
Really? Do we know if it's perfect?
Because I added a feature that clearly no one was using and
wanted to delete a feature that everyone is using because I did not have my priorities correct. So again, monitoring
is crucial to being successful when it comes to DevOps. Cool. All right. Hey, having that said,
I think this was actually a great discussion about how important monitoring and CI, CD,
continuous delivery and DevOps obviously is, right?
Because with monitoring, you have the real facts to make informed decisions.
And there's obviously different phases of the pipeline with different type of monitoring data
or the same data should be used.
Any final thoughts before we kind of summarize and wrap it up?
No, this was good.
Actually, one other thing I might like to say is that if you are a Pluralsight customer – I'm not an author on Pluralsight, but I watched a show on there recently.
And the only reason that made me think about this was you talking about teasing apart a monolith. It was an amazing show. So if you are on there, I think it's Eric Sutton's about modernizing a monolith, right? And I would think it's an amazing watch. So, and it kind of just made me think about that when I heard you talking about teasing apart a monolith. It's an amazing course
on adding microservices to a monolith and doing exactly what you and I just described. So if
you're a Pluralsight person, I'd highly recommend looking up that course. And it said modernizing.
Your ASP.NET app. Yeah. As a matter of fact, I have a channel on there. So if you go to,
I'm not an author again, but they call them expert channels.
And there's a DevOps expert channel.
That course is in my expert channel.
I'll tweet it.
So I'm at DonovanBrown on Twitter.
If you follow me on Twitter, I'll tweet it after this show airs.
So let me know when the show airs, and I will tweet it so everyone can go and find it.
Awesome.
You can also add it to the podcast notes.
Yep.
That sounds good
cool so yeah andy i did have one thing i wanted to bring up um one because this came up yesterday
at the conference so we spoke about in last episode we spoke about how at least in my opinion
as i'm finding out dynatrace is becoming cool again right now uh i mean microsoft is becoming yeah i mean
microsoft is always cool awesome well there was that there was that weird teen period when i had
all those pimples now um yes how microsoft is like becoming cool again right they have all
you know serverless.net core running on linux all these you know all these fun things and
so i was i was out giving a demo um at a conference yesterday and it was almost like the class, you know, a high, I'm a Mac,
I'm a PC such that commercial. If you remember that, I'm sure you love those ones. Uh, um,
so we had the first person come over and, uh, he, he's stated that he works a lot with, uh,
Microsoft products. He's, he's doing a.net on windows. Um, and as we were starting, uh, a Java guy came over
and, um, you know, I asked the Microsoft guy, so are you looking to do anything like eventually
like moving to.net core, moving microservices on Azure? And he's like, yeah, we're, we're just
starting that kind of process. And I said, isn't that so cool how like Microsoft is like
really doing all these really cool, amazing things. It's becoming cool again. Right. I said, isn't that so cool how like Microsoft is like really doing all these really cool, amazing things. It's becoming cool again. Right. I said that cause I, I genuinely believe it.
And the, the, the Microsoft guy started, you know, nodding his head a little bit as in like,
yeah, it is kind of cool. And the Java guy was like, well, you mean it's cool that they're
just finally catching up with everybody. And I just had to bite my tongue i was like oh are you kidding me come on come on
give him some credit the opens it's give him some so yeah there's still there's still an attitude
but i i think i'm on your side i i think microsoft is definitely getting cool um i just wanted to
bring that up because i it was the first time i encountered somebody being snarky about it i was
like oh man wow you haven't said that. Then you haven't had that conversation enough
yet because everyone I talked to about it gets snarky about it. Right. And there were still the
evil empire to a lot of people. If you're, if you're not just coming out of college, I'm 45
years old. I remember the Microsoft that everyone is afraid of, right? Oh, yeah. So, and those scars and those wounds aren't going to heal overnight.
Just being the number one contributor to open source in the world is not enough.
Having a product that we're releasing that happens to be built on Linux clearly is not enough.
Open sourcing.NET, running SQL Server on Linux, running.NET Core everywhere, having Xamarin so you can build any language, any platform, clearly is not enough. I mean, that should be more than enough, right? Because those people
who are coming out of college now don't see Microsoft like the Java person that you just
spoke to sees Microsoft. And I've noticed in the Java community, I have the hardest time
breaking through. I did a talk in South Africa once, and they wanted me to come and keynote some conference and I said
okay that's a long trip I'm only coming if you can get me in front of Java developers and they
were like why do you want to talk about Java developers like because we add value to every
language and every platform you get me Java meetups and I'll come to your conference and
they got me two of them and I had to go in like under like disguise right like you can't come in
here and you can't pitch any Microsoft stuff.
Like, no, I'm just going to tell you
how we made our transformation.
It doesn't matter.
It works for any language in any platform.
And it's just generic theory stuff.
Like, all right, fine.
You can come say that.
I'm like, all right, great.
So I come in and I give basically the transformation talk
and there's no pitching there.
There's no promo there,
but I left myself like 10 minutes at the end.
Like, okay, I just want to ask you one question,
Java devs, right?
So you're a Java dev. Let me just throw out a scenario for you real quick. And I just want to see how long this is going to take. So imagine that you have nothing on your desktop,
no code, no pipeline, no nothing. And you can use all the open source tools that you want.
I want you to write me a Spring MVC application. I want you to be running JUnit test. I want you
to build an entire CICD pipeline that goes from dev, QA, and prod upon every commit. I want you to be running JUnit test. I want you to build an entire CICD pipeline
that goes from dev, QA, and prod upon every commit. I want there to be UI test run during your build.
I want SonarQube integrated, and I want there to be approvals between QA and the production.
How long, starting from absolutely nothing, would it take you to build that entire pipeline with a
sample app? I've gotten anywhere from four hours to a week as the answer.
I was like, that's interesting. Hold on. And then four minutes later, I was done, right? I had built
a Java application, Spring MVC, full CICD pipeline and Visual Studio Team Services deploying out into
Azure in four minutes, right? And that's what it took for me to finally win them over. I literally
had one person's mouth dropped open and did not close. It was like, what just happened to me? I'm like, this is Microsoft. Please
stop thinking of us as the people that you hate. We can do this for your languages too. And that,
and I've been doing that at every Java meetup I can. Matter of fact, I believe I have an open
source meetup when I get to Austria. And the whole point was so I can do that demo and say, listen,
you got to look at Microsoft differently. We're not who you think we are.
That's awesome.
I love that demo.
People are always blown away.
Pretty cool.
Awesome, awesome.
Thanks for having me again.
Yeah.
All right.
So, Andy, do you want to do the summarization?
Let's do it.
Yeah.
Just a quick summary, as always, I believe, what we discussed and thanks Donovan for agreeing with us that the concept that we call the unbreakable pipeline makes a lot of sense, meaning we're using monitoring data from dev all the way through the different stages into production to make better automated informed decisions about whether a code change is a good code change or a bad code change. You started your discussion on let's first figure out what happens in production, right?
And then making decisions on how our code changes actually impact on the bottom line,
which is very important.
But I think we can then take it from there and then shift it left.
We discussed a little bit about monitoring as code, that concept.
I'm sure it can be extended.
But great also to hear your commitment
that you are dedicating resources from your team
to build something like this into TFS.
And yeah, I think we all agree that
the most important thing is that you have to have
trustable monitoring data.
And because only with that you can make really informed fact based
decisions and not just anecdotal
decisions and we
can automate most of the stuff and as you
just said we can we have the capabilities
now the tools that allow us to build an
end-to-end pipeline in four minutes as you
demonstrate with your meetups and
so can we maybe it takes a little more than four
minutes but it should not take longer
but
to bake monitoring into the pipeline.
Right.
And I, thanks for, thanks for being on the show again.
Thanks for, by the time this airs, that you have been to Australia and spoke at DEF One.
There's also going to be a DEF One coming up later this year in Detroit that we are hosting.
So a little shout out there in October.
We are doing a Dev One
in our Detroit office. Maybe Donovan, we can convince you to come there as well. There's a
lot of developers in Detroit that are interested in hearing that as well. So I'll give you some
updates on that. And I'm very much looking forward to continue working with you and promoting
everything around automating the pipeline to push changes faster but more safer out to production.
So I'm happy to work with you on that and educate the community.
As am I. I appreciate it.
Awesome. Well, Donovan, thank you once again for being a guest.
You've joined the Two Timers Club, so congratulations to that.
Nice.
I forget, Andy, we have a three timer already or no,
I forget.
Have we,
have we,
yeah.
Okay.
So,
all right.
So I'll be on the show at least four times.
Yeah.
Yeah.
It's a competition.
I'm very competitive.
You can have me back as long as I'm number one.
I'll keep coming back to stay number one.
I'll be on the show at least two more times.
Every,
so every time somebody ties you,
we'll make sure we let you know,
Hey,
you know,
number one. Yeah. That's let you know hey thanks again and andy hey i know again it's this is a little bit later but hey two years right that's awesome two years uh yeah two years of this podcast and donovan we're so
glad you're a part of it you're you're a very enjoyable guest thank you for being thank you so
much thank you for having me guys and congratulations congratulations on two years. Thank you. And thanks everyone else. Bye-bye.