PurePerformance - 076 Shift-Left SRE: Building Self-Healing into your Cloud Delivery Pipeline
Episode Date: December 17, 2018This episode is a recap of Andi’s presentation at AWS re:Invent where he talked common use cases Operation Teams have been auto-remediate over the years and how now Site Reliability Engineering (SRE...) Teams take it to the next level. The key point of Andi’s message is to not only auto-remediate these and newer cloud native use cases in production. It is about shifting-left and preventing them upstream in the delivery pipeline. If you want to learn more check out Andi’s blog or watch the recorded session from re:Invent on YouTube.Also make sure to listen until the end to learn about how you can mail your Christmas wishes to either Santa Claus or the Christkind!Blog:https://www.dynatrace.com/news/blog/shift-left-sre-building-self-healing-into-your-cloud-delivery-pipeline/Video:https://www.youtube.com/watch?v=PsI4pc0NtoI
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody, and welcome to another episode of Pure Performance.
My name is Brian Wilson, and as always, I have with me my co-host, Andy Grabner. And before I introduce Andy Grabner, I want to introduce our guest today, because we have a very, very special guest.
Ladies and gentlemen, our guest today, Andy Grabner.
Andy, welcome to the show. How are you doing today?
I'm very good. Does it mean I get paid twice? Like two times zero dollars for doing that show?
Because I'm the guest and the host. You know, I'll give you, I'll pay you 10 times for this.
Yeah, that's amazing. Right. So you're our guest today, right? Because you wrote a pretty cool
blog and you gave a really cool talk. I'm saying it's really cool, but I didn't see it yet. At AWS, right?
At reInvent, was that?
Yeah.
About shift.
I'll just read the title
and then we can go into it, right?
Shift left SRE,
building self-healing
into your cloud delivery pipeline.
So Andy, right?
There's a couple of terms
we know in there.
Shift left.
We know about that.
Our listeners probably know about that.
We'll cover all these, of course, in there. SRE, site reliable engineering, right? That's a term of terms we know in there. Shift left, we know about that. Our listeners probably know about that. We'll cover all these, of course, in there.
SRE, site-reliable engineering.
That's a term that's been around.
Self-healing, which I think is a term that a lot of people might have some qualms about.
But it's a very good concept and a very important concept.
But there's something we'll talk about there.
And obviously, your cloud delivery pipeline.
There's a lot of buzzwords in today's – in this talk.
Yeah, it seems like buzzword bingo.
And I'm using the word buzzword bingo and not the other B bingo words because that might not be acceptable using that type of language.
I don't even know what the other one is.
You could tell me what the other one is after the show.
Okay.
Yeah, you're right so basically obviously there was a session that i had at
reinvent uh at the time of the recording reinvent has been almost two weeks ago uh an amazing
conference uh just too many people at least from my perspective i think it's more than 50 000 people
oh my gosh and i know um but what what I thought, I wanted to present something that was valuable for most people that are kind of moving towards the cloud, trying to figure out how they can operate systems in the cloud.
And I know that we all know that Shifting Left has been around for a while. We know that site reliability engineering has been, thanks to Google, around for a couple of years.
And self-healing is also a very sexy term.
So I thought, what can we present that kind of sums up what we see our customers do out there?
And also what we do internally at Dynatrace to make operations of these cloud-native applications and also, let's say, applications that are cloud-native but also on the legacy side or both sides because in the end, we do not just have the cool new stuff and not only the legacy.
I think we need to cover all.
So how can we make operations easier? easier. And when I actually went on stage, and by the way, the blog also embeds the video
that was recorded thanks to AWS and put on YouTube to my session. And when I think when I started my
session, I actually said, you know, I've never been in operations. I've never been a site
reliability engineer, yet I'm standing here. And I'm just trying to tell you how you should do SRE and how your operations job should be better.
But I told him, obviously, you as well,
we've spoken with a lot of companies.
And half the people left the room?
Half the people, yeah, exactly.
Who is this guy then?
Yeah, who is this guy then?
No, but I think that the concept that i try to explain and this is
also how i structured my presentation we all understand that we have to figure out a way how
to automate manual tasks especially in operations if something breaks if a disk is full if a process
crashes um if a um i don't know a network route is wrong. How can we deal, how can we fix this problem?
And I'm sure the problem has been solved many, many times.
And when I was, before I typically go on stage, I'll try to walk through the first couple of rows in my room and then try to figure out, you know, who are people that actually were brave enough to sit in the first couple of rows.
And then I don't remember his name, but he was sitting in the first row and one of them was a database expert
for many, many years. And he said, you know,
he has solved all of this already on the database side, like automating
faulty databases or how he can
fix problems in an automated and scripted way. And then he said
now he wants to move forward.
He wants to move to the cloud and also save the cloud
and everything that happens there.
And that's when I thought, see, I mean,
a lot of people have solved problems in the past.
So now what we need to do is besides keep solving them in new environments,
we need to figure out how to prevent these problems.
And this was kind of the gist of my talk.
It's not only about fixing things in production, but preventing them in the first place.
And that's the whole shifting left.
So how can we shift left and fix the holes that we have in our pipes instead of letting
it kind of drip through into production?
And that's the little, we talked about this earlier before we started the podcast, a nice analogy that I learned from one of our customers.
And maybe you want to explain it, the analogy, or shall I?
You do.
Well, yeah, I'll take a shot.
This is Andy putting me on the spot here.
But yeah, so the idea is you have, when you're doing self-healing and all that, you're fixing a problem, right?
Or you're not self-healing and all that, you're fixing a problem, right? Or you're not
really fixing a problem. It's the same as if you have a leaky pipe and you either put your finger
in the hole or wrap a towel around it or whatever, whatever bandage you might use to fix that pipe to
get you moving through. And what Andy's talking about is don't just fix that pipe, fix the problem
that caused that pipe, replace the pipe and whatever might have triggered that pipe to get the hole, go ahead and fix that left on in your pipeline.
So, you know, maybe you had rocks being sent through your pipe that busted a hole in there.
Well, get the rocks out of your pipeline, right?
Did I do well?
Did I pass?
Yeah, I had a second.
I think the way it was initially explained to me was also the pipe.
Then if there's a hole in there and kind of the water drips out and kind of runs all the way until the end on outside of the pipe.
And then it basically drops down to the floor and makes a big puddle.
One way to solve the problem is to put a bucket underneath.
And if there's more water coming through, you may want to place a bigger bucket on the floor.
But that's obviously not the way you should solve it. I mean this is the analogy for me of just creating more automation scripts in production to yet solve more problems that kind of run down the pipe.
But you're not solving the initial problem, which is, as you said, I think I like your analogy too with the rocks.
Maybe there were some rocks running through the pipes that kind of punched a hole in it.
But yeah, I think it's all about prevention.
And with the bucket idea too,
because I think this brings up a good bit of the analogy too.
If you think about it, if you're not washing those buckets,
what's going to happen?
Your water is going to overflow from those buckets.
And now not only do you have a leaky pipe,
now you have a floor damage to your floor,
which if you're on the second floor is going to go into the floor below and start damaging down there.
So it's a non-maintainable situation because eventually it's going to get away from you.
And that happens with the bucket scenario as well.
I like it. you think about the the services on top that are you know maybe they have memory leaks and now they are once they're flowing over they're going to impact the underlying hosts because they're eating
up all the memory and also now the cpu because of garbage collection kicks in until it drips down
even further into the data center and it's kind of wiping out all of the resources that you have
because you're just all the resources are now spent in memory cleanup and all that stuff
maybe we should rename the show.
That's exactly what I just said.
That's amazing.
I know.
Maybe we should rename the show from Pure Performance to Pure Analogy.
Yeah.
Let's do them all as car analogies because people love car analogies.
Yeah.
But coming back to the topic itself, I know you said – also if you read the blog, in the blog I actually started with – that I believe as exciting as the name self-healing is, I believe the term itself is misleading. system unless they are you're driving it that far that in case you really have uh you know you can
detect bad code and you can automatically fix bad code in a fully automated way then it's maybe
self-healing and with code i mean not only code is code but also configuration is code and all that
so maybe that is possible but in most cases i believe we talk about smart or auto remediation
um and just what i also just to make a point there because i think it's an important point But in most cases, I believe we talk about smart or auto-remediation.
And just to make a point there, because I think it's an important point, the difference between remediation and healing means if you're doing healing, that problem's not going to happen again because you're fixing that root cause of it, whatever it might be.
Remediation means you're making the problem go away for now, right?
And it might be for now for a while, might be for now for an hour, but it's a remediation. It's a fix. It's not a cure. And I think that's a really good point to make with those terms because again, yeah, who really has self-healing, you know, maybe in a few
years when we have, you know, AI extended into everything, you might be in that kind of a
situation. But, uh, I think, um, the remediation is a much more realistic and viable kind of a, you know, you can talk about self-remediation or not.
So auto-remediation and people can be like, yeah, I can get with that.
Yo, I can get with that, yo.
Yo, yo.
Well, and yo, yo, I have to say one more thing on this. that for certain problems, auto-remediation without any follow-up action
might be good enough,
especially if you think about people or organizations
that are now looking at using canary releases,
feature flagging, and things like this
where you want to get features out as fast as possible
may not be perfect.
But if you remember back,
Goranka, what she told us with Facebook,
where she said most of the features that Facebook ever releases never are really successful. Therefore, they take it offline.
If these features don't hit the bar, meaning that many active users over a certain period of time after the feature was first released. So therefore, there might be remediation actions that temporarily fix problems
in a way that it's not noticeable to the end user,
but there's no long-term healing action necessary
because eventually some of these features are not there for that long anyway.
So I just wanted to point this thought out there as well.
One other thought on that, though, I want to say is if people are thinking, well, I don't know if we're going to be able to get into the auto remediation or anything like that.
If you're running a cloud platform, you're probably already doing something like this.
And I'm kind of stretching it a tiny bit here.
But just think about auto scaling. Autoaling on its own is auto-remediation. Now, it's not because you
necessarily have a problem. Your problem is we know what the capacity of our system is. We know
our traffic is going to go. When we know that when we reach that capacity level, we need another
server to handle the traffic, and then we scale it back down. But that in itself is auto-remediation. You're remediating for traffic.
So you already have it,
extending it for unexpected problems
or predictable but unexpected problems
is just taking the next step.
But it's all part of the same concept in a lot of ways.
Yeah, exactly.
Yeah, and so coming back to what I tried to explain,
folks in the session,
so there's a lot of known use cases already,
like what you just said, right?
You're running out of resources, you scale up.
We have auto scaling groups for that.
But there's a lot of things that we also know
that go a little bit beyond that.
So for instance, the classical disk full problem, right?
The classical disk full problem can be solved by jobs
that clean, let's say, log file directories.
If you have a logging strategy, then you alert on, let's say, a certain percentage of consumed disk space of logs.
And then you are either archiving them somewhere else or just removing them.
And that's obviously one traditional approach. But what I brought up here is,
well, this works if you don't have any changes
in the way the application actually logs things.
So what we see constantly,
and again, I was kind of confirmed by the folks
that were sitting in the room
because a lot of people were nodding
and then also somebody raised their hand and said,
yes, this is exactly what I've seen.
The classical problem of
somebody makes a configuration or a code
change and now turns on
verbose logging or
brings in a new framework
that has a default logging strategy
and the default is not the
thing you want to have in production and
nobody cared in pre-prol. So now
we're logging that many more log messages that nobody really cares about so and if something like this goes into
production and you just apply your let's say default strategy that you used to have over the
years of just cleaning up log files if they are filling up the disk then it's just like adding a bigger bucket under the
leaking pipe because the root cause is obviously something somewhere completely different especially
if the uh the cycle time of how often these cleanups happen is now kind of kind of you know
fastening up if it's if it's getting if you fill up your disk faster and faster because you have a more verbose logging strategy, maybe that was not even intended, then this is something you
need to really address.
And this is where self, all the remediation alone doesn't really help you if all the
remediation is just the default of cleaning up log files.
You really need to then follow up with the engineers and say, hey, look at this.
We used to clean up the log directories once a week.
Now we have to do it once a day because we are,
because you guys are just logging so much more
and then you want to probably talk with the developers
and say, you know, what of this information
that you actually log is actually useful?
This is stuff that I always bring up and say, hey,
you know, if you
are a performance engineering or if you're in operations, from time to time, take the log files,
walk them over, bring them over to your engineers, and then ask them what of this information here
is actually useful in case of a problem or what information is necessary for our logging
strategy and for our analytics
and everything that they cannot point out as being necessary, ask the question, so why
is it in there?
And this is my point.
So there's a lot of use cases.
And I brought logging, I brought exception handling, I brought database connection handling.
I brought a lot of these examples on where I believe we need to level up in terms of what our auto remediation strategy is.
So what strikes me is really funny is I'm going back to seven or eight years now when I and saying, you know, whatever you don't need in here, why don't you remove?
And getting an earful of why they're not going to remove it because of time and other projects.
I think things are probably a bit better today.
Obviously, there's some of those situations still in existence.
But with the smaller, speedier things, definitely could be added at least to the bit.
But what I wanted to bring into here,
you brought up a really interesting point with this idea of going back to the development team with the logs and, and, you know, from the engineers,
from the operations team to go back and make that connection,
which is where that shift left part of your topic comes in.
I think a large part of it.
I would also then extend that to not just, you know,
not just having the SRE team contact the developers and say,
hey, make this fix, but then also extend that to the monitoring team throughout the pipeline to say,
we want you to also monitor the size of the log because it used to grow, you know, by 10x under
certain load in production. Let's take a look at what that grows under. Let's say you're going to
do a load test, right? What's the load test growth of it. Now that we know what it is, keep an eye
on that, make that one of your metrics. Maybe you can add that to there so that if you do see a
change in the size of how it grows during your testing, you know, there's a change to that.
And you can maybe stop that or figure out, you know, if this is necessary, if something else
can get cut, but you could stop that early in the pipeline, part of the whole shift left of it,
so that, again, you don't introduce that in there.
Not just checking the developers, because, right,
there's always going to be human error.
Adding multiple checks through it into your pipe,
and you could probably even automate that, I'm sure,
checking the size of it during a test.
Yeah, exactly.
And I think, Brian, you just hit the, what's it called,
the nail on the head here.
This is exactly the, I believe, the change
that has to happen
to those folks
that used to work in
what I now call
traditional operations,
where it was about,
you know,
keeping the infrastructure running,
the cleanup tasks,
provisioning new hardware.
I think this is all stuff now
that, as you said earlier,
it comes with the auto scaling groups there's like built
in things into most of the past environments where these these things are completely automated now
but i think the what what's not been automated yet is what sre tries to solve which means we are
if i i mean again i'm not i haven't been in the industry that long and I've never worked in operations, but probably if I would have spent my last 20 years in operations, then I would like, then I would want to become a site reliability engineer, which means I take all of my know-how that I've built up over the last 20 years, because that's amazing know-how that you have about problems that can go wrong on an infrastructure level and this is still
valid most of these concepts obviously for the cloud because in the end the cloud is just another
man's or another woman's uh hardware that i just rent right so i will take this knowledge and try
to figure out how can i become a mentor to the folks that are actually pushing code changes
through the pipeline.
Because if I don't do this, if I don't fix these problems earlier on, guess what?
Companies will eventually figure out that traditional operations is no longer needed
because AWS, Azure, or Google, they're taking care of provisioning the right hardware and
the right resources, making sure enough disk space is there. But if you now allow these, let's say, verbose logging kind of creep into your builds,
then the cloud providers and the past providers of the world will solve this problem for you
by just throwing automatically more hardware on it,
and you will realize it at the end of the month when you get presented with the bill and i think this is where this is why
it's so valuable now to have the traditional operations team that level up to become site
reliability engineers meaning automating thinking about automation not only on an infrastructure
level but thinking the next level up this is the services
and the applications and also shifting left in making sure that these problems are detected
and then prevented earlier in the pipeline and i think this is the this is an amazing opportunity
for for all these teams now yeah i think it goes back to some of the, the point you made about the, the end of the bill month, right. Comes back to
discussions I used to hear, um, in my, you know, my, my former jobs where people would come in and
trying to sell us on ideas or on tools or whatever to help improve processes. And sometimes the
operation person would say, well, why don't I just throw more
metal at the problem? You know, it's like, well, because of the expense, you know, and, and back
in the old days, you know, you had a very clear idea because you would buy a server per server,
right. Or disc per disc, you were purchasing it. They were coming out of direct your budget.
And I don't say this to disparage the cloud providers. I don't think they're doing anything sneaky.
They're solving problems by just allocating this stuff for you.
But if you're not paying attention,
those things are going to just rack up.
It's fake money, right?
In a way, you don't realize it until it's too late.
So there is definitely, definitely a need to be looking at that
before you just run the cost through the roof and
that again i believe goes back to part of what garenka was saying right a bit of you know is
this perform is this going to be worth running is this feature worth running in production
from from the cost from everything else the maintenance anyhow sorry no that's perfect yeah
and so a couple of metrics that that i think mentioned
in the in the blog post and also in the talk it is the the the and i like actually your explanation
better when you said kind of the ratio of the logs being written under a certain throughput
right that's one thing then another metric that i brought up is the number of exceptions being
thrown if you remember brian we talked a lot about the cost of an exception.
And with exceptions, we mean application exceptions that may never see the light of a log file, but they're just handled within frameworks.
But they consume memory.
If they consume memory, it means they also have to be garbage collected.
So it's memory and CPU.
So I encourage them to look at the number of exceptions being thrown.
I also encourage them to look at CPU cycles consumed.
This is a metric that obviously we in Dynatrace show you on an endpoint-by-endpoint basis of CPU cycles
and how that changes from build to build.
Because somebody may add a new library or tries a new algorithm
to do certain things, and maybe now it takes
more CPU cycles, and this is something
we need to capture early on.
So this is the whole shifting left,
so making sure we can detect these problems early on.
And also kind of start, as I mentioned earlier,
I think the word mentor is a great term. Mentoring
the
engineers that don't have 20, 30
years of experience in operating
large-scale environments.
I remember when I came out of
high school, I mean, I
didn't really care much about
CPU cycles or memory
or disk because
I was just, seriously, man, it was like,
who cared about that when you got out of high school? Come on.
I know nobody cares about that. Right. I mean, you just want to write cool code. And, um,
and therefore I think it's, it's a great, um, I think it's a, it's you, everybody should feel
passionate about educating and mentoring the next generation of software engineers by telling them what they've learned over the years.
And so this is one thing.
The other thing, though, and this is also what I mentioned, it's not just about saying I need to look at these metrics.
But the challenge is how can we actually simulate that particular behavior?
How can we simulate the, let's say, similar load in
pre-prod? And how can we simulate similar problems that could potentially happen in
production? And this is where I then talked about things like production twin testing,
where we can take the traffic from production and either through modern frameworks like Istio, for instance, to mirror
traffic into another environment, there will be one option.
Another option would be, and that's what we've been doing with some of the load testing
providers like Neotis.
I'm working with Hendrik from Neotis right now to extract production workload information and then create a workload definition that is very close to production
so to be able to always simulate production kind of equivalent load
in a pre-prod environment.
You mean like the load model?
The load model, exactly, the workload model, yeah.
Because the monitoring tool tool they have the data
and so the idea is just you know extracting that workload model over let's say a 24-hour period
from production and then taking this and then applying it to your load testing tool so the
idea and i think the idea initially came from i think from, definitely Mark Tomlinson, right? He said he wants to get to a stage where on Tuesday,
he can simulate the load in pre-prod from Monday, right?
He basically looks back, I think at midnight,
he was clearing the data from the production monitoring tool
and then generating the workload configuration for his load test the next day.
So he can always validate it.
I'm going to take that idea away from Mark.
That did not come from him.
OK.
And in fact, I'll take credit for it.
Only because I remember way back at the first perform
that I went to, which was in Waltham at what was like,
it was maybe 2011 or so.
I saw Burned.
And it might have been 2010, because it was before I started with maybe 2011 or so, uh, I saw burned, uh, and it might've been 2010 cause it
was before I started with Dynatrace. And I said, you used to work on a, uh, you know, used to work
on load generating tools and all that. Can you make one that takes the, uh, traffic from production
and recreate it in the tool? And his response was, well, I don't do those tools anymore,
which is a good response, but yeah, no, I mean, not to take away from Mark. Um, but yeah, I mean,
that's always been, I think the Holy grail for, for load testing has been to do response. But yeah, no, I mean, not to take away from Mark, but yeah, I mean, that's always been, I think, the holy grail for load testing is going to do that.
But I think bringing up Mark brings in another idea here.
Besides those models and you're testing the production traffic, I really love the idea of the continuous performance environment, right?
Where not only are you testing those models, everything else, but because it's continuously under load, as you roll out,
you're also testing your deployment strategy.
So just sidetracking there is recreate as much of that
in your pre-production and your testing environment that you can
because that is a perfect way to test if your rollout is going to be successful
because you never know when you're going to be rolling out under full traffic or not, right,
with an emergency and everything else. Exactly, yeah.
And so the way I always explain it,
I said, you know, if you really have
a production twin, I think this is the
term that Alois
coined, Alois Reitbauer, he said
if you can do production
twin testing, that means you have a production-like
testing environment, then
with every build
that you get to the pipeline, you can basically validate, will this build survive production
if you would decide to deploy it in production today?
And I think that's the most critical thing to answer.
And now where kind of my talk then continued was, so we know what we want to fix in production.
We know we need to simulate similar loads in pre-prod like it is in production.
Now, if we have remediation scripts in production, we need to validate them obviously in pre-prod
because I don't want to validate a remediation script the first time in production. That would be really brave or stupid.
But I think you want to validate it early on, right?
So you want to simulate these scenarios.
Yeah.
And so, for instance, you know, we can just take the Chaos Monkey test or Chaos Monkey frameworks that I'm sure people are familiar with, and if not, just Google for Chaos Monkey,
where the idea is that you have automation scripts
that are going to simulate bad things happening.
For instance, filling up your disk or reconfiguring your route.
So something that could potentially also go wrong
in production, having a denial of service attack that is coming in. Things like that,
simulating real world behavior. And then the reason why we want to do this, obviously,
in pre-prod, in an environment that has production-like load, is figuring out
if the auto-remediation scripts, first of all, get triggered correctly. Or first of all, if your monitoring is actually detecting that there is a problem fast enough, that your monitoring tool is then triggering the correct auto-remediation functions and that these auto-remediation functions actually auto-remediate the problem and also then trigger the right alert or notifications or follow-up actions.
So for instance, if the auto-remediation does not solve the problem within a certain amount of time, escalate it up to the next level.
So with this, you can also obviously test your incident response.
If it does solve the problem or remediate the problem, maybe then then depending on the on the on the root cause
create a ticket for engineering to follow up afterwards right because this is what we said
earlier it's the auto remediation versus self-healing i know we have to look into this
later on because why all of a sudden this come in so these are all the things we i think we need to
take care of so you're no longer just testing your application you're also testing your monitoring
and you're testing yourself remediation yeah. Yeah, of course, because everything is, I mean,
this all belongs together. And this brings me to the kind of my conclusion to my talk. My conclusion
was when I came out of high school, I was a developer and I wrote code. Somebody else took
care of testing, somebody else provisioned the infrastructure and production, and somebody
else operated the whole thing.
Then 15 years ago, I would say, is when agile development came in, where we were encouraged
to also write our own tests.
And then DevOps came in, and with all the automation tools around provisioning hardware
and configuring our PaaS environments, like nowadays, you know, with Kubernetes, I have all of my configuration files
and how my apps and microservices should be deployed.
So now all of a sudden I came from just writing code
as code to also test this code
and also infrastructure as code.
And I believe the next step is to sit down and figure out,
hey, what are important metrics that I want?
So monitoring is code.
And the last step would then be auto-remitation is code.
So I want to also develop scripts with the help, obviously,
from people that have a lot of knowledge on that.
That's, again, where the mentoring comes in.
I want to also write script as part of my engineering process
that live in my source code repo that will be executed in case something happens unexpectedly.
I think this is it.
And then we have everything as code.
Code is code.
Test is code.
Infrastructure is code.
Monitoring is code.
And auto remediation is code.
And if we track this with every single change through the pipeline,
then I believe we should be very close to what we now call
the unbreakable delivery pipeline,
which means we should not be able to deploy something
from development all the way into production
that can then break where it really matters, which
is the end-use experience or your business-critical systems, because we can either prevent them
early on or we have to write remediation scripts in place that can do as much as possible in
a fully automated way to bring the system back to a healthy state, to a reliable state,
hence site reliability engineering.
Yeah, and I would add to that, well, not add, but my addendum to that would then be for
everybody thinking, well, I'm going to, you know, automate myself as out of a job. It's like, no,
listen to what, you know, everything that you just said. Someone has to create all the scripts.
Someone has to maintain all the scripts. Someone has to test all that stuff right there is a whole new set of opportunity um to do something
within with that feels familiar but is quite you know new and different and it's not like it's a
one-time deal you get that that pipe you get that all set up that's got to be maintained right um so
this is not a case you know until we can figure out a way to automate the automation right and
in that case a lot more of us will be out of a way to automate the automation, right? And in that case,
a lot more of us will be out of a job and there'll be other things to worry about like Skynet and all,
but, um, you know, there's a lot of opportunity there. Like I've been playing with automation
for the last year or so now, again, this going back to your old, that blog that you did, um,
I think it was, uh, based off of a Wilson Mars speech about, you know, the, the, the future of,
you know, performance engineers and load testers. one of them was you know take take a daily task and automate it
right so i started with that that's actually we were talking about ansible before i started with
with the ansible one um and i've just been jumping in every time i've got to do something um you know
with perform i got to do all the stuff with these servers i'm like oh let me figure out cloud
formation figuring out it's easy stuff especially if, especially if you don't have that developer background.
If you're not a full hardcore, you know, like you came out of high school or college or whatever going into coding, I was going to go make movies, right?
So I didn't have that developer background.
The extent of my coding in the old days was writing in C for the Lode Runner scripts, writing maybe a little loop and making sure I had to do the memalloc and memfree and all those annoying bits. But very, very little bit.
The nice thing about automating these things is you're not writing hardcore code. These are small,
discrete things that are pretty easy to tackle. So it's not difficult. It's not a humongous entry
point. And it just opens a whole new world of what's going to be coming next. Um, because this is going to be coming, uh, you're going to
start seeing this stuff more and more. You know, I think a lot of, I think the slow, the slow part
of the adoption is going to be getting people to buy into trust in the pipeline that will fully
automate itself. Um, I was, I was recently doing a talk on one of your pipeline talks, but I was
doing the cloud formation one and obviously the talks, but I was doing the cloud
formation one. And obviously the shift right of all the metadata to your tools, that part, you
know, everyone can very easily buy into. But then when you start saying, okay, now take the data
and automatically remediate based on, you know, thresholds and alerts that, you know, always makes
people a little bit uneasy because they're like, well, I want to make sure it's like, well,
over time, once it proves itself out, people get comfortable with it. So that's why I'm saying,
yes, it will come. You know, it's going to take people a while to get used to and feel comfortable
with, but we're headed there. And I think this is some great, great advice, um, that you're putting
out there, Andy, on how to get started, how to look at it, uh, things to consider, uh, cause there is a lot to consider with it. Um, but as, uh, Donovan Brown would say, start with one piece, you know,
maybe spend 15 minutes a day, uh, working on a piece of it. And we, we've heard it from all the
people who've done the successful CICD, uh, conversion, you know, um, all the people have
gone through the transformation. Most, you know, in almost all cases, people are not saying,
I'm going to redo everything all at once, right?
You start small and start building out
and adding and adding and adding
until you migrate over, right?
So don't be overwhelmed as well.
It's not an all at once kind of a situation.
I think that almost qualifies for a summary.
Wow, was I the summarator?
I think you were the summarator and didn't even know it.
Yeah, I was wondering if I was going to have to call on you to summarize your own thing.
No, I think you did a brilliant job.
I think the only thing we want to – and this is probably going into the description of the session anyway, the links to the blog and the video.
Yeah, if you go to Dynatrace blog, if you go to the Dynatrace blog,
I just had it up where to go. Oh, come on. There we go.
If you go to the Dynatrace blog right now, it is up on the top, right?
But it'll very, it's going to be up there again.
By the time this airs,
there's not going to still be too many new blogs of anything.
So just look for the shift shift left SRE.
And you'll see a picture of Andy wearing his Dynatrace shirt. Um, looks like he's holding onto something. He's got his hand resting on an invisible podium in this picture. Um,
but it's up there. It's got the video link in it to check it all out um also we want to remind people right um speaking about
aws and speaking all that we have perform 2019 in vegas coming up in january right uh andy and i
will be there as well as a lot of other people uh andy gives so so for people who don't know
how would you summarize what perform is what is perform i think perform is a great way to network with the people that are in a similar
situation like you are meaning you need to change you you you need to change the way you've done
things in the past in your company that's what the company sends you to conferences like perform
to figure out how can we actually leverage this new technology that we that we all build right in our
case obviously it's monitoring uh full-stack monitoring with our ai capabilities and how
can this you know in a in a positive way impact our lives and how can it obviously support our
businesses so what i like about perform is the first day is the hands-on training day where uh
where we people can choose obviously a morning and an afternoon session it's four hours each
on variety and one of those they should choose should be the uh dynatrace for appmon users which
is what i'll be teaching oh cool yeah perfect yeah that's uh obviously a lot of appmon users
out there still that uh want to know what the new world is going to look like with our third generation management platform.
I'm doing the continuous performance with Jenkins using the cool Jenkins performance signature plugin from our friends from T-Systems.
That's going to cover a lot of what we're doing today.
Not necessarily everything, but a lot of what we talked about today is going to be in that, which is awesome.
And then the other thing that i like is the breakouts right i am track captain of the
devops no obstract and to the topic of today we have a couple of sessions around self-healing
and sre i have one session that i'm i mean many sessions i'm looking forward to but the sre
session is going to be a guy from McGraw Hill education,
how to build an SRE team.
Then we also have Experian talking about self healing at Experian.
So there's a lot of great sessions out there.
Nestor is back from Citrix talking about how to level up operations,
virtual operations,
automation.
So there's a lot of cool breakouts and you can just hear from your peers on what they've been doing and what they're thinking of and which problems they ran into and how they solved it right i think that's the great thing about perform yeah i went
to one of the really i don't think it was called perform at the time but i was talking about earlier
in the show when i went to that one uh in the early early days of dynatrace and one of the
things i got out of it, even way back then was just
finding other people who are using some of these tool sets and saying, Hey, what did you guys do?
How did you guys get through this problem? And just hearing a bunch of different ideas of how
to tackle things, not just from the breakouts, not just from the keynote speeches, but just the,
your person to person with someone. And you could just share ideas and collaborate. And maybe the
two of you walk away with something new you can go back and try.
So it's really awesome.
Myself and Mark Tomlinson will be doing podcasting from there
for Pure Performance.
So if you want to tell us a little short story,
you can always come up to us.
It's going to be at the, which hotel is that at?
The Cosmopolitan.
Yes, the Cosmopolitan in Vegas, January 28th through the 30th.
So go ahead ahead sign up and
do a get a hot day session in there exactly cool all right well andy thank you for being a guest
on the show today um i don't know you know i guess you count i guess you have the most you and i both
have the most repeated appearances on the show but i've never been the quote-unquote guest, so I never have anything to say so much except for make dumb jokes.
No, you have stuff to say, right?
You've just done a great session in Denver at the user at the meetup.
Yeah.
And then you are obviously our expert when it comes to hybrid monitoring.
So maybe you want to do a session on that at some point.
Maybe.
Talking about.
Hey, look at that.
Hey. What? Maybe I'll be a hey maybe i'll be for me yeah awesome uh no i mean obviously yes i do know quite a lot about the our hybrid setup but um yeah maybe we can do that in the future uh and uh for anybody else who has
ideas if you have ideas or want to be a guest on the show or if there are certain topics you would
like us to explore make sure you let us know,
go ahead and tweet it to us at pure underscore DT,
or you can be old fashioned and send an email at pure underscore.
I mean,
pure performance at dynatrace.com.
If you want to send a handwritten letter,
you can address it to Santa Claus at the North pole.
And then we'll find out from Santa.
What's with the Chris Kent? We we have actually so in austria we
have the chris kind which is a different kringle kind of thing chris kindle exactly and we actually
have an official mailing address for the chris kindle oh wow there's it's not just north pole
no it's not just north pole it's in a city that is actually called chris kindle
in low and actually close to Linz.
So you can actually mail stuff in, and it seems that they're really looking at every letter.
So it would actually be really fun.
If you go to that address, is there an actual building?
Yeah, it's the Christkindl post office.
Oh, it goes to the post office.
Yeah. Christkindl post office in Steyr and there's also you find
the if you search for
Christkindl post office
you will find it
Christkindl week 6
that's actually funny
that's awesome
I've never mailed anything there
but it's a real thing
yeah it's amazing the lengths we go through
to deceive our children yeah and everybody gets a reply it just says here if you are
domestic or international oh really uh you'll get a reply oh i can say well that's awesome yeah
yeah all right well learn something new yes always every always, every day. All right, well, thanks, everybody, for listening.
Andy, thanks again for being my partner with this for quite some time now.
This is Episode 76, so we're not too far off from 100.
Looking forward to reaching that milestone.
And thanks again for everybody listening who helps make this possible.
See you all next time.
Thank you.
Bye.
Bye. Bye.