PurePerformance - 076 Shift-Left SRE: Building Self-Healing into your Cloud Delivery Pipeline

Episode Date: December 17, 2018

This episode is a recap of Andi’s presentation at AWS re:Invent where he talked common use cases Operation Teams have been auto-remediate over the years and how now Site Reliability Engineering (SRE...) Teams take it to the next level. The key point of Andi’s message is to not only auto-remediate these and newer cloud native use cases in production. It is about shifting-left and preventing them upstream in the delivery pipeline. If you want to learn more check out Andi’s blog or watch the recorded session from re:Invent on YouTube.Also make sure to listen until the end to learn about how you can mail your Christmas wishes to either Santa Claus or the Christkind!Blog:https://www.dynatrace.com/news/blog/shift-left-sre-building-self-healing-into-your-cloud-delivery-pipeline/Video:https://www.youtube.com/watch?v=PsI4pc0NtoI

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian Wilson, and as always, I have with me my co-host, Andy Grabner. And before I introduce Andy Grabner, I want to introduce our guest today, because we have a very, very special guest. Ladies and gentlemen, our guest today, Andy Grabner. Andy, welcome to the show. How are you doing today? I'm very good. Does it mean I get paid twice? Like two times zero dollars for doing that show?
Starting point is 00:00:49 Because I'm the guest and the host. You know, I'll give you, I'll pay you 10 times for this. Yeah, that's amazing. Right. So you're our guest today, right? Because you wrote a pretty cool blog and you gave a really cool talk. I'm saying it's really cool, but I didn't see it yet. At AWS, right? At reInvent, was that? Yeah. About shift. I'll just read the title and then we can go into it, right?
Starting point is 00:01:13 Shift left SRE, building self-healing into your cloud delivery pipeline. So Andy, right? There's a couple of terms we know in there. Shift left. We know about that.
Starting point is 00:01:23 Our listeners probably know about that. We'll cover all these, of course, in there. SRE, site reliable engineering, right? That's a term of terms we know in there. Shift left, we know about that. Our listeners probably know about that. We'll cover all these, of course, in there. SRE, site-reliable engineering. That's a term that's been around. Self-healing, which I think is a term that a lot of people might have some qualms about. But it's a very good concept and a very important concept. But there's something we'll talk about there. And obviously, your cloud delivery pipeline.
Starting point is 00:01:43 There's a lot of buzzwords in today's – in this talk. Yeah, it seems like buzzword bingo. And I'm using the word buzzword bingo and not the other B bingo words because that might not be acceptable using that type of language. I don't even know what the other one is. You could tell me what the other one is after the show. Okay. Yeah, you're right so basically obviously there was a session that i had at reinvent uh at the time of the recording reinvent has been almost two weeks ago uh an amazing
Starting point is 00:02:14 conference uh just too many people at least from my perspective i think it's more than 50 000 people oh my gosh and i know um but what what I thought, I wanted to present something that was valuable for most people that are kind of moving towards the cloud, trying to figure out how they can operate systems in the cloud. And I know that we all know that Shifting Left has been around for a while. We know that site reliability engineering has been, thanks to Google, around for a couple of years. And self-healing is also a very sexy term. So I thought, what can we present that kind of sums up what we see our customers do out there? And also what we do internally at Dynatrace to make operations of these cloud-native applications and also, let's say, applications that are cloud-native but also on the legacy side or both sides because in the end, we do not just have the cool new stuff and not only the legacy. I think we need to cover all. So how can we make operations easier? easier. And when I actually went on stage, and by the way, the blog also embeds the video
Starting point is 00:03:27 that was recorded thanks to AWS and put on YouTube to my session. And when I think when I started my session, I actually said, you know, I've never been in operations. I've never been a site reliability engineer, yet I'm standing here. And I'm just trying to tell you how you should do SRE and how your operations job should be better. But I told him, obviously, you as well, we've spoken with a lot of companies. And half the people left the room? Half the people, yeah, exactly. Who is this guy then?
Starting point is 00:04:01 Yeah, who is this guy then? No, but I think that the concept that i try to explain and this is also how i structured my presentation we all understand that we have to figure out a way how to automate manual tasks especially in operations if something breaks if a disk is full if a process crashes um if a um i don't know a network route is wrong. How can we deal, how can we fix this problem? And I'm sure the problem has been solved many, many times. And when I was, before I typically go on stage, I'll try to walk through the first couple of rows in my room and then try to figure out, you know, who are people that actually were brave enough to sit in the first couple of rows. And then I don't remember his name, but he was sitting in the first row and one of them was a database expert
Starting point is 00:04:48 for many, many years. And he said, you know, he has solved all of this already on the database side, like automating faulty databases or how he can fix problems in an automated and scripted way. And then he said now he wants to move forward. He wants to move to the cloud and also save the cloud and everything that happens there. And that's when I thought, see, I mean,
Starting point is 00:05:13 a lot of people have solved problems in the past. So now what we need to do is besides keep solving them in new environments, we need to figure out how to prevent these problems. And this was kind of the gist of my talk. It's not only about fixing things in production, but preventing them in the first place. And that's the whole shifting left. So how can we shift left and fix the holes that we have in our pipes instead of letting it kind of drip through into production?
Starting point is 00:05:39 And that's the little, we talked about this earlier before we started the podcast, a nice analogy that I learned from one of our customers. And maybe you want to explain it, the analogy, or shall I? You do. Well, yeah, I'll take a shot. This is Andy putting me on the spot here. But yeah, so the idea is you have, when you're doing self-healing and all that, you're fixing a problem, right? Or you're not self-healing and all that, you're fixing a problem, right? Or you're not really fixing a problem. It's the same as if you have a leaky pipe and you either put your finger
Starting point is 00:06:10 in the hole or wrap a towel around it or whatever, whatever bandage you might use to fix that pipe to get you moving through. And what Andy's talking about is don't just fix that pipe, fix the problem that caused that pipe, replace the pipe and whatever might have triggered that pipe to get the hole, go ahead and fix that left on in your pipeline. So, you know, maybe you had rocks being sent through your pipe that busted a hole in there. Well, get the rocks out of your pipeline, right? Did I do well? Did I pass? Yeah, I had a second.
Starting point is 00:06:39 I think the way it was initially explained to me was also the pipe. Then if there's a hole in there and kind of the water drips out and kind of runs all the way until the end on outside of the pipe. And then it basically drops down to the floor and makes a big puddle. One way to solve the problem is to put a bucket underneath. And if there's more water coming through, you may want to place a bigger bucket on the floor. But that's obviously not the way you should solve it. I mean this is the analogy for me of just creating more automation scripts in production to yet solve more problems that kind of run down the pipe. But you're not solving the initial problem, which is, as you said, I think I like your analogy too with the rocks. Maybe there were some rocks running through the pipes that kind of punched a hole in it.
Starting point is 00:07:24 But yeah, I think it's all about prevention. And with the bucket idea too, because I think this brings up a good bit of the analogy too. If you think about it, if you're not washing those buckets, what's going to happen? Your water is going to overflow from those buckets. And now not only do you have a leaky pipe, now you have a floor damage to your floor,
Starting point is 00:07:43 which if you're on the second floor is going to go into the floor below and start damaging down there. So it's a non-maintainable situation because eventually it's going to get away from you. And that happens with the bucket scenario as well. I like it. you think about the the services on top that are you know maybe they have memory leaks and now they are once they're flowing over they're going to impact the underlying hosts because they're eating up all the memory and also now the cpu because of garbage collection kicks in until it drips down even further into the data center and it's kind of wiping out all of the resources that you have because you're just all the resources are now spent in memory cleanup and all that stuff maybe we should rename the show.
Starting point is 00:08:26 That's exactly what I just said. That's amazing. I know. Maybe we should rename the show from Pure Performance to Pure Analogy. Yeah. Let's do them all as car analogies because people love car analogies. Yeah. But coming back to the topic itself, I know you said – also if you read the blog, in the blog I actually started with – that I believe as exciting as the name self-healing is, I believe the term itself is misleading. system unless they are you're driving it that far that in case you really have uh you know you can
Starting point is 00:09:07 detect bad code and you can automatically fix bad code in a fully automated way then it's maybe self-healing and with code i mean not only code is code but also configuration is code and all that so maybe that is possible but in most cases i believe we talk about smart or auto remediation um and just what i also just to make a point there because i think it's an important point But in most cases, I believe we talk about smart or auto-remediation. And just to make a point there, because I think it's an important point, the difference between remediation and healing means if you're doing healing, that problem's not going to happen again because you're fixing that root cause of it, whatever it might be. Remediation means you're making the problem go away for now, right? And it might be for now for a while, might be for now for an hour, but it's a remediation. It's a fix. It's not a cure. And I think that's a really good point to make with those terms because again, yeah, who really has self-healing, you know, maybe in a few years when we have, you know, AI extended into everything, you might be in that kind of a
Starting point is 00:09:59 situation. But, uh, I think, um, the remediation is a much more realistic and viable kind of a, you know, you can talk about self-remediation or not. So auto-remediation and people can be like, yeah, I can get with that. Yo, I can get with that, yo. Yo, yo. Well, and yo, yo, I have to say one more thing on this. that for certain problems, auto-remediation without any follow-up action might be good enough, especially if you think about people or organizations that are now looking at using canary releases,
Starting point is 00:10:34 feature flagging, and things like this where you want to get features out as fast as possible may not be perfect. But if you remember back, Goranka, what she told us with Facebook, where she said most of the features that Facebook ever releases never are really successful. Therefore, they take it offline. If these features don't hit the bar, meaning that many active users over a certain period of time after the feature was first released. So therefore, there might be remediation actions that temporarily fix problems in a way that it's not noticeable to the end user,
Starting point is 00:11:14 but there's no long-term healing action necessary because eventually some of these features are not there for that long anyway. So I just wanted to point this thought out there as well. One other thought on that, though, I want to say is if people are thinking, well, I don't know if we're going to be able to get into the auto remediation or anything like that. If you're running a cloud platform, you're probably already doing something like this. And I'm kind of stretching it a tiny bit here. But just think about auto scaling. Autoaling on its own is auto-remediation. Now, it's not because you necessarily have a problem. Your problem is we know what the capacity of our system is. We know
Starting point is 00:11:53 our traffic is going to go. When we know that when we reach that capacity level, we need another server to handle the traffic, and then we scale it back down. But that in itself is auto-remediation. You're remediating for traffic. So you already have it, extending it for unexpected problems or predictable but unexpected problems is just taking the next step. But it's all part of the same concept in a lot of ways. Yeah, exactly.
Starting point is 00:12:18 Yeah, and so coming back to what I tried to explain, folks in the session, so there's a lot of known use cases already, like what you just said, right? You're running out of resources, you scale up. We have auto scaling groups for that. But there's a lot of things that we also know that go a little bit beyond that.
Starting point is 00:12:35 So for instance, the classical disk full problem, right? The classical disk full problem can be solved by jobs that clean, let's say, log file directories. If you have a logging strategy, then you alert on, let's say, a certain percentage of consumed disk space of logs. And then you are either archiving them somewhere else or just removing them. And that's obviously one traditional approach. But what I brought up here is, well, this works if you don't have any changes in the way the application actually logs things.
Starting point is 00:13:13 So what we see constantly, and again, I was kind of confirmed by the folks that were sitting in the room because a lot of people were nodding and then also somebody raised their hand and said, yes, this is exactly what I've seen. The classical problem of somebody makes a configuration or a code
Starting point is 00:13:30 change and now turns on verbose logging or brings in a new framework that has a default logging strategy and the default is not the thing you want to have in production and nobody cared in pre-prol. So now we're logging that many more log messages that nobody really cares about so and if something like this goes into
Starting point is 00:13:52 production and you just apply your let's say default strategy that you used to have over the years of just cleaning up log files if they are filling up the disk then it's just like adding a bigger bucket under the leaking pipe because the root cause is obviously something somewhere completely different especially if the uh the cycle time of how often these cleanups happen is now kind of kind of you know fastening up if it's if it's getting if you fill up your disk faster and faster because you have a more verbose logging strategy, maybe that was not even intended, then this is something you need to really address. And this is where self, all the remediation alone doesn't really help you if all the remediation is just the default of cleaning up log files.
Starting point is 00:14:38 You really need to then follow up with the engineers and say, hey, look at this. We used to clean up the log directories once a week. Now we have to do it once a day because we are, because you guys are just logging so much more and then you want to probably talk with the developers and say, you know, what of this information that you actually log is actually useful? This is stuff that I always bring up and say, hey,
Starting point is 00:15:04 you know, if you are a performance engineering or if you're in operations, from time to time, take the log files, walk them over, bring them over to your engineers, and then ask them what of this information here is actually useful in case of a problem or what information is necessary for our logging strategy and for our analytics and everything that they cannot point out as being necessary, ask the question, so why is it in there? And this is my point.
Starting point is 00:15:34 So there's a lot of use cases. And I brought logging, I brought exception handling, I brought database connection handling. I brought a lot of these examples on where I believe we need to level up in terms of what our auto remediation strategy is. So what strikes me is really funny is I'm going back to seven or eight years now when I and saying, you know, whatever you don't need in here, why don't you remove? And getting an earful of why they're not going to remove it because of time and other projects. I think things are probably a bit better today. Obviously, there's some of those situations still in existence. But with the smaller, speedier things, definitely could be added at least to the bit.
Starting point is 00:16:23 But what I wanted to bring into here, you brought up a really interesting point with this idea of going back to the development team with the logs and, and, you know, from the engineers, from the operations team to go back and make that connection, which is where that shift left part of your topic comes in. I think a large part of it. I would also then extend that to not just, you know, not just having the SRE team contact the developers and say, hey, make this fix, but then also extend that to the monitoring team throughout the pipeline to say,
Starting point is 00:16:51 we want you to also monitor the size of the log because it used to grow, you know, by 10x under certain load in production. Let's take a look at what that grows under. Let's say you're going to do a load test, right? What's the load test growth of it. Now that we know what it is, keep an eye on that, make that one of your metrics. Maybe you can add that to there so that if you do see a change in the size of how it grows during your testing, you know, there's a change to that. And you can maybe stop that or figure out, you know, if this is necessary, if something else can get cut, but you could stop that early in the pipeline, part of the whole shift left of it, so that, again, you don't introduce that in there.
Starting point is 00:17:27 Not just checking the developers, because, right, there's always going to be human error. Adding multiple checks through it into your pipe, and you could probably even automate that, I'm sure, checking the size of it during a test. Yeah, exactly. And I think, Brian, you just hit the, what's it called, the nail on the head here.
Starting point is 00:17:42 This is exactly the, I believe, the change that has to happen to those folks that used to work in what I now call traditional operations, where it was about, you know,
Starting point is 00:17:54 keeping the infrastructure running, the cleanup tasks, provisioning new hardware. I think this is all stuff now that, as you said earlier, it comes with the auto scaling groups there's like built in things into most of the past environments where these these things are completely automated now but i think the what what's not been automated yet is what sre tries to solve which means we are
Starting point is 00:18:20 if i i mean again i'm not i haven't been in the industry that long and I've never worked in operations, but probably if I would have spent my last 20 years in operations, then I would like, then I would want to become a site reliability engineer, which means I take all of my know-how that I've built up over the last 20 years, because that's amazing know-how that you have about problems that can go wrong on an infrastructure level and this is still valid most of these concepts obviously for the cloud because in the end the cloud is just another man's or another woman's uh hardware that i just rent right so i will take this knowledge and try to figure out how can i become a mentor to the folks that are actually pushing code changes through the pipeline. Because if I don't do this, if I don't fix these problems earlier on, guess what? Companies will eventually figure out that traditional operations is no longer needed because AWS, Azure, or Google, they're taking care of provisioning the right hardware and
Starting point is 00:19:20 the right resources, making sure enough disk space is there. But if you now allow these, let's say, verbose logging kind of creep into your builds, then the cloud providers and the past providers of the world will solve this problem for you by just throwing automatically more hardware on it, and you will realize it at the end of the month when you get presented with the bill and i think this is where this is why it's so valuable now to have the traditional operations team that level up to become site reliability engineers meaning automating thinking about automation not only on an infrastructure level but thinking the next level up this is the services and the applications and also shifting left in making sure that these problems are detected
Starting point is 00:20:13 and then prevented earlier in the pipeline and i think this is the this is an amazing opportunity for for all these teams now yeah i think it goes back to some of the, the point you made about the, the end of the bill month, right. Comes back to discussions I used to hear, um, in my, you know, my, my former jobs where people would come in and trying to sell us on ideas or on tools or whatever to help improve processes. And sometimes the operation person would say, well, why don't I just throw more metal at the problem? You know, it's like, well, because of the expense, you know, and, and back in the old days, you know, you had a very clear idea because you would buy a server per server, right. Or disc per disc, you were purchasing it. They were coming out of direct your budget.
Starting point is 00:21:01 And I don't say this to disparage the cloud providers. I don't think they're doing anything sneaky. They're solving problems by just allocating this stuff for you. But if you're not paying attention, those things are going to just rack up. It's fake money, right? In a way, you don't realize it until it's too late. So there is definitely, definitely a need to be looking at that before you just run the cost through the roof and
Starting point is 00:21:26 that again i believe goes back to part of what garenka was saying right a bit of you know is this perform is this going to be worth running is this feature worth running in production from from the cost from everything else the maintenance anyhow sorry no that's perfect yeah and so a couple of metrics that that i think mentioned in the in the blog post and also in the talk it is the the the and i like actually your explanation better when you said kind of the ratio of the logs being written under a certain throughput right that's one thing then another metric that i brought up is the number of exceptions being thrown if you remember brian we talked a lot about the cost of an exception.
Starting point is 00:22:08 And with exceptions, we mean application exceptions that may never see the light of a log file, but they're just handled within frameworks. But they consume memory. If they consume memory, it means they also have to be garbage collected. So it's memory and CPU. So I encourage them to look at the number of exceptions being thrown. I also encourage them to look at CPU cycles consumed. This is a metric that obviously we in Dynatrace show you on an endpoint-by-endpoint basis of CPU cycles and how that changes from build to build.
Starting point is 00:22:41 Because somebody may add a new library or tries a new algorithm to do certain things, and maybe now it takes more CPU cycles, and this is something we need to capture early on. So this is the whole shifting left, so making sure we can detect these problems early on. And also kind of start, as I mentioned earlier, I think the word mentor is a great term. Mentoring
Starting point is 00:23:06 the engineers that don't have 20, 30 years of experience in operating large-scale environments. I remember when I came out of high school, I mean, I didn't really care much about CPU cycles or memory
Starting point is 00:23:22 or disk because I was just, seriously, man, it was like, who cared about that when you got out of high school? Come on. I know nobody cares about that. Right. I mean, you just want to write cool code. And, um, and therefore I think it's, it's a great, um, I think it's a, it's you, everybody should feel passionate about educating and mentoring the next generation of software engineers by telling them what they've learned over the years. And so this is one thing. The other thing, though, and this is also what I mentioned, it's not just about saying I need to look at these metrics.
Starting point is 00:23:57 But the challenge is how can we actually simulate that particular behavior? How can we simulate the, let's say, similar load in pre-prod? And how can we simulate similar problems that could potentially happen in production? And this is where I then talked about things like production twin testing, where we can take the traffic from production and either through modern frameworks like Istio, for instance, to mirror traffic into another environment, there will be one option. Another option would be, and that's what we've been doing with some of the load testing providers like Neotis.
Starting point is 00:24:39 I'm working with Hendrik from Neotis right now to extract production workload information and then create a workload definition that is very close to production so to be able to always simulate production kind of equivalent load in a pre-prod environment. You mean like the load model? The load model, exactly, the workload model, yeah. Because the monitoring tool tool they have the data and so the idea is just you know extracting that workload model over let's say a 24-hour period from production and then taking this and then applying it to your load testing tool so the
Starting point is 00:25:18 idea and i think the idea initially came from i think from, definitely Mark Tomlinson, right? He said he wants to get to a stage where on Tuesday, he can simulate the load in pre-prod from Monday, right? He basically looks back, I think at midnight, he was clearing the data from the production monitoring tool and then generating the workload configuration for his load test the next day. So he can always validate it. I'm going to take that idea away from Mark. That did not come from him.
Starting point is 00:25:49 OK. And in fact, I'll take credit for it. Only because I remember way back at the first perform that I went to, which was in Waltham at what was like, it was maybe 2011 or so. I saw Burned. And it might have been 2010, because it was before I started with maybe 2011 or so, uh, I saw burned, uh, and it might've been 2010 cause it was before I started with Dynatrace. And I said, you used to work on a, uh, you know, used to work
Starting point is 00:26:09 on load generating tools and all that. Can you make one that takes the, uh, traffic from production and recreate it in the tool? And his response was, well, I don't do those tools anymore, which is a good response, but yeah, no, I mean, not to take away from Mark. Um, but yeah, I mean, that's always been, I think the Holy grail for, for load testing has been to do response. But yeah, no, I mean, not to take away from Mark, but yeah, I mean, that's always been, I think, the holy grail for load testing is going to do that. But I think bringing up Mark brings in another idea here. Besides those models and you're testing the production traffic, I really love the idea of the continuous performance environment, right? Where not only are you testing those models, everything else, but because it's continuously under load, as you roll out, you're also testing your deployment strategy.
Starting point is 00:26:49 So just sidetracking there is recreate as much of that in your pre-production and your testing environment that you can because that is a perfect way to test if your rollout is going to be successful because you never know when you're going to be rolling out under full traffic or not, right, with an emergency and everything else. Exactly, yeah. And so the way I always explain it, I said, you know, if you really have a production twin, I think this is the
Starting point is 00:27:13 term that Alois coined, Alois Reitbauer, he said if you can do production twin testing, that means you have a production-like testing environment, then with every build that you get to the pipeline, you can basically validate, will this build survive production if you would decide to deploy it in production today?
Starting point is 00:27:36 And I think that's the most critical thing to answer. And now where kind of my talk then continued was, so we know what we want to fix in production. We know we need to simulate similar loads in pre-prod like it is in production. Now, if we have remediation scripts in production, we need to validate them obviously in pre-prod because I don't want to validate a remediation script the first time in production. That would be really brave or stupid. But I think you want to validate it early on, right? So you want to simulate these scenarios. Yeah.
Starting point is 00:28:15 And so, for instance, you know, we can just take the Chaos Monkey test or Chaos Monkey frameworks that I'm sure people are familiar with, and if not, just Google for Chaos Monkey, where the idea is that you have automation scripts that are going to simulate bad things happening. For instance, filling up your disk or reconfiguring your route. So something that could potentially also go wrong in production, having a denial of service attack that is coming in. Things like that, simulating real world behavior. And then the reason why we want to do this, obviously, in pre-prod, in an environment that has production-like load, is figuring out
Starting point is 00:28:59 if the auto-remediation scripts, first of all, get triggered correctly. Or first of all, if your monitoring is actually detecting that there is a problem fast enough, that your monitoring tool is then triggering the correct auto-remediation functions and that these auto-remediation functions actually auto-remediate the problem and also then trigger the right alert or notifications or follow-up actions. So for instance, if the auto-remediation does not solve the problem within a certain amount of time, escalate it up to the next level. So with this, you can also obviously test your incident response. If it does solve the problem or remediate the problem, maybe then then depending on the on the on the root cause create a ticket for engineering to follow up afterwards right because this is what we said earlier it's the auto remediation versus self-healing i know we have to look into this later on because why all of a sudden this come in so these are all the things we i think we need to take care of so you're no longer just testing your application you're also testing your monitoring
Starting point is 00:30:04 and you're testing yourself remediation yeah. Yeah, of course, because everything is, I mean, this all belongs together. And this brings me to the kind of my conclusion to my talk. My conclusion was when I came out of high school, I was a developer and I wrote code. Somebody else took care of testing, somebody else provisioned the infrastructure and production, and somebody else operated the whole thing. Then 15 years ago, I would say, is when agile development came in, where we were encouraged to also write our own tests. And then DevOps came in, and with all the automation tools around provisioning hardware
Starting point is 00:30:40 and configuring our PaaS environments, like nowadays, you know, with Kubernetes, I have all of my configuration files and how my apps and microservices should be deployed. So now all of a sudden I came from just writing code as code to also test this code and also infrastructure as code. And I believe the next step is to sit down and figure out, hey, what are important metrics that I want? So monitoring is code.
Starting point is 00:31:06 And the last step would then be auto-remitation is code. So I want to also develop scripts with the help, obviously, from people that have a lot of knowledge on that. That's, again, where the mentoring comes in. I want to also write script as part of my engineering process that live in my source code repo that will be executed in case something happens unexpectedly. I think this is it. And then we have everything as code.
Starting point is 00:31:34 Code is code. Test is code. Infrastructure is code. Monitoring is code. And auto remediation is code. And if we track this with every single change through the pipeline, then I believe we should be very close to what we now call the unbreakable delivery pipeline,
Starting point is 00:31:54 which means we should not be able to deploy something from development all the way into production that can then break where it really matters, which is the end-use experience or your business-critical systems, because we can either prevent them early on or we have to write remediation scripts in place that can do as much as possible in a fully automated way to bring the system back to a healthy state, to a reliable state, hence site reliability engineering. Yeah, and I would add to that, well, not add, but my addendum to that would then be for
Starting point is 00:32:29 everybody thinking, well, I'm going to, you know, automate myself as out of a job. It's like, no, listen to what, you know, everything that you just said. Someone has to create all the scripts. Someone has to maintain all the scripts. Someone has to test all that stuff right there is a whole new set of opportunity um to do something within with that feels familiar but is quite you know new and different and it's not like it's a one-time deal you get that that pipe you get that all set up that's got to be maintained right um so this is not a case you know until we can figure out a way to automate the automation right and in that case a lot more of us will be out of a way to automate the automation, right? And in that case, a lot more of us will be out of a job and there'll be other things to worry about like Skynet and all,
Starting point is 00:33:09 but, um, you know, there's a lot of opportunity there. Like I've been playing with automation for the last year or so now, again, this going back to your old, that blog that you did, um, I think it was, uh, based off of a Wilson Mars speech about, you know, the, the, the future of, you know, performance engineers and load testers. one of them was you know take take a daily task and automate it right so i started with that that's actually we were talking about ansible before i started with with the ansible one um and i've just been jumping in every time i've got to do something um you know with perform i got to do all the stuff with these servers i'm like oh let me figure out cloud formation figuring out it's easy stuff especially if, especially if you don't have that developer background.
Starting point is 00:33:47 If you're not a full hardcore, you know, like you came out of high school or college or whatever going into coding, I was going to go make movies, right? So I didn't have that developer background. The extent of my coding in the old days was writing in C for the Lode Runner scripts, writing maybe a little loop and making sure I had to do the memalloc and memfree and all those annoying bits. But very, very little bit. The nice thing about automating these things is you're not writing hardcore code. These are small, discrete things that are pretty easy to tackle. So it's not difficult. It's not a humongous entry point. And it just opens a whole new world of what's going to be coming next. Um, because this is going to be coming, uh, you're going to start seeing this stuff more and more. You know, I think a lot of, I think the slow, the slow part of the adoption is going to be getting people to buy into trust in the pipeline that will fully
Starting point is 00:34:39 automate itself. Um, I was, I was recently doing a talk on one of your pipeline talks, but I was doing the cloud formation one and obviously the talks, but I was doing the cloud formation one. And obviously the shift right of all the metadata to your tools, that part, you know, everyone can very easily buy into. But then when you start saying, okay, now take the data and automatically remediate based on, you know, thresholds and alerts that, you know, always makes people a little bit uneasy because they're like, well, I want to make sure it's like, well, over time, once it proves itself out, people get comfortable with it. So that's why I'm saying, yes, it will come. You know, it's going to take people a while to get used to and feel comfortable
Starting point is 00:35:14 with, but we're headed there. And I think this is some great, great advice, um, that you're putting out there, Andy, on how to get started, how to look at it, uh, things to consider, uh, cause there is a lot to consider with it. Um, but as, uh, Donovan Brown would say, start with one piece, you know, maybe spend 15 minutes a day, uh, working on a piece of it. And we, we've heard it from all the people who've done the successful CICD, uh, conversion, you know, um, all the people have gone through the transformation. Most, you know, in almost all cases, people are not saying, I'm going to redo everything all at once, right? You start small and start building out and adding and adding and adding
Starting point is 00:35:50 until you migrate over, right? So don't be overwhelmed as well. It's not an all at once kind of a situation. I think that almost qualifies for a summary. Wow, was I the summarator? I think you were the summarator and didn't even know it. Yeah, I was wondering if I was going to have to call on you to summarize your own thing. No, I think you did a brilliant job.
Starting point is 00:36:13 I think the only thing we want to – and this is probably going into the description of the session anyway, the links to the blog and the video. Yeah, if you go to Dynatrace blog, if you go to the Dynatrace blog, I just had it up where to go. Oh, come on. There we go. If you go to the Dynatrace blog right now, it is up on the top, right? But it'll very, it's going to be up there again. By the time this airs, there's not going to still be too many new blogs of anything. So just look for the shift shift left SRE.
Starting point is 00:36:51 And you'll see a picture of Andy wearing his Dynatrace shirt. Um, looks like he's holding onto something. He's got his hand resting on an invisible podium in this picture. Um, but it's up there. It's got the video link in it to check it all out um also we want to remind people right um speaking about aws and speaking all that we have perform 2019 in vegas coming up in january right uh andy and i will be there as well as a lot of other people uh andy gives so so for people who don't know how would you summarize what perform is what is perform i think perform is a great way to network with the people that are in a similar situation like you are meaning you need to change you you you need to change the way you've done things in the past in your company that's what the company sends you to conferences like perform to figure out how can we actually leverage this new technology that we that we all build right in our
Starting point is 00:37:47 case obviously it's monitoring uh full-stack monitoring with our ai capabilities and how can this you know in a in a positive way impact our lives and how can it obviously support our businesses so what i like about perform is the first day is the hands-on training day where uh where we people can choose obviously a morning and an afternoon session it's four hours each on variety and one of those they should choose should be the uh dynatrace for appmon users which is what i'll be teaching oh cool yeah perfect yeah that's uh obviously a lot of appmon users out there still that uh want to know what the new world is going to look like with our third generation management platform. I'm doing the continuous performance with Jenkins using the cool Jenkins performance signature plugin from our friends from T-Systems.
Starting point is 00:38:37 That's going to cover a lot of what we're doing today. Not necessarily everything, but a lot of what we talked about today is going to be in that, which is awesome. And then the other thing that i like is the breakouts right i am track captain of the devops no obstract and to the topic of today we have a couple of sessions around self-healing and sre i have one session that i'm i mean many sessions i'm looking forward to but the sre session is going to be a guy from McGraw Hill education, how to build an SRE team. Then we also have Experian talking about self healing at Experian.
Starting point is 00:39:13 So there's a lot of great sessions out there. Nestor is back from Citrix talking about how to level up operations, virtual operations, automation. So there's a lot of cool breakouts and you can just hear from your peers on what they've been doing and what they're thinking of and which problems they ran into and how they solved it right i think that's the great thing about perform yeah i went to one of the really i don't think it was called perform at the time but i was talking about earlier in the show when i went to that one uh in the early early days of dynatrace and one of the things i got out of it, even way back then was just
Starting point is 00:39:45 finding other people who are using some of these tool sets and saying, Hey, what did you guys do? How did you guys get through this problem? And just hearing a bunch of different ideas of how to tackle things, not just from the breakouts, not just from the keynote speeches, but just the, your person to person with someone. And you could just share ideas and collaborate. And maybe the two of you walk away with something new you can go back and try. So it's really awesome. Myself and Mark Tomlinson will be doing podcasting from there for Pure Performance.
Starting point is 00:40:12 So if you want to tell us a little short story, you can always come up to us. It's going to be at the, which hotel is that at? The Cosmopolitan. Yes, the Cosmopolitan in Vegas, January 28th through the 30th. So go ahead ahead sign up and do a get a hot day session in there exactly cool all right well andy thank you for being a guest on the show today um i don't know you know i guess you count i guess you have the most you and i both
Starting point is 00:40:40 have the most repeated appearances on the show but i've never been the quote-unquote guest, so I never have anything to say so much except for make dumb jokes. No, you have stuff to say, right? You've just done a great session in Denver at the user at the meetup. Yeah. And then you are obviously our expert when it comes to hybrid monitoring. So maybe you want to do a session on that at some point. Maybe. Talking about.
Starting point is 00:41:04 Hey, look at that. Hey. What? Maybe I'll be a hey maybe i'll be for me yeah awesome uh no i mean obviously yes i do know quite a lot about the our hybrid setup but um yeah maybe we can do that in the future uh and uh for anybody else who has ideas if you have ideas or want to be a guest on the show or if there are certain topics you would like us to explore make sure you let us know, go ahead and tweet it to us at pure underscore DT, or you can be old fashioned and send an email at pure underscore. I mean, pure performance at dynatrace.com.
Starting point is 00:41:36 If you want to send a handwritten letter, you can address it to Santa Claus at the North pole. And then we'll find out from Santa. What's with the Chris Kent? We we have actually so in austria we have the chris kind which is a different kringle kind of thing chris kindle exactly and we actually have an official mailing address for the chris kindle oh wow there's it's not just north pole no it's not just north pole it's in a city that is actually called chris kindle in low and actually close to Linz.
Starting point is 00:42:05 So you can actually mail stuff in, and it seems that they're really looking at every letter. So it would actually be really fun. If you go to that address, is there an actual building? Yeah, it's the Christkindl post office. Oh, it goes to the post office. Yeah. Christkindl post office in Steyr and there's also you find the if you search for Christkindl post office
Starting point is 00:42:30 you will find it Christkindl week 6 that's actually funny that's awesome I've never mailed anything there but it's a real thing yeah it's amazing the lengths we go through to deceive our children yeah and everybody gets a reply it just says here if you are
Starting point is 00:42:51 domestic or international oh really uh you'll get a reply oh i can say well that's awesome yeah yeah all right well learn something new yes always every always, every day. All right, well, thanks, everybody, for listening. Andy, thanks again for being my partner with this for quite some time now. This is Episode 76, so we're not too far off from 100. Looking forward to reaching that milestone. And thanks again for everybody listening who helps make this possible. See you all next time. Thank you.
Starting point is 00:43:24 Bye. Bye. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.