PurePerformance - Self-Healing in the Real World – HATech Lessons learned from Enterprise Engagements

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always with me is my co-host Andy Grabner. Hi Andy, how are you doing today? Hey Brian, I'm a little sore today to be honest with you because we had, you know, it's the time of the recording, 15th of April. Yesterday was Lint's marathon.

Starting point is 00:00:46 Oh, it's Tech's day today. Yes yes but for you you had a marathon yesterday yeah but i guess a lot of people will have a sore feeling tomorrow when they found out that they may be late for tech stay yeah for me it was to have marathon and i think today i'm kind of in the need of a little bit of self-healing uh at least for my muscles and joints see what you did there. I know. Very nice. It was an awesome segue, right? Well, since you brought up self-healing about your sore legs, it's funny because I think earlier when you mentioned it to me, you said you felt your legs today.

Starting point is 00:01:16 I'm like, yeah, I kind of feel mine every day. Shall we go ahead and let's jump into the episode since you brought it up. Unless you wanted to tell us what your time was. You did a half marathon. Do you want to share your time? No, it was not as good as I hoped. But that's why I keep it a secret. It was under five hours though, right?

Starting point is 00:01:33 Yeah, yeah. Under five hours? Yeah, yeah. Okay, good. So let's jump over to the topic. So I brought it up, self-healing. Self-healing in the real world. And today we have two gentlemen actually on the other end of the line, John and Jarvis from AJ Tech.

Starting point is 00:01:49 And I don't want to introduce them because I think they do a much better job. So maybe John and Jarvis, do you want to go ahead and quickly tell folks who you are, what you do, and also why you think we brought you on the podcast to talk about self-healing in the real world? Absolutely. First of all, guys, thanks very much for having us. So, yes, my name is John Hathaway. I'm the CEO of an IT and cloud specialist company called HATEC, based in Nevada as well as Arizona. And we spend most of our time building cloud architectures

Starting point is 00:02:22 for large enterprises all the way down to small startups. And we've been doing lots of cool and exciting things with regards cloud migrations, as well as scaling organizations DevOps capabilities through chat ops, auto remediation, some of those exciting aspects. And so we have a number of years experience now having worked for large enterprises and large vendors all the way through to us working in this startup that have helped organizations be successful in their IT strategies. Awesome. And Jarvis? I'm Jarvis Mishler. I'm a solutions architect with HATech. I've Been working here for two years. And while John has the industry experience and the big brain when it comes to DevOps and all things solutions related, I have a

Starting point is 00:03:15 pretty unique perspective to the customer service side of things coming from a restaurant background. So I try to advocate for our customers as best I can and sort of play liaison between them and HA Tech. Pretty cool. I think we need to pick up the topic. Your background sounds pretty interesting coming from a restaurant. So let's keep this in my back pocket. But I want to ask you a question. So obviously one thing that I've noticed and that I experience every time when I either talk with people or write a blog or have a podcast and we kind of use the term self-healing and auto-remediation, then everybody gets super excited. Then everybody says, this is what we need.

Starting point is 00:04:01 This is what we want. But then I feel like while as great as these terms sound like uh i'm not really sure if everybody really has the same understanding of a what this is really all about and b that it is not just like installing yet another tool but it actually requires a lot of prerequisites and And so I want to really understand this is why I really love having you on the podcast today because you are day in and day out working with different types of enterprises with, I'm pretty sure, different technology stacks, different processes in place. And therefore, you see a huge mixture of what self-healing actually would mean for them and also kind

Starting point is 00:04:46 of what the misconception is. But let's get started first. What do you see out there? What if you approach a customer, let's say a large enterprise that has, let's say, a more, let's say, not legacy, but an established tech stack. What do they think when they hear self-healing and what do they think it takes to get them towards self-healing or auto-remediation? And what can you actually really do for them as a first step? That's a great question.

Starting point is 00:05:18 So, I mean, a lot of our customers, a lot of the enterprises, they view the world as auto remediation as this magic wand that you can just wave. And suddenly your uptime is increased, your customer satisfaction is increased, your customer's experience goes through the roof and you maintain your competitive edge. In reality, what we find is where we have customers that have an established tech stack, they're probably doing monitoring, they're probably doing logging. They've got teams of individuals that are typically overwhelmed, trying to firefight and keep the application healthy and keep it up and running. But what we invariably find is that a large number of customers don't actually know enough about what their application is and how it behaves to make good decisions. And the precursor to be able to move into an auto remediation space, and even if it's like a first

Starting point is 00:06:21 tier of the maturity around moving into auto-remediation, such as things like automatic diagnostics or ticket enrichment, for example, is that there is a huge gap typically around understanding what their application should be doing and what it actually is doing in real life. And then compared with how the application maybe was created years ago and some of that tribal knowledge has been lost. And so typically, you know, what a number of enterprises are trying to do is bridge this gap between lack of knowledge and how to keep their application up and running. And that's where typically, you know, a number of the first stages

Starting point is 00:07:03 of adopting this sort of strategy will fail. Even though there's logging and monitoring, we find people are collecting every monitor metric, every log. They're generating petabytes of data per year. No one's reading it. No one understands it. And so without understanding how a particular incident or how a particular problem occurs in an application stack, you're never going to move into an auto remediation space. But what we also find is it's not always about the technology problem. A lot of the times we're finding that when we really boil it down, that the technology is kind of a distraction.

Starting point is 00:07:44 And that really what they're trying to do is they're trying to answer a business question. And a business question is, how do I keep my business and our platforms and our customers as happy as possible? And so a large amount of our consulting actually comes around proactive remediation as well. So being able to predict what a platform needs to look like, not just trying to keep a platform up and healthy as long as possible, but maybe someone has acquired more Twitter users and that's going to drive more load on a particular platform. That can also be classed as remediating a problem that actually hasn't occurred yet, but getting into that world of predictive analytics is kind of where we're moving into now between the sort of auto-remediation and scaling operations through to how do we predict

Starting point is 00:08:36 some of this ahead of time. And the two topics, even though they're treated very separately from enterprises, they're very intertwined from an analytical and a data gathering exercise. Well, I mean, it's obviously very interesting because in the end, we talk about figuring out what are the different scenarios when your system needs to react to, let's say, an unknown situation or an unknown behavior, whether it's increased load. And I like your example with what if your team, your organization is planning a new marketing campaign, what's the increase of load that you're expecting?

Starting point is 00:09:19 And then obviously plan ahead and assume you're doing proper testing upfront to make sure you can actually handle that load without the system falling apart. Now, let me ask you, when you're taking a step back, if you approach an enterprise for the first time and they ask you to help them with building better systems, building resiliency in there and self-healing, do you have a way to evaluate the maturity model as it is right now? So what are the, are there any, some kind of questions or things you look at to figure out where they are on a scale from zero to 10, which zero means, you know, obviously

Starting point is 00:09:59 everything is back to basics and they have no real clue about the app and they don't really know what auto remediation is versus 10 where obviously there will be the the perfect scenario where everything is fully automated so do you have a way to go in meet a customer and say i need to ask you these 5-10 questions and then i know more about you so what is this what you do yes absolutely so both on our devops, as well as our auto remediation transformation, we have an engagement where we'll go in, we will sit with the teams, we'll sit with the leadership. We have like a 250 point questionnaire that we sit there and

Starting point is 00:10:38 typically we're filling in in the background for them. We're not asking clients to fill it out for us. We sit in their scrum meetings, their sprints. We sit in their planning and product planning sessions. We can understand what they're trying to build and what they're trying to achieve with their platform. And then also, we're sitting within their engineering teams over a period of time and we participate in releases, getting kind of this fly on the wall to understand exactly where these problems are. Invariably, it ends up in a QA type of environment where QA tests have been missed, and that's where we see a lot of the remediation work that we do is getting driving back into driving QA to be better and helping QA be the quality gate for the organization

Starting point is 00:11:27 rather than just pure testing. But yes, we go in, we sit and consult with many of the different teams involved both in the value chain, but also in the product release cycle and just understand where are their pain points. And then we start digging into things like ticketing systems and looking at how many outages they've occurred and what type of problems they're looking for. Invariably, though, you kind of alluded to this as well,

Starting point is 00:11:55 that there's not always a lot of knowledge around the application. And so what we also find ourselves doing, ironically, is we end up finding we're documenting a lot of the architectures for them as they've grown over the years, or we're documenting particular API flows for them or particular architecture flows for how users are coming into the systems. And we use a number of monitoring tools out there that help us map out these types of interactions. And that gives them a lot of insight to begin with. What we also tend to find is once we've done that work, 10% to 20% of the issues they can identify immediately. They just didn't realize where they were, or there was a lot of misconception about

Starting point is 00:12:43 how parts of their application tier may be communicated, or how maybe some parts of their application weren't scaling the way that they expected them to, and they were scaling using the wrong metrics. So invariably, through that initial type of work, we identify some very, very quick wins for our customers to go in and fix immediately, which then gives us a lot more time to then work with some of the more complex algorithms and complex problems that really are much more buried in the code as well as the infrastructure. are kind of guiding them in this initial phase when you are kind of becoming part of their of their sprints you really i guess what you are you are kind of showing them what they don't know about the app and their infrastructure so you are the you are kind of the facilitator or the moderator i'm not sure what the right word is but you're kind of you get you're kind of the facilitator or the moderator. I'm not sure what the right word is, but you're kind of, you get, you're getting all the different people together to figure,

Starting point is 00:13:50 really help them understand how the architecture looks like, what's really going on right now. And with that additional knowledge that you help them kind of, you know, surface, they actually are then able to have first of all some quick wins um because some of these things might be things that they as you said earlier maybe have you know forgotten or have not been shared correctly so that the right people cannot act on like upon that so um this is kind of i'm not sure if i'm if I did the summary well enough. Yeah, you're absolutely correct, Andy. I mean, I think the reality is everyone's busy doing their job.

Starting point is 00:14:31 No one really has time to look back in history and try and work out what everybody else's job was as well. And so while getting everybody on the same page and understanding what the goals of the business are is absolutely critical. You know, we're not, we're not, we don't go into an organization to sell them a technology. We're going in to embed as part of the team and bring expertise that we've seen from other customers that we've been working with that maybe have similar tech stacks and bringing that knowledge immediately into those organizations. We do a lot of work with things like Kubernetes and OpenShift and AWS. And we see all these different technologies and how they've been used well and how they've been challenging. And so we have a luxury of coming into an organization with this toolbox, being able

Starting point is 00:15:22 to land into a particular project or into a particular engagement and bring that sort of experience, which enables us to sometimes leap to some conclusions that maybe they haven't been able to do internally themselves purely because they're working only in their environment and not having the luxury of working across many different instances. And so that tends to bring some clarity to the overall adoption and starts helping build that foundation for a self-healing or auto-remediation strategy because we're not going to build a black box. We're going to come in and help you and help the customer

Starting point is 00:16:04 and help our clients build out a solution that works for them. And, you know, being very technology and tools agnostic, it has to fit and gel with the culture of the actual team itself as well. Hey Andy, I wanted to ask a question here. In terms of where you've put self-healing in, I imagine you go into a lot of places, and one of the things that looks like you all do is help enterprises also create their pipeline, right? Get that whole CICD front-end piece flowing. And one thing we've noticed when we talk to a lot of other people is the idea of moving to a DevOps culture,

Starting point is 00:16:42 moving to a CICD pipeline, it's always don't try to do the whole thing at first. It's, you know, start with a smaller project and build and build and build until you get the whole, you know, the,

Starting point is 00:16:50 get the culture adopted. You get the pipeline worked out, you get things flowing on things. When and where do you see the auto remediation piece being introduced into that process? Like, in other words, when you start,

Starting point is 00:17:01 like, let's say you're starting on getting someone over into their, you know, flowing into a pipeline. Are you including self-healing at that point as well, like baking it into from the ground up? Or is that kind of like the next step after they reach maturity with their pipeline, then it's time to bring in the healing? So I would say that it kind of happens hand in hand. And really the biggest driver for auto remediation comes down to an unwillingness for a product team to fix an issue. And therefore features and new capabilities now take a priority. And so there's always this internal struggle with regards maintenance versus new features, new priority. And so there's always this internal struggle with regards maintenance versus

Starting point is 00:17:46 new features, new capabilities. And if you're in an organization and you're just pushing new features all the time, then that escalates the requirement for a remediation. If you work in an organization where there's great feedback and engineering our stakeholders in the product design and, you know, expert center of resources are allowed to work on maintenance related issues, then, you know, that becomes a less of an environment where auto remediation is required. So what we see a lot of the times is auto remediation being offered as a band-aid to give the product team enough time to then find the right resource or the right opportunity to introduce the correct fix into the environment. Now, there's obviously patterns out there.

Starting point is 00:18:40 If you're in the Kubernetes world, then you've got a number of auto remediation capabilities already built in with readiness and liveness probes. If you're working in an AWS environment, then you can use load balancer, health checking and various other things to do some things in your application. But when you're talking about more of an IT infrastructure, we see auto remediation being brought in very, very early on into a project. And the reason being is they're not living necessarily in a cattle type of world where we can just kill things and restart them, which typically is where auto remediation starts off. It's something that has to be maintained. We can't rebuild it. We've got to manage it whilst it's in place. And that raises the requirement for auto remediation.

Starting point is 00:19:32 So it's really a combination of how much your infrastructure follows things like 12-factor and various other things versus how much of a traditional IT infrastructure. They tend to have auto remedation brought in in slightly different priorities in slightly different locations. And then a lot of it comes down to the willingness for a product team to carve up time to provide maintenance-type fixes for some of these issues. Obviously, it's a maturity question, and I think you just phrased it nicely where you said, if you look at your full stack, you have different layers, right? You have infrastructure, your network, your storage, your compute,

Starting point is 00:20:13 your services, your applications. And obviously, there's remediation. Remediation can happen in any type of layer. Do you see, what's the percentage of companies you work with that actually reached, I would say, kind of the higher levels of the stack, meaning service or application remediation? Because I understand that on the infrastructure side, it's not necessarily easier, but I think certain things are baked in into the underlying stakes that we're using,

Starting point is 00:20:50 whether it's restarting machines or cleaning. As we said in preparation of this call, something like cleaning a full log directory is something, let's say, more simple and obviously still self-healing or auto-remediation versus going the level up in the stack when you say you want to switch traffic between two canaries because canary A and canary B differ from a conversion rate perspective.

Starting point is 00:21:24 So to kind of raise my question, how these companies that you are talking to and helping, how many of those are kind of trying to figure out auto remediation in the lower levels of the stack, closer to the hardware infrastructure? And how many already talk about, let's say, the higher level in terms of, you know, correct the self-healing of Kineris and blue-green deployments and so on? That's a great question. I would say currently out of the customers and the people that we meet regularly that I would say probably 60%, if not a little bit higher,

Starting point is 00:22:01 are only looking at infrastructure remediation. They're looking at how to physically restart an entity, whether that's something as destructive as restarting a switch that's locked up, which we've all seen in our careers, whether it's responding to a denial of service attack and just hard blocking ports. There's a number of aspects. We see very few customers getting to the point where they're looking at doing auto failover for customer affecting traffic. So whether it's something like a drop in conversions, whether it's a drop in, you know, particularly

Starting point is 00:22:47 user activity, we see very few customers thinking about that type of problem. You know, the people that are looking at that are, you know, some of the huge entities out there that are pure e-commerce, that's where they spend their time, that's where they're looking at how to recover, you know, drop shopping carts and that type of traffic. Whereas in reality, 60, maybe a little bit more is just focusing on some of the traditional problems that people are having because we find a lot of customers that haven't configured their Java heap size correctly and they run out of memory or they're not managing some of the disk infrastructure the way they should be. They haven't got alerting and

Starting point is 00:23:31 monitoring and they all of a sudden run out of disk and it surprises everybody at two o'clock on a Sunday morning. Then you've got this strange gap in the middle. And the strange gap in the middle are these strange combinations where it's a service-led problem, but it's application-related. And so it's not so much implementation that was wrong. Some of it can be to do with tuning. Maybe the tuning metrics weren't quite discovered the way they need to be. And, you know, maybe there's certain configurations that could be updated to make the application more reliable. But invariably, what we see is the gap in between

Starting point is 00:24:11 is people looking at putting auto remediation in place purely to accommodate the fact that their application isn't as reliable as they thought it would be. And that can be things like slow database queries. We've all seen those. And instead of building a database architecture that can scale and can accommodate, or even simple things like distributing database traffic between read and write endpoints, we see a huge number of customers that still do not use read and write endpoints for their database clusters. And, you know, that has a knock-on effect. And so once you're through the infrastructure layer, and we have to fix the infrastructure layer to kind of peel the onion

Starting point is 00:24:52 to get back to the sort of root core is application-related issues, but invariably they fall into two brackets. It's tuning-related activities or it's bad code. And, you know, the ones we struggle with the most and are the most costly to fix, as everybody knows, is related to the application code fixes that are causing havoc a lot of the time inside of a platform and people don't realize it. But it's a process we have to go through. We can't jump straight to application related. We have to fix all the layers in between and we get closer and closer to the center of the issue over time.

Starting point is 00:25:36 I just wanted to get clarification on a point there. I know you got something, Andy. But it almost sounds like what you're saying, and it's totally believable, but what it sounds like you're saying right there is that one of the side effects of auto-remediation and self-healing is that people are using it to throw more resources at a problem, which then, as we know in the cloud world, means more costs. I mean, not everybody, but it's there, though, right? 100%. I mean, we still have discussions with customers now where, you know, the answer is, well, let's add order remediation strategy is to vertically scale the database cluster.

Starting point is 00:26:16 That's what it means to them rather than getting to the root cores. And so, you know, you find that, you know, back in the day when we only had like 4K of memory to write an application and we actually cared about how we wrote applications and how we use resources. Now we've got gigabytes of RAM. And there's a skill that's been lost, which is, you know, keeping software to a nice small footprint that is just required to run your application. There's a lot of bloat and a lot of enterprise stacks. There's a lot of bloat in a lot of enterprise stacks. There's a lot of libraries that aren't really being used. You know, anyone who comes from an embedded hardware world like I do, you know, you have to strip libraries back to get just the pieces you need so you can just about fit it in the memory available. And so, you know, we get to this realm where absolutely people are using auto remediation

Starting point is 00:27:06 to purely scale operations uh and scale their break fix but they actually have no intention of actually going back and actually fixing the root the root cause of the issue well i think i mean i've i've i see these scenarios too unfortunately, sometimes it comes down to it is, quote unquote, cheaper and more effective to throw more hardware money at the problem rather than spending an unknown amount of time with no certainty of really fixing the root cause, which might be an architectural issue. Because when you talk about bad code, and that was my question that I had earlier, bad code, we're not talking about, let's say, one function that has a bad for loop in there. We typically talk about something that is more complex, maybe an architectural issue that is really hard to fix, especially if you may no longer have access to the people that initially came up with the architecture and did the implementation. If it is in a third-party library where you don't really

Starting point is 00:28:15 have a good replacement or you don't know the side effects that a replacement will bring in. So therefore, at least the way I hear it being justified by companies that say it is more calculated risk, even though it costs maybe more money, but they don't know, to throw more money and hardware on a problem rather than taking a step back and trying to fix the actual root cause. Yeah, I think just to expand on that, I think, you know, what we see a lot of the time as well with large enterprises especially is there's always the intention to rewrite the platform. So, the platform never sort of enters this, you know, sort of utopia where it actually

Starting point is 00:29:01 met the goals of the company and, the company and everything was worked out ahead of time. And so you end up with this situation that we hear all the time is like, well, we just need to resolve it for the next two years whilst we're rewriting this particular module or this particular capability, or we're moving away from this platform. And we get caught up in those discussions a number of times. And auto remediation is often that band-aid to help people sort of limp along long enough as long as that platform is built. Out of all the customers we've worked with, I would say it's less than 2% have ever actually rebuilt their application within the timelines that they said they would

Starting point is 00:29:50 and resolve those underlying issues. It tends to be more market-driven factors rather than platform-related engineering being raised to a much higher level on a product manager's list. Coming to, because I'm interested, I'm sure there's at least some examples, though, where you, hopefully many examples, actually, where you worked with your customers and say, here are some really, let's say, straightforward things to implement auto-remediation, whether it is something as simple as cleaning a log directory that is about to be overfilled, redirecting traffic because there might be a spike in load coming in. Are there any, let's say, classical auto remediation steps that you see you apply more often with your enterprise customers?

Starting point is 00:30:54 And what would they be? Yeah, I think the typical ones are service restarts. We see that all the time. We mentioned, obviously, the infrastructure restarts earlier. You know, that's typically where, you know, the majority of our customers start off. And then you get to things a little bit more interesting. So in the enterprise world, very quickly, we end up in the world of auto-remediating security. And this is something that kind of grows a little bit quicker and suddenly

Starting point is 00:31:25 ends up on our priority list where once they understand that we can detect and we have these metrics that we're gathering, then it's very easy for us to dynamically make changes to a web application firewall or add rules to a security group or make changes to a firewall or add rules to a security group, or make changes to a firewall, or add a route to some black hole for specific types of traffic. So whilst order remediation is kind of in that typical IT world of, this service has stopped, or this service is non-compliant anymore, let us restart that service or we're running out of disk, let's clear out the log directories or we find a user that's approaching

Starting point is 00:32:15 their quota or space in their home directory or some of those typical sort of IT scenarios where we find in enterprise that it's starting to make the biggest impact is around the auto security side of things. So tying into high traffic from unknown sources or countries that you know you don't have any customers in, all of those things are very easy to detect and very easy for us to make automatic changes to infrastructure, as well as making changes to applications and web farms and various other configurations to block some of that content. Pretty cool. Yeah, I mean, obviously security, we're all aware that security is and will be one of the top issues we as an industry have to face anyway. And if we can, you know, kind of solve two problems with one action, because in your case, would you say if you block certain traffic that is not supposed to be there or is potentially harmful, then you're solving two issues. First of all, you block access to these people that should not access your site,

Starting point is 00:33:27 and you're keeping traffic away that would potentially impact performance of your other users. So I think that's – Absolutely. Absolutely. And also it keeps your logs cleaner so you can actually find a real issue. I mean, one of the big problems we find is it's great to have logs. It's great to gather

Starting point is 00:33:45 all these metrics. The reality is, is no one knows why they're gathering all these metrics. Nobody knows why they're gathering all these logs. We have customers that literally generate terabytes of logs a day. And the problem is, is they believe that they've got a great operational function. They've got a great operational capability. And the reality is, it's this false sense of security. You're kind of like a police force, where a police force tends not to be funded to fight crime proactively. Their job is to discourage crime by investigating things that have occurred and, you know, punish those individuals. And so if you look at auto remediation, you're trying to be this law enforcement agency, you know, that is predicting as well as reacting as quickly as you possibly can,

Starting point is 00:34:39 rather than waiting 90 minutes for a police car to turn up. You know, so you end up in this situation where, you know, we see lots of people that their logs are just absolute garbage. They're just absolute noise. There's zero reason to even keep these logs. And a large challenge we have is trying to get customers to get to the point where they log the things that matter to you and start off with your customer experience. We find a lot of people will log and generate metrics because they want to know. But in reality, it's not actually helping your customers have a better experience. So, you know, everything needs to be in this, in this generation, especially everything needs to be geared around that, that end user expect the expectation that you end user experience.

Starting point is 00:35:30 And so, you know, when we work with customers, yes, we want to filter out the bad noise and we do that by, you know, you know, changing firewall rules and changing application configs and making things, making sure applications start and stop cleanly and don't generate all these excess logs. But starting off with what is going to affect your customers? Because a lot of the logging tends to be around the middle tiers of your application, your backends. And invariably, that doesn't necessarily have that much of an effect with

Starting point is 00:36:06 your end user experience from day one but we tend to focus all of our logging and experience and logging and monitoring there versus you know what is a customer seeing from a browser you know what is a customer experiencing on our application and when we get customers to become, our customers become much more mindful about what their customers are experiencing and how their infrastructure and architecture can affect the end user's experience. That also helps them generate enthusiasm and excitement around the product team because they might end up with a list of technical things to fix. And the ones they really need to fix first are the ones that affect your customers um and so you know as we as we try and as we try and filter out some of the some of the noise um it helps us pinpoint you know those types of issues and it also makes makes auto remediation a lot easier um because we have we have fewer things to to index fewer things to try and match against.

Starting point is 00:37:06 Wow. I just wanted to give, hey, Jarvis, you're still there with us? Of course. Yes. We've been talking a lot of things here. I wanted to give you an opportunity to weigh in on any of this. Sure. Especially because you mentioned earlier your background in the restaurant. How does that all tie in and what are you seeing in your real world experiences? that's kind of key. So one of the things that I've noticed as somebody who's sort of new to the industry and comes from a very particular background, most of the monitoring and logging and tracking and solutions like that tend to come from the ops team. Those are the guys who are driving that action. But the platform itself is really dependent and owes its existence to the customer experience.

Starting point is 00:38:09 So that's where you should really be tracking everything. And what I would love to see and what I try to encourage every time we talk to people is that the customer success teams need to be a lot more involved with the monitoring, with remediation, with what's being tracked, with what's important to the application. If an application is running at 99% memory, that's not necessarily a problem if your customer experience is locked down and tight. It's really nice and really great to actually talk with folks like you guys that are out there working with enterprises in all sorts of sizes and really help them go through that necessary transformation right we on the vendor side we are sometimes seen as well they are just talking about things and they talk about self healing and auto remediation full-stack monitoring,

Starting point is 00:39:07 but they just want to sell us a product. So that's why it's great to have companies like you. And I know AJ Tech is also a partner of Dynatrace that actually then implements not only a solution, as you said, but really help them transform. Now to sum it up, because this is the way we typically do it in the end what i took away from this and there's a couple of things i took a lot of notes john in the beginning you said what's really important is that we are not only thinking about self-healing and auto remediation as a band-aid in production to keep the lights on. But really, we need to shift left. And I believe the way I wrote it down is you said you have to enable QA

Starting point is 00:39:49 to become real quality gates versus just testers. But I think this is something that's also dear to our heart. It is about the proactive enforcement of quality gates, of looking at things early on so that you're not just are forced to spend and throw more money and resources on a problem in production. We need to be more proactive. And I also liked your analogy with law enforcement, that we should become more proactive and more like minority report, like kind of stopping things before they actually happen. And I also, thanks for explaining a little bit about some of the top issues that you see out there when it comes to wrong configuration of a Java heap, running out of disk space, those things like that. So it seems there are some low, quote unquote, low hanging fruit that everybody can probably

Starting point is 00:40:42 do. But ultimately, I encourage everyone, every enterprise out there to reach out to experts like yourself and let them be helped by these experts, by people like you, to first of all, show them what's wrong right now and how they can improve what's ultimately the most important thing, which is end user experience, right? So running your business so that your users are happy, that you build reliable systems that can heal itself. So it's a win-win situation for your business

Starting point is 00:41:20 and for your users. I just want to thank you. Any last words from you all based on what Andy said or any last thoughts or ideas that you wanted to get out there really quickly? Yeah, just real quickly. I think one of the key aspects here is people start off with metrics and monitoring and logging. And typically, they're looking at a single metric to drive a workflow or a behavior. Eventually what happens though is once you start knocking off those low hanging fruits, you're not going to be able to do that

Starting point is 00:41:50 through a single metric. And there's a number of ways to do that out there. But really, once you get to the point of understanding, it's a combination of things that might be happening in your platform that are indicative of a particular problem, you can no longer do that manually. And this is where you need things such as artificial intelligence and machine learning to drive that mechanism. It's impossible to build your own complex event processing engine internally because you end up spending all of your time now trying to manage the complex relationships across your across your platform and so when you get to the point where it's no longer a single metric driving a single action and it's a combination the only way to to now achieve that is to go down to go down an ai or a machine learning route to do that yeah i wonder

Starting point is 00:42:42 if there's any tools out there that help with that, but amazing. If there was, wasn't it? It'd be really handy. Yeah. Well, I wanted to thank you all for coming on today and spending some time with us.

Starting point is 00:42:54 Really appreciate you have having on here. Do either of you on Twitter or anything, any social media you want to promote to anybody to follow you on? That's where we tab. Okay. I'm doing anything specific for this. Do you know? That's where we tab. Oh. Okay. Is Web doing anything specific for this? Do you know? We'll put some links in the show description.

Starting point is 00:43:11 If you have any questions, comments, or anything regarding this episode or any other episode or topics, you can tweet to pure underscore DT or you can send an old-fashioned email to pureperformance at dynatrace.com. And I want to give a shout-out to Alois Reitbauer, who was on our Captain episode a little while ago,

Starting point is 00:43:29 because when I said send an old-fashioned email, he sent me an email with a picture of an old-fashioned drink. So kudos to you, Alois. Thanks, everybody, for being on today, and we hope you had a good time. Thanks. Thanks. Thanks.

PurePerformance - Self-Healing in the Real World – HATech Lessons learned from Enterprise Engagements

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.