PurePerformance - Perform 2020 xMatters Automating Remediation Actions with TravisDepuy

Episode Date: February 6, 2020

Travis Depuy of xMatters speaks to Leandro & Brian about how to leverage tools like xMatters to enable proactive alerting and remediation throughout your code life-cycle...

Transcript
Discussion (0)
Starting point is 00:00:00 Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance! Hi everybody and welcome to another episode of the jointly produced PerfBytes and Pure Performance live from, or not live, but from coming to you from Dynatrace Perform in Las Vegas. Leandro Melendez is my co-host today from PerfBytes Española. Hola, Leandro. Hello, hola everybody. It's such a pleasure to be here. How are you doing today? Pretty good. Enjoying and very excited about all that is happening, Perform 2020. Seriously excited. Great. Me too.
Starting point is 00:00:54 So our guest now we have from X Matters, and this is their product evangelist, Travis DePue. DePue. Yeah. Excellent. DePue. DePue. It was very, very close. Travis, welcome to the PerfBytes and Pure Performance. How are you doing today? Yeah, pretty good. Thanks for having me on. I'm stoked. Great. Can you, before we start getting into any of this stuff, do you
Starting point is 00:01:15 want to just give a little brief background of who you are and why you're here? Yeah, so I'm a product evangelist with X Matters. I, and I'm a man of many hats. So I, as you can see in your LinkedIn profile. Yes, exactly. Yeah. So I do a lot of, you know, blog posts and, and talks here and there. And then I also kind of build, you know, integrations that we want to explore that customers kind of request. So I get to get my hands dirty into some code, but then I also do some of this evangelizing kind of stuff too. Great. So for people who are maybe not familiar with X Matters,
Starting point is 00:01:59 in a nutshell, what is it that X Matters does and how would organizations use X Matters? Sure. Yeah, we help keep digital services running. You know, when stuff blows up, we help you guys keep it up and running. We do that by tracking people down using on-call schedules and notification devices. But really, you know, the notifications, it's not the first thing that happens. You know, that's really, sorry, it's not the last thing that happens.
Starting point is 00:02:24 That's really the first thing that happens. You know, that's really, or sorry, it's not the last thing that happens. That's really the end of the incident. So we help you, you know, build out workflows that allow you to either enrich the notifications or provide workflow options so that you can, you know, drive stuff forward. So you can actually start the troubleshooting process before you even, you know, get off your phone. Great, great. And as people may or may not know, there is an integration between X Matters and Dynatrace, so you can feed the problems into X Matters. And if I'm not mistaken, some of the integrations you've been working with, can you explain? I guess that falls into the category of helping to automate the pipeline, helping to feed data from one tool to another. One thing we've seen with a lot of vendors of all different shapes and sizes is the idea of using these API integrations so that you can, without necessarily even having to go into some of the tools for some of the features the apis can
Starting point is 00:03:25 communicate to each other and take care of things and then bring you a result or something else what is it that you all have done on the dynatrace side with that or with dynatrace how does that integration work yeah um so so we have a couple different um you know places so once dynatrace you know figures out that something's wrong, you know, it can reach out to a person, you know, that we were really good at tracking people down and letting them make a decision about something. But even before that point, we have tools, integration tools, we call them the flow designer. And on this nice canvas, you can build out a nice flow that would, once Dynatrace reaches out to X Matters, we can then run other API calls to either enrich the data or maybe go run some kind of job either in Dynatrace or some other tool to actually perform a self-healing action before we even engage a person. But then, you know, maybe that self-healing action didn't actually fix the problem. So then it's like, okay, now we need to get, you know, one client that we called it a hands-on keyboard, you know,
Starting point is 00:04:33 so they would do all this kind of front work and then they'd say, okay, well that did not work automatically. So let's, you know, get hands-on keyboard to actually get somebody to pay attention to this. When you mentioned that it specializes in notification and reaching out to people, what are the capabilities of reaching people or what are the channels?
Starting point is 00:04:53 How do you manage to do this? Like you call their wives to say, hey, you're being requested? Yes, we're working on the cat droppings. I'm sorry. The drone delivery by cat. No, but we do push notifications, voice calls, SMS, and what am I missing? There are still some pages out there.
Starting point is 00:05:20 We also had a client. I don't think we actually have them anymore, but they requested a fax feature. Oh, wow. Oh, cool. So anymore, but they requested a fax feature. Oh, wow. Oh, cool. So we actually had to build a fax system. Not so cool, more retro than cool. Yeah, retro cool. Wow.
Starting point is 00:05:35 And you could have the fax drop the fact sheet on a scanner and then automatically email it to the person. Yeah. That'd be great. Yeah, we're building out a whole IoT department for that. No, I'm just kidding kidding that's not true um so one thing you mentioned that was interesting because we've seen this uh you know and anytime i say us i'm referring to like andy and myself because we've done you know a lot of podcasts with some of these pieces in the self-healing side and one thing we see and you mentioned a really good point is when it comes to self-healing, a lot of people think that, well, you just automate all the self-healing and then you don't have to touch anything anymore, right?
Starting point is 00:06:11 But the reality is, conceptually, at least as far as I understand, and correct me if I'm wrong, that with self-healing, you're going to have, depending on the criteria that comes in, right? So again, in this instance, Dynatrace sends over some data about what the problem that goes on. XMatters is going to know, okay, based on this, we're going to tell maybe Ansible Tower or something else to run this playbook, you know, which might have been automated and set up. And then go back and check to make sure
Starting point is 00:06:38 the problem is actually resolved based on the self-healing factor. Now, often there might be several different steps and several different things you can try to do. Like maybe try a reboot if the reboot doesn't work. Maybe you try doing something else. If not, if there was a deployment, maybe at that point you roll back, whatever those steps might be, but they're always, or there sometimes can come a case when nothing works. And that's when you have to take then and reach out to an actual human being. So you can automate
Starting point is 00:07:03 so much of it until you hit a point where the self-healing didn't heal right and that's i guess one of the you mentioned that as like the last part of the chain is you'll facilitate all those handoffs between the different tools and help do these things but then if none of those work it is going to go ahead and reach out and notify saying hey you really need to look at this because we can't fix it automatically did i get that right or is that completely wrong? Yeah, let me just expand a little bit. I mean, first of all, I would really be interested to find a company that has completely solved this self-healing problem and they don't have to reach out to people.
Starting point is 00:07:35 I think that's why people don't like calling it no ops because there's always ops, right? But I also might be concerned that they've actually built Skynet and we should find the package. But yeah, but then the other part is that, you know, once reaching humans, you know, once you reach a human and, you know, we've done kind of this pre-work to say, okay, well, this was attempted, reboot the server, you know, maybe roll back or something, but we're still having problems. So let's reach out to a person. But then that person usually is, again, rarely in a silo, you know, they're rarely by themselves. So they are going to, you know, maybe if it's bad enough, they might flip the set one switch. So then, you know, we provide them in the notifications,
Starting point is 00:08:15 an option to start the set one process. So open up a Slack channel and invite people into Slack channel, open up a, you know, a Jira ticket or some other ticket that needs to be open to track that and start that process. Like I said before, people will kind of get off their phone because there's a lot that happens right after that. The notification is not the end of it. Yeah, so that's cool. So what you're saying then is that instead of them having to log in, create a Slack channel, invite everybody, that can all be kind of automated as well through.
Starting point is 00:08:47 And that button will be embedded in the notification and then they push out. That's really cool. If it's got an API a chat or join it. Do they have a way to trigger these actions that you generate through Flow Designer? And what actions can they create? Yeah, so they pre-configure the actions. We're working on kind of making them a little more dynamic, but currently they're pre-configured at configuration time.
Starting point is 00:09:26 And then they can select that from any of the notification methods that we've reached out to them. So push notifications, voice, email, SMS, all have the same kind of response set so that however X Matters gets to you, you can initiate the same actions. So it is a little bit like if you want to i don't know um clean up the 10 files of your disk uh dial one or reply with zero send the facts
Starting point is 00:09:54 back with uh with a hand yeah exactly uh yeah and um from what other devices? I mean, you mentioned an SMS or a voice call or different interactions. Do you have something like an app or something that the users receive these notifications? Yeah, so we have a pretty extensive app that receives the push notifications. You can also initiate kind of custom or ad hoc notifications from there, as well as see your on-call settings. So you can be like, oh, yeah, no, I'm not on call for the next two days. Sweet. I'm going to Vegas. Cool.
Starting point is 00:10:33 And also from these apps, you can trigger these actions, right? If you are the person and you know what to initiate from the app, you can call it. Yeah, exactly. Do you have extension to other smart devices? No, not that I can think of right now are you are you talking like alexa like hey alexa initiate yeah alexa smart watches or i don't know i've seen that even the smart fridges can get some of that you know so you could have your smart fridge initiate um we had uh we did have, we have played around
Starting point is 00:11:06 with the smartwatch, Apple smartwatch specifically. I don't think we actually productized it, but it's floating around somewhere. Oh, that's really cool. So you can also see the notifications
Starting point is 00:11:16 and trigger some of the actions. Your eggs are rotting and your server's going down. Alexa, please reboot the server. Yeah, but we expose an endpoint so that we can then receive everything in the kitchen sink. So it wouldn't be too hard to build out. Yeah, and I think that's a common thing with a lot of modern tools these days, right? You have these API endpoints that
Starting point is 00:11:44 it might not necessarily be something the tool manufacturer, or I guess software, whatever you want to call it, has built specifically, but the endpoint allows those to be done. And that's where I think a lot of the creativity of what we're seeing in
Starting point is 00:11:59 the interconnection of these tools is going, where people are like, hey, I can do this, and then they build it and put it out there. So it's just because, and this is something we see all the time too, right? Just because it's not officially in the product doesn't mean it can't be done. Right. And then a lot of people are out there doing these things. So it's really cool to see what people are doing with the tools today. What's the state of all this that you see from, you know,
Starting point is 00:12:24 you know, how obviously the customers that you're going to are probably starting on that journey, but, but in a big picture, right? We've seen even, even an automated pipeline beef before getting to remediation, before getting to these other pieces, just even getting an automated pipeline. Uh, there's still a lot of catch up people are playing on there. How, where do you see the state of um taking that for you know to the to the step where you all fit in and what's what's on the horizon
Starting point is 00:12:53 for that and you know like what's the outlook like for that whole setup yeah um so you know i think organizations once they get get to a certain you know size they um you know that size, it's hard to keep kind of a standard tech stack or even processes, right? So having these tools that allow for maximum flexibility, a lot of the pipeline tools are just hooking into other places as well. So I think that kind of concept is kind of reaching out to other places.
Starting point is 00:13:25 And so having tools that can have the flexibility to talk to all the other tools that the organization is working in is going to be important. You know, I mean, for example, kind of going back to the 7-1 incident kind of thing, you know, you might have some, your dev team is working in JIRA, but they need to have a ticket created for the CSMs or, you know, for the support people to, uh, to be able to see, have some kind of visibility into that, um, into that incident because they're affected, you know, by that, but they've been necessarily be responsible for it. Right. So, yeah. So I think teams that, uh, or tools that, um, can have flexibility and talk to those other tools and then support a variety of different processes and technologies
Starting point is 00:14:06 is definitely where we're going to see a lot of work in the future and soon. In terms of the sort of you kind of described some prerequisites, certain tooling and all this, but when people are thinking, all right, now I'm going to use the self-healing part as an example. I know that's not all that you all do, but it's just a piece that interests me. But if someone's going to say, I going to use the self-healing part as an example i know that's not all that you all do but it's just a piece that interests me but if someone's if someone's going to say i want to head towards self-healing right obviously they want to get first get a pretty solid pipeline and everything but let's say they are like well we don't know all of our playbooks we don't know all the conditions to do everything i would imagine some people might feel
Starting point is 00:14:44 overwhelmed and say, we have to figure all that out before we take the next step. But I would imagine you might counter smartly, just like the whole process of going to CICD in the first place would be you don't have to have everything figured out first. If you have one workflow figured out, start there. And then you can build as you learn from that, as you make sure you have the proper inputs for everything. It doesn't have to be an all at once.
Starting point is 00:15:11 You don't have to get all your playbooks in line. You don't have to make sure all your metrics and everything is set up. Just get one solid case and then you can start on that journey. Am I correct in that? Is that what you recommend to people or is it just the exact opposite? No, I definitely would recommend that. Actually, I've been buried in Google's SRE handbook. Technically, it's called the Site Reliability
Starting point is 00:15:36 Engineering, but I call it the SRE handbook because it sounds better. But they really stress the importance of postmortems. So every incident of a certain, you know, priority or severity, you know, you will have a postmortem and then there should be artifacts taken away from there. So if you're not fixing a process or fixing, you know, something, you're at least taking actions that can be translated into a run book. And yeah, certainly in the beginning, no, you're not going to have anything right. But after, you know, a year or two years, now you're going to start to compile a list of runbooks, you know, that can be helpful in those situations. actions, how they are designed or what information gets stored so that in the future, when you
Starting point is 00:16:27 experience an incident or something that you have already experienced and you already applied some solution, you can quickly and easy, let's say, not only the system, but the organization starts to learn how to create these flows. Or is it, how am I imagining that this works kind of we you know we are really good at tracking people down and giving them a decision point on what to do next um so we would rely on you know the other tools like ansible that would actually do the runbook so we would deliver the information so that a person could then say run you, you know, run book alpha or beta or gamma. And then we can go talk to Ansible and say,
Starting point is 00:17:08 okay, execute, you know, run book alpha. So we, you know, it's like a best in breed, right? We're not going to rebuild Ansible. We're not in the business of putting Ansible out of business or Red Hat, I guess. But being able to insert a human into a process is still important. As much as we talk about automation,
Starting point is 00:17:29 having people make decisions and move process forward is still critical. And for what you mentioned, you are leaving like breadcrumbs or some sort of log of what were the solutions applied and what was the situation so that this can be built up in the future. Yeah.
Starting point is 00:17:45 You know, even we've done some use cases where we pre-populate or pre-configure post-mortem, you know, so we can grab, you know, maybe some scribe updates from a channel, Slack channel, and then dump those into, you know, maybe a Confluence document. Actually, I think in a particular use case, I was thinking we did it in ServiceNow.
Starting point is 00:18:05 So the incident started in ServiceNow, created a Slack channel. We pulled just the Scribe updates into the incident so that then when somebody's going through trying to do a postmortem, they have all the information of when the incident started, who was notified, who started the conference bridge, and then the Scribe updates as well.
Starting point is 00:18:23 Cool. So everybody or the key people gets involved quickly get the information as soon as possible and get ready to start acting on yeah exactly and yeah if i might from these features that uh the solution has and that you have applied with your customers which ones would you say have been the ones that your customers praise the most or like the most i mean we all have our favorites yeah um yeah the the enrichment piece i think has been really um you know it's always fun to you know go to go to these events and and talk about um you know this enrichment process where you have the the monitoring arm that has found an issue,
Starting point is 00:19:05 but then pulling in information from other sources, you know, in the past, or, you know, you talk to some large IT organizations where it was like the CMDB being able to map a particular business service to a CMDB, you know, now it's kind of more like, okay, this is the microservice that we're dealing with. Okay. where does it sit in the service mesh? So enriching, because nothing's in a vacuum, so it's all part of an ecosystem, but which part of the ecosystem is it affecting? So enriching that kind of information
Starting point is 00:19:37 and then delivering that to a person is invaluable because they can quickly make a decision. Yeah, I'm sure that they are all very happy to not to be thrown at a defect, a problem or an emergency blindfolded and like, Whoa, Whoa, what's happening. Yeah, exactly. I guess what you're saying is like you get all the answers almost right away to start acting.
Starting point is 00:19:56 Yep. Sounds pretty awesome. Yeah. Yeah. We're pretty stoked about it. I guess, you know, one more question I wanted to ask is, you know, what do you see on the horizon for this next year? You know, not to put you like in a predictive spot, but based on trends from last year and everything, or maybe some projects or new ideas you're trying to work on,
Starting point is 00:20:20 what are you hoping to see accomplished or kind of make some waves in the upcoming year um i think you know kind of taking this idea that we've flushed out pretty well in the you know incident management space um and and bring it into you know into the security space um you know actually i was talking to our information assurance team this week, and we built out a nice little flow so that when they get notified from their incident application, you know, then they can start this process of, okay, let's start the collaboration process and, you know, get people chatting. And I think kind of just taking the same concept, you know, because it's the same kind of teams that are kind of dealing with something happened. I need to get some more information. I need to get people together. and then we need to close it like it's kind of the same process that a lot of different teams go through so i think it'll be interesting to explore
Starting point is 00:21:11 um you know all these other ones um actually we just had a blog post that was um for uh wires the rescue animal rescue group in australia um and they've been using us extensively um you know of just like people in australia can call um call them up to say hey i found an animal on the side of the road you know and then using x matters they can dispatch people out to go find that animal and then bring it back and you know get it um you know get the medical help or uh or help that it needs wow yeah yeah pretty cool you know it's funny that help or, or help that it needs. Wow. Yeah. Yeah. Pretty cool. You know, it's funny that you mentioned the security piece. Cause you know, I started working in performance in around 2000, right.
Starting point is 00:21:52 And for the longest time you couldn't get people to pay attention to performance. And it seems like it's, it's, it's has a pretty good spotlight on it now, you know, it still could always be better, right. Just like anything. But then whenever anyone brings up security, I almost feel bad for security because they're like in that spot
Starting point is 00:22:12 performance was years ago where everyone's like, yeah, yeah, security, security, security. And it's biting everybody. And everybody knows they have to pay attention to it. Everyone knows that more has to be done, but it's slow going. So hopefully, yeah, with what you're working on
Starting point is 00:22:29 and in general, hopefully we'll be seeing a lot more about security in the future because that's getting really scary these days. And expensive too. I mean, the other thing is that, you know, either a breach or, you know, privacy issues, that there's some serious fees associated with some of that stuff. Yeah. I think also on that line, with what you have in a mechanism of detecting those security holes, breaches, or whatever can be detected, you can, in the same way as you do with technical issues right now start pinging start alerting as usually those security issues uh get unnoticed for a while until
Starting point is 00:23:10 it's too late and as you say that's where it gets super expensive and everybody gets sad well and then and then also that like the auditing of all that you know who knew what you know when yeah you know we can you know provide some of that context of okay well this person was alerted on these devices and for tracing issues right yep yeah okay travis is there anything else you wanted to get in before we wrap up um i don't think uh just last thing um february 19th just having a webinar with andy grabner so check that out right so we should probably find that a link there either on the dynatrace website or the x matters all right yeah no that'll be excited to see that
Starting point is 00:23:52 webinar so that's just a couple weeks after perform everybody so make sure you check that out there's also i know i think we have a three-part blog on the integration with x matters on the dynatrace blog site. And there's also an X Matters, what would you call it, e-book, right? A short e-book talking about some of this flow and the connection between your monitoring tools and the X Matters tools and how that all works. So be sure to check those out as well.
Starting point is 00:24:20 Travis, thanks so much for stopping by today. Great. Yep. Thanks, Brian. Thank you very much, Travis. All right. Talk to you much, Travis. All right. Talk to you soon. Thanks. All right. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.