PurePerformance - Perform 2020 xMatters Automating Remediation Actions with TravisDepuy
Episode Date: February 6, 2020Travis Depuy of xMatters speaks to Leandro & Brian about how to leverage tools like xMatters to enable proactive alerting and remediation throughout your code life-cycle...
Transcript
Discussion (0)
Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance!
Hi everybody and welcome to another episode of the jointly produced PerfBytes and Pure Performance live from, or not live, but from coming to you from Dynatrace Perform in Las Vegas.
Leandro Melendez is my co-host today from PerfBytes Española. Hola, Leandro.
Hello, hola everybody. It's such a pleasure to be here.
How are you doing today? Pretty good. Enjoying and very excited about all that is happening, Perform 2020.
Seriously excited.
Great.
Me too.
So our guest now we have from X Matters, and this is their product evangelist, Travis DePue.
DePue.
Yeah.
Excellent.
DePue.
DePue.
It was very, very close. Travis, welcome to the PerfBytes and Pure Performance. How are you doing today? Yeah, pretty good. Thanks for
having me on. I'm stoked. Great. Can you, before we start getting into any of this stuff, do you
want to just give a little brief background of who you are and why you're here? Yeah, so I'm a
product evangelist with X Matters. I, and I'm a man of many hats.
So I, as you can see in your LinkedIn profile. Yes, exactly. Yeah. So I do a lot of, you know,
blog posts and, and talks here and there. And then I also kind of build, you know,
integrations that we want to explore that customers kind of request. So I get to get my hands dirty into some code,
but then I also do some of this evangelizing kind of stuff too.
Great.
So for people who are maybe not familiar with X Matters,
in a nutshell, what is it that X Matters does
and how would organizations use X Matters?
Sure.
Yeah, we help keep digital services running.
You know, when stuff blows up, we help you guys keep it up and running.
We do that by tracking people down using on-call schedules and notification devices.
But really, you know, the notifications, it's not the first thing that happens.
You know, that's really, sorry, it's not the last thing that happens.
That's really the first thing that happens. You know, that's really, or sorry, it's not the last thing that happens. That's really the end of the incident.
So we help you, you know, build out workflows that allow you to either enrich the notifications or provide workflow options so that you can, you know, drive stuff forward.
So you can actually start the troubleshooting process before you even, you know, get off your phone.
Great, great. And as people may or may not know, there is
an integration between X Matters and Dynatrace, so you can feed the problems into X Matters. And
if I'm not mistaken, some of the integrations you've been working with, can you explain?
I guess that falls into the category of helping to automate the pipeline, helping to feed data from one tool to another.
One thing we've seen with a lot of vendors of all different shapes and sizes is the idea of using these API integrations so that you can, without necessarily even having to go into some of the tools for some of the features the apis can
communicate to each other and take care of things and then bring you a result or something else
what is it that you all have done on the dynatrace side with that or with dynatrace
how does that integration work yeah um so so we have a couple different um you know places so once
dynatrace you know figures out that something's wrong, you know, it can reach out to a person, you know, that we were really good at tracking people down
and letting them make a decision about something. But even before that point, we have tools,
integration tools, we call them the flow designer. And on this nice canvas, you can build out a nice flow that would, once Dynatrace reaches out to X Matters, we can then run other API calls to either enrich the data or maybe go run some kind of job either in Dynatrace or some other tool to actually perform a self-healing action before we even engage a person. But then, you know, maybe that self-healing action didn't actually fix the problem.
So then it's like, okay, now we need to get, you know,
one client that we called it a hands-on keyboard, you know,
so they would do all this kind of front work and then they'd say, okay,
well that did not work automatically.
So let's, you know,
get hands-on keyboard to actually get somebody to pay attention to this.
When you mentioned that it specializes in notification
and reaching out to people,
what are the capabilities of reaching people
or what are the channels?
How do you manage to do this?
Like you call their wives to say,
hey, you're being requested?
Yes, we're working on the cat droppings.
I'm sorry.
The drone delivery by cat.
No, but we do push notifications, voice calls, SMS, and what am I missing?
There are still some pages out there.
We also had a client.
I don't think we actually have them anymore, but they requested a fax feature.
Oh, wow. Oh, cool. So anymore, but they requested a fax feature. Oh, wow.
Oh, cool.
So we actually had to build a fax system.
Not so cool, more retro than cool.
Yeah, retro cool.
Wow.
And you could have the fax drop the fact sheet on a scanner and then automatically email it to the person.
Yeah.
That'd be great.
Yeah, we're building out a whole IoT department for that.
No, I'm just kidding kidding that's not true um so one thing you mentioned that was interesting because
we've seen this uh you know and anytime i say us i'm referring to like andy and myself because
we've done you know a lot of podcasts with some of these pieces in the self-healing side and one
thing we see and you mentioned a really good point is when it comes to self-healing, a lot of people think that, well, you just automate all the self-healing and then you don't have to touch anything anymore, right?
But the reality is, conceptually, at least as far as I understand, and correct me if I'm wrong, that with self-healing, you're going to have, depending on the criteria that comes in, right? So again, in this instance, Dynatrace sends over some data
about what the problem that goes on.
XMatters is going to know,
okay, based on this,
we're going to tell maybe Ansible Tower
or something else to run this playbook,
you know, which might have been automated and set up.
And then go back and check to make sure
the problem is actually resolved
based on the self-healing factor.
Now, often there might be several different steps
and several different
things you can try to do. Like maybe try a reboot if the reboot doesn't work. Maybe you try doing
something else. If not, if there was a deployment, maybe at that point you roll back, whatever those
steps might be, but they're always, or there sometimes can come a case when nothing works.
And that's when you have to take then and reach out to an actual human being. So you can automate
so much of it until you hit
a point where the self-healing didn't heal right and that's i guess one of the you mentioned that
as like the last part of the chain is you'll facilitate all those handoffs between the
different tools and help do these things but then if none of those work it is going to go ahead and
reach out and notify saying hey you really need to look at this because we can't fix it automatically
did i get that right or is that completely wrong? Yeah, let me just expand a little bit.
I mean, first of all, I would really be interested to find a company that has completely solved
this self-healing problem and they don't have to reach out to people.
I think that's why people don't like calling it no ops because there's always ops, right?
But I also might be concerned that they've actually built Skynet and we should find
the package.
But yeah, but then the other part is that, you know, once reaching humans, you know, once you reach a human and, you know, we've done kind of this pre-work to say, okay, well, this was attempted, reboot the server, you know, maybe roll back or something, but we're still having problems.
So let's reach out to a person.
But then that person usually is, again, rarely in a silo, you know, they're rarely by themselves. So they are going to, you know,
maybe if it's bad enough, they might flip the set one switch. So then,
you know, we provide them in the notifications,
an option to start the set one process.
So open up a Slack channel and invite people into Slack channel, open up a,
you know,
a Jira ticket or some other ticket that needs to be open to track that
and start that process. Like I said before, people will kind of get off their phone
because there's a lot that happens right after that. The notification is not the end of it.
Yeah, so that's cool. So what you're saying then is that instead of them having to log in,
create a Slack channel, invite everybody, that can all be kind of automated as well through.
And that button will be embedded in the notification and then they push out.
That's really cool.
If it's got an API a chat or join it. Do they have a way to trigger these actions
that you generate through Flow Designer?
And what actions can they create?
Yeah, so they pre-configure the actions.
We're working on kind of making them a little more dynamic,
but currently they're pre-configured at configuration time.
And then they can select that from any of the notification methods
that we've reached out to them.
So push notifications, voice, email, SMS,
all have the same kind of response set
so that however X Matters gets to you,
you can initiate the same actions.
So it is a little bit like if you want
to i don't know um clean up the 10 files of your disk uh dial one or reply with zero send the facts
back with uh with a hand yeah exactly uh yeah and um from what other devices? I mean, you mentioned an SMS or a voice call or different interactions.
Do you have something like an app or something that the users receive these notifications?
Yeah, so we have a pretty extensive app that receives the push notifications.
You can also initiate kind of custom or ad hoc notifications from there,
as well as see your on-call settings.
So you can be like, oh, yeah, no, I'm not on call for the next two days.
Sweet. I'm going to Vegas.
Cool.
And also from these apps, you can trigger these actions, right?
If you are the person and you know what to initiate from the app,
you can call it.
Yeah, exactly.
Do you have extension to other smart devices?
No, not that I can think of right now are you are you talking like alexa like hey alexa initiate yeah alexa smart watches or
i don't know i've seen that even the smart fridges can get some of that you know so you
could have your smart fridge initiate um we had uh we did have, we have played around
with the smartwatch,
Apple smartwatch specifically.
I don't think we actually
productized it,
but it's floating around somewhere.
Oh, that's really cool.
So you can also see
the notifications
and trigger some of the actions.
Your eggs are rotting
and your server's going down.
Alexa, please reboot the server.
Yeah, but we expose an endpoint so that
we can then receive everything in the kitchen sink. So it wouldn't
be too hard to build out. Yeah, and I think that's a
common thing with a lot of modern tools these days, right? You have these API endpoints that
it might not necessarily be
something the
tool manufacturer, or I guess
software, whatever you want to call it, has built
specifically, but the endpoint
allows those to be done.
And that's where I think a lot of the creativity of what
we're seeing in
the interconnection of these tools
is going, where people are like, hey, I can
do this, and then they build it and put it out there. So it's just because,
and this is something we see all the time too, right?
Just because it's not officially in the product doesn't mean it can't be done.
Right. And then a lot of people are out there doing these things.
So it's really cool to see what people are doing with the tools today.
What's the state of all this that you see from, you know,
you know, how obviously the customers
that you're going to are probably starting on that journey, but, but in a big picture,
right?
We've seen even, even an automated pipeline beef before getting to remediation, before
getting to these other pieces, just even getting an automated pipeline.
Uh, there's still a lot of catch up people are playing on there.
How, where do you see the state of um
taking that for you know to the to the step where you all fit in and what's what's on the horizon
for that and you know like what's the outlook like for that whole setup yeah um so you know i
think organizations once they get get to a certain you know size they um you know that size, it's hard to keep kind of a standard tech stack
or even processes, right?
So having these tools that allow for maximum flexibility,
a lot of the pipeline tools are just hooking
into other places as well.
So I think that kind of concept is kind of reaching out
to other places.
And so having tools that can have the flexibility to talk to all the other tools that the organization is working in is going to be important.
You know, I mean, for example, kind of going back to the 7-1 incident kind of thing,
you know, you might have some, your dev team is working in JIRA,
but they need to have a ticket created for the CSMs or, you know,
for the support people to, uh, to be able to see, have some kind of visibility into that, um, into
that incident because they're affected, you know, by that, but they've been necessarily be responsible
for it. Right. So, yeah. So I think teams that, uh, or tools that, um, can have flexibility and
talk to those other tools and then support a variety of different processes and technologies
is definitely where we're going to see a lot of work in the future and soon.
In terms of the sort of you kind of described some prerequisites,
certain tooling and all this, but when people are thinking,
all right, now I'm going to use the self-healing part as an example.
I know that's not all that you all do, but it's just a piece that interests me. But if someone's going to say, I going to use the self-healing part as an example i know that's not all that you all do but it's just a piece that interests me but if someone's if someone's
going to say i want to head towards self-healing right obviously they want to get first get a
pretty solid pipeline and everything but let's say they are like well we don't know all of our
playbooks we don't know all the conditions to do everything i would imagine some people might feel
overwhelmed and say,
we have to figure all that out before we take the next step.
But I would imagine you might counter smartly,
just like the whole process of going to CICD in the first place
would be you don't have to have everything figured out first.
If you have one workflow figured out, start there.
And then you can build as you learn from that, as you make sure you have the proper inputs for everything.
It doesn't have to be an all at once.
You don't have to get all your playbooks in line.
You don't have to make sure all your metrics and everything is set up.
Just get one solid case and then you can start on that journey.
Am I correct in that?
Is that what you recommend to people or is it just the exact opposite?
No, I definitely would recommend that. Actually, I've been
buried in Google's SRE handbook.
Technically, it's called the Site Reliability
Engineering, but I call it the SRE handbook because it sounds better.
But they really stress the importance of postmortems.
So every incident of
a certain, you know, priority or severity, you know, you will have a postmortem and then there
should be artifacts taken away from there. So if you're not fixing a process or fixing, you know,
something, you're at least taking actions that can be translated into a run book. And yeah,
certainly in the beginning, no, you're not going to have anything right. But after, you know,
a year or two years, now you're going to start to compile a list of runbooks, you know, that can be helpful in those situations. actions, how they are designed or what information gets stored so that in the future, when you
experience an incident or something that you have already experienced and you already applied
some solution, you can quickly and easy, let's say, not only the system, but the organization
starts to learn how to create these flows.
Or is it, how am I imagining that this works kind of we you know we are really good at
tracking people down and giving them a decision point on what to do next um so we would rely on
you know the other tools like ansible that would actually do the runbook so we would deliver the
information so that a person could then say run you, you know, run book alpha or beta or gamma.
And then we can go talk to Ansible and say,
okay, execute, you know, run book alpha.
So we, you know, it's like a best in breed, right?
We're not going to rebuild Ansible.
We're not in the business of putting Ansible out of business
or Red Hat, I guess.
But being able to insert a human into a process
is still important.
As much as we talk about automation,
having people make decisions
and move process forward is still critical.
And for what you mentioned,
you are leaving like breadcrumbs
or some sort of log of what were the solutions applied
and what was the situation
so that this can be built up in the future.
Yeah.
You know, even we've done some use cases
where we pre-populate or pre-configure post-mortem,
you know, so we can grab, you know,
maybe some scribe updates from a channel, Slack channel,
and then dump those into, you know,
maybe a Confluence document.
Actually, I think in a particular use case,
I was thinking we did it in ServiceNow.
So the incident started in ServiceNow,
created a Slack channel.
We pulled just the Scribe updates into the incident
so that then when somebody's going through
trying to do a postmortem,
they have all the information of when the incident started,
who was notified, who started the conference bridge,
and then the Scribe updates as well.
Cool.
So everybody or the key people gets
involved quickly get the information as soon as possible and get ready to start acting on yeah
exactly and yeah if i might from these features that uh the solution has and that you have applied
with your customers which ones would you say have been the ones that your customers praise the most or like the most
i mean we all have our favorites yeah um yeah the the enrichment piece i think has been really um
you know it's always fun to you know go to go to these events and and talk about um you know
this enrichment process where you have the the monitoring arm that has found an issue,
but then pulling in information from other sources, you know, in the past, or, you know,
you talk to some large IT organizations where it was like the CMDB being able to map a particular
business service to a CMDB, you know, now it's kind of more like, okay, this is the microservice
that we're dealing with. Okay. where does it sit in the service mesh?
So enriching, because nothing's in a vacuum,
so it's all part of an ecosystem,
but which part of the ecosystem is it affecting?
So enriching that kind of information
and then delivering that to a person is invaluable
because they can quickly make a decision.
Yeah, I'm sure that they are all very happy to not to be thrown at a defect,
a problem or an emergency blindfolded and like, Whoa, Whoa,
what's happening.
Yeah, exactly.
I guess what you're saying is like you get all the answers almost right away
to start acting.
Yep.
Sounds pretty awesome.
Yeah.
Yeah.
We're pretty stoked about it.
I guess, you know, one more question I wanted to ask is, you know, what do you see on the horizon for this next year?
You know, not to put you like in a predictive spot, but based on trends from last year and everything,
or maybe some projects or new ideas you're trying to work on,
what are you hoping to see accomplished or kind of make some waves in the upcoming year um i think
you know kind of taking this idea that we've flushed out pretty well in the you know incident
management space um and and bring it into you know into the security space um you know actually
i was talking to our information assurance team this week, and we built out a nice little flow so that when they get notified from their incident application, you know, then they can start this process of, okay, let's start the collaboration process and, you know, get people chatting.
And I think kind of just taking the same concept, you know, because it's the same kind of teams that are kind of dealing with something happened.
I need to get some more information.
I need to get people together. and then we need to close it like it's kind of the
same process that a lot of different teams go through so i think it'll be interesting to explore
um you know all these other ones um actually we just had a blog post that was um for uh wires
the rescue animal rescue group in australia um and they've been using us extensively um you know of just like
people in australia can call um call them up to say hey i found an animal on the side of the road
you know and then using x matters they can dispatch people out to go find that animal
and then bring it back and you know get it um you know get the medical help or uh or help that it
needs wow yeah yeah pretty cool you know it's funny that help or, or help that it needs. Wow. Yeah. Yeah. Pretty cool.
You know, it's funny that you mentioned the security piece. Cause you know,
I started working in performance in around 2000, right.
And for the longest time you couldn't get people to pay attention to
performance. And it seems like it's, it's,
it's has a pretty good spotlight on it now, you know,
it still could always be better, right. Just like anything.
But then whenever anyone brings up
security,
I almost feel bad for security
because they're like in that spot
performance was years ago
where everyone's like, yeah, yeah, security, security, security.
And it's biting everybody.
And everybody knows they have to
pay attention to it. Everyone knows that
more has to be done, but
it's slow going.
So hopefully, yeah, with what you're working on
and in general, hopefully we'll be seeing
a lot more about security in the future
because that's getting really scary these days.
And expensive too.
I mean, the other thing is that, you know,
either a breach or, you know, privacy issues,
that there's some serious fees associated with some of that stuff.
Yeah. I think also on that line, with what you have in a mechanism of detecting those security holes, breaches, or whatever can be detected, you can, in the same way as you do with technical issues right now start pinging start alerting as usually those security issues uh get unnoticed for a while until
it's too late and as you say that's where it gets super expensive and
everybody gets sad well and then and then also that like the auditing of all that you know who
knew what you know when yeah you know we can you know provide some of that
context of okay well this person was alerted on these devices and for tracing issues right yep
yeah okay travis is there anything else you wanted to get in before we wrap up
um i don't think uh just last thing um february 19th just having a webinar with andy grabner so
check that out right so we should probably find that a link there
either on the dynatrace website or the x matters all right yeah no that'll be excited to see that
webinar so that's just a couple weeks after perform everybody so make sure you check that out
there's also i know i think we have a three-part blog on the integration with x matters
on the dynatrace blog site.
And there's also an X Matters,
what would you call it, e-book, right?
A short e-book talking about some of this flow and the connection between your monitoring tools
and the X Matters tools and how that all works.
So be sure to check those out as well.
Travis, thanks so much for stopping by today.
Great. Yep. Thanks, Brian.
Thank you very much, Travis.
All right. Talk to you much, Travis. All right.
Talk to you soon.
Thanks.
All right.
Bye-bye.