PurePerformance - DevOps to NoOps in action: ChatOps for Autonomous Operations with Nestor Zapata
Episode Date: January 30, 2019Long-time Dynatracer Nestor Zapata chats with us about Citrix’s fundamental shift from reactive to proactive and predictive operations; moving from data sets and charts to AI-powered answers. His se...ssion detailed advantages of a “Gen 3” monitoring approach and how to get there.
Transcript
Discussion (0)
Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance!
Hello everybody and welcome to Pure Performance and PerfBytes.
Coming to you from Dynatrace Perform in 2019 in Las Vegas.
I'm Brian Wilson, and my co-host is, I've been saying the one and only a lot, so Mark Tomlinson from PerfBytes.
But Mark, what else can I call you besides the one and only?
You could call me the veritable, the incorrigible.
That actually is my new favorite.
Asynchronous.
Yeah.
So anyway, Mark, we have another guest with us today from Perform, Nestor Zapata.
He's a former guest of Pure Performance, right?
And did he ever do anything with PerfBytes?
Yeah, Nestor, you've been on PerfBytes live from Perform for a couple of years, yeah?
Yeah, I've been live from Perform, and I've also done a PerfBytes about 45 or even more with Andy by Agile in operations.
So that was a pretty lengthy one that we had, and that was a cool transformation that our ops team took a couple of years ago.
Yeah, yeah. A pure performance podcast, definitely. Yep. And Nestor, so why don't you introduce yourself and tell people a little bit about yourself and then we'll jump in.
Well, hello everyone. My name is Nestor Zapata and
currently I have a new role at Citrix Systems. I am the data center
and cloud operations manager. The last couple years I have been
at Perform and working with Denitrace as one of the DevOps engineers
on the web and application team,
and since moved on to bigger and better things per se,
but always continue my work with Denitrace.
And today I had the opportunity to share my experience
about chat ops for autonomous operations.
And that's something that at Citrix, it is our goal to make operations more relevant
in terms of the feature of automation and involving things like predictive analytics,
AI, chat, and voice ops.
And that's something that I had a fun time enjoying my story, sharing my story today
at Perform.
Thank you.
First of all, congratulations to the promotion.
It sounds like you said you had.
So tell us, last time we chatted, right,
which was I think a year and a half, two years ago on the podcast,
you all were in the middle, early middle phase of your transformation, right?
You had gotten some landmarks and, you know,
we said we were going to check back in and now I guess we can't check back in.
Where are you now as opposed to last time?
And what would you say you've learned from,
from this last big push in the last year or so?
Well, in the last year,
I think one of the biggest pushes we've done is obviously moving along from
Atman on-prem solution to the
Ganatrace solution, the SaaS platform. And also moving on, we did explore synthetics. We did
start off with classic synthetics and now obviously making that push to making all the synthetics go
through our Ganatrace platform. So that was huge for us because with that transformation,
we were able to dive in and peek at things like AI and analytics
that were not available to us before, more robust information, more deeper drill downs.
Not only as an IT organization, we were able to be successful at getting, but also the
business was able to benefit from because they saw how their application, how their
business were being impacted as well.
And with Denitrace, we were able to do quicker deployments.
And now we even got to the point of trusting the data from Denitrace and including automated
deployments.
Now our deployments in pre-production and in QA went from a ServiceNow ticket to a person
manually going into Jenkins. Not all of that was done by a single ticket in Snow or ServiceNow ticket to a person manually going into Jenkins.
Not all of that was done by a single ticket in Snow or ServiceNow.
And then we moved on even further where we can test the data from Dynatrace
and pull it back into our Slack channel.
We incorporated what we call a Slack bot called Ultron.
I don't know why we named it the evil nemesis of the Avengers,
but it's in our operational world, but that's what it's called.
That's cool.
And so Ultron, yeah, Ultron does it.
If you were able to see my session today, you saw the clips of Ultron working hand-in-hand with both Jenkins, Adam, another Slack bot, and our Dynatrace solution as well.
So basically, somebody goes into Slack, tells Ultron to deploy, it deployed the latest code,
Ultron goes out to Jenkins, Jenkins deploys it,
checks out Dynatrace, checks out whatever thing needs to be done.
If it failed anything on the DT side,
it will come back to the Slack channel and say,
hey, engineers, it failed because it did not pass this
or this threshold was blown and you have to
redeploy or you have to check it out before I move it on to pre-production. So it's not just
eliminating the time that our engineers were spending doing deployments, but also preventing
bad code from moving into pre-production, which of course led into much cleaner code coming into our
official production environment. So reducing the amount of time we had to do work cost analysis and troubleshootings
of that nature. And that obviously gave us faster deployment times, fully automated prepod,
and increased automation because we were able to get the data from Danitrace,
trusting it and bubbling it up to our business users.
Wow. So I have two thoughts here I wanted to touch on. Number one,
I know two reasons why you named it Ultron.
Number one, because it sounds really, really cool to say Ultron deployed.
It does.
Right?
Yeah, Ultron working for you.
And also, I think you're a unifier, right?
So you have Ultron working for the good now, which is a good thing to have, and it's doing good things.
That's good.
Ultron is no longer evil.
So I wanted to try to pull like a little Andy Grabner here and do a real quick summary of what you did there
because I kind of think I heard this
and I want to know if I hear this right.
It sounds like since we last talked,
your pipeline has matured amazingly, right?
You finished kind of all the pieces of it.
And also that transition from moving from Atman to Dynatrace
then allowed you to plug Dynatrace much more completely into all different aspects of the pipeline, not just for monitoring data, but for helping with the control flow and everything else that really put the finishing touches on that pipeline.
It's obviously not the pipeline itself, but it makes it, it polished it, it made it much more functional,
much more autonomous. And I think that's awesome because that's something that's a lot of our
customers right now are going through. In fact, you know, people are going to be sick of hearing
me say this, but I was teaching a hot day on Monday called Dynatrace for Atman users and
trying to help people make that transition of, first of all, how do you use the new Dynatrace tool if you're a long-time Atman user like we all were?
But also, what are the different things you can do with it and how do you start thinking differently?
And it's really awesome to hear that you all are leveraging all these side components of Dynatrace in ways that it's designed to be there for you.
Yep, exactly. And along with that, one point that I'd like to touch on,
because we had various sessions of that moving from Atman
into the Dynatrace platform.
Is it going to give me everything I had before?
Is it going to give me every single nook and cranny to the application?
And the answer to that is realistically, no.
It's not going to drill down to every single.NET transaction,
but at the end of the day, we try to understand the data and trust the data.
And what does that mean?
Well, understanding that we don't need to see every.NET transaction that kind of comes through there.
We want to know the problems and the highlights.
As an engineer, sometimes you feel like, well, this tool is dumbing me down because it's doing all the work for me.
And I turn to them and I tell them, don't you want that tool to do that for you?
And you concentrate on automating that solution,
concentrating on learning another skill set
because now you don't have to spend
three or four hours finding a root cause.
Then it trades within minutes.
You can figure out how it correlates to the database,
how it correlates to the web front-end servers
and the application.
And when business users are impacted,
now you have that all in one frame, one single pane of glass, and you can trust that data
because you know it's picking up every single transaction on the back end and just presenting
you what you need to know.
So that was a cultural change, kind of a mind shift, you know, engineers kind of feeling
they were being left out.
But once you knew that your role was going to change, they were able to accept that.
I kept saying, for IT, you have to automate yourself out of your current job role.
Not your job, but your job role.
If you were a DC tech, now you move on to sys administration.
If you were a sys admin or a web engineer, now you move on to DevOps engineering or automation engineering.
And that's what the application team has done in the last year. They move on from being just simple, you know, web and application administrators or engineers.
And since they're moving on to not being more automation engineers, DevOps engineers, incorporating Python, PowerShell, and having an automation first mentality.
And they understand that, that a lot of the stuff they can hand down to a level two or even a level one. And letting go of work that you used to do, that you used to think was, oh my gosh, so important that the world can revolve without me doing this.
Now they have to understand they have to evolve as we move into a world of AI and predictive analytics.
Yeah, I think that's really interesting too because a lot of engineers or a lot of even testers, anybody, they find they can put value on themselves by showing the problems they can fix and the problems they can solve in terms of problems that were created by other people.
Maybe even themselves that were put into production or put into something, but look how great I am at problem solving.
It's a very easy way to validate your value, but if you can move away from fixing those problems or triaging them and instead become more
of a creator of the new things that's that's you know that's but that that you need a little bit
more confidence to take that step but that is a that's why it's a cultural shift and since
everybody loves car uh analogies right mark no no i i liken it to being i liken it to being like
if you're a mechanic and you're fixing problems, like, graduate to being the designer of the car.
Yeah.
Right?
And that's what these new tools, whether it be Dynatrace or other things that are automating or taking the human element out of it, it's going to let you move up to being more of a creator than a problem solver or a fixer.
But that's just always the way we've been, is we've always been fixing things that we're breaking instead of getting ahead of it anyway.
Um, so now you're going into chat ops, right?
And I'm hoping, uh, you're, you're either going to call, you're going to, and, and for
people who are not aware of this is touching into being able to plug Dynatrace into a chat
thing.
Um, but I was going to throw this out there before that. I suggest you try to see if you can rename the name command
from Davis to either Magneto, since you have Ultra.
Magneto would be good.
Or if you want to take it really, really, go a little bit more obscure,
you can call it the Master Processor from Tron.
I think it was the Master Processor.
Was that the name of the evil computer guy?
Or, you know, I wonder if it wouldn't be better to just name him Jarvis.
We could do that.
Because it's like the chatbot that Tony uses in in the iron yeah well jarvis becomes ultron but it starts as jarvis yes that is true i i didn't see that
movie mark you you just ruined it for me so i so i actually i have one question for you nester
just thinking about as you've streamlined the pipeline and people's roles have changed they're
working in different ways,
some of the things I'm hearing from people is like,
well, I don't have time to do experiments
and sort of, you know, kind of exploratory programming,
you know, just playing around and experimenting with stuff.
Do you find that is a challenge for developer engineers?
Is there still room for that?
Do you have other ways of accommodating that experimentation?
Yeah, so it's the different factors that come into play with that.
That is the nature of the beast in our world.
And whether we think we work 40 hours a week, we'll always be working much more than that.
You have to make time for it.
It has to be a priority. Autom. It has to be a priority.
Automation first has to be a priority.
You have to bake it into your process.
And management has to understand that.
And I always tell my team, if I ask you to do a task and you tell me,
hey, Nestor, it's going to take me about two hours to complete this manually,
but if you give me a couple days, I can whip up a PyroShot script,
a Python script to make it more of an automated script that we can use continuously.
I'm always going to tell you, unless, you know, obviously nothing being done, of course, take the two business days that you need to make this happen.
And I think that's important for leadership to do that, for one as an engineer to go ahead and do that and also break out some time whether it is you know apart from your lunch break or sending out
maybe a thursday or friday you know on a lesser hour that you have want to take into your calendar
and put in i'm gonna learn some automation skills and do courses that you have online or work with
another teammate we've encouraged that in our team and And they do that on a weekly basis. They sit down.
It's in their calendar.
No meetings are booked.
They try to keep it as much as they can and try to bake that time into your routine because
it is going to be needed.
The future is going to be all about automation, more of a DevOps.
And if you're not catching up now, you're going to have a tough time showing your value
three to five
years from now when all these bots are kind of doing some of the stuff level three guys
were doing, girls were doing, and now shifting it to a level two or level one.
And now's the time to appreciate that, learn that now, because if not, it's going to be
a tough road in the next few years.
Yeah.
So I hear you right. I mean, it's, it's also as if I hear you correctly,
the, the maturity of the organization goes along with the maturity of the pipeline. And so if
you're not turning that extra time, because we've now automated and things are moving faster,
turning that into these programs as part of the cadence, like, you know, every, every other sprint,
you get a Friday to do,
you know, just hacker day or something on the product you're working on that we kind of tabled,
but we want to come back to it just to see what we could do with it. And you program that into
your schedule or into the cadence that you have. That's what I see, you know, people,
if you miss out on that, then you'll just go faster and faster and almost burn people out,
or you'll be missing out on those other investments, I think.
Yes, exactly.
Take that time out.
Know how much time you save, and then shave off some of that.
If you saved two hours a week because of a process that was automated and that took you a couple months,
well, out of those two hours, take 30, 45, maybe an hour into that week and incorporate it.
You're still going to be in the
office you're still going to be doing great so now make some productivity out of it yeah of course
yeah another approach was you know one of one of our guests we've had on a couple times and
it was fun to talk to donovan brown over at microsoft um you know his his command to everybody
was i think he might have said 15 minutes but I'll be more realistic and say 15 to 30 minutes, take 15 to 30 minutes of your day, your work day, your work time,
right. And work on a project. That's not part of your main thing. So learning the project,
something else. Um, and you know, if you have a problem justifying it, you just got to realize
that 15 to 30 minutes a day, it's going to impact your other project. Your other projects already
impacted, you know? So, so take that time as you're saying, and, and, you know, and I've been applying
that to my life outside of even outside of work, but partially within work, but even on other
things, like I used to always think, well, I need a big chunk of time in order to accomplish
something. Cause I need to dive in for like two hours, you know, but I've been realizing like,
well, actually, no, if I take 15, 20 minutes, get a few things done or, you know but i've been realizing like well actually no if i take 15 20 minutes get a few things done or you know work on just a little bit set a realistic ending to it i'm
chipping away and chipping away and getting so it definitely does work so whether or not you're
you know you're giving the person the two out the two-day business days like you said nester
or if it's like hey just take a little bit of time at the end of your day, watch another one of those 15-minute online courses
to keep brushing it up,
you could definitely be leveling yourself up
to be doing these cool,
so that you do remain relevant, right?
And that's the big thing.
So we brought up the chat ops a little bit,
but I did want to dive a little bit more into that.
So what are you incorporating that into? And just explain
maybe what this is for people who might not be completely familiar with this.
Yeah, so basically chat ops is an extension to our
support as an operational team. For example, if the
manager or the business owner or the developer wants to see how their deployment is
going, because Jenkins is kicking it off, they can type in what was the last deployment at Ultron? What
was the last deployment done in our environment? It's going to shoot back the last couple of
deployments that was done, which of them were successful, which of them failed. It's going to
keep you up to date when you speak to it per se you know you type it in because we're
doing chat ops here right and incorporating things like davis also too you know you can do the ad
from slack you know in your slack channel right davis enter and you reach out to and say hey how
does my performance for this application look and then davis will come respond via the slack channel
we have a couple of bots there for not just Ultron,
but for Jenkins, AWS Jenkins.
We have Adam.
Also some self-healing items that you can go ahead and do and incorporate.
So if you know something is wrong with your application,
because Denetraze has alerted you on that,
you can go ahead and say, hey, Ultron,
trigger self-healing for key service applications, for example, is one thing that I showed today.
And it's going to go into this Azure key service.
It's going to go to the API.
It's going to shut it down, restart it node by node on how it should be done, and then put it back on.
So that's something that was done on a script that you have to log in to go to Jenkins.
But now you can go straight from our Slack channel and tell Jenkins there and say, hey, hashtag kickoff this issue.
But we took it a step further.
Now, based on you could still trigger it manually, but now we have this bot looking out for these issues of the login failures because then a trace and synthetics are baked in.
Now you can go back and say, hey, I, the bot, have already triggered the self-healing for key servers on node 13 or node 15.
Right.
And it lets you know what it's done and so on.
So you could do it manually or it could do it on an automated fashion because of the AI built in between Dynatrace and our bots.
And then it all shows up in the Slack channel.
So there's nobody missing from the DL.
There's nobody not aware of what's going on.
You don't have to worry about this. And you'll see the screenshots in the replay, or if you were in
my presentation today, that it'll say one o'clock in the morning, 2.51 in the morning, things like
that, you know, are things that we're trying to incorporate. And our global team is able to go in
there, keep us up to date. So when we come in the morning, we can see what the bot said, or we can see what our team did and take it from there
and not have to take up time from our Kanban meeting every morning
that's only 15 minutes and not have to worry about that.
We already have somebody assigned that's being taken care of
or we'll get the root cause analysis later on in the day.
That's not balling down our current work stream,
and that's made us more efficient uh to date yeah
and efficient but also easier to manage the workload across the team of course yeah yep of
course yeah so that's really i never thought i never thought about that right we we've always
had you know internally we promote the davis integration into slack right and i just think
okay people talk to davis and that's the beginning and end of it. This is the first time, and it's my own fault. I didn't think
about it, but it's the first time I'm hearing, oh, there are other bots and you can talk to bot A,
then bot B and bot C and give them commands, which that's fascinating to me. I just never
thought about that. I don't know why, but that's really, really cool that you have that set up.
And the only question I have is, do you ever come in in the morning and find ultron
and davis chatting with each other like hey davis what's going on not much ultron
is davis strong enough to hold an infinity stone
i'm not sure we have not tried that out that would be good yeah
now we get to the goofy bits
obviously you don't have to ask
if you've been in Vegas before because we've all seen you
perform before
are you spending any extra time
are you spending any extra time
after the conference is over or are you
going straight home
no I'm going straight home unfortunately my wife couldn couldn't make it out uh with this trip this time around if she was here
we would have we we talked about staying through through the weekend uh but she couldn't make it
out so um i'll be staying for the after party that we have going on wednesday night and thursday
i head on back so but i'll be arriving there you know I arrived here hopefully hopefully we'll get
to dance again you Andy you know we had a fun little actually no I think you were just videotaping
Andy and I dancing I was videotaping you guys yeah probably laughing at me mostly I still haven't
seen that so I have no idea what I look like when I'm dancing and I've had a few scotches
it's somewhere out there I'm sure it is uh And then the other question, you kind of talked a lot about these things, but if you had to pick one, what's your performance resolution for the new year?
Performance resolution is to automate our production environment and self-healing tool sets the same way our pre-production is.
So bake more of the Denitrace logistics in there. Get that synthetic that we have now from
classic moving into that and triggering things like the logins are going to
trigger failures here on this side. That correlation can trigger
a self-healing or let us know what we need to trigger on our side and have the
business utilize our Slack environment to chat with
Davis and get more real-time metrics on their environment.
And there are specific applications, but they may not care how the whole environment is working.
They just want to concentrate on their substance.
So we can trigger that for them, and they won't have to come up to us, per se, and ask us how is everything looking.
They can just chat up and say, hey, at Ultron, at Davis, how is my XYZ environment?
And it'll return back the performance of last night or the last three hours and mature that process of self-healing in our production environment.
Wow.
Yeah.
Cool.
I'll make sure your bosses hear that so they can use that as your performance review at the end of the year.
That's awesome.
It's awesome what you all are doing. I'm glad to hear it's progressed. I mean, it's progressed a lot since we last talked. That's really awesome. It's awesome what you all are doing.
I'm glad to hear it's progressed.
I mean, it's progressed a lot since we last talked.
That's really awesome.
So congratulations for living the dream.
Yeah.
It's an awesome experience.
I'm looking forward to more work in 2019.
Exciting opportunities that we have.
All right.
Cool.
Enjoy the rest of your time here in Vegas.
All right. Thank you, gentlemen. Appreciate it.