PurePerformance - Perform2020 Andi on the Street: NoOps, thresholds and hybrid observability
Episode Date: February 6, 2020Andi Grabner, our man-on-the-street, gets the scoop on:-NoOps: Reaching zero-incident prod through auto-remediation-as-code with Juergen Etzlstorfer-Beyond thresholds – Find anomalies and reduce ...false positives with Thomas Natschlaeger-Hybrid observability, from enterprise cloud to mainframe and everything in between with Alex Huetter
Transcript
Discussion (0)
Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance!
Hi everybody and welcome to Pure Performance and PerfBytes, coming to you from Perform 2020 in Las Vegas.
Our man on the street, Andy Gravner, has sent us a few interviews, which we've compiled in this episode.
There's a short musical interlude between each, so be sure to stay tuned in. Take it away, Andy.
Welcome, everyone, to another episode of Pure Performance Café, still in Vegas, still at Perform 2020 in the Cosmopolitan.
I bumped into another speaker, Thomas. How are you?
Hi, fine, thanks very much. Hey, it's your first Perform, I believe? Yes, it's my first Perform and
I'm really curious to giving my session, yeah. Perfect. Maybe for those people that don't know
you, who are you and what kind of role do you have within Dynatrace? I just joined Dynatrace
one year ago as a lead data scientist and driving all AI related and data science related issues in the company now.
Oh perfect. So I assume you're working closely with Wolfgang Baer and some other folks?
Yes, I'm closely working with Wolfgang Baer and also with the whole data engineering team which
is responsible for setting up the whole data science environment which we are leveraging to
drive forward Dynatrace AI Engine in Davis.
Perfect. So your breakout, so it's coming up. It will be recorded, so I assume a lot
of people that cannot make it will watch it. So what can they expect from the breakout?
What is it about? What are the key takeaways?
The key takeaways. So I will actually talk about some insights of our AI Engine in Davis.
And here the main focus actually will be how it is possible to have good root cause analysis
results, good answers from a huge pile of data without defining fine granular thresholds
and baseline for each and every metric.
And I will dive into the details how such statistical advanced methods which we have
in place for anomaly detection and change point detection allow a great amount of noise reduction in order to deliver AI ops and well-defined root
cause analysis.
That's awesome.
So it's also, I think it kind of bridges the gap with what Wolfgang, I think, has been
talking about in his breakout when he talked about how to leverage Davis.
He had Kroger on stage and then you now
give the actual background information how all this stuff actually works.
Yes, that's a link up between these two talks. My talk is more about the
technical details and how the whole Trinitr's data science team is working
with respect to product intelligence and how the Davis
engine internally works and Wolfgang was talking about the benefits brought to the customer
by these new algorithms.
Cool. I got a question for you now.
So in my field, I focus a lot on using Dynatrace
strategically across the delivery pipeline.
So from pre-production all the way into production,
using Dynatrace to automatically detect in case a new deployment had an issue.
And obviously for that, Davis is great because we can rely on the anomaly detection.
Now, are there any ways how we can kind of also teach or customize the AI?
Because while it's great to have everything out of the box and everything works magically,
but are there any other options for people to say, I have my domain knowledge and I want to get in?
Yes, it's possible.
What we have brought so far is custom
metric ingest where users can actually leverage Davis II algorithms to
automatically baseline new custom metrics, get Dynatrace suggesting
him perfect thresholds and this can be levered and this then integrated with
the Root Cause Analysis. That's awesome. So that means now what's new with the
latest let's say evolution of Davis is that you can also ingest your own data
and that also gets baselined. Yes. Because that's something new I believe right?
Yeah you can you can baseline it you can get your threshold suggestions you don't
have to worry about your own thresholdings but Davis will suggest the
proper thresholds for you. That's perfect and I think I saw this also in the UI under custom alerting
that's what you talked about right? Yes definitely that's under custom alerting
and that's going to be approved in the future. That's awesome, perfect. What about
APIs is a big topic for me and a lot of our customers how can we automate things
better? Are you touch basing on some of the APIs as well? I mean obviously
ingesting data, is there anything else people should be aware of?
I think the most important part is actually the data ingest and also
being able to leverage this custom baseline setting via the APIs. So it's
perfectly possible running all this algorithm suggestion and then
search for suggestion via the APIs.
That's cool. So can I, and this is now where I lack knowledge because there's so much in the product that's going on.
Could you also, is there any other APIs available where you can also extract baseline information?
You can extract the baseline information, I think, via the config API, and you can also extract the problems via the API, which Dynatrace has found,
and it will give you the information when and where and why it has found it.
Because I think of it, I always try in my world to shift left,
meaning I want to take production data, and I want to, let's say, leverage the baseline that was calculated in production,
extract it and then leverage it in pre-prod to make those quality gates.
So instead of folks having to define thresholds in pre-prod, just take the baseline from production.
Yeah, that's a way which is how Dynadress is built, API first, such that you must not use the UI, so to say.
Perfect, cool.
Anything else that you want to tell people?
Why should they definitely watch your breakout?
Yeah, if they really want to know the inside,
the deep and the secret sauce of the Dynatrace DVCI engine
and also how Dynatrace is driving its own digital transformation
and how the Dynatrace data science team will be working in the future and to even more improve the Dynatrace product by applying AI and data science in the back
office so to say, you should come to my session.
Awesome.
Hey Thomas, thank you so much.
Enjoy the rest of the show.
Thank you very much.
Bye bye.
Andy, bye. Andy. Bye. Welcome everyone to another episode of Pure Performance Cafe.
I'm still wandering around the corridors of the Cosmopolitan here at Perform 2020.
I bumped into another colleague of mine who's doing another session on observability.
Alex, hi, how are you?
Hi, yeah, I'm fine.
Thank you very much.
Yeah, I'm new product manager at Dynatrace.
I'm responsible for the mainframe and everything around the mainframe.
So the mainframe was introduced more than 30 years ago,
but still more than 70% of the Fortune 500 companies
are using a mainframe for processing their critical business transactions.
Why they are using a mainframe? Because mainframes are incredibly fast,
process transactions secure and reliable. But monitoring
or tracing the transactions on the mainframe is not enough anymore. Because the mainframe
environments are transforming. More and more companies are moving
their services to the cloud. They implement services
as microservices. and finding the root cause
for a problem on the mainframe is very complicated nowadays.
So with Dynatrace they have the possibility to do an end-to-end tracing, so when a customer
starts a request on a mobile application or starts a service in the cloud, Dynatrace provides
you the possibility to trace this request
from the distributed world down to the mainframe.
And in this preq session, you will see how you can use Dynatrace for end-to-end tracing
and also finding the root cause of a problem from the distributed world down to the mainframe.
That's pretty awesome.
I mean, first of all, the capability of root cause analysis through the different layers of the modern stack is amazing.
But then I think from a mainframe perspective,
there's also other aspects.
I'm not an expert, but I believe there is ways
where if you can prove to, let's say, IBM,
who is selling you the mainframe,
how much traffic actually comes from mobile,
you also get discounts on the licensing.
And I'm pretty sure it's also from the cloud workloads. I think this is also a capability that Dynatrace
brings to the table, telling you how much traffic actually comes from these
out, let's say, cloud or mobile into the mainframe and therefore helping you to
run to lower your costs. Of course, yes. In 2020 we are also working on a mobile
cloud workload pricing so that it is possible directly in Dynatoys to identify the source of your request.
When a request comes from a mobile application or cloud service, as you already explained, then you can save money because IBM provides special discounts for these requests.
And we will provide support also in 2020 for this feature.
That's perfect.
In addition to that, of course, we will also support Java in 2020.
We are currently in the implementation phase for this new enhancement.
In Dynatrace we had, in AppOn we already had Java support, but in Dynatrace not.
So we are working together with the different development teams to support this also in Dynatrace because Java is one of the most important technologies in the world.
And with this support, we can enhance also the visibility on the mainframe
because when a customer triggers a mobile request or a cloud request with Java technology,
it's currently not possible to trace the request down to the mainframe
because there is no Java support in Dynatrace to the mainframe because there is no
java support in dynatrace for the mainframe but this is also on the roadmap 2020. perfect and in addition to that it is very important to identify how much cpu usage a specific
request needs because based on the cpu usage the customers pay customers have to pay a lot of money to IBM.
And we want to introduce also in 2020 CPU usage analysis on an alpha level.
This is a host in Dynatrace and on a process level.
This is a region in Dynatrace.
So we will also enhance this in 2020.
That's awesome.
So I know we have, as you mentioned, there's a lot of people still running on the mainframe, a lot of customers that will be at
Perform or are here at Perform that have mainframe. So it's great to know the
person that they need to go to. Alex, that's you. You can also, you know, I
think it's a shout out to our customers to make sure if they run a mainframe, if
they want to know what they can do, how Dynaspace can help them to optimize their mainframe deployment,
their costs, their root cause analysis,
then either find you after your session,
or I'm pretty sure you're probably also around at the expo area
with all the innovation labs, that the towers are there.
I think that's a great way for customers to get in touch with you
and learn more about what's going on.
Of course, yeah, and it's also a great opportunity for me
to learn more about the use cases.
I can present the first mocks to our customers.
We can discuss the mocks together so that we can provide a good solution in 2020 for the mainframe.
Perfect.
And every breakout is recorded.
So in case somebody is listening to this after the perform and they couldn't make it because they couldn't make it,
make sure to check out your recording in case
you're doing anything on the mainframe because obviously there's a lot of cool stuff that
you guys are doing.
Yeah.
Great.
Thank you.
Yeah.
Thank you.
And I enjoyed the show.
I know it's, I think you said it's your first performance.
Yeah, it's my first show.
Yeah.
So, well, I know Vegas is sometimes a little overwhelming.
At least it was for me the first time I was there, but you know, enjoy it as good as you
can.
Thank you very much.
See you.
Bye-bye.
See you.
Bye-bye.
Welcome, everyone, to another episode of Pure Performance Cafe.
I'm still wandering the hallways of the Cosmopolitan in Vegas at Perform 2020.
And as you probably have guessed, I just bumped into another of one of my speakers at the release battle of the Faster Track.
Jürgen, wie geht's?
Hi, Andy. I'm fine. Thank you.
What a nice German introduction or question. Thank you. Well, you know, there's a lot of things people learn here in Vegas and maybe at least some of the
Austrian words that they
can use because there's obviously not just the two
of us Austrians here wandering around,
but there's a lot of Austrians here from
the engineering lab. So,
Wie geht's is always a great conversation
to start. It basically means, how are you?
Wie geht es gut?
Yeah, that's good. Hey, Jürgen. You know, started basically means how are you so you can you know the reason why I do this short podcast is because we want to give folks that are still deciding on
whether they're going to join a session or those that listen to this later on
and decide on whether they want to watch recording so we want to give them a
little background and kind of insights on what they're to watch a recording. So we want to give them a little background and
kind of insights on what they're going to learn. Now, you are doing one of the last sessions in
the track, and it's called NOAPS, Reaching Zero Incident Prod Through Auto-Remediationist Code.
Now, that's, as most of the other topics, a very, let's say, ambitious titles because it talks about auto
remediation, it talks about zero-insulin and prod, and it talks about no-ops.
Can you tell us a little bit about what people will learn, what you're going to teach, also
who is going to speak with you because I think you have a customer and what they're going
to talk about?
Exactly.
Exactly.
So first, a big shout-out to the company Citrix.
They are joining us on stage.
Actually, it will be Nestor.
And together with him, we will talk about auto-remediation as code.
What is it?
How you can leverage it?
How you can build your self-healing systems?
Actually, which prerequisites you really want to take care of first. You don't want to start by just jumping in the cold water and then struggling with all the waves. But basically,
you want to make sure that you have a good foundation. And we will give you hints and we
will give you recommendations from everything we saw with our customers. And we also learned
within Dynatrace how you can reach a zero incident production system.
And then we will have Nestor on stage
and he will talk about also how Citrix
managed this for them.
And I'm really excited because they use bots.
So they are using a lot of automation.
They can communicate with bots.
They actually have bots talking to bots.
So it will be, I think it's like the perfect setting for 2020, have bots talking to bots. So it will be, I think it's like
the perfect setting for 2020
having bots talking to bots
and we will see
what a bot can do for them
and also how they are
using this within their organization
and what all the benefits are
for them. And I think we can
learn a lot from them. Well, if bots
talk to bots, I wonder if they also
eventually start talking
and sharing cat pictures
like we humans do or not.
So I guess we'll see.
Maybe next time
we will invite bots
also to the stage.
Yeah, that's actually good.
Hey, so obviously,
we've been talking
about your content
as part of the
preparing and getting ready for Perform.
Now, I'm really excited about the structure that you have in your talk.
And I think you mentioned it earlier.
I think the first step to auto-remediation is actually not auto-remediation.
It's reducing the noise.
And can you give us just a glimpse on what that really means and what people will learn?
Yeah, sure.
So reducing the noise really means that you don't want to be alerted and informed about everything that's really going on in your environment.
If you have huge environments, you might get a lot of noise and you are maybe suffering from this alert fatigue that you don't take care about your alerts anymore
because it's just too many.
So we will see in this track also how our customers that have huge environments, how
they are tackling their noise and how they are basically putting their problems into
different buckets, how they are using the Dynatrace API to identify which problems are really critical for them and which problems are not as critical, how they are tweaking
the alert settings, how they are tweaking notifications.
So we will see in the very first part how you can make sure that you can really focus
on the things where auto remediation makes sense and you don't get overwhelmed by too many alerts and you don't like you don't lose the overview of everything that's going
on in your environment that's that's one very important part and I think that's
also kind of demystifying all the self-healing yourself feeling really
means put a lot of automation in place but first of all it needs humans to
identify where to put this automation in place. It's not what
will take over our work, but we just leverage them to do the work
where we get bored. And obviously, as you said, reducing the noise means
using automation to solve problems that actually make sense
to be solved and first use other techniques to actually
filter out those things that actually
don't need attention or need a different way of attention.
I like that as well.
The last thing I want to ask you, because self-healing and auto-remediation is a topic
that has influenced our work on Captain, I would assume you are going to talk about Captain
and the self-healing and auto-remediation capabilities?
Yes, thanks for bringing this up.
So this year, we are really excited to also have CAPTEN on the Innovation Towers.
You might have seen them already.
If not, please also visit us on the Innovation Towers.
We have all the CAPTEN experts there.
And I will be talking especially about the self-healing and auto-remediation
capabilities we built into Captain. So you can use basically Captain and its auto-remediation
capabilities by just providing instructions in a declarative way. So it's not about
learning a new programming language, how to use Captain. But it's very easy. I will show an example. Actually, we will see a live demo
during the talk. So I will kind of crash an environment or I will make it
unstable, let's say, and we will see what Captain can then do
for us to auto-remediate this with a great
combination between feature flags, Dynatrace, AI, and Captain.
So I'm really excited to show this to you.
Hopefully the demagogues are with us.
You don't know during a live demo what will be happening,
but we have quite mature software in the background.
So I'm pretty excited to show this to you.
And if you're interested,
then also visit us on the Innovation Towers.
Yeah.
Well, the thing is, you know,
if things go bad and wrong and we don't want to talk about,
you know, the slogan of Vegas,
what happens in Vegas stays in Vegas.
So if it's really bad,
we just don't talk about it anymore.
Hey, final shout out.
I know there is initially Sohaib
was slotted for your, you know,
track for your breakout.
But then he had some things he had to take care of,
which was more important.
So still a big shout out to him
because he worked with you very closely
and preparing the content, also contributing to Captain.
And yeah, I mean, sad that he cannot be here,
but also thanks to Nestor who is jumping in.
Yes, thanks also, So Zoe, for your work.
And I'm excited and pretty sure that we will hear Zoe
also in one of our next maybe Captain Community meetings
or in one of your podcasts, Andy,
because he has done great work with the Slackbot.
And we also want to show this to the public,
what he has achieved or what we have achieved together,
the great collaboration between Captain and Citrix. Perfect hey Jurgen thank you so much I think it's
time to move on it's time to go to the breakouts and hopefully for everyone
that is listening in that are contemplating our do you want to see
this session yes or no hopefully you got enough arguments now to actually watch it either live or later on
the recording and now Jürgen, tschüss, bis später. Servus, thanks Andy, see you in the session. Bye-bye.