PurePerformance - Perform2020 Andi on the Street: NoOps, thresholds and hybrid observability

Episode Date: February 6, 2020

Andi Grabner, our man-on-the-street, gets the scoop on:-NoOps: Reaching zero-incident prod through auto-remediation-as-code​ with Juergen Etzlstorfer-Beyond thresholds – Find anomalies and reduce ...false positives with Thomas Natschlaeger-Hybrid observability, from enterprise cloud to mainframe and everything in between with Alex Huetter

Transcript
Discussion (0)
Starting point is 00:00:00 Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance! Hi everybody and welcome to Pure Performance and PerfBytes, coming to you from Perform 2020 in Las Vegas. Our man on the street, Andy Gravner, has sent us a few interviews, which we've compiled in this episode. There's a short musical interlude between each, so be sure to stay tuned in. Take it away, Andy. Welcome, everyone, to another episode of Pure Performance Café, still in Vegas, still at Perform 2020 in the Cosmopolitan. I bumped into another speaker, Thomas. How are you? Hi, fine, thanks very much. Hey, it's your first Perform, I believe? Yes, it's my first Perform and I'm really curious to giving my session, yeah. Perfect. Maybe for those people that don't know
Starting point is 00:00:54 you, who are you and what kind of role do you have within Dynatrace? I just joined Dynatrace one year ago as a lead data scientist and driving all AI related and data science related issues in the company now. Oh perfect. So I assume you're working closely with Wolfgang Baer and some other folks? Yes, I'm closely working with Wolfgang Baer and also with the whole data engineering team which is responsible for setting up the whole data science environment which we are leveraging to drive forward Dynatrace AI Engine in Davis. Perfect. So your breakout, so it's coming up. It will be recorded, so I assume a lot of people that cannot make it will watch it. So what can they expect from the breakout?
Starting point is 00:01:34 What is it about? What are the key takeaways? The key takeaways. So I will actually talk about some insights of our AI Engine in Davis. And here the main focus actually will be how it is possible to have good root cause analysis results, good answers from a huge pile of data without defining fine granular thresholds and baseline for each and every metric. And I will dive into the details how such statistical advanced methods which we have in place for anomaly detection and change point detection allow a great amount of noise reduction in order to deliver AI ops and well-defined root cause analysis.
Starting point is 00:02:12 That's awesome. So it's also, I think it kind of bridges the gap with what Wolfgang, I think, has been talking about in his breakout when he talked about how to leverage Davis. He had Kroger on stage and then you now give the actual background information how all this stuff actually works. Yes, that's a link up between these two talks. My talk is more about the technical details and how the whole Trinitr's data science team is working with respect to product intelligence and how the Davis
Starting point is 00:02:40 engine internally works and Wolfgang was talking about the benefits brought to the customer by these new algorithms. Cool. I got a question for you now. So in my field, I focus a lot on using Dynatrace strategically across the delivery pipeline. So from pre-production all the way into production, using Dynatrace to automatically detect in case a new deployment had an issue. And obviously for that, Davis is great because we can rely on the anomaly detection.
Starting point is 00:03:08 Now, are there any ways how we can kind of also teach or customize the AI? Because while it's great to have everything out of the box and everything works magically, but are there any other options for people to say, I have my domain knowledge and I want to get in? Yes, it's possible. What we have brought so far is custom metric ingest where users can actually leverage Davis II algorithms to automatically baseline new custom metrics, get Dynatrace suggesting him perfect thresholds and this can be levered and this then integrated with
Starting point is 00:03:44 the Root Cause Analysis. That's awesome. So that means now what's new with the latest let's say evolution of Davis is that you can also ingest your own data and that also gets baselined. Yes. Because that's something new I believe right? Yeah you can you can baseline it you can get your threshold suggestions you don't have to worry about your own thresholdings but Davis will suggest the proper thresholds for you. That's perfect and I think I saw this also in the UI under custom alerting that's what you talked about right? Yes definitely that's under custom alerting and that's going to be approved in the future. That's awesome, perfect. What about
Starting point is 00:04:16 APIs is a big topic for me and a lot of our customers how can we automate things better? Are you touch basing on some of the APIs as well? I mean obviously ingesting data, is there anything else people should be aware of? I think the most important part is actually the data ingest and also being able to leverage this custom baseline setting via the APIs. So it's perfectly possible running all this algorithm suggestion and then search for suggestion via the APIs. That's cool. So can I, and this is now where I lack knowledge because there's so much in the product that's going on.
Starting point is 00:04:54 Could you also, is there any other APIs available where you can also extract baseline information? You can extract the baseline information, I think, via the config API, and you can also extract the problems via the API, which Dynatrace has found, and it will give you the information when and where and why it has found it. Because I think of it, I always try in my world to shift left, meaning I want to take production data, and I want to, let's say, leverage the baseline that was calculated in production, extract it and then leverage it in pre-prod to make those quality gates. So instead of folks having to define thresholds in pre-prod, just take the baseline from production. Yeah, that's a way which is how Dynadress is built, API first, such that you must not use the UI, so to say.
Starting point is 00:05:45 Perfect, cool. Anything else that you want to tell people? Why should they definitely watch your breakout? Yeah, if they really want to know the inside, the deep and the secret sauce of the Dynatrace DVCI engine and also how Dynatrace is driving its own digital transformation and how the Dynatrace data science team will be working in the future and to even more improve the Dynatrace product by applying AI and data science in the back office so to say, you should come to my session.
Starting point is 00:06:15 Awesome. Hey Thomas, thank you so much. Enjoy the rest of the show. Thank you very much. Bye bye. Andy, bye. Andy. Bye. Welcome everyone to another episode of Pure Performance Cafe. I'm still wandering around the corridors of the Cosmopolitan here at Perform 2020. I bumped into another colleague of mine who's doing another session on observability.
Starting point is 00:06:40 Alex, hi, how are you? Hi, yeah, I'm fine. Thank you very much. Yeah, I'm new product manager at Dynatrace. I'm responsible for the mainframe and everything around the mainframe. So the mainframe was introduced more than 30 years ago, but still more than 70% of the Fortune 500 companies are using a mainframe for processing their critical business transactions.
Starting point is 00:07:03 Why they are using a mainframe? Because mainframes are incredibly fast, process transactions secure and reliable. But monitoring or tracing the transactions on the mainframe is not enough anymore. Because the mainframe environments are transforming. More and more companies are moving their services to the cloud. They implement services as microservices. and finding the root cause for a problem on the mainframe is very complicated nowadays. So with Dynatrace they have the possibility to do an end-to-end tracing, so when a customer
Starting point is 00:07:37 starts a request on a mobile application or starts a service in the cloud, Dynatrace provides you the possibility to trace this request from the distributed world down to the mainframe. And in this preq session, you will see how you can use Dynatrace for end-to-end tracing and also finding the root cause of a problem from the distributed world down to the mainframe. That's pretty awesome. I mean, first of all, the capability of root cause analysis through the different layers of the modern stack is amazing. But then I think from a mainframe perspective,
Starting point is 00:08:09 there's also other aspects. I'm not an expert, but I believe there is ways where if you can prove to, let's say, IBM, who is selling you the mainframe, how much traffic actually comes from mobile, you also get discounts on the licensing. And I'm pretty sure it's also from the cloud workloads. I think this is also a capability that Dynatrace brings to the table, telling you how much traffic actually comes from these
Starting point is 00:08:33 out, let's say, cloud or mobile into the mainframe and therefore helping you to run to lower your costs. Of course, yes. In 2020 we are also working on a mobile cloud workload pricing so that it is possible directly in Dynatoys to identify the source of your request. When a request comes from a mobile application or cloud service, as you already explained, then you can save money because IBM provides special discounts for these requests. And we will provide support also in 2020 for this feature. That's perfect. In addition to that, of course, we will also support Java in 2020. We are currently in the implementation phase for this new enhancement.
Starting point is 00:09:14 In Dynatrace we had, in AppOn we already had Java support, but in Dynatrace not. So we are working together with the different development teams to support this also in Dynatrace because Java is one of the most important technologies in the world. And with this support, we can enhance also the visibility on the mainframe because when a customer triggers a mobile request or a cloud request with Java technology, it's currently not possible to trace the request down to the mainframe because there is no Java support in Dynatrace to the mainframe because there is no java support in dynatrace for the mainframe but this is also on the roadmap 2020. perfect and in addition to that it is very important to identify how much cpu usage a specific request needs because based on the cpu usage the customers pay customers have to pay a lot of money to IBM.
Starting point is 00:10:06 And we want to introduce also in 2020 CPU usage analysis on an alpha level. This is a host in Dynatrace and on a process level. This is a region in Dynatrace. So we will also enhance this in 2020. That's awesome. So I know we have, as you mentioned, there's a lot of people still running on the mainframe, a lot of customers that will be at Perform or are here at Perform that have mainframe. So it's great to know the person that they need to go to. Alex, that's you. You can also, you know, I
Starting point is 00:10:37 think it's a shout out to our customers to make sure if they run a mainframe, if they want to know what they can do, how Dynaspace can help them to optimize their mainframe deployment, their costs, their root cause analysis, then either find you after your session, or I'm pretty sure you're probably also around at the expo area with all the innovation labs, that the towers are there. I think that's a great way for customers to get in touch with you and learn more about what's going on.
Starting point is 00:11:02 Of course, yeah, and it's also a great opportunity for me to learn more about the use cases. I can present the first mocks to our customers. We can discuss the mocks together so that we can provide a good solution in 2020 for the mainframe. Perfect. And every breakout is recorded. So in case somebody is listening to this after the perform and they couldn't make it because they couldn't make it, make sure to check out your recording in case
Starting point is 00:11:26 you're doing anything on the mainframe because obviously there's a lot of cool stuff that you guys are doing. Yeah. Great. Thank you. Yeah. Thank you. And I enjoyed the show.
Starting point is 00:11:36 I know it's, I think you said it's your first performance. Yeah, it's my first show. Yeah. So, well, I know Vegas is sometimes a little overwhelming. At least it was for me the first time I was there, but you know, enjoy it as good as you can. Thank you very much. See you.
Starting point is 00:11:45 Bye-bye. See you. Bye-bye. Welcome, everyone, to another episode of Pure Performance Cafe. I'm still wandering the hallways of the Cosmopolitan in Vegas at Perform 2020. And as you probably have guessed, I just bumped into another of one of my speakers at the release battle of the Faster Track. Jürgen, wie geht's? Hi, Andy. I'm fine. Thank you.
Starting point is 00:12:26 What a nice German introduction or question. Thank you. Well, you know, there's a lot of things people learn here in Vegas and maybe at least some of the Austrian words that they can use because there's obviously not just the two of us Austrians here wandering around, but there's a lot of Austrians here from the engineering lab. So, Wie geht's is always a great conversation to start. It basically means, how are you?
Starting point is 00:12:43 Wie geht es gut? Yeah, that's good. Hey, Jürgen. You know, started basically means how are you so you can you know the reason why I do this short podcast is because we want to give folks that are still deciding on whether they're going to join a session or those that listen to this later on and decide on whether they want to watch recording so we want to give them a little background and kind of insights on what they're to watch a recording. So we want to give them a little background and kind of insights on what they're going to learn. Now, you are doing one of the last sessions in the track, and it's called NOAPS, Reaching Zero Incident Prod Through Auto-Remediationist Code. Now, that's, as most of the other topics, a very, let's say, ambitious titles because it talks about auto
Starting point is 00:13:25 remediation, it talks about zero-insulin and prod, and it talks about no-ops. Can you tell us a little bit about what people will learn, what you're going to teach, also who is going to speak with you because I think you have a customer and what they're going to talk about? Exactly. Exactly. So first, a big shout-out to the company Citrix. They are joining us on stage.
Starting point is 00:13:47 Actually, it will be Nestor. And together with him, we will talk about auto-remediation as code. What is it? How you can leverage it? How you can build your self-healing systems? Actually, which prerequisites you really want to take care of first. You don't want to start by just jumping in the cold water and then struggling with all the waves. But basically, you want to make sure that you have a good foundation. And we will give you hints and we will give you recommendations from everything we saw with our customers. And we also learned
Starting point is 00:14:21 within Dynatrace how you can reach a zero incident production system. And then we will have Nestor on stage and he will talk about also how Citrix managed this for them. And I'm really excited because they use bots. So they are using a lot of automation. They can communicate with bots. They actually have bots talking to bots.
Starting point is 00:14:44 So it will be, I think it's like the perfect setting for 2020, have bots talking to bots. So it will be, I think it's like the perfect setting for 2020 having bots talking to bots and we will see what a bot can do for them and also how they are using this within their organization and what all the benefits are
Starting point is 00:14:59 for them. And I think we can learn a lot from them. Well, if bots talk to bots, I wonder if they also eventually start talking and sharing cat pictures like we humans do or not. So I guess we'll see. Maybe next time
Starting point is 00:15:14 we will invite bots also to the stage. Yeah, that's actually good. Hey, so obviously, we've been talking about your content as part of the preparing and getting ready for Perform.
Starting point is 00:15:28 Now, I'm really excited about the structure that you have in your talk. And I think you mentioned it earlier. I think the first step to auto-remediation is actually not auto-remediation. It's reducing the noise. And can you give us just a glimpse on what that really means and what people will learn? Yeah, sure. So reducing the noise really means that you don't want to be alerted and informed about everything that's really going on in your environment. If you have huge environments, you might get a lot of noise and you are maybe suffering from this alert fatigue that you don't take care about your alerts anymore
Starting point is 00:16:06 because it's just too many. So we will see in this track also how our customers that have huge environments, how they are tackling their noise and how they are basically putting their problems into different buckets, how they are using the Dynatrace API to identify which problems are really critical for them and which problems are not as critical, how they are tweaking the alert settings, how they are tweaking notifications. So we will see in the very first part how you can make sure that you can really focus on the things where auto remediation makes sense and you don't get overwhelmed by too many alerts and you don't like you don't lose the overview of everything that's going on in your environment that's that's one very important part and I think that's
Starting point is 00:16:54 also kind of demystifying all the self-healing yourself feeling really means put a lot of automation in place but first of all it needs humans to identify where to put this automation in place. It's not what will take over our work, but we just leverage them to do the work where we get bored. And obviously, as you said, reducing the noise means using automation to solve problems that actually make sense to be solved and first use other techniques to actually filter out those things that actually
Starting point is 00:17:26 don't need attention or need a different way of attention. I like that as well. The last thing I want to ask you, because self-healing and auto-remediation is a topic that has influenced our work on Captain, I would assume you are going to talk about Captain and the self-healing and auto-remediation capabilities? Yes, thanks for bringing this up. So this year, we are really excited to also have CAPTEN on the Innovation Towers. You might have seen them already.
Starting point is 00:17:54 If not, please also visit us on the Innovation Towers. We have all the CAPTEN experts there. And I will be talking especially about the self-healing and auto-remediation capabilities we built into Captain. So you can use basically Captain and its auto-remediation capabilities by just providing instructions in a declarative way. So it's not about learning a new programming language, how to use Captain. But it's very easy. I will show an example. Actually, we will see a live demo during the talk. So I will kind of crash an environment or I will make it unstable, let's say, and we will see what Captain can then do
Starting point is 00:18:36 for us to auto-remediate this with a great combination between feature flags, Dynatrace, AI, and Captain. So I'm really excited to show this to you. Hopefully the demagogues are with us. You don't know during a live demo what will be happening, but we have quite mature software in the background. So I'm pretty excited to show this to you. And if you're interested,
Starting point is 00:18:59 then also visit us on the Innovation Towers. Yeah. Well, the thing is, you know, if things go bad and wrong and we don't want to talk about, you know, the slogan of Vegas, what happens in Vegas stays in Vegas. So if it's really bad, we just don't talk about it anymore.
Starting point is 00:19:13 Hey, final shout out. I know there is initially Sohaib was slotted for your, you know, track for your breakout. But then he had some things he had to take care of, which was more important. So still a big shout out to him because he worked with you very closely
Starting point is 00:19:32 and preparing the content, also contributing to Captain. And yeah, I mean, sad that he cannot be here, but also thanks to Nestor who is jumping in. Yes, thanks also, So Zoe, for your work. And I'm excited and pretty sure that we will hear Zoe also in one of our next maybe Captain Community meetings or in one of your podcasts, Andy, because he has done great work with the Slackbot.
Starting point is 00:19:56 And we also want to show this to the public, what he has achieved or what we have achieved together, the great collaboration between Captain and Citrix. Perfect hey Jurgen thank you so much I think it's time to move on it's time to go to the breakouts and hopefully for everyone that is listening in that are contemplating our do you want to see this session yes or no hopefully you got enough arguments now to actually watch it either live or later on the recording and now Jürgen, tschüss, bis später. Servus, thanks Andy, see you in the session. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.