PurePerformance - Perform2020 Andi on the Street: Web Scale, OpenTelemetry and Resiliency

Episode Date: February 6, 2020

Andi Grabner, our man-on-the-street, gets the scoop on:-Going web-scale with cross-environment features, globally distributed high availability and more​ - with Guido Deinhammer-The role of OpenTele...metry in Dynatrace​ with Daniel Khan and Sonja Chevre-Build resiliency into your continuous delivery pipeline​ with Michael Villiger

Transcript
Discussion (0)
Starting point is 00:00:00 Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance! Hi everybody and welcome to Pure Performance and PerfBytes, coming to you from Perform 2020 in Las Vegas. Our man on the street, Andy Gravner, has sent us a few interviews, which we've compiled in this episode. There's a short musical interlude between each, so be sure to stay tuned in. Take it away, Andy. Welcome to another Pure Performance Cafe episode today, live from Vegas. I know there's a little background noise here because people are running around, probably finding the session. Talking about the next session I just bumped into Sonja and into Daniel. Hey! Hi Andy! Hi Andy! Hey how are you? Good, good. Excited about the sessions. Yeah, so you are actually talking about the sessions. I know we only have a couple of minutes because you need to rush over
Starting point is 00:01:02 but you have a session coming up and it's called the role of dynatrace in open telemetry right uh i think was it this year last year we had a session uh with a podcast already and with also sonia we did uh performance clinic as well yeah yeah it was last year last year last year in in in fall so you you know you have heard last year about lots of new words coming up like observability, like open telemetry, emerging from open tracing, open sensors. We also talked a lot about trace context. So a lot of them are around distributed tracing and observability as a whole are coming to the industry and some of the words I think we know and some of the things are new things like open telemetry and we want to give an
Starting point is 00:01:51 opportunity to tell the people what it is all about and how Dynatrace plays a role in that and Daniel is really interesting because he has been working with us. He's interesting of course Daniel is the most interesting person, one of the most, you may be the second All right back to the topic what's happening here because of the contribution that we made to open telemetry maybe then you want to say a little bit so yeah what what people are like most probably surprised about is that we are amongst the top contributors to open telemetry as a vendor so. So we are heavily investing into that whole topic so we have a whole team sitting on it and collaborating with the community. There
Starting point is 00:02:34 are a few things that are good for Dynatrace by doing so. First of all, we learn a lot around that. Also how we collaborate with other companies, how decisions are made, etc. So this is the whole open source idea helps a lot. But also as Dynatrace, we want at some point also utilize what OpenTelemetry produces and as vendor with like over 10 years of experience in monitoring of course we know or our knowledge is pretty much battle tested so there are years of years of support issues and
Starting point is 00:03:22 things we figured out along the way that are built into what Dynatrace does. So we want to have at least, or striving to have at least the basic stability that we would expect from Dynatrace there. And one important point of course I think also maybe Sonja wants to talk about it, is of course that OpenTelemetry is purely about data collection. We're not talking about how the data is then displayed. It's a really, really important one, but it's also important to notice that it's not OpenTelemetry versus one agent. You know, people start to hear OpenTelemetry, is it going to be something newer than one
Starting point is 00:04:03 agent, is it going to replace it and we see open telemetry as an additional tool that we have Dynatrace as a possibility as an opportunity to collect additional data for specific use cases so it's not a one agent or Dynatrace versus open telemetry so everybody is working together to reach the best value for customers yeah in the end it should not matter for a customer what underlying agent technology is used. We as Dynatrace also decide that for instance now for serverless environments we are we are using OpenTelemetry just because in serverless you don't need so much like what the one agent provides. One agent provides a lot more than OpenTelemetry does, but for a pure
Starting point is 00:04:50 serverless function where you don't need this deep code level visibility for instance or CPU stats of a function that runs just a few milliseconds, there you just can go for a simpler approach. this is where we're working on. I think you put it nicely said OpenTelemetry is about data collection that's clear right so that means it will be it can be used by people that where maybe the one agent is too powerful or maybe where the one agent is just not a fit you can then still use OpenTelemetry to ingest data into Dynatrace where then all the analytics obviously comes into play.
Starting point is 00:05:25 Now you also mentioned there's other companies, obviously part of that. Are you presenting on yourself? Do you have anybody with you at this show? So Morgan McLean from Google will join us for this session. And he's one of the original product manager at Google for OpenCensus. And OpenCensus was that one project that was merged with OpenTracing to form OpenTelemetry. Perfect. So you have Morgan with you.
Starting point is 00:05:50 Yeah, he explains to our user what is OpenTelemetry all about, what is the roadmap, what is the long-term vision. And I have to say that working with Google and Microsoft and others is really really valuable there so it's also a lot of fun we enjoy a lot collaborating with those companies and yeah and they like Google will build that or builds in open telemetry into their stack driver logic and also trace context what we talked about before so this these are also opportunities to yeah also open up Dynatrace more. If when there is a common standard to do things and we cannot always assume that there is one agent in place, then it's easy for us to
Starting point is 00:06:34 create the best solution to ingest this data. We want to be open and flexible. Yeah, we want to be open flexible. Of course if there are 20 solutions it's hard to be flexible. You have to then support a lot of things but the more standardized things get it also in this open source space they like also with trace context the easier it is for vendors as a whole to support those different for just these things things. Very cool. So maybe if you want to sum it up for people that are not able to see it live, but maybe that watch a recording later
Starting point is 00:07:12 on, can you give me, I mean, we already talked a lot about this, but two or three additional takeaways, what will people learn when they see the session? So the session is asked around three main topics. First one, what is Open Telemetry? Just common, what is happening, what is the project, what it is, what is the next step. Then Daniel, this is the part that Morgan from Google will do.
Starting point is 00:07:33 Then Daniel will talk about our contribution to really understand which active part is taking the interest in this project. And I'll round up by talking about the use case. Where does OpenTelemetry provide values to our users? That's awesome. Really cool. Well, then I'll let you run.
Starting point is 00:07:48 I know we are already close to getting to the breakouts. So enjoy, and I'll see you later, OK? Thank you. Thank you, Andy. Welcome, everyone. Back at Perform. It's another performance, Pew Performance Cafe, that's what I call these sessions. And I bumped into another speaker, another colleague of mine. He's been with the company for how many years Guido? It's almost 10 years now.
Starting point is 00:08:18 Almost 10 years. Wow. And we found again a beautiful spot here at the Cosmopolitan next to a coffee shop. So please, folks, mind the background noise or grab a coffee in case that is kind of motivating. So Guido, I just want to make sure that people that have not the chance to see your session live at Perform, because maybe they are in other rooms, but obviously we all know that you have the best session. So why would they be in other rooms, but obviously we all know that you have the best session, so why would they be in other rooms? But maybe there are some people that are not here and they want to see maybe the recording. What are people learning in your session? I just want to read out the title,
Starting point is 00:08:54 Going Web Scale with Cross-Environment Features, Globally Distributed High Availability and More. So what is this all about? Yeah, thanks, for giving me a chance to talk about this. It was a great session. And it's really about questions that our customers ask themselves as they deploy Dynatrace at a very large scale. Some of the questions are, how do I enable my dozens of teams? How do I best split up a Dynatrace environment into manageable units so that in each individual team can take full ownership of their configuration, can do whatever
Starting point is 00:09:33 they need to do to, you know, be productive and leverage Dynatrace in their everyday work and still maintain the end-to-end visibility. If I have a larger production problem, I need to be able to pinpoint it and I need to see it in context. I need to understand what's the impact of a problem as well as what's the root cause, even if that spans many different environments, many different systems. If it spans hundreds of hosts and dozens of teams,
Starting point is 00:10:03 I need to be able to connect this together. And we're showing you some of the best practices on how to go about this to make sure you can actually leverage Dynatrace productively and still be in a position to pinpoint a problem that happens in production. So if I kind of, from what I hear now, I assume this is then a lot about tagging,
Starting point is 00:10:24 it's a lot about management zones, it's a lot about management zones. Management zones, host groups, you name it. Also, the big question always is, should you have a single large environment, should you have multiple smaller environments? And if multiple smaller environments, how do you best connect them with each other to make sure you maintain the end-to-end visibility. Also another big topic that we've discussed is with larger and larger installations, many of our customers say Dynatrace for them is like a tier one application. It needs to maintain the highest availability standards and many of our managed customers have asked us, hey, how can I make sure that Dynatrace is resilient to data center failures? And of course, Dynatrace managed has high availability built in, right?
Starting point is 00:11:13 If you have a three node cluster, one fails, you're still good, right? But if you have two data centers and you want to make sure that your Dynatrace installation survives the failure of either data center, then until very recently, we didn't have a very convincing answer for that. Yes, you could take a backup and you could start up a failover cluster in the other. But from an RPO and RTO perspective, so this recovery point objective, how much data do you lose and how quickly can you start again, RTO, recovery time objective, those answers weren't very good. So we've really invested a lot over the past year, year and a half almost, to get this problem resolved and deliver what we call a
Starting point is 00:12:00 turnkey solution with premium high availability so that you can install a single Dynatrace managed cluster across two data centers in a full active, active fashion so that if one data center fails, everything will continue smoothly and seamlessly. You don't lose any data, right? So you have essentially a recovery time objective of zero. You don't lose any monitoring in an outage scenario. And also you don't lose any data, right? Because every data, you know, all data points are synchronized across both data centers. Also allows you to best leverage the hardware because all the nodes would always be active in both data centers and if one of them fails our automatic
Starting point is 00:12:49 load reduction would automatically kick in or reduce the transactional load the Dynatrace needs to process and still give you full visibility into into your production problems. Well it's pretty I would say amazing. I know I know and now I understand why you think anybody everybody's coming anyway to your session right off the bat but is this session then mainly targeted for managed customers? Not at all. We have a lot of very large SaaS customers and one of the common challenges that all of our customers face is that of enabling their teams. Mostly what we see with our customers is there is a central team that owns, in a sense, the Dynatrace installation.
Starting point is 00:13:36 And they are often challenged with enabling dozens of other teams, right? So we were also showing some best practices you know on how to how to you know help your teams get up to speed what you can do in terms of making sure they can quickly start with Dynatrace making it easy to get the word out and actually help them from you know get to crawl walk run very quickly and we've some with some great examples about that as well. I have one more question for you. I know we have obviously a growing customer base
Starting point is 00:14:11 and customers themselves also grow. But then on the other side, we also, as Dynatrace, as the platform, we're always getting faster and we scale better. So I think we have situations where customers may start with two or three environments and are now consolidating. Is this also a use case that you are going to address and discuss like how can you go from let's say three dynatrace environments that you had in the past for whatever reason
Starting point is 00:14:34 maybe organizationally but maybe also because in prior they didn't know how to do it in a different way now to go back to let's say say, one large environment, is this something you cover? Generally, overall, the guidance that we try to give is make sure your Dynatrace environments are of manageable size, right? And make sure you can split them. Because if you have smaller Dynatrace environments, it gives you a clearer sense of ownership, right? And it allows you to make sure that your teams are fully in control of this environment. They own all the configuration there. They can do whatever they need to do. And then we have all the cross-environment and cross-cluster functionality to tie those environments together so you can still maintain the end-to-end visibility.
Starting point is 00:15:23 So rather than suggesting to our customers to say you know let's put everything in the same environment if you have two or three let's consolidate, we say yes let's maintain this end-to-end visibility and we have a lot of capabilities and we're showing this also in the session where customers can really make sure they understand problems end-to-end, across environments, across clusters, even across managed and SaaS. So if you have a managed cluster and a SaaS environment, you can connect those together as well.
Starting point is 00:15:55 That's awesome, because I think I hear this a lot from customers. They have certain groups that, for whatever reason, have to have managed, but then there's others that just start with SaaS, because that's perfect for their type of application or their type of environment. And now they end up with a mixed environment, basically. And you, and obviously with Dynatrace, we support visibility across all of these types of environments.
Starting point is 00:16:16 That's really, really critical for us. It enables us to scale better and really make sure that everyone, every team is fully in charge of their Dynatrace environment. So yes, that's critical for us. Very cool. Well Guido, I hope you get, you know, everybody's obviously watching your session live. In case not, hopefully they watched the recording. Folks, remember, the session is called Going Web Scale with Cross-Environment Features, Globally Distributed, High Availability, More.
Starting point is 00:16:44 Yes, it is a long title, but there's also a lot of great content in there. Thank you, Guido. Thanks, Andy. Bye-bye. Welcome, everyone, to another episode of Pure Performance Cafe. We're still in Vegas, here at Perform 2020 in the Cosmopolitan, and I just bumped into another colleague of mine. Mike, how are you?
Starting point is 00:17:11 I am doing great. There you go. Perfect. I know it's been, I mean, I always love Perform because there's a lot of energy, but it's also very busy, right? There's a lot of stuff we inhale. I just enjoy the fact that we get to share our ideas, but also consume a lot of new ideas from customers. How about you? Oh, I concur. So, you know, a lot of what we're going to be talking about in my session today
Starting point is 00:17:35 is some of the great work that Christian at ERT has done to kind of help implement the ideas around building resiliency into their pipelines. That's awesome. So actually, yeah, I wanted to talk about your session. So you just mentioned Christian, Christian Heckelmann from ERT, a German-based company. And I think the session, I'll just read it out here, build resiliency into your continuous delivery pipeline with AI and automation. A very, I think, lofty, a lot of cool terms in that title, obviously. I wonder who
Starting point is 00:18:08 came up with that. But so, Mike, unfortunately, not everybody's going to be a performer that's going to be able to join your sessions. What are people going to learn in case they make it live or in case they're going to watch the recording? What are you going to teach them? Yep. So there's a couple of really kind of, in my opinion, key takeaways from my session. A lot of it is going to be a pretty deep and technical dive into some of the APIs that Donatrace is offering. But the goals of the session are a couple of kind of key topics that I've been talking about for quite some time. And the first thing is ensuring that your monitoring solution obviously performs. So we're talking about Dynatrace.
Starting point is 00:18:54 It's ensuring that Dynatrace is aware that the deployment events are taking place, right? So I'm going to be talking a little bit about, you know, what are the payloads for deployment events that will make it into Dynatrace, which is really important because then Davis, the root cause analysis engine, will include those events in that data in terms of possibly detecting a problem with that most recent deployment. So that's something that I find really valuable, and it has relatively low risk to include that in your pipeline.
Starting point is 00:19:27 So every time I talk to people about this, I say, look, you just need to do this period. Just do it. It's low risk, high reward. The next thing then is quality gates. So you and I, Andy, have been talking about quality aids one way or another for, what, like six years now? At least. We've been telling this story for a really, really long time. We had some really catchy slides about that way back in the day.
Starting point is 00:19:57 If folks are interested, they can probably look those up. I looked it up because I prepared for my presentation, and it was 2011 when I used my, remember, the table view with the free builds. And so it was 2011 when we first talked about metrics-driven quality gates. Yep. And, well, we had another slide that was a little bit more cartoony, and it had folks kind of shoveling a nameless brown substance into a pipeline, and then that same nameless brown substance was coming out the end. And that was a slide that I had delivered main stage at CF Summit several years ago, and it actually got retweeted by some big names at that point in time. So it was a really catchy topic. But when we talk about quality gates, it's really ensuring that your pipelines are able to take a set of data and validate that data. And what we're doing now is we see the rise of SRE and the SRE principles as espoused by Google as the way to operate your environments as an engineer would.
Starting point is 00:21:06 So we talk about defining our service level objectives, which is how we want the system to behave, and then the indicators, which is how we measure it's behaving that way, right? And we've implemented those in a pipeline. And what I'm kind of walking folks through here is, you know, what are the metrics that are interesting? How do you interact with the Dynatrace time series API to do that?
Starting point is 00:21:35 And then, you know, it's really great and easy to use now with the Captain Quality Gates that the Captain team has come up with. And one of the things that I'm personally most excited about, and perhaps you can even hear it as my voice goes up a little bit in pitch, is the custom service metrics. So with our new time series API and with custom service metrics, we can actually take and basically utilizing headers in our performance test, mark the requests as coming from a performance test, and then validate just those requests as part of our pipeline. Because with the custom service metrics, we can go back and look at data just for requests with that request attribute.
Starting point is 00:22:18 And that is, I am really excited that we're able to do that now. Like legitimately excited, not just being an employee and like, you know, being excited about something. This is something that I'm legitimately excited about. Yeah, and I think the great thing about this, we're not only talking about metrics like response time or failure rate. We're also talking about these metrics that we've been preaching about a long, long time. Architectural metrics like the number of database calls that are executed, the number of service round trips, and all that stuff. Yeah. Even simple things like just
Starting point is 00:22:47 validating that the number of instances that we've asked for were actually delivered. Exactly. Those architectural validations are something that really take the insight that a solution like ours can provide and really help you build that much more resiliency in the things and making sure that the things that might have passed
Starting point is 00:23:08 the unit tests in reality might not pass. Very cool. You talk about quality gates, you talk about SLIs and SLOs, talk about the API. Christian from ERT, I know I've wrote a couple of blogs with him, thanks to all of his work. Can you quickly kind of give a glance on what he's going to talk about and what he's going to show?
Starting point is 00:23:34 Yeah. So a lot of what I'm talking about is theory. Christian is going to talk about what he's done in practice. And that's always really amazing. And I have to give like just an extra super, like the biggest shout out I can possibly give the Christian, you know, my, my hot day that I did on Tuesday, um, a lot of the content in that hot day was actually originally developed by Christian. So, um, it actually saved me quite a bit of work having to, to, you know, develop something like that from scratch, but, you know, Christian is going to be giving him, uh, is going to be giving the examples from, from his pipelines of where he's implemented things like quality gates. Um, but he's also going to be giving the examples from his pipelines of where he's implemented things like quality gates. But he's also going to be talking about something really, really cool. He built a
Starting point is 00:24:11 Kubernetes operator that will automatically create Dynatrace synthetic checks for services that are deployed in his Kubernetes cluster. And that is just the world's craziest thing. And again, when we start talking about like, you know, how do we do things like autonomous cloud and how do we eliminate, you know, human error, right? We talk about automation, right? And here we've actually taken that automation and built it into Kubernetes then at that point, right? So anytime somebody deploys a new service to that cluster, we're automatically going to have a synthetic check created
Starting point is 00:24:49 to validate that. Yeah, that's perfect because that immediately then alerts you in case a deployment actually didn't work as well. And then you can roll back, which means in the end, you have a more resilient,
Starting point is 00:24:59 stable system, which is just awesome. Very cool, Mike. Hey, I know you got around because the session starts probably very soon. Thank you so much. Very cool, Mike. Hey, I know you got around because the session starts probably very soon. Thank you so much. Let's get together later on at the end of the day and grab another
Starting point is 00:25:12 beverage because I think we both deserve it. For sure. Or, you know, water because it's Vegas and we're all going to be parched. So we're not going to be drinking water, right, Andy? Of course. That's what we tell our parents at home. All right. See you, Mike. All? Of course. That's what we tell our parents at home. Alright, see you, Mike.
Starting point is 00:25:27 Alright, thanks. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.