PurePerformance - Perform2020 Andi on the Street: Web Scale, OpenTelemetry and Resiliency
Episode Date: February 6, 2020Andi Grabner, our man-on-the-street, gets the scoop on:-Going web-scale with cross-environment features, globally distributed high availability and more - with Guido Deinhammer-The role of OpenTele...metry in Dynatrace with Daniel Khan and Sonja Chevre-Build resiliency into your continuous delivery pipeline with Michael Villiger
Transcript
Discussion (0)
Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance!
Hi everybody and welcome to Pure Performance and PerfBytes, coming to you from Perform 2020 in Las Vegas.
Our man on the street, Andy Gravner, has sent us a few interviews, which we've compiled in this episode.
There's a short musical interlude between each, so be sure to stay tuned in. Take it away, Andy.
Welcome to another Pure Performance Cafe episode today, live from Vegas.
I know there's a little background noise here because people are running around, probably finding the session. Talking about the next session I just bumped into Sonja and into Daniel. Hey!
Hi Andy! Hi Andy! Hey how are you? Good, good. Excited about the sessions. Yeah, so you are actually
talking about the sessions. I know we only have a couple of minutes because you need to rush over
but you have a session coming up and it's called the role of dynatrace in open telemetry right uh i think was it this year last year
we had a session uh with a podcast already and with also sonia we did uh performance clinic as
well yeah yeah it was last year last year last year in in in fall so you you know you have heard
last year about lots of new words coming up like observability,
like open telemetry, emerging from open tracing, open sensors. We also talked a lot about trace
context. So a lot of them are around distributed tracing and observability as a whole are coming
to the industry and some of the words I think we know and some of the things are new things like
open telemetry and we want to give an
opportunity to tell the people what it is all about and how Dynatrace plays a role in that and Daniel is
really interesting because he has been working with us. He's interesting of course
Daniel is the most interesting person, one of the most, you may be the second
All right back to the topic what's happening here because of the contribution that we made to open
telemetry maybe then you want to say a little bit so yeah what what people are
like most probably surprised about is that we are amongst the top contributors
to open telemetry as a vendor so. So we are heavily investing into that whole
topic so we have a whole team sitting on it and collaborating with the community. There
are a few things that are good for Dynatrace by doing so. First of all, we learn a lot around that.
Also how we collaborate with other companies,
how decisions are made, etc.
So this is the whole open source idea helps a lot.
But also as Dynatrace, we want at some point also
utilize what OpenTelemetry produces and as vendor with like over 10
years of experience in monitoring of course we know or our knowledge is
pretty much battle tested so there are years of years of support issues and
things we figured out along the way that are built
into what Dynatrace does. So we want to have at least, or striving to have at least the
basic stability that we would expect from Dynatrace there. And one important point of
course I think also maybe Sonja wants to talk about it, is of course that OpenTelemetry is purely about data collection.
We're not talking about how the data is then displayed.
It's a really, really important one, but it's also important to notice that it's not OpenTelemetry
versus one agent.
You know, people start to hear OpenTelemetry, is it going to be something newer than one
agent, is it going to replace it and we see open telemetry as an additional tool that we
have Dynatrace as a possibility as an opportunity to collect additional data
for specific use cases so it's not a one agent or Dynatrace versus open
telemetry so everybody is working together to reach the best value for
customers yeah in the end it should not matter for a customer what underlying agent technology is used.
We as Dynatrace also decide that for instance now for serverless environments we are
we are using OpenTelemetry just because in serverless you don't need so much like what
the one agent provides. One agent provides a lot more than OpenTelemetry does, but for a pure
serverless function where you don't need this deep code level visibility for
instance or CPU stats of a function that runs just a few milliseconds, there you
just can go for a simpler approach. this is where we're working on.
I think you put it nicely said OpenTelemetry is about data collection
that's clear right so that means it will be it can be used by people that where
maybe the one agent is too powerful or maybe where the one agent is just not a
fit you can then still use OpenTelemetry to ingest data into Dynatrace
where then all the analytics obviously comes into play.
Now you also mentioned there's other companies, obviously part of that.
Are you presenting on yourself?
Do you have anybody with you at this show?
So Morgan McLean from Google will join us for this session.
And he's one of the original product manager at Google for OpenCensus.
And OpenCensus was that one project that
was merged with OpenTracing to form OpenTelemetry.
Perfect. So you have Morgan with you.
Yeah, he explains to our user what is OpenTelemetry all about, what is the roadmap,
what is the long-term vision.
And I have to say that working with Google and Microsoft and others is really really valuable there so it's also a lot of fun
we enjoy a lot collaborating with those companies and yeah and they like Google
will build that or builds in open telemetry into their stack driver logic
and also trace context what we talked about before so this these are also
opportunities to yeah also open up Dynatrace more. If when there is a common standard to do things and we
cannot always assume that there is one agent in place, then it's easy for us to
create the best solution to ingest this data. We want to be open and flexible.
Yeah, we want to be open flexible. Of course if there are 20 solutions it's
hard to be flexible. You have to then support a lot of things
but the more standardized things get it also in this open source space they like
also with trace context the easier it is for vendors as a whole to support
those different for just these things things. Very cool.
So maybe if you want to sum it up for people that are not
able to see it live, but maybe that watch a recording later
on, can you give me, I mean, we already talked a lot about this,
but two or three additional takeaways,
what will people learn when they see the session?
So the session is asked around three main topics.
First one, what is Open Telemetry?
Just common, what is happening, what is the project,
what it is, what is the next step.
Then Daniel, this is the part that Morgan from Google will do.
Then Daniel will talk about our contribution
to really understand which active part is taking
the interest in this project.
And I'll round up by talking about the use case.
Where does OpenTelemetry provide values to our users?
That's awesome.
Really cool.
Well, then I'll let you run.
I know we are already close to getting to the breakouts.
So enjoy, and I'll see you later, OK?
Thank you.
Thank you, Andy.
Welcome, everyone. Back at Perform. It's another performance, Pew Performance Cafe, that's what I call these sessions.
And I bumped into another speaker, another colleague of mine.
He's been with the company for how many years Guido?
It's almost 10 years now.
Almost 10 years.
Wow.
And we found again a beautiful spot here at the Cosmopolitan next to a coffee shop.
So please, folks, mind the background noise or grab a coffee in case that is kind of motivating.
So Guido, I just want to make sure that people that have not the chance to see your session live at Perform,
because maybe they are in other rooms, but obviously we all know that you have the best session.
So why would they be in other rooms, but obviously we all know that you have the best session, so why would they be in other rooms? But maybe there are some people that are not here and they want to see maybe the
recording. What are people learning in your session? I just want to read out the title,
Going Web Scale with Cross-Environment Features, Globally Distributed High Availability and More.
So what is this all about? Yeah, thanks, for giving me a chance to talk about this.
It was a great session.
And it's really about questions that our customers ask themselves
as they deploy Dynatrace at a very large scale.
Some of the questions are, how do I enable my dozens of teams?
How do I best split up a Dynatrace environment into manageable units so
that in each individual team can take full ownership of their configuration, can do whatever
they need to do to, you know, be productive and leverage Dynatrace in their everyday work
and still maintain the end-to-end visibility. If I have a larger production problem, I need to be able to pinpoint it
and I need to see it in context.
I need to understand what's the impact of a problem
as well as what's the root cause,
even if that spans many different environments,
many different systems.
If it spans hundreds of hosts and dozens of teams,
I need to be able to connect this together.
And we're showing you some of the best practices
on how to go about this
to make sure you can actually leverage Dynatrace productively
and still be in a position to pinpoint
a problem that happens in production.
So if I kind of, from what I hear now,
I assume this is then a lot about tagging,
it's a lot about management zones, it's a lot about management zones.
Management zones, host groups, you name it.
Also, the big question always is, should you have a single large environment, should you have multiple smaller environments?
And if multiple smaller environments, how do you best connect them with each other to make sure you maintain the end-to-end visibility. Also another big topic that we've discussed is with larger and larger installations, many
of our customers say Dynatrace for them is like a tier one application.
It needs to maintain the highest availability standards and many of our managed customers
have asked us, hey, how can I make sure that Dynatrace is resilient to data center failures?
And of course, Dynatrace managed has high availability built in, right?
If you have a three node cluster, one fails, you're still good, right?
But if you have two data centers and you want to make sure that your Dynatrace installation survives the failure of either data center,
then until very recently, we didn't have a very convincing answer for that.
Yes, you could take a backup and you could start up a failover cluster in the other.
But from an RPO and RTO perspective, so this recovery point objective, how much data do
you lose and how quickly can you start again, RTO, recovery time objective, those answers
weren't very good. So we've really invested a lot over the past year, year
and a half almost, to get this problem resolved and deliver what we call a
turnkey solution with premium high availability so that you can install a single Dynatrace
managed cluster across two data centers in a full active, active fashion so that if one
data center fails, everything will continue smoothly and seamlessly.
You don't lose any data, right?
So you have essentially a recovery time objective of zero. You don't lose any
monitoring in an outage scenario. And also you don't lose any data, right? Because
every data, you know, all data points are synchronized across both data centers.
Also allows you to best leverage the hardware because all the nodes would always be active in both data centers and if one of them fails our automatic
load reduction would automatically kick in or reduce the transactional load the
Dynatrace needs to process and still give you full visibility into into your
production problems. Well it's pretty I would say amazing. I know I know and now I
understand why you think anybody everybody's coming anyway to your
session right off the bat but is this session then mainly targeted for managed
customers? Not at all. We have a lot of very large SaaS customers and one of the
common challenges that all of our customers face is that of enabling their teams.
Mostly what we see with our customers is there is a central team that owns, in a sense, the Dynatrace installation.
And they are often challenged with enabling dozens of other teams, right?
So we were also showing some best practices you know on how to how to
you know help your teams get up to speed what you can do in terms of making sure
they can quickly start with Dynatrace making it easy to get the word out and
actually help them from you know get to crawl walk run very quickly and we've
some with some great examples about that as well.
I have one more question for you.
I know we have obviously a growing customer base
and customers themselves also grow.
But then on the other side, we also, as Dynatrace,
as the platform, we're always getting faster
and we scale better.
So I think we have situations where customers
may start with two or three environments
and are now consolidating.
Is this also a use case that you are going to address and discuss like how can you go from let's say three dynatrace environments that you had in the past for whatever reason
maybe organizationally but maybe also because in prior they didn't know how to do it in a different
way now to go back to let's say say, one large environment, is this something you cover? Generally, overall, the guidance that we try to give is make sure your Dynatrace environments are of manageable size, right?
And make sure you can split them.
Because if you have smaller Dynatrace environments, it gives you a clearer sense of ownership, right?
And it allows you to make sure that your teams are fully in control of this environment.
They own all the configuration there.
They can do whatever they need to do.
And then we have all the cross-environment and cross-cluster functionality to tie those environments together so you can still maintain the end-to-end visibility.
So rather than suggesting to our customers to say you know let's put everything in the same
environment if you have two or three let's consolidate, we say yes let's
maintain this end-to-end visibility and we have a lot of capabilities and we're
showing this also in the session where customers can really make
sure they understand problems end-to-end,
across environments, across clusters, even across managed and SaaS.
So if you have a managed cluster and a SaaS environment,
you can connect those together as well.
That's awesome, because I think I hear this a lot from customers.
They have certain groups that, for whatever reason, have to have managed,
but then there's others that just start with SaaS,
because that's perfect for their type of application
or their type of environment.
And now they end up with a mixed environment, basically.
And you, and obviously with Dynatrace,
we support visibility across all of these types of environments.
That's really, really critical for us.
It enables us to scale better
and really make sure that everyone,
every team is fully in charge
of their Dynatrace environment. So yes, that's critical for us.
Very cool. Well Guido, I hope you get, you know, everybody's obviously watching your session live.
In case not, hopefully they watched the recording. Folks, remember, the session is called
Going Web Scale with Cross-Environment Features, Globally Distributed, High Availability, More.
Yes, it is a long title, but there's also a lot of great content in there.
Thank you, Guido.
Thanks, Andy.
Bye-bye.
Welcome, everyone, to another episode of Pure Performance Cafe.
We're still in Vegas, here at Perform 2020 in the Cosmopolitan,
and I just bumped into another colleague of mine.
Mike, how are you?
I am doing great.
There you go. Perfect.
I know it's been, I mean, I always love Perform because there's a lot of energy,
but it's also very busy, right?
There's a lot of stuff we inhale.
I just enjoy the fact that we get to share
our ideas, but also consume a lot of new ideas from customers. How about you?
Oh, I concur. So, you know, a lot of what we're going to be talking about in my session today
is some of the great work that Christian at ERT has done to kind of help implement the ideas
around building resiliency into their pipelines.
That's awesome.
So actually, yeah, I wanted to talk about your session.
So you just mentioned Christian, Christian Heckelmann from ERT, a German-based company.
And I think the session, I'll just read it out here, build resiliency into your continuous
delivery pipeline with AI and automation.
A very, I think, lofty, a lot of cool terms in that title, obviously. I wonder who
came up with that. But so, Mike, unfortunately, not everybody's going to be a performer that's
going to be able to join your sessions. What are people going to learn in case they make it live
or in case they're going to watch the recording? What are you going to teach them? Yep. So there's a couple of really kind of, in my opinion, key takeaways from my session. A
lot of it is going to be a pretty deep and technical dive into some of the APIs that
Donatrace is offering. But the goals of the session are a couple of kind of key topics that I've been talking about for quite some time.
And the first thing is ensuring
that your monitoring solution obviously performs.
So we're talking about Dynatrace.
It's ensuring that Dynatrace is aware
that the deployment events are taking place, right?
So I'm going to be talking a little bit about,
you know, what are the payloads for deployment events
that will make it into Dynatrace, which
is really important because then Davis, the root cause analysis engine, will include those
events in that data in terms of possibly detecting a problem with that most recent deployment.
So that's something that I find really valuable, and it has relatively low risk to include that in your pipeline.
So every time I talk to people about this, I say, look, you just need to do this period.
Just do it. It's low risk, high reward.
The next thing then is quality gates.
So you and I, Andy, have been talking about quality aids one way or another for, what, like six
years now?
At least.
We've been telling this story for a really, really long time.
We had some really catchy slides about that way back in the day.
If folks are interested, they can probably look those up.
I looked it up because I prepared for my presentation, and it was 2011 when I used my, remember, the table view with the free builds.
And so it was 2011 when we first talked about metrics-driven quality gates.
Yep. And, well, we had another slide that was a little bit more cartoony, and it had folks kind of shoveling a nameless brown substance into a pipeline, and then that same nameless brown substance was coming out the end.
And that was a slide that I had delivered main stage at CF Summit several years ago, and it actually got retweeted by some big names at that point in time.
So it was a really catchy topic. But when we talk about quality gates, it's really ensuring that your pipelines are able
to take a set of data and validate that data. And what we're doing now is we see the rise
of SRE and the SRE principles as espoused by Google as the way to operate your environments as an engineer would.
So we talk about defining our service level objectives,
which is how we want the system to behave,
and then the indicators,
which is how we measure it's behaving that way, right?
And we've implemented those in a pipeline.
And what I'm kind of walking folks through here is,
you know, what are the metrics that are interesting?
How do you interact with the Dynatrace time series API to do that?
And then, you know, it's really great and easy to use now with the Captain Quality Gates that the Captain team has come up with.
And one of the things that I'm personally most excited about, and perhaps you can even hear
it as my voice goes up a little bit in pitch, is the custom service metrics. So with our new
time series API and with custom service metrics, we can actually take and basically utilizing
headers in our performance test, mark the requests as coming from a performance test,
and then validate just those requests as part of our pipeline.
Because with the custom service metrics,
we can go back and look at data just for requests with that request attribute.
And that is, I am really excited that we're able to do that now.
Like legitimately excited, not just being an employee
and like, you know, being excited about something. This is something that
I'm legitimately excited about. Yeah, and I think the great thing about this, we're not only
talking about metrics like response time or failure rate. We're also talking about these
metrics that we've been preaching about a long, long time. Architectural metrics
like the number of database calls that are executed, the number of service round trips,
and all that stuff. Yeah. Even simple things like just
validating that the number of instances that
we've asked for were actually delivered.
Exactly.
Those architectural validations are something that really take
the insight that a solution like ours can
provide and really help you build
that much more resiliency in the things and making sure
that the things that might have passed
the unit tests in reality might not pass.
Very cool. You talk about quality gates,
you talk about SLIs and SLOs,
talk about the API.
Christian from ERT,
I know I've wrote a couple of blogs with him,
thanks to all of his work.
Can you quickly kind of give a glance on what he's going to talk about and what he's going to show?
Yeah. So a lot of what I'm talking about is theory. Christian is going to talk about what he's done in practice. And that's always really amazing. And I have to give like just an extra
super, like the biggest shout out I can possibly give the Christian, you know, my,
my hot day that I did on Tuesday, um, a lot of the content in that hot day was actually
originally developed by Christian. So, um, it actually saved me quite a bit of work having to,
to, you know, develop something like that from scratch, but, you know, Christian is going to
be giving him, uh, is going to be giving the examples from, from his pipelines of where he's
implemented things like quality gates. Um, but he's also going to be giving the examples from his pipelines of where he's implemented things like
quality gates. But he's also going to be talking about something really, really cool. He built a
Kubernetes operator that will automatically create Dynatrace synthetic checks for services that are
deployed in his Kubernetes cluster. And that is just the world's craziest thing. And again, when we start talking
about like, you know, how do we do things like autonomous cloud and how do we eliminate,
you know, human error, right? We talk about automation, right? And here we've actually
taken that automation and built it into Kubernetes then at that point, right? So anytime somebody
deploys a new service to that cluster,
we're automatically going to have
a synthetic check created
to validate that.
Yeah, that's perfect
because that immediately then
alerts you in case a deployment
actually didn't work as well.
And then you can roll back,
which means in the end,
you have a more resilient,
stable system,
which is just awesome.
Very cool, Mike.
Hey, I know you got around
because the session starts probably very soon. Thank you so much. Very cool, Mike. Hey, I know you got around because the session starts probably
very soon. Thank you so
much. Let's get together
later on at the end of the day and grab another
beverage because I think we both deserve it.
For sure. Or, you know, water
because it's Vegas and we're all going to be parched.
So we're not going to be drinking
water, right, Andy? Of course.
That's what we tell our parents
at home.
All right. See you, Mike. All? Of course. That's what we tell our parents at home. Alright, see you, Mike.
Alright, thanks. Bye.