PurePerformance - Understanding Distributed Tracing, Trace Context, OpenCensus, OpenTracing & OpenTelemetry
Episode Date: September 16, 2019Did you know that Distributed Tracing has been around for much longer than the recent buzz? Do you know the history and future of OpenCensus, OpenTracing, OpenTelemetry and TraceContext? Listen to thi...s podcast where we chat with Sonja Chevre, Technical Product Manager at Dynatrace, and Daniel Khan, Technical Evangelist at Dynatrace, about the past, current and future state of distributed tracing as a standard.OpenTelemetryhttps://opentelemetry.io/OpenCensushttps://opencensus.io/OpenTracinghttps://opentracing.io/TraceContexthttps://www.dynatrace.com/news/blog/distributed-tracing-with-w3c-trace-context-for-improved-end-to-end-visibility-eap/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Welcome to another episode of a special Pure Performance.
I'm not sure if it's a regular Pure Performance or a Pure Performance Cafe.
We have the cafes which are kind of like more technical focused with the engineering team. And I have a mixture sitting here.
Well, I have engineers sitting here.
But I actually start on my left.
Sonja.
Hi, Andy.
Hi.
I think we had the opportunity in the past.
Yes, yes that's true. On a different topic. Different topic, yeah. Maybe you want to quickly introduce yourself?
I'm a Technical Product Manager at Dynatrace. I have been working at Dynatrace for four or five years.
I don't remember exactly. And now my main topic is distributed tracing and observability
cool so obviously these are the topics that will be relevant for today's talk
because we're going to talk about all the things that are happening in the
community around open tracing open telemetry open sensors and all these
terminologies it's a really dynamic environment right now yeah and you are
on the product management side in Dynatrace,
meaning trying to figure out how Dynatrace can contribute
to that open tracing community or to that distributed tracing community.
Exactly, how we can help with the project to make sure that they fit for us
once we want to integrate and support them within Dynatrace
for our customers' use case.
Very cool.
And then on the other side, a little further to the right from my perspective, sits, who
is that?
Hi, I'm Daniel.
Hi, Daniel.
So, Daniel, maybe a quick introduction for you as well, for people that don't know you
yet.
Yeah, I'm lead technology strategist at Dynatrace, also quite the same time at Dynatrace as Sonja
is.
And I'm responsible for everything open-sourced, what we do as Dynatrace.
And also I'm on the W3C working group for Trace Context and also participating in a few projects, open source projects.
And more or less I'm responsible for our community efforts.
So how are we doing open source? this also involves of course the open observability
open observability space perfect so that means on the one side we have the product team
you know implementing things in the product that will obviously help our customers on the other
side daniel the connection to the the larger open source community and really driving also the
efforts around open tracing or traceability so So for me as a, let's say, developer, architect,
if I want to start with a new project in the cloud native space,
I am overwhelmed with a lot of terminology that is floating around.
And I think I want to get started with maybe a level setting
or explaining everyone what all the terminologies that is floating around
that I have to be aware of.
So for me, Sonia, you said earlier open,
I mean, distributed tracing, right?
I think distributed tracing should be clear,
but for those where it's not clear,
can you give us a little overview
because you're also the PM for distributed tracing.
What does this really mean?
So distributed tracing is about connecting
all your services together to be able to have
one end-to-end view on what's happening with one request.
You know, one request typically you start in the browser or in a mobile app and then
you have the click that goes to the first server or maybe to a web server and then to
a services and another services and in microservices environment you have for one request a high number of services that are included in the past
It was less more interconnected
So you had one monolithic and the request wasn't that big so we're just within one big component
And now with microservices one request can go to a high number of services and people
Especially even troubleshooting debugging
looking at the environment are really confused where did my request went where does the
did the error come from and distributed tracing is the tool that helps us architects developer
to understand how services are connected which one was included in my request how long did it take in
each services and to be able to
do some meaningful analysis.
And is distributed tracing something completely new that just came up in the last year or
two?
No, actually it's been around for like the last 10 or 15 years.
Actually at Dynatrace we have a patent on it and I think it was started, it's already,
Alois tweeted that sometimes ago on Twitter.
So it was two years before the DAPA paper from Google.
So it's another reference in the distributed tracing world so it's
like the last decade many we have been doing distributed tracing but now it's
really more focused and visible due to the microservices issue you know before
developer they would just develop their code in the monolith application but now
with multiple microservices debugging within the process is not enough they need to have a view across services and
that's why recently it has become more popular. Cool but it's been around for a
while so it's nothing new new we've been doing it for probably 15 years now since
we've been around. The new thing are the standardizations effort that's
going on for example with the W3C working group Daniel is working on.
So maybe that's a good segue over.
So distributed tracing is the overarching term.
Now there's a lot of different standards around.
I mean, I'm just open census, open tracing, open telemetry, trace context.
There's a lot of terminology that is floating around when I do my research.
Can you maybe give us a little background in in what all these terms are and what's
really relevant now yeah that's actually a huge question um if if we start with trace context so
trace content so if we as sonja said before want to trace transactions end-to-end, we need a way to pass some ID, some transaction ID between all those tiers.
Until now, each vendor, each project had their own way of passing this ID.
Mostly it was a header, but it was not clear which header this should be.
And WC Trace Context establishes a dedicated header format for passing on tracing information
between different tiers, which means that also, first of all, one good thing is that we can hope that infrastructure provider
like cloud providers or proxies or firewalls will then not remove this
header when it passes through them so the hope is that when there is a
standard this header will be passed on between tiers so that's the first thing
the second thing is that this header has a standard this header will be passed on between tiers so that's the first thing the second thing is that this header has a second
field that will also contain some information about who participated in
this trace and that is also valuable information when it's about
interoperability between tools so you know maybe so here is Dynatrace that
was something in Azure and here is Dynatrace that was something in Asia and
here is Dynatrace again and you can then maybe use the same tracerd to look up
the same trace in Asia so you get more or less yeah you can weave in
merge different traces from different vendor that's one thing the other things
you mentioned are
different projects these are not standards these are basically projects that deal or try to solve
the problem of of tracing application or like instrumenting applications so that they collect
traces so that they pass on this header because this comes not for free of course you have to
instrument your code somehow and open telemetry open census is a project by google
they had this have the solution then there is open tracing that's a project by lightstep but
it's a cncf project so since you have uh cloud foundation it's a CNCF project. So CNCF, Cloud Native Foundation? Yeah, it's a Cloud Native Foundation project.
And those two projects now merged to one,
or are merging to one project
that's called OpenTelemetry.
Okay.
And this should be then best of both worlds, basically.
So that's the whole idea of that.
Yeah, so if I kind of sum it up, the trace context is kind of a definition how to pass trace information from one tier to another and kind of what information is passed.
Solving some of the issues that we as a vendor have been facing that our trace context that we used, we used an HTTP header.
It has been constantly removed by tools like proxies,
and so we had to go in and configure it.
So now the belief is, or the hope is,
once this is a standard, then the firewalls, the proxies,
all these things in the middle will pass it along.
Plus, if the cloud vendors and maybe other vendors
are also sending this information along, then it will be easier to do end-to-end tracing even though you
may not have your let's say monitoring tool installed on every tier. For our
users the first thing is really making sure you get the end-to-end transaction
no broken transaction no broken traces because of a middleware
component that we just dropped the header. And that's why we started working on that project to push for a standard
to enable us to really assure end-to-end transaction, no more broken processes.
And we're not only working on this project,
actually Alois is from Dynatrace is chairing the project.
So Alois is chair of this W3C project.
So it's in our full interest to have that.
And as far as I know, Sonia already has something in the product already as well.
Yes, so we have it already running as a preview in the product.
So you can register on the website and we enable a feature flag for you.
You can add those headers additionally and
Meaning if you have a services running in the cloud with Azure is implementing it
You already have some success where you get an end-to-end
Way where the trace don't work anymore because this header is being passed on that's pretty cool
And so now this is trace context. Makes sense. Maybe one more thing.
It's really our early standard.
So we know that Microsoft, Google and others will implement it.
So right now the number of use case that we start with that are not that huge.
But in the future as cloud providers and additional components will implement this new standard,
then we will have solved this problem for our customers.
Cool.
So we are ready and waiting for the cloud.
Yeah.
And we can also post, so you said people can register, at least the DiamondTrace users
can register for an early access program.
So that's, I guess, on our regular early access program page, they can find it.
That's awesome.
And I think you also wrote a blog post about it, didn't you?
Yeah, exactly.
That comes with a nice blog post and some examples.
Perfect. So now this is trace context. But the real problem that we obviously have to solve is
how do we instrument applications so we can actually get details about the transaction
that is flowing through a microservice or even through a monolith. And this is now, Daniel,
if I'm hearing this correctly, this is what open telemetry, kind of the combination of open sensors and open tracing tries to solve.
How do we instrument applications, and I would assume in the most convenient way, for developers or framework vendors?
Is this kind of the idea of open telemetry?
Of course, I mean to be honest I have to say that of course in the most
convenient way you just install maybe Dynatrace and it will instrument do all
that so that's also not that's basically a soft problem. We put a lot of effort
into instrumentation creating those agents.
Yet there are use cases where you maybe are not able to get into some platform
with an agent or also from a vendor perspective you cannot always support
each and every technology and maybe there is no dedicated instrumentation for a given platform, technology, whatever.
And in such cases, of course, community sponsored project can of course help a lot.
And also a standard around that.
So for us, and that's also the reason why we are really actively participating in Open
Telemetry is is for us it's
fine we don't care so much about where the data is coming from.
Because we want, so if open source tracing solutions create high quality data even better
for us because our value prop is around what to do with this data.
How to analyze it, display it,
artificial intelligence,
the whole backend that you need.
You have to store it somewhere.
So that's our value prop.
And as such, we hope even that at some point
the open source tracing solutions will be so good that
we can just plug and play them or plug them in instead of a Dynatrace agent and it will
produce data that is as good as our agents produce.
And this is what we are actively working on as Dynatrace.
So now what type of data are we talking about?
On the one side, one of the key aspects is obviously the trace context passing data along,
but that's not really the main data, right?
Is this like metrics?
Is it additional trace data, like what we call the pure path?
Is it method execution times?
What are you guys working on right now as part of open telemetry what type of data can be exposed first of all open telemetry will be
traces metrics and logs okay so it will be all of that uh our focus right now is on traces. What is a trace? A transaction consists of a trace.
So that's the end-to-end thing that has one ID. And every sub operation, be it between tiers, be it within an application, is called span so a trace is more or less the let's put it like you can say
it's the root and on such a trace you have a lot of spans like a tree of spans that is below that
and and this is what we are that's very much resembles the pure pass that we have as Dynatrace. So it's not of, and that's also a good thing
because you can, it's always the same structure
you end up with.
It's always some kind of tree structure.
We have pretty much the same tree structure.
And yeah, our focus is now getting those traces
out of those different platforms.
Yeah.
Pretty cool.
And that means how do we, so as a developer, like if I would use any of the libraries that
will come up, right?
I mean, obviously with the APM vendors like Dynatrace, the idea is you just install the
agent and that's taken care of.
But there will be other projects, I would assume some libraries, like I think we have
an SDK, right?
Where you can also instrument your applications.
Exactly, it would be very similar to our SDK.
In fact, we are trying to push some of the aspects of our own SDK, some of the learning
we had about data quality, about how the interface looks like.
We are trying to inject some of those thoughts coming from the enterprise world, scaling
and what needs to be done to be run in production large environments all the supportability
and all these small tiny bags we are trying to inject into the open telemetry project working
on it with the community to make sure we will also be able to get some value from those data
and as for the use case there i see two main use cases the one the first is frameworks or component
you know who has the best knowledge to instrument a
framework or a third-party component? Is it we at Dynatrace or is it the framework's developers?
I would argue that the framework's developers should be the best. They know the codes, they
know where the boundaries to the service are. They should be, in the end, the one that will
be able to do the best instrumentation. Right now we are taking care of that and it's always taking us a lot of resources and testing. Then a version changes and then something
in the library changes and we have to update our instrumentation. So the vision and the
hope for us as talent-based with OpenTelemetry is that frameworks will have instrumentation,
will come instrumented with an additional library and we'll be able to build on that. Not having all the resources in R&D to do the instrumentation by ourselves,
but really being able to reuse the framework instrumentation. And that will
allow us a quicker time to market, because doing it for us at R&D,
we have to decide which one we want to do, where do we want to put our resource,
and if it comes already built in, then we will be able to quickly display it and have it in Dynatrace.
I mean, for me, the great benefit is obviously for the framework developers because
they can put in instrumentation once, and then they are ensured that whatever monitoring
tool the customers will use, assuming they adhere to the OpenTelemetry standard, they
can consume the same data that
they have put into the framework.
I think that's perfect.
And that's about traces, but I said that's also about metrics.
And logs, yeah.
And so that's one use case.
And the second use case we see for our users, sometimes some custom instrumentation code
is needed.
Like for business value, you want to put a business tag
on one transaction to do further analytics.
That stuff we cannot do automatically with Dynatrace.
So customer have to use the SDK, right now,
one agent SDK, but they will always come up
and some customer might, hmm, but that's vendor code.
So what if in one, two year I want to switch vendor,
then it's not vendor neutral.
And that's why for us it also makes sense
to add this possibility to have the vendor
neutral SDK.
So in a, I think that's another, it's a big thing, right?
Because we see it always when with our enterprise customers, after a couple of years, they typically
have contracts with vendors and after a couple of years, they may change.
And then if you are ensured that the instrumentation that you have
will also work with the new vendor, then it's much easier to switch,
which means vendors need to much better differentiate the value problem.
I think that's what you said, Daniel.
We need to differentiate not in the way we instrument,
because instrumentation is a solved problem,
and it's basically a commodity, and it's a standard,
but we need to differentiate as monitoring vendors on what we do with
data I think that's yeah that's a really cool thing okay so if I if I kind of
recap again just to make sure that everybody understands this should be
started with distributed tracing in the beginning we understand that what we
want to really do is end-to-end tracing with information on what happens on each individual tier.
We want to make sure that things are passed along each individual tier.
That's why we have trace context, which defines a standard.
And the hope here is that every tool is producing the same type of data
and that cloud vendors, network component vendors are simply passing this along
and not throwing it
away making it hard to actually get the end-to-end trace. Then we had two
projects one from Google and one from what was the other one? LightStep.
Right open sensors and open tracing and they now merged into open telemetry.
Open telemetry basically tries to figure out or specify standards for traces,
for metrics, and for logs. So that if I'm a user or if I'm building applications or
frameworks, I can basically pick whatever library I want to instrument my app, and I'm
ensured that wherever my app or my framework gets deployed if they have a monitoring tool that can ingest the data that I'm always that I'm safe
basically that's the kind of test the idea I like your to use cases that you
explained that's pretty cool so what does this mean then what's the when when
I am you know it's at the time of their recording it's July I'm not sure exactly
when this airs but probably in the next couple of weeks so let's assume 2019 in the fall i start a new project
what do i as an architect as a developer as an operations team need to take care of should i
should i insist that the platform that i'm choosing uh is part of open telemetry that
my frameworks adhere to that,
or what are the things I need to take care of
so that I don't, let's say, go off in the wrong direction?
I know that's a tough question.
That's a pretty hard question,
but depends on very much what you want to do.
I don't see, like using open source tracing.
Yeah, it might be an option to get started, but you always have to take care or know and
take care of that you have to do something with the data.
So you have to, besides that, that in fall open telemetry will not be ready. So it will take, I'm pretty sure, until the end of the year,
until this feature is out. Yeah, it really depends on if you want to use open source, okay, you can start
with that and you can instrument that, but also means that you, of course, put actual
like instrumentation code most probably
into your application.
So it means you have like some kind of lock-in produced.
And you will build some kind of infrastructure
where you store all this stuff.
And so this is, if you want to go this route, it's fine.
And yeah.
Also for experimenting, you know.
For experimenting, yeah. People understand the concept behind that better when they're doing hands-off.
The antrails do it automatically for you, but sometimes people need to understand
and that's a great way to get started and understand what's going on.
And as you said, you're starting a project right, so it's fine to maybe start out with whatever actually then you will still have
to choose between OpenCensus and OpenTracing and then I would go with the platform that
has the project that has the better support for the technologies you're using.
Because they differ depending on if you're on Java or Node or.NET.
So they have a different breadth of what they do and what they auto instrument.
Because that's also a thing. You don't want to instrument each and everything on your own.
You want to have some auto instrumentation.
This really depends. There are different breadth of coverage here.
That's the first thing I would say you have to take care of.
And then you can start experimenting with that. If you are like today, if you are working on some enterprise product
and that's not, and I'm not saying Dynatrace now if you're working
on an enterprise product as of today I would go for a vendor to be honest because you need
the support you need some back end you need the analytics so and you so if your time to
market is like in the next few months you, I would not go open source only.
There is just not enough tooling available, like backends, like analytics platforms, etc.
And also you have to think about the total cost of ownership.
Right now when you start and to play it's really nice, but then once you develop to
production you have to think about all the other stuff what which value do you get out of it so
right now the open source tool are really limited to a list of traces and
good luck try finding one trace you need for debugging in a million traces
and also then you might like some data quality to really figure out what the
issue was and then about maintaining and then you have to scale your system.
And there's a lot of consideration you have to think of which in the end might be more
effort and more resources.
And as an application developer, you want to focus on the core business.
You want to be writing codes that brings value to the business.
You don't want to build your own monitoring platform.
And that's a little bit the risk with the open source tools.
Now back to your question, I think that's for application user, they should stick to
something that does it automatically.
No complexity, enable us to do business value.
On the other side, the question that's really relevant is for framework a component provider. Because if you have a framework or a library or a component,
like let's think about stuff like Envoy.
Envoy is a proxy, native.
We cannot do automatic instrumentation,
and it has open tracing built in.
And I think that's a really smart move from there.
And we are looking at implementing it
and adding those traces to our own PurePath.
And that really makes sense because it's native
so we couldn't do automatic instrumentation
but because they have built it open tracing support
in such a nice way we will be able to do automatic
injection and to retrieve the data out of it.
And I've just seen today, this morning, a tweet
from Linkerd asking users for feedback
on what they should
build in direction distributed tracing. So if you are a framework provider or component
provider library, that's the moment where you should think, okay, which value do I want
to bring to my user? How do I want to enable them to not only debug and monitor my component,
but make sure I provide value along the distributed trace along the end-to-end tracing
because most of the time it's not only about that component but making sure it plays nice with the
other and we can integrate it and for this framework and components then I think it's
really important to start looking at open telemetry now and trace context that's the two
two things that are really important so who is is part of OpenTelemetry?
Like what companies, what bodies are part of it?
I mean, I assume the list is long, but who are they?
I mean, I would assume all the big cloud vendors.
Microsoft, Google is in there.
Uber is participating because they have Yiga tracing.
A lot of different also people that are
right on the vendor side, Netflix is working, participating, so a lot of
different companies that are either vendors or cloud providers or have
really large deployments of some technology they want to monitor properly.
So I mean the reason I'm asking this because this actually shows the commitment and I
think also the future proof. Absolutely. It's the way forward right?
I mean that's open telemetry. Yeah well we had a kind of like a
separation when there was still open sensors and open tracing also
separations of different philosophies how to do it and now like this is a
really joint effort of a lot of different groups vendors and and cloud
providers and we can expect that this is the way going forward yeah definitely
pretty cool we also contribute into a pantry so we have part of the project Yeah, definitely. Pretty cool. We are also contributing to OpenTechnometry,
so we are part of the project.
And we are in the, Daniel correct me if I'm wrong,
in the Java, in the Python, in the Node.js, in the.NET groups.
And in the specification.
And in the specification.
That's pretty cool.
So to kind of sum it up, I mean, I know I summed up
a lot of things already, but I want
to just highlight some of the other things I just learned at the end.
If I'm a framework developer, if I'm a cloud vendor, if I'm a component developer, right now is a good time to look into open telemetry.
Because if I put this in the trace context and everything else, I'm future proof.
And I know that the next generation of my frameworks will have a better chance to be correctly monitored and contribute to a real good end-to-end trace.
I think that's a great way.
If people just want to play around with it right now, open telemetry is not yet there.
So that means right now you can still look into sensors and open tracing.
And Daniel, I think you said end of the year is probably a good timeframe for OpenTelemetry,
maybe early next year.
Now it's a good time to look at OpenTelemetry from a spec perspective, see how things are
implemented, contribute maybe.
Yeah, that's good.
I mean, it makes sense if you see that something is missing.
So it's not good.
There are a lot of issues up for grabs at those different projects.
So if you want to participate in open source, go ahead.
But yeah, until end of year, there will be, I just expect nothing to be there and also no
backend to be available to actually digest this data. We have to, I think I have to emphasize that again. This open telemetry will just collect the data
and will send it to somewhere.
This somewhere has to exist, of course.
And this somewhere will be also Dynatrace at some point.
But with open telemetry alone,
you just get protobufs or JSONs or whatever.
Understood, yeah.
That's really important.
Yeah, and I think that's also to kind of echo your point again from earlier.
It's great that we have it.
That means we will make it very easy to instrument frameworks and applications.
But the key and the value prop, the differentiator of vendors like us is going to be what we
can do with the data.
Exactly.
And that's the clear message.
Yeah.
And I hope I don't step on anybody's toes here now, but that actually means the
agent technology itself is becoming less, obviously, of a differentiator.
I mean, there's still obviously value because it's the auto instrumentation, but the real value differentiator and the value prop is going to be what to do with the data.
Yeah, I mean, the depth of the data is still in this will take a long time until that is open source can match that. topic. Because just as an example, we run very often native on platforms and collect
also native metrics that are not available to some like Node.js. If you think Node.js,
if you don't run in a native module, you get some data not out of it and also some kind of
stack traces and all of that so that
It will take its time until they are
Those project are that far we are working on getting them there. So I I think that
the agent
Development like we are talking about now three, four years, that those technologies
will have the same low overheads, the same breadth of data that will take a few years.
And actually I'm starting a blog post series to highlight what are those technical challenges.
You know, to really make it visible what are the stuff we want to work on, which are the features we are
pushing for in OpenTelemetry to make sure that all the use case we have for
large enterprise production ready customers are fit in OpenTelemetry.
Quality of data, supportability and all those topics that might be overseen when
people work in development on small projects,
but that are real issues for us.
Very cool.
All right.
Any last words?
No?
All good?
We will definitely make sure on the proceedings of the podcast,
we link to your blog posts and also for Dynatrace users to find your Alexis program.
I think that will be good.
And I think we should probably do another podcast at the end of the year
to see what the status is and where we are.
Yeah, maybe around or after KubeCon.
Yeah.
KubeCon is in November.
There will be a few news around open telemetry, definitely.
And after that, we should talk again, I guess.
But I think it was really great as an overview
and also kind of explaining the different terminology,
the history.
Open telemetry is the future.
I think that's important to understand.
Great to look into it right now
if you want to contribute to the spec and want to use it.
And if you are a user,
that means if you're writing apps that you want to monitor,
it will take a little while long until it's ready.
But you can already look into what's available right now, OpenSense, OpenTracing, and obviously the vendors.
We all have good solutions that makes it easy to instrument and trace and analyze data.
Awesome. Thank you.
Thank you.