PurePerformance - Understanding Distributed Tracing, Trace Context, OpenCensus, OpenTracing & OpenTelemetry

Episode Date: September 16, 2019

Did you know that Distributed Tracing has been around for much longer than the recent buzz? Do you know the history and future of OpenCensus, OpenTracing, OpenTelemetry and TraceContext? Listen to thi...s podcast where we chat with Sonja Chevre, Technical Product Manager at Dynatrace, and Daniel Khan, Technical Evangelist at Dynatrace, about the past, current and future state of distributed tracing as a standard.OpenTelemetryhttps://opentelemetry.io/OpenCensushttps://opencensus.io/OpenTracinghttps://opentracing.io/TraceContexthttps://www.dynatrace.com/news/blog/distributed-tracing-with-w3c-trace-context-for-improved-end-to-end-visibility-eap/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome to another episode of a special Pure Performance. I'm not sure if it's a regular Pure Performance or a Pure Performance Cafe. We have the cafes which are kind of like more technical focused with the engineering team. And I have a mixture sitting here. Well, I have engineers sitting here. But I actually start on my left. Sonja.
Starting point is 00:00:43 Hi, Andy. Hi. I think we had the opportunity in the past. Yes, yes that's true. On a different topic. Different topic, yeah. Maybe you want to quickly introduce yourself? I'm a Technical Product Manager at Dynatrace. I have been working at Dynatrace for four or five years. I don't remember exactly. And now my main topic is distributed tracing and observability cool so obviously these are the topics that will be relevant for today's talk because we're going to talk about all the things that are happening in the
Starting point is 00:01:15 community around open tracing open telemetry open sensors and all these terminologies it's a really dynamic environment right now yeah and you are on the product management side in Dynatrace, meaning trying to figure out how Dynatrace can contribute to that open tracing community or to that distributed tracing community. Exactly, how we can help with the project to make sure that they fit for us once we want to integrate and support them within Dynatrace for our customers' use case.
Starting point is 00:01:41 Very cool. And then on the other side, a little further to the right from my perspective, sits, who is that? Hi, I'm Daniel. Hi, Daniel. So, Daniel, maybe a quick introduction for you as well, for people that don't know you yet. Yeah, I'm lead technology strategist at Dynatrace, also quite the same time at Dynatrace as Sonja
Starting point is 00:01:58 is. And I'm responsible for everything open-sourced, what we do as Dynatrace. And also I'm on the W3C working group for Trace Context and also participating in a few projects, open source projects. And more or less I'm responsible for our community efforts. So how are we doing open source? this also involves of course the open observability open observability space perfect so that means on the one side we have the product team you know implementing things in the product that will obviously help our customers on the other side daniel the connection to the the larger open source community and really driving also the
Starting point is 00:02:41 efforts around open tracing or traceability so So for me as a, let's say, developer, architect, if I want to start with a new project in the cloud native space, I am overwhelmed with a lot of terminology that is floating around. And I think I want to get started with maybe a level setting or explaining everyone what all the terminologies that is floating around that I have to be aware of. So for me, Sonia, you said earlier open, I mean, distributed tracing, right?
Starting point is 00:03:12 I think distributed tracing should be clear, but for those where it's not clear, can you give us a little overview because you're also the PM for distributed tracing. What does this really mean? So distributed tracing is about connecting all your services together to be able to have one end-to-end view on what's happening with one request.
Starting point is 00:03:31 You know, one request typically you start in the browser or in a mobile app and then you have the click that goes to the first server or maybe to a web server and then to a services and another services and in microservices environment you have for one request a high number of services that are included in the past It was less more interconnected So you had one monolithic and the request wasn't that big so we're just within one big component And now with microservices one request can go to a high number of services and people Especially even troubleshooting debugging looking at the environment are really confused where did my request went where does the
Starting point is 00:04:11 did the error come from and distributed tracing is the tool that helps us architects developer to understand how services are connected which one was included in my request how long did it take in each services and to be able to do some meaningful analysis. And is distributed tracing something completely new that just came up in the last year or two? No, actually it's been around for like the last 10 or 15 years. Actually at Dynatrace we have a patent on it and I think it was started, it's already,
Starting point is 00:04:39 Alois tweeted that sometimes ago on Twitter. So it was two years before the DAPA paper from Google. So it's another reference in the distributed tracing world so it's like the last decade many we have been doing distributed tracing but now it's really more focused and visible due to the microservices issue you know before developer they would just develop their code in the monolith application but now with multiple microservices debugging within the process is not enough they need to have a view across services and that's why recently it has become more popular. Cool but it's been around for a
Starting point is 00:05:13 while so it's nothing new new we've been doing it for probably 15 years now since we've been around. The new thing are the standardizations effort that's going on for example with the W3C working group Daniel is working on. So maybe that's a good segue over. So distributed tracing is the overarching term. Now there's a lot of different standards around. I mean, I'm just open census, open tracing, open telemetry, trace context. There's a lot of terminology that is floating around when I do my research.
Starting point is 00:05:44 Can you maybe give us a little background in in what all these terms are and what's really relevant now yeah that's actually a huge question um if if we start with trace context so trace content so if we as sonja said before want to trace transactions end-to-end, we need a way to pass some ID, some transaction ID between all those tiers. Until now, each vendor, each project had their own way of passing this ID. Mostly it was a header, but it was not clear which header this should be. And WC Trace Context establishes a dedicated header format for passing on tracing information between different tiers, which means that also, first of all, one good thing is that we can hope that infrastructure provider like cloud providers or proxies or firewalls will then not remove this
Starting point is 00:06:55 header when it passes through them so the hope is that when there is a standard this header will be passed on between tiers so that's the first thing the second thing is that this header has a standard this header will be passed on between tiers so that's the first thing the second thing is that this header has a second field that will also contain some information about who participated in this trace and that is also valuable information when it's about interoperability between tools so you know maybe so here is Dynatrace that was something in Azure and here is Dynatrace that was something in Asia and here is Dynatrace again and you can then maybe use the same tracerd to look up
Starting point is 00:07:31 the same trace in Asia so you get more or less yeah you can weave in merge different traces from different vendor that's one thing the other things you mentioned are different projects these are not standards these are basically projects that deal or try to solve the problem of of tracing application or like instrumenting applications so that they collect traces so that they pass on this header because this comes not for free of course you have to instrument your code somehow and open telemetry open census is a project by google they had this have the solution then there is open tracing that's a project by lightstep but
Starting point is 00:08:20 it's a cncf project so since you have uh cloud foundation it's a CNCF project. So CNCF, Cloud Native Foundation? Yeah, it's a Cloud Native Foundation project. And those two projects now merged to one, or are merging to one project that's called OpenTelemetry. Okay. And this should be then best of both worlds, basically. So that's the whole idea of that. Yeah, so if I kind of sum it up, the trace context is kind of a definition how to pass trace information from one tier to another and kind of what information is passed.
Starting point is 00:08:56 Solving some of the issues that we as a vendor have been facing that our trace context that we used, we used an HTTP header. It has been constantly removed by tools like proxies, and so we had to go in and configure it. So now the belief is, or the hope is, once this is a standard, then the firewalls, the proxies, all these things in the middle will pass it along. Plus, if the cloud vendors and maybe other vendors are also sending this information along, then it will be easier to do end-to-end tracing even though you
Starting point is 00:09:29 may not have your let's say monitoring tool installed on every tier. For our users the first thing is really making sure you get the end-to-end transaction no broken transaction no broken traces because of a middleware component that we just dropped the header. And that's why we started working on that project to push for a standard to enable us to really assure end-to-end transaction, no more broken processes. And we're not only working on this project, actually Alois is from Dynatrace is chairing the project. So Alois is chair of this W3C project.
Starting point is 00:10:03 So it's in our full interest to have that. And as far as I know, Sonia already has something in the product already as well. Yes, so we have it already running as a preview in the product. So you can register on the website and we enable a feature flag for you. You can add those headers additionally and Meaning if you have a services running in the cloud with Azure is implementing it You already have some success where you get an end-to-end Way where the trace don't work anymore because this header is being passed on that's pretty cool
Starting point is 00:10:41 And so now this is trace context. Makes sense. Maybe one more thing. It's really our early standard. So we know that Microsoft, Google and others will implement it. So right now the number of use case that we start with that are not that huge. But in the future as cloud providers and additional components will implement this new standard, then we will have solved this problem for our customers. Cool. So we are ready and waiting for the cloud.
Starting point is 00:11:08 Yeah. And we can also post, so you said people can register, at least the DiamondTrace users can register for an early access program. So that's, I guess, on our regular early access program page, they can find it. That's awesome. And I think you also wrote a blog post about it, didn't you? Yeah, exactly. That comes with a nice blog post and some examples.
Starting point is 00:11:26 Perfect. So now this is trace context. But the real problem that we obviously have to solve is how do we instrument applications so we can actually get details about the transaction that is flowing through a microservice or even through a monolith. And this is now, Daniel, if I'm hearing this correctly, this is what open telemetry, kind of the combination of open sensors and open tracing tries to solve. How do we instrument applications, and I would assume in the most convenient way, for developers or framework vendors? Is this kind of the idea of open telemetry? Of course, I mean to be honest I have to say that of course in the most convenient way you just install maybe Dynatrace and it will instrument do all
Starting point is 00:12:16 that so that's also not that's basically a soft problem. We put a lot of effort into instrumentation creating those agents. Yet there are use cases where you maybe are not able to get into some platform with an agent or also from a vendor perspective you cannot always support each and every technology and maybe there is no dedicated instrumentation for a given platform, technology, whatever. And in such cases, of course, community sponsored project can of course help a lot. And also a standard around that. So for us, and that's also the reason why we are really actively participating in Open
Starting point is 00:13:04 Telemetry is is for us it's fine we don't care so much about where the data is coming from. Because we want, so if open source tracing solutions create high quality data even better for us because our value prop is around what to do with this data. How to analyze it, display it, artificial intelligence, the whole backend that you need. You have to store it somewhere.
Starting point is 00:13:34 So that's our value prop. And as such, we hope even that at some point the open source tracing solutions will be so good that we can just plug and play them or plug them in instead of a Dynatrace agent and it will produce data that is as good as our agents produce. And this is what we are actively working on as Dynatrace. So now what type of data are we talking about? On the one side, one of the key aspects is obviously the trace context passing data along,
Starting point is 00:14:14 but that's not really the main data, right? Is this like metrics? Is it additional trace data, like what we call the pure path? Is it method execution times? What are you guys working on right now as part of open telemetry what type of data can be exposed first of all open telemetry will be traces metrics and logs okay so it will be all of that uh our focus right now is on traces. What is a trace? A transaction consists of a trace. So that's the end-to-end thing that has one ID. And every sub operation, be it between tiers, be it within an application, is called span so a trace is more or less the let's put it like you can say it's the root and on such a trace you have a lot of spans like a tree of spans that is below that
Starting point is 00:15:15 and and this is what we are that's very much resembles the pure pass that we have as Dynatrace. So it's not of, and that's also a good thing because you can, it's always the same structure you end up with. It's always some kind of tree structure. We have pretty much the same tree structure. And yeah, our focus is now getting those traces out of those different platforms. Yeah.
Starting point is 00:15:43 Pretty cool. And that means how do we, so as a developer, like if I would use any of the libraries that will come up, right? I mean, obviously with the APM vendors like Dynatrace, the idea is you just install the agent and that's taken care of. But there will be other projects, I would assume some libraries, like I think we have an SDK, right? Where you can also instrument your applications.
Starting point is 00:16:05 Exactly, it would be very similar to our SDK. In fact, we are trying to push some of the aspects of our own SDK, some of the learning we had about data quality, about how the interface looks like. We are trying to inject some of those thoughts coming from the enterprise world, scaling and what needs to be done to be run in production large environments all the supportability and all these small tiny bags we are trying to inject into the open telemetry project working on it with the community to make sure we will also be able to get some value from those data and as for the use case there i see two main use cases the one the first is frameworks or component
Starting point is 00:16:43 you know who has the best knowledge to instrument a framework or a third-party component? Is it we at Dynatrace or is it the framework's developers? I would argue that the framework's developers should be the best. They know the codes, they know where the boundaries to the service are. They should be, in the end, the one that will be able to do the best instrumentation. Right now we are taking care of that and it's always taking us a lot of resources and testing. Then a version changes and then something in the library changes and we have to update our instrumentation. So the vision and the hope for us as talent-based with OpenTelemetry is that frameworks will have instrumentation, will come instrumented with an additional library and we'll be able to build on that. Not having all the resources in R&D to do the instrumentation by ourselves,
Starting point is 00:17:28 but really being able to reuse the framework instrumentation. And that will allow us a quicker time to market, because doing it for us at R&D, we have to decide which one we want to do, where do we want to put our resource, and if it comes already built in, then we will be able to quickly display it and have it in Dynatrace. I mean, for me, the great benefit is obviously for the framework developers because they can put in instrumentation once, and then they are ensured that whatever monitoring tool the customers will use, assuming they adhere to the OpenTelemetry standard, they can consume the same data that
Starting point is 00:18:06 they have put into the framework. I think that's perfect. And that's about traces, but I said that's also about metrics. And logs, yeah. And so that's one use case. And the second use case we see for our users, sometimes some custom instrumentation code is needed. Like for business value, you want to put a business tag
Starting point is 00:18:25 on one transaction to do further analytics. That stuff we cannot do automatically with Dynatrace. So customer have to use the SDK, right now, one agent SDK, but they will always come up and some customer might, hmm, but that's vendor code. So what if in one, two year I want to switch vendor, then it's not vendor neutral. And that's why for us it also makes sense
Starting point is 00:18:44 to add this possibility to have the vendor neutral SDK. So in a, I think that's another, it's a big thing, right? Because we see it always when with our enterprise customers, after a couple of years, they typically have contracts with vendors and after a couple of years, they may change. And then if you are ensured that the instrumentation that you have will also work with the new vendor, then it's much easier to switch, which means vendors need to much better differentiate the value problem.
Starting point is 00:19:15 I think that's what you said, Daniel. We need to differentiate not in the way we instrument, because instrumentation is a solved problem, and it's basically a commodity, and it's a standard, but we need to differentiate as monitoring vendors on what we do with data I think that's yeah that's a really cool thing okay so if I if I kind of recap again just to make sure that everybody understands this should be started with distributed tracing in the beginning we understand that what we
Starting point is 00:19:41 want to really do is end-to-end tracing with information on what happens on each individual tier. We want to make sure that things are passed along each individual tier. That's why we have trace context, which defines a standard. And the hope here is that every tool is producing the same type of data and that cloud vendors, network component vendors are simply passing this along and not throwing it away making it hard to actually get the end-to-end trace. Then we had two projects one from Google and one from what was the other one? LightStep.
Starting point is 00:20:13 Right open sensors and open tracing and they now merged into open telemetry. Open telemetry basically tries to figure out or specify standards for traces, for metrics, and for logs. So that if I'm a user or if I'm building applications or frameworks, I can basically pick whatever library I want to instrument my app, and I'm ensured that wherever my app or my framework gets deployed if they have a monitoring tool that can ingest the data that I'm always that I'm safe basically that's the kind of test the idea I like your to use cases that you explained that's pretty cool so what does this mean then what's the when when I am you know it's at the time of their recording it's July I'm not sure exactly
Starting point is 00:21:03 when this airs but probably in the next couple of weeks so let's assume 2019 in the fall i start a new project what do i as an architect as a developer as an operations team need to take care of should i should i insist that the platform that i'm choosing uh is part of open telemetry that my frameworks adhere to that, or what are the things I need to take care of so that I don't, let's say, go off in the wrong direction? I know that's a tough question. That's a pretty hard question,
Starting point is 00:21:36 but depends on very much what you want to do. I don't see, like using open source tracing. Yeah, it might be an option to get started, but you always have to take care or know and take care of that you have to do something with the data. So you have to, besides that, that in fall open telemetry will not be ready. So it will take, I'm pretty sure, until the end of the year, until this feature is out. Yeah, it really depends on if you want to use open source, okay, you can start with that and you can instrument that, but also means that you, of course, put actual like instrumentation code most probably
Starting point is 00:22:26 into your application. So it means you have like some kind of lock-in produced. And you will build some kind of infrastructure where you store all this stuff. And so this is, if you want to go this route, it's fine. And yeah. Also for experimenting, you know. For experimenting, yeah. People understand the concept behind that better when they're doing hands-off.
Starting point is 00:22:49 The antrails do it automatically for you, but sometimes people need to understand and that's a great way to get started and understand what's going on. And as you said, you're starting a project right, so it's fine to maybe start out with whatever actually then you will still have to choose between OpenCensus and OpenTracing and then I would go with the platform that has the project that has the better support for the technologies you're using. Because they differ depending on if you're on Java or Node or.NET. So they have a different breadth of what they do and what they auto instrument. Because that's also a thing. You don't want to instrument each and everything on your own.
Starting point is 00:23:35 You want to have some auto instrumentation. This really depends. There are different breadth of coverage here. That's the first thing I would say you have to take care of. And then you can start experimenting with that. If you are like today, if you are working on some enterprise product and that's not, and I'm not saying Dynatrace now if you're working on an enterprise product as of today I would go for a vendor to be honest because you need the support you need some back end you need the analytics so and you so if your time to market is like in the next few months you, I would not go open source only.
Starting point is 00:24:28 There is just not enough tooling available, like backends, like analytics platforms, etc. And also you have to think about the total cost of ownership. Right now when you start and to play it's really nice, but then once you develop to production you have to think about all the other stuff what which value do you get out of it so right now the open source tool are really limited to a list of traces and good luck try finding one trace you need for debugging in a million traces and also then you might like some data quality to really figure out what the issue was and then about maintaining and then you have to scale your system.
Starting point is 00:25:07 And there's a lot of consideration you have to think of which in the end might be more effort and more resources. And as an application developer, you want to focus on the core business. You want to be writing codes that brings value to the business. You don't want to build your own monitoring platform. And that's a little bit the risk with the open source tools. Now back to your question, I think that's for application user, they should stick to something that does it automatically.
Starting point is 00:25:35 No complexity, enable us to do business value. On the other side, the question that's really relevant is for framework a component provider. Because if you have a framework or a library or a component, like let's think about stuff like Envoy. Envoy is a proxy, native. We cannot do automatic instrumentation, and it has open tracing built in. And I think that's a really smart move from there. And we are looking at implementing it
Starting point is 00:26:03 and adding those traces to our own PurePath. And that really makes sense because it's native so we couldn't do automatic instrumentation but because they have built it open tracing support in such a nice way we will be able to do automatic injection and to retrieve the data out of it. And I've just seen today, this morning, a tweet from Linkerd asking users for feedback
Starting point is 00:26:24 on what they should build in direction distributed tracing. So if you are a framework provider or component provider library, that's the moment where you should think, okay, which value do I want to bring to my user? How do I want to enable them to not only debug and monitor my component, but make sure I provide value along the distributed trace along the end-to-end tracing because most of the time it's not only about that component but making sure it plays nice with the other and we can integrate it and for this framework and components then I think it's really important to start looking at open telemetry now and trace context that's the two
Starting point is 00:27:01 two things that are really important so who is is part of OpenTelemetry? Like what companies, what bodies are part of it? I mean, I assume the list is long, but who are they? I mean, I would assume all the big cloud vendors. Microsoft, Google is in there. Uber is participating because they have Yiga tracing. A lot of different also people that are right on the vendor side, Netflix is working, participating, so a lot of
Starting point is 00:27:33 different companies that are either vendors or cloud providers or have really large deployments of some technology they want to monitor properly. So I mean the reason I'm asking this because this actually shows the commitment and I think also the future proof. Absolutely. It's the way forward right? I mean that's open telemetry. Yeah well we had a kind of like a separation when there was still open sensors and open tracing also separations of different philosophies how to do it and now like this is a really joint effort of a lot of different groups vendors and and cloud
Starting point is 00:28:15 providers and we can expect that this is the way going forward yeah definitely pretty cool we also contribute into a pantry so we have part of the project Yeah, definitely. Pretty cool. We are also contributing to OpenTechnometry, so we are part of the project. And we are in the, Daniel correct me if I'm wrong, in the Java, in the Python, in the Node.js, in the.NET groups. And in the specification. And in the specification. That's pretty cool.
Starting point is 00:28:39 So to kind of sum it up, I mean, I know I summed up a lot of things already, but I want to just highlight some of the other things I just learned at the end. If I'm a framework developer, if I'm a cloud vendor, if I'm a component developer, right now is a good time to look into open telemetry. Because if I put this in the trace context and everything else, I'm future proof. And I know that the next generation of my frameworks will have a better chance to be correctly monitored and contribute to a real good end-to-end trace. I think that's a great way. If people just want to play around with it right now, open telemetry is not yet there.
Starting point is 00:29:16 So that means right now you can still look into sensors and open tracing. And Daniel, I think you said end of the year is probably a good timeframe for OpenTelemetry, maybe early next year. Now it's a good time to look at OpenTelemetry from a spec perspective, see how things are implemented, contribute maybe. Yeah, that's good. I mean, it makes sense if you see that something is missing. So it's not good.
Starting point is 00:29:39 There are a lot of issues up for grabs at those different projects. So if you want to participate in open source, go ahead. But yeah, until end of year, there will be, I just expect nothing to be there and also no backend to be available to actually digest this data. We have to, I think I have to emphasize that again. This open telemetry will just collect the data and will send it to somewhere. This somewhere has to exist, of course. And this somewhere will be also Dynatrace at some point. But with open telemetry alone,
Starting point is 00:30:24 you just get protobufs or JSONs or whatever. Understood, yeah. That's really important. Yeah, and I think that's also to kind of echo your point again from earlier. It's great that we have it. That means we will make it very easy to instrument frameworks and applications. But the key and the value prop, the differentiator of vendors like us is going to be what we can do with the data.
Starting point is 00:30:47 Exactly. And that's the clear message. Yeah. And I hope I don't step on anybody's toes here now, but that actually means the agent technology itself is becoming less, obviously, of a differentiator. I mean, there's still obviously value because it's the auto instrumentation, but the real value differentiator and the value prop is going to be what to do with the data. Yeah, I mean, the depth of the data is still in this will take a long time until that is open source can match that. topic. Because just as an example, we run very often native on platforms and collect also native metrics that are not available to some like Node.js. If you think Node.js,
Starting point is 00:31:40 if you don't run in a native module, you get some data not out of it and also some kind of stack traces and all of that so that It will take its time until they are Those project are that far we are working on getting them there. So I I think that the agent Development like we are talking about now three, four years, that those technologies will have the same low overheads, the same breadth of data that will take a few years. And actually I'm starting a blog post series to highlight what are those technical challenges.
Starting point is 00:32:22 You know, to really make it visible what are the stuff we want to work on, which are the features we are pushing for in OpenTelemetry to make sure that all the use case we have for large enterprise production ready customers are fit in OpenTelemetry. Quality of data, supportability and all those topics that might be overseen when people work in development on small projects, but that are real issues for us. Very cool. All right.
Starting point is 00:32:50 Any last words? No? All good? We will definitely make sure on the proceedings of the podcast, we link to your blog posts and also for Dynatrace users to find your Alexis program. I think that will be good. And I think we should probably do another podcast at the end of the year to see what the status is and where we are.
Starting point is 00:33:10 Yeah, maybe around or after KubeCon. Yeah. KubeCon is in November. There will be a few news around open telemetry, definitely. And after that, we should talk again, I guess. But I think it was really great as an overview and also kind of explaining the different terminology, the history.
Starting point is 00:33:30 Open telemetry is the future. I think that's important to understand. Great to look into it right now if you want to contribute to the spec and want to use it. And if you are a user, that means if you're writing apps that you want to monitor, it will take a little while long until it's ready. But you can already look into what's available right now, OpenSense, OpenTracing, and obviously the vendors.
Starting point is 00:33:52 We all have good solutions that makes it easy to instrument and trace and analyze data. Awesome. Thank you. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.