PurePerformance - The History & Power of Distributed Tracing with Christoph Neumueller & Thomas Rothschaedl

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready! It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my co-host Andy Grabner. Andy Grabner step on down. How are you doing today Andy? Very good Brian. Thank you so much and I'm pretty sure you were a little bit surprised that I didn't mock you during your introduction.

Starting point is 00:00:42 I was. A running joke. It kind of almost traces back for many, many episodes now. God, you're so good at that. Oh, keep going, keep going. It traces back. We're interested to trace it back all the way to the beginning where this all started.

Starting point is 00:00:59 Kind of the root cause, the moment that ignited all this. And yeah, I don't know what else to say. Yeah, well, I think you said enough. And as our loyal listeners and for our guests, Andy always has a great way to tie in something from when he first speaks to the episode. So that's why I reacted the way I did. And speaking of our guests, Andy, would you like to do the mutual? Speaking of our guests, speaking about would you like to do the mutual?

Starting point is 00:01:25 Speaking of our guests, speaking about tracing, distributed traces, and distributed tracing is the topic of today. And I want to be upfront and honest. We have four diner tracers today on the call. So there might be some bias, but we try to keep this as educational and neutral as possible.

Starting point is 00:01:46 With us today, Christoph Neumüller and Thomas Otschedl from our product engineering team. I want to start with Christoph. Christoph, thank you for being here. Could you quickly just say who you are, how long with the company, what gets you excited? Hi, Andy. Hi, Brian. Thanks for introducing. A pleasure to be here.

Starting point is 00:02:05 So I'm Christoph, I've been with Dynatrace for 13 years now, 2011 started as an engineer in the agent team, so data collection, and I'm now a product architect in the observability space in Dynatrace. Perfect, thank you so much. And then Thomas, how about you? Hello Andy. Hello Brian. I'm now with the company since 10 years.

Starting point is 00:02:32 I started originally in the pre-sales and moved over to product management. So now I'm a principal product manager caring about tracing and topics around distributed tracing in Dynat tracing in any other ways. Very cool. And this just shows, I think, some of the big qualities that we have in our organization. We've all been here for a very long, long time. Brian, you started back when? 2011, same as Christoph. 2011, yeah. I started 2008. And I think this also shows we have collectively more than 40 years of experience in distributed traces and in observability. And I know that a lot of people are kind of entering

Starting point is 00:03:13 this field new things to all the work that has been done recently in the CNCF in cloud native, especially around the projects, Prometheus, open telemetry, bringing traces, logs, metrics to the people that really need it in an easier way. I want to go a little bit back in history first to talk a little bit about the problems that we as vendors, and that's not just Dynatrace, right? I also want to mention the AppDynamics, the New Relics of the world, the Datadogs of the world. There's many out there that have built observability platforms over the years, trying to instrument on the fly without any code changes,

Starting point is 00:03:58 and then facing and tackling the challenges of capturing a lot of this data in large amount at scale, storing it and analyzing it. And I know I think, Christoph, you said 2011. If you could maybe take us back a little bit in history, if you do a little bit of a history recap on your end on how everything started, that would be good for me to know. For me, what I want to tell is an aha moment for me when I started working at Dynatrace. I got my first introduction to the product and as an engineer back then, the first time I saw a trace in Dynatrace, a pure path, I was pretty blown away because it back then, 2011, it showed me

Starting point is 00:04:48 the flow of the transaction through my system. It showed me when exceptions happened in my flow. It even back then showed me the logs associated within the trace. And a new feature back then was, was I think Appmon release 4.0, if you remember Andy, was the first time we even put profiling into tracing. So we had some form of continuous profiling that would go into the trace, the auto sensors, if you remember Andy. So first time I saw this, I was blown away by how cool this is and I could now debug my stuff much better because I have all this inside. Thanks for bringing all this up.

Starting point is 00:05:36 Now the industry has moved to provide these things, or is about to provide all of these things you just mentioned as a standard logs in the context of a trace, sampling and profiling these things, or is about to provide all of these things The first experience I had with one of our customers that turned this on and then they had debug logs and they had millions of debug logs in their transactions. And then we blew up all these distributed traces. These are all things that is a topic for also later in the discussion around what is the real data that we should capture, which data we should probably not capture, which data should be rather translated into a metric other than capturing everything in the raw format. But as you said, these concepts have been around for a while

Starting point is 00:06:35 and we have solved a couple of challenges over the years. And we're also obviously very happy to use opportunities like today to talk about this so that the broader community can also benefit from our knowledge we've built up over the years. You talked about the CPU sampling. The fun part is that I was a customer for about a year and a half or two years before. I think I was on Atmon 3.5 was my first one. And the fantastic piece, and it's great to hear that open telemetry is getting into the side now with the CPU sampling, is, I'm sure you remember, Andy, who had the sensor profiler where you can go and it would show you all the different methods you might want to pick, and then you could make the mistake of saying,

Starting point is 00:07:46 groundbreaking because it prevented you from over instrumenting from doing this so I just needed to give a shout out to that because you know Knowing that open telemetry hadn't had that and and that it's either you know I don't know what the status of it now, but the fact that it's coming It's it just it shows we know what a long way it's come anyway that we have a second guest So why don't we go over to our next guest Thomas? Then how do you say do you say Thomas or how do you say your name? Sorry? So why don't we go over to our next guest, Thomas. How do you say, do you say Thomas, or how do you say your name, sorry?

Starting point is 00:08:08 Thomas, yeah. Thomas, okay. Thomas is fine, that's perfect, yeah. Yeah, it was quite interesting. When I entered the company, this was 2014, it was the time when just one agent was born, so to do everything automatically, to just deploy one binary on your machine,

Starting point is 00:08:23 and that takes care about everything. And this was the point when I was starting looking into Dyna Trace, the name, it was already there. But when we looked, we didn't show a trace outside of it. So we showed the things behind the trace. So we showed, okay, information about the JVM, the services that have been there. And the first time that I then got to know

Starting point is 00:08:45 about what the trace is in the backend was when I was looking at some developers there in the debug backend that visualizes the end-to-end trace, where all this information was coming from. And getting this automatically from users, from just, yeah, it's doing a binary restarting, your Tomcat, your JVM, distributed across systems. And also then to combine this with real user monitoring.

Starting point is 00:09:12 So to see all the browser stats, to see where the end users are located. This blow me away. And this was the time when I said, okay, I want to get deeper into that, how this works, because at this point of time, I had no idea how this technically can work to do all of this automatically. It was an interesting transition, if I may under just briefly from Appmon to Raxit.

Starting point is 00:09:37 It was Appmon gave you all the details from a single transaction and the trace and it was awesome for debugging a single issue, great. Raxit meant that it has stepped in a different direction which was give you an analytics answer from many traces. So from millions of pure paths we did analytics behind the scenes, baseline and root cause analysis already to give you an answer, but it did not emphasize the detail, the single trace a lot. So we went in this direction. This was quite interesting to be there to see then what was from where this data was coming from, from where was it originating and how the backend was then putting this together. And I think it took then one and a half year till this trace perspective was even added

Starting point is 00:10:27 to this product at this point of time. And as I was in pre-sellers, I was fighting a lot, hey, we need to expose this also in the UI here because it was a hard time to show where it comes from and how all these data points relate to each other. And I think I just wanted to, for those people that are not familiar with the history, we started with Dynatrace, tracing in the name

Starting point is 00:10:52 and then we called it the second generation. We called it Ruxit, so we basically incubated a new observability platform, called it Ruxit, brought it to the market and then eventually Ruxit, brought it to the market. Eventually Ruxit became what Dynatrace has been for many, many years afterwards. Now we are on our third gen, next gen, current generation platform that is powered by Grail. I just wanted to highlight because maybe not everybody is familiar with the word Ruxit. The interesting thing is we went from small scale, or let's say more the cloud scale architectures

Starting point is 00:11:45 where we needed a different approach because it was no longer practical to look into millions of pure paths and find all the needle and the haystack. And then kind of hiding the complexity behind the scenes and all this data, and first trying to give you really answers and root cause information before we brought the pub half back. Really an interesting story, I would say. Which now brings me to a lot of time has passed and we mentioned OpenTelemetry quite a bit. OpenTelemetry has also now been around for a couple of years.

Starting point is 00:12:30 It was the merge of OpenTracing and OpenCensus. And what I would be interested in, I'm not sure who wants to tackle this first, maybe Christoph, how has OpenTelemetry also changed the way we think about distributed traces from a Dynatrace perspective. Because I know we are obviously we have our agent, we've invested a lot of time with many engineers that work on the agent that provides automated instrumentation, all these features. How has OpenTelemetry changed the way we think about tracing? Yes. So where do I start? It changed a lot in summary. And for me, the personal story began in 2018 when a colleague, Chris Spatzbauer, asked me to join some effort into standardizing distributed tracing. It was before telemetry and it was rather about the context passing, which became W3C

Starting point is 00:13:23 trace context. So he said there's a bunch of, you know, other companies who are interested in standardizing how we pass context between calls so that the distributed trace can even be built up. We had such a technique for a long time in built in in Purepath, and now industry wanted to standardize. And, you know, I joined them joined then these working groups that we did there and said, how can we cross vendors even back then, make sure we pass context on so we can build up a distributed trace.

Starting point is 00:13:58 It was about passing on trace ID and parents' band ID and that kind of stuff. And within those groups, there was people from Google, from Microsoft, Trace ID and Parents Ban ID and that kind of stuff. And within those groups, there was people from Google, from Microsoft, from Neuralic, from Splunk. So, AppDynamics, I think, I hope I'm not forgetting anyone. Apologies if Instana was also present. So we held these meetings and this became fruitful, right? We, you know, some years later, we finalized the standard.

Starting point is 00:14:30 It's now very commonly used also in OpenTelemetry. And while I'm mentioning this is also from that group, some people emerged into founding OpenTelemetry. Back then, 2018 was the situation also that there were competing standards. There was OpenTracing, who provided APIs, not so much more. OpenCensus, which provided SDKs, which was capable of producing spans, also defined a protocol. And those were competing because instrumentations didn't know how to program against OpenTracing, OpenSensors. So the merger was being, I think, announced 2019.

Starting point is 00:15:14 And I was, at the time, involved with those people through trace context. That's why I know a little behind the scenes there as well. And the goal really behind OpenTelemetry was to standardize the data collection for observability to a degree where libraries know how to produce observability data, like which spans and span data to produce, also metrics events, it's all under this common umbrella. And then a standard protocol that can send the data to some backend, which the idea is

Starting point is 00:15:49 to be exchangeable and then diagnostic basically. That's basically the OTLP protocol, right? That's the OTLP protocol. And I can't believe that was 2019 because it seems like it was just a couple of years right? The That's the old theory protocol. The I can't believe that was 2019 because it seems like it was just a couple of years ago, but we're talking about what, six years now? It took time to stabilize. It's still stabilizing, but it has very good adoption already. But it still seems like it was yesterday.

Starting point is 00:16:19 Yeah. But I guess that's not each. And I think the trace context is also quite important point because it was a first step to do start with standardization to make sure that it's even possible to propagate the trace state across load balances across the systems. So that people started talking about this. This was at this point of time, very often an issue to, yeah, what is this vendor

Starting point is 00:16:44 specific header? Why do you need it there? Why do I need to make my load balancer forwarding all this information there? This took a lot of time to convince people and that they are all the vendors and they are agreed on a standard there to do that. This made things really easier.

Starting point is 00:17:03 And this is, I think, key to even be able to have an end-to-end trace across subsystems and also across cloud vendors and about across all of this information. I think that was also an important point. Now it's convenient. Now everyone knows what is the trace context and how is it working and it's built into the standards. But it was a good first step here. Yeah. Because it solved one, as you mentioned, the big problem. We always had these situations where they said, hey, now show me the end-to-end trace. And then we said, well, it looks like we have broken traces, right? We have a little bit here and a little bit there. And then we tried to figure about what's in the middle. Oh, there's this load balancer, there's this proxy, and they're kicking

Starting point is 00:17:40 out our header. And as we agreed on the standard standard then this proxy load balancer could forward it without any questions asked. So we talked about OpenTelemetry. I think Christoph, the two of us, we just recorded a session on distributed trace analytics on GRAIL and you did a great job in the beginning of the session to explain kind of the autonomy, I think you called it the autonomy of a trace with spans. What is a span? What is a request? What is a trace? Could you quickly just do me a favor for the audience here to quickly explain the core terminology because I think that's very important.

Starting point is 00:18:27 terminology because I think that's very important. All right, sure. So, well, the trace itself, it represents the whole transaction from beginning to end, right, throughout your systems, multiple tiers, multiple hops. But that is built up from individual, you know, pieces that are called spans. And those represent units of work within your distributor trace. That might be incoming HTTP, that might be outgoing HTTP, that might go an outgoing database call, it might also be an internal method execution somewhere and those all have hierarchical relationships with each other. So each span has another span as a parent. So it builds up the tree and there's a root and it goes across tiers, right? The parent span can be a remote one from somewhere else.

Starting point is 00:19:18 And each of those spans represents something different. That's why it also carries different information. A database call carries the database statement, the number of database rows that were affected by the database, if it failed or not. HTTP call obviously contains HTTP request method, HTTP headers. There is a range of what you can configure what the instrumentation does but that's kind of the data model of spans is depending on you know what operation it represents carries different information. I think you also, in our recording, you referred to it for you as BAN is nothing else than a structured log, if you look at it from a technical perspective. So this is an interesting point, right?

Starting point is 00:20:14 You look at logging and what it logs, right? There's logging from standard frameworks that log every HTTP call. Then of course, there's also custom logs. Nowadays, it's very modern to see logs, to make them structured, so it's easier to parse and do analytics on them. So you don't just write the text line, you write all your elements of what you want to log in a JSON form or so. Now if you think of spans, what are they? They are like also just logs that log the HTTP call or the database call or some important method execution at some point.

Starting point is 00:20:54 The attributes on a span is nothing more than metadata of that you like, like in a structured log the context and the metadata of that operation. So what's the difference? The span is a structured log with more context because it knows also the parent and what transaction it belongs to through the trace study. So you can do even more powerful analytics than on this. So does this then mean if I continue thought, a span is nothing else than a structured log with a certain semantic convention where you have certain minimum data like the span ID, the trace ID, the parent ID, maybe a type and then depending on what type it is, some additional metadata? Exactly.

Starting point is 00:21:47 Maybe let me go back a little bit to OpenTelemetry. What you just said is, okay, there is a model behind spans, data model, yes. Go a little back to OpenTelemetry, what they define is semantic conventions, which is basically a description of such data model. What is it that I can find on a database call? What is it that I can find on HTTP call is defined in in open telemetry semantic conventions, which is very helpful because it allows one that someone who writes an

Starting point is 00:22:21 instrumentation to know what put on what to put on someone who writes more importantly, an analytics query on the data, what to expect, how to work with data. In Dynatrace, we have introduced something similar to that, which is called semantic dictionary, but it's basically that. It's the description of that model that you can expect to find on traces. It's very much very closely aligned to open telemetry conventions with some differences,

Starting point is 00:22:49 but it basically describes how you can work with the data. Those structured logs, quote unquote, that are spans. Thomas. Oh yeah, sorry, Brian. No, I was just gonna comment on, you know, thank you for explaining the difference between span and trace. I remember when open telemetry, even open trace and stuff were starting and some brave people were venturing into the DIY component when we would talk to them, all they would talk about is spans, right?

Starting point is 00:23:26 And it took us on the pre-sale side, some time to figure out like, what are they talking about? We're talking about traces, like it's the whole thing. They're like, but do you have all the spans from this layer? We're like, right? But because in those early phases,

Starting point is 00:23:39 they didn't have necessarily the full trace, it was all about the spans. And I don't see that come up too much anymore, but I think it's definitely a very important component. The spans make up the trace, the trace is the full picture, which is very, very important. Yes, you might be a developer on a specific service or a specific area, you only care about that one piece, but you still have to have awareness of what's going on above and be a developer on a specific service or a specific area, really some nice visuals in explaining all this. This episode airs the video with Christoph.

Starting point is 00:24:25 Really some nice visuals in explaining all of this. Now to Thomas. Thomas, I've got a question for you. For me, the question is you are the PM, the product manager for the Distributed Traces app. I also know in the Distributed Traces app, you made a strategic decision to differentiate between requests and spans to support different types of analytics use case. Could you fill us in a little bit on what you typically see out there when people are now analyzing the traces and maybe some additional thoughts that were put in so that we can actually analyze the vast

Starting point is 00:25:05 amount of traces and spans? Yeah, so I think that's also a very important point here what you figured out. So one thing is that you typically want to find out if what failed in the systems, did everything went well, did everything succeed or did something fail and on which endpoint. So on which, what was the interface that was called here? Was it a specific HTTP request that reached microservice? Was it a specific queue endpoint that was called here in that case? All of that is typically happening on the request level,

Starting point is 00:25:41 so where the information is coming in, and that is what is interesting here. So to understand, because not every technical error that happens further down on a small unit of work, what Christoph described, is leading to a logical business error in the end, because, yeah, developers made quite a lot of effort to build the code error-prone

Starting point is 00:26:04 and to deal with small technical errors to do a retry out there. But I want typically to understand, okay, did it fail? I want to find it very quick, what failed? And then I want, and just then I want to go into the details to all of the small units of work, to all the spans to understand what was really happening. Do we have specific error codes on it? Did any exceptions happened there? What have been the exception messages? And from there, I can then jump to related log files

Starting point is 00:26:36 that hopefully also have some information about the trace or the span where they happened that can give more information what really happened down there in the code and what can I do to prevent this in the future. I think that's one of the important points to think about what is failing. The other point is to understand what was the response time. How long does it take? Was it fast enough?

Starting point is 00:27:04 Like end user expected it? This is also a little bit what is in the DNA of most vendors out there. To understand, to look at it from a performance perspective. And here also the request is per design telling the response time. So when I do a request, I call an endpoint, how long does it take that I got the response back? And this helps them to understand, yeah, why are my users complaining

Starting point is 00:27:33 that something took too long or longer than they expected? And when I can filter for the slow request, I can then look into all the details, technical details that happened there. So I can then understand, oh yeah, this was in a specific data center where some issues have been there, or oh, this was executed on a specific infrastructure component where the CPU went high or things like that. So to really have the full picture and that the trace can give you that is then to find out what happened in the backend and find the jump points over to other related signals like metrics.

Starting point is 00:28:10 Oh yeah, this was happening on a specific Kubernetes pod that was trottled down because it runs out of CPU cycles there and was limited down there. This brings me to another topic which we also discussed I think in the video. As we're ingesting traces, there's also the importance of enriching the data that we have with additional information. So for instance, you just brought up would the response time, the increase of response time be correlated with a CPU spike or with something in the Kubernetes cluster because would the response time, the increase of response time, be correlated with a CPU spike,

Starting point is 00:28:50 or with something in the Kubernetes cluster, because the Kubernetes cluster runs into any type of CPU scheduling issues. In order to make that analysis, that runs in the port, that runs on this particular Kubernetes namespace in that node. Can you tell me a little bit, and I'm not sure who wants to take this, can you tell me a little bit where the data enrichment and the context enrichment typically happens? And I would like to, if you can do me the favor, explain how the OneAgent does it. And for those people that don't have the OneAgent agent that are doing it on OpenTelemetry, there's also ways to enrich this on the OpenTelemetry side. I don't know who wants to, Christoph maybe? Yeah, let me take that. So it is critically important for spans but also all other

Starting point is 00:29:39 signals like locks and metrics to have the proper context on them, especially information about the vertical stack. So where did the signal come from? The port, the container, the process, the host, the cloud information. This important vertical stack information must be enriched on every signal, then you can do the proper slicing, dicing, filtering, or splitting your data through those dimensions. In DynAdress even, I have to say one cool advantage is even if you don't have something fully enriched, you can join in the query language we have to the topology over like a host and enrich the information about the process or the hardware that is ran on the processor CPU model into your data and also use it for filtering and slicing and testing.

Starting point is 00:30:41 But even apart from that, the direct enrichment is very important. But even apart from that, the direct enrichment is very important. One agent pretty much takes care of this automatically. So to incriminate this, we have our operator that helps it, puts context into the pod. One agent then uses also that information plus the process level information it has, plus the cloud information it gets, does that automatically. In OpenTelemetry, it is actually similar. There is in OpenTelemetry,

Starting point is 00:31:13 the concept of resource attributes. It's exactly that, it is attributes that go on all signals. And there is so-called resource detectors. And they do pretty much this. There is a process level resource detector, host level resource detector, a Kubernetes resource detector, a cloud resource detector, they all they go to

Starting point is 00:31:32 the cloud API, if it's available locally, fetches the cloud region, the cloud provider, stuff like that, puts it as a resource attribute into the, the agent or the SDK of open telemetry. And that again again makes sure that on the exported protocol, OTLP, those resource attributes are present. And then in the backend, it would materialize as resource attributes or directly in which attributes

Starting point is 00:31:55 and available for filtering. And that means I would assume all these attributes are based on standard convention again. So that means if somebody is following the open telemetry, best practices, these enrichments will then on the Dynatrace side or also on any other observability back end that support these conventions show up in the right way. Absolutely. Yeah, exactly right.

Starting point is 00:32:20 So obviously this enrichment is very, very important, but there's other contexts of enrichment and I'll explain where I'm going first and then ask if this is something that's supported So, obviously, this enrichment is very, very important, Those are, at container, which pod, which all those other components, those are things that the trace can pick up. Is OpenTelemetry, do they have slots for additional data or are there plans for additional data to be able to be put into OpenTelemetry so that if people in their release naming, in their environment variables or anything else, start enriching the context at the process level, is this additional stuff that OpenTelemetry can pick and then pass on to the backend?

Starting point is 00:33:30 Or do you need a third party backend like Dynatrace to take that data from, okay, we know it's running on this pod, this pod also has additional data that we're grabbing from here, we could put it all together. So our understanding of this is that Kubernetes, for example, we have Kubernetes labels. They are very useful context, but they are also a lot and not all of them are always

Starting point is 00:33:53 100% useful, right? So you can enrich them, but selectively. Or you should selectively enrich them. In open telemetry, to my knowledge, this would be possible through an environment variable that OpenTelemetry offers, the DT, not DT, OTEL resource attributes, environment variable.

Starting point is 00:34:16 And if you, in your deployment, make sure that your Kubernetes labels are translated into a list in that environment variable, you effectively end up enriching your total data with those labels that you want to. Great. In one agent, it's working similarly. You have a configuration that you can specify which labels should go on all your data. So you can do it remotely, centrally, and basically enrich your data as well.

Starting point is 00:34:45 centrally and basically enrich your data as well. But then the benefit of, I'm not trying to sell Dimetries here, right? Cause I'm sure all of the vendors like us have similar capabilities, right? Where if you have data from outside of that area, right? Again, if you're then pulling in your AWS tags or something like that, because you already have the association with, with which pod or node or whatever it's running on, or AWS tags or something like that.

Starting point is 00:35:05 Because you already have the association with which pod or node or whatever it's running on, that backend can then take that information and say, this belongs with this, so now when we look at the trace, can bring in all that other context. Switching from how we instrument and how we enrich data, one topic that I briefly want to cover because it just comes up in more and more conversations is, and I bring you the statement that I hear at conferences, Dynatrace, Datadog, New Relic,

Starting point is 00:36:01 we're also expensive with open telemetry. On the one side we say we love it, but on the other side we're also expensive with open telemetry. On the one side we say we love it, but on the other side we're super expensive. These organizations, the ones that I just mentioned, we rather want people to use our agent. So why is that? What I've learned in my conversations is that at least in many cases, it seems that there is an educational gap on what an agent does from one of us vendors in terms of sampling versus what the default configuration with OpenTelemetry does in terms of sampling.

Starting point is 00:36:37 Meaning what I've seen in the recent conversation, the same instrumentation from OTEL versus our agent produces 100 times more data. And a hundred times more data storing it in the back end needs to be more expensive now. But nobody looks into this and nobody and the first thing to blame is you're so expensive. So my question now to you, A, is this what I'm seeing here, a real thing or am I just in my little bubble and I was just unfortunate to have come some conversations or is this also something that you see? Yeah, so I think one point that we definitely see is the amount of data volume and that's

Starting point is 00:37:15 what you mentioned already, what is a little bit in the DNA already when you do distributed tracing on large scale customers, you also need to think about how to handle this load. And also which of the data is relevant and is needed here. Because not every data point, when I have a specific endpoint on my blog post and it's opened a thousand times a minute or something like this, and everything went swell and fast,

Starting point is 00:37:41 I typically don't need all of this information inside of it. I want to have a good portion of it, okay, the response time was fine, I didn't saw any failures here, or I have a good probability of how to collect this data and how to find out which data is relevant. And yeah, this big chunk of knowledge and architectural work like from Christoph and all his colleagues went into this to do this in a smart way automatically also to assure to have the end-to-end traces. So there needs to be some information how can we be assured when something is coming in that also we have then the database calls in the end when several microservices got hopped on here in a good probabilistic way to find out,

Starting point is 00:38:27 yeah, to have all the relevant data there. And I know there are mechanisms in open telemetry also to do sampling, to reduce this data because they face the same problems. I know even that colleagues from us contribute there to bring this forward. So to have similar algorithms to reduce the data or to just have the relevant data or more relevant data there that is processed

Starting point is 00:38:51 here. But OpenTelemetry is, as far as I know, not as far as most of the vendors here where put a lot of knowledge there. I think Christoph can jump in here with more details about how this works. I mean, OpenTelemetry does a good job. It's a very modular system to put samplers in place that have different sampling strategies from simple probabilistic to more advanced ones. In one agent, for DynaTrace now data collection, over the years we find our strategy on how to sample data for tracing and we apply basically two, I can describe two major strategies. The one is we do different sampling rates based on the endpoint that's being hit.

Starting point is 00:39:36 Those that are being hit a lot get a lower sampling rate than those that are being hit rarely. Like the resource requests against your images on your website would have a very low sampling rate. That means not every transaction is being captured. But the login request that only happens once, or far fewer times, would get maybe a 100% sampling rate. It is dynamic, though, but rarer endpoints get a better sampling rate than frequently used endpoints. That's one way to overall manage your observability cost well.

Starting point is 00:40:21 And the other one is, overall, we look at the whole system. And if you get into peak usage, we then you know, we start rate limiting at some point in the system to control the cost. So if you if you blow your your monitoring system on like Friday, because you have 10 times traffic, we will start reducing the overall sampling rate to manage basically the cost in your system. To recap, because I think this is hugely important. You said the first approach, right, we want to make sure that those in correlation less

Starting point is 00:41:01 frequent transactions, we capture them because if a login that may happen a hundred times an hour versus your image request for your website happens a hundred thousand times, you want to have those 100 login requests because they are business critical, they're important. Looking into some of the observed systems that I've seen, what I've done, and based on stuff that I learned from you, Christoph and Thomas, I opened up the Distributed Traces app and I used the grouping feature to figure out what is the number of traces that are coming in per endpoint.

Starting point is 00:41:39 And you can immediately see on the very top where there's no sampling involved, it's all these JPEG requests, those CSS requests, all these static resource requests that are coming in. you see on the very top where there's no sampling involved, This is why people then say, but you need to put the effort in to configure the system so that it works for you. Versus you can also go with a vendor agent and these things are being taken care of for you. And I think the other thing that's important to point out too is we're talking about sampling the traces, right? So you're not going to have that span level data for every single trace coming through. However, that doesn't mean we're not, it doesn't mean we're sampling the metrics, right? So it's not going to be like if you only

Starting point is 00:42:53 capture one out of 50 image requests that you're going to see, you know, like how we only had 20 image requests in the last minute. No, all that information, all the metric level information is still there. And I'm assuming that with Othel on its metric ingestion is doing the same thing. You'll still have the full picture from that metric point of view. We're just talking about that detailed level trace. And Thomas, you said something very important

Starting point is 00:43:23 about statistical probabilities, right, earlier, where if something is happening that frequently, and this goes back to even with the CPU sample when we first started introducing it, it's like if a method is happening so fast or a trace is happening so often, and it's gonna show up in there, right? Just not every single one. But if it's a problem and it's it's gonna show up in there, right? Just not every single one

Starting point is 00:43:45 But if it's a problem and it's running slowly it's gonna be there because it's gonna be running longer it's gonna be a bigger part of the picture, so No, you know and everyone's doing sampling these days because again this data load and I think the big the big challenge for people especially people coming from the earlier days when it was easier to capture everything because systems were much simpler, of letting go of having at or near 100% of everything and trusting in, okay, we got to look at statistics, we got to look at probability of what we capture. We still have that full context though from the metrics, so we'll still know if a failure

Starting point is 00:44:22 rate is high or anything else. And just that key idea of if a problem happens on one trace and one trace only, is it a problem or is there a blip? Are you going to get everybody up and say we have this problem because one trace took 10 seconds? Maybe there was a power surge. So it's. That context. I know it makes people uneasy, but I think what I loved, and again, not to do a commercial for Donna Chase here, but when we came up with that smart sampling, I forget the name

Starting point is 00:44:56 we had for it, but the smart sampling idea where we're going to do it based on volume and all that kind of stuff. But yeah, when you have it on your own premise to have the hardware available. But yeah, when you have it stored, it's not just the storing of this data. You also need to have a lot of CPU power to analyze this data. So if you have all your traces, then you have the next problem. You have perhaps some rarely occurring issues that you need then to analyze. And then you need to run needle in the high stack search to really find the errorness or the slow request here in this huge high stack there. And this then and then is typically also using a lot of CPU power here. So that's what you then have when you have this by yourself. But you bring up a very interesting topic. Maybe let's dive there a little bit,

Starting point is 00:46:02 you bring up a very interesting topic. Maybe let's dive there a little bit, which is metrics or spans. Now, as you spend both, now let's get into this a little bit. What's now better? Now, if you produce a metric on non-sampled data, like in the source, of course, that's more accurate if you look at counts and such.

Starting point is 00:46:23 Spans are sampled usually so you don't have exact counts then but you have all the context. This is a huge difference in metrics. Of course, you can also put the horizontal, the vertical stack I mean, that doesn't increase your cardinality but you cannot put everything on a metric. You can't put the full exception message or the stack trace or the all HTTP headers that appeared on your metrics because you're running to cardinality explosion. Your metric, if you put the random ideas that I mentioned on a metric, then it becomes an event, right?

Starting point is 00:47:04 Username or something like this, if you can't get it here. Right. The efficiency of a metric comes from the fact that if you have a series, it's just a number of integers or doubles, you know, in a row. That's very efficient to query. If that array is not there anymore, because it's just one number per cardinality combination, then it's like an event. It's not efficient. It's not as efficient anymore. Now, that is why in most metric systems in the world, or all of them, you cannot do high cardinality. You have to choose your dimensions that you want to split by. But what if you have a problem and you want to split by something you didn't know before you want to split by?

Starting point is 00:47:48 You want to filter, I mean, the username now it's obvious, but what about the order ID or the product ID? Something that's high cardinality and in combination it would explode. So you can't put it on a metric. Now on a span where a structured log for that matter is similar or same, you can put all the dimensions stored as it is and then you can do those kinds of queries. And also needle haystack and Thomas I have to say needle hayag, we became extremely efficient nowadays in Needle Hashtag queries also in Dynatrace, now with Grail, which was built for indexless raw data search

Starting point is 00:48:33 for logs, but we use the same thing for spans. So searching for individual Hashtag IDs or whatever is very efficient. And it can also very efficiently extract time series on the fly from spans. So you can do a make time series out of spans. Looks like an actual metric but it is on the fly calculated and I can split it by order ID. I can split it by request ID if I need to. request ID if I need to. The system can handle it quite efficiently, not exactly as efficiently as it was

Starting point is 00:49:10 a metric in the first place, but still, Grail became very powerful. So Dynatrace, the query system storage and query system became very powerful in doing so. And this is also where some industry vendors go to, which is the unknown, unknown, you know, scenario where you didn't know upfront you need something. Now you need it. You weren't able to put it on a metric, but now you want to analyze against this in your raw data. This is a, you know, very, you know, exciting new direction that the industry is going to.

Starting point is 00:49:49 I think, Christoph, you also did a great job in the video, folks. You can check it out as well in the links in the description here, where you walk through some of these scenarios and show exactly that use case. But I want to just to clarify, this will obviously work and be accurate, depending on how much data you really ingest and where you sample. And this is why sampling is so important, like sampling the right things. And this is why, you know, ignore or sample things away that are not business critical and focus on those things that are critical.

Starting point is 00:50:19 And then you can also make these things a reality, what you just said. Thank you for tying back that note to sampling. We came from there And then you can also make these things a reality, what you just said. Thank you for tying back that knot to sampling. We came from there because when you do the on-demand time series creation, you need to be aware of the sampling that has happened originally on your spans and work with that extrapolate, etc. Cool. I think we're getting it to the end. It's amazing how time flies. Yeah. are new to traces? I don't know anything and Thomas I want to start with you. Anything else that we missed that you want to highlight? Oh, I'm very sure we missed something. But yeah, what I think is what Christoph brought up here is using this data to analyze on and on and on. So yeah, when you have metrics, you typically need to decide beforehand what do you want to look at. Like, yeah, typically something failed, okay,

Starting point is 00:51:25 or you have perhaps performance information. So that is typically important, you want to have this. But all this other information on it, and to make this really searchable, just to access this data, this is really helpful, especially in today's days where the data volumes become huge. To find that and to be able to analyze this data

Starting point is 00:51:49 on the fly in real time, this is really key here to do the right decisions and to find things that you have not been aware of beforehand. So I think that's an important point that we should highlight there. Christoph? Yeah. I was thinking about the summary or, you know, overall statement about distributed tracing.

Starting point is 00:52:12 I think we need to differentiate two things, the data collection and the analytics capabilities. So data collection nowadays became much more of a commodity through open telemetry. So whether you choose a vendor agent or open telemetry, well, it's a matter of, you know, each gives you some advantages, disadvantages, or convenience functions, or additional features. That's fine. So either is okay. What's important, much more important, is the analytics backend you go to. What's important, much more important, is the analytics backend you go to. And this needs to give you two things. It needs to give you out-of-the-box answers and anomaly detection, I would say. So out-of-the-box with your trace data should do something useful that doesn't involve your

Starting point is 00:53:00 daily interaction. It should alert you out-of-the-box about issues. The second second thing is if you need to dig into the data, it should give you a powerful way to do so. Analyze, slice and dice your raw data as you need, as you see fit for certain troubleshooting cases. I think that's the two big things you should look for in an observability backend. Thank you. Brian, how are you? big things you should look for in an observability backend. Thank you. Brian, how about you? Well, first I want to pour a little out for the old term peer path.

Starting point is 00:53:35 That was our old word for what traces are now and I got water all over me. But yes, rest in peace, Peter Veth. But I think, first of all, Thomas and Christoph, all the work that you all and everyone you're working with on maintaining the ability and refining the capabilities around collecting traces and all this has just been fantastic over the years. And I also just want to shout out all the companies that you mentioned earlier, Christophe, who started getting into this whole idea of how do we standardize this, I think is a great testament, once again again to the IT community of sharing and letting everybody get better.

Starting point is 00:54:27 Because not that it's full-on proprietary, but you can take a look from a vendor point of view if we're going to create a way for people to not necessarily have to use the agent that we're selling by agreeing on these terms and just the fact that it came together and everyone recognized the need for this obviously not only creates something like open telemetry but creates better practices for the vendors to leverage themselves. So I just always am in awe of when things like this happen because it's you know when big money is involved people don't really do too much to help each other out. So thanks for everyone who was doing that too. And I hope people, yeah,

Starting point is 00:55:12 this has just been a fascinating conversation and I'm sure we can go on for another three hours, but we'll have to have an internal one about the history of diatres and all the fun things we used to see. Memory lane. Anyhow, Andy, back to you then. have to have an internal one about the history of Dynatrace haven't talked about today's open pipeline because there's another component as data gets ingested in our case into Dynatrace we can extract data we can decide what to do how we enrich it and also what we store and we'll just make sure that we link all of these assets in the description

Starting point is 00:55:57 but yeah looking forward to where this year the Traces OpenTelemetry will bring us in the years to come. Thank you. Thank you all. Thank you. Bye. Thank you. Bye.

Starting point is 00:56:09 Bye.

Your Ad Here

PurePerformance - The History & Power of Distributed Tracing with Christoph Neumueller & Thomas Rothschaedl

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.