PurePerformance - The State of OpenTelemetry with Jaana Dogan

Episode Date: April 26, 2021

Googles Census, OpenCencus, OpenTelemetry and AWS Distro for OpenTelemetry. Our guest Jaana Dogan, Principal Engineer at AWS, has been working in observability over many years and definitely had a pos...itive impact on the where OpenTelemetry is today. In this episode Jaana (@rakyll) explains which problems the industry, and especially cloud vendors, try to solve with their investment in open source standards such as OpenTelemetry. She gives an update where OpenTelemetry is, the next upcoming milestones such as metrics and logs and what a bright future with OpenTelemetry being widely adopted could bring.https://twitter.com/rakyllIf you are interested in learning more – here are the links we discussed during the podcast: https://github.com/open-telemetryhttps://github.com/open-telemetry/opentelemetry-specificationhttps://github.com/open-telemetry/opentelemetry-protohttps://github.com/open-telemetry/opentelemetry-collectorhttps://github.com/open-telemetry/communityhttps://o11yfest.org/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always my co-host Andy Grabner. Hey Andy! Hey! I guess I... What? I was going to say it's funny getting to be able to see you while we do this and have you see what you make fun of me.
Starting point is 00:00:43 Probably you do all the time and now I'm finally getting to see it for the first time. Brian you see what you make fun of me, probably you do all the time. And now I'm finally getting to see it for the first time. Brian, I would never make fun of you. It was really the first time when I tried to make some facial expressions, seeing if I can get you out of your kind of your talking flow. But I guess, no, it was really the first time. You did though. You see, if you go back and listen,
Starting point is 00:01:00 you'll notice I've made a couple of mistakes, especially coming out of it. I'm also a little flustered today, Andy. It's, you know, performance, you know, everybody always thinks about work. It's like, oh, you know, sometimes I like work, sometimes I don't. It's a challenge and all this. But sometimes, I've got to tell you, when know, to deal with not computer problems, not deployment problems, but arguing with my 11 year old on why she needs to brush her teeth in the morning before going to school and her fighting back with me because kids just want to fight
Starting point is 00:01:34 back over the dumbest crap they can and trying not trying to hold it together and not just like throw her through the window, you know, and you're just like, just brush your teeth. You know, I, I could have got into the argument that, you know, it's, it's the conspiracy of fluoride to take, you know, but I didn't want to. Anyhow, I'm glad to be here because this is an oasis as I was saying, and it's always a great time. And I hope our listeners feel that way. And most importantly, Andy,
Starting point is 00:01:59 I'm glad to be here with you because your smile always brings a smile to my face. I'm happy. So, hey, Brian, before going on to our guests, I also have to say thank you to somebody else that is not on the show today. But I wanted to say thank you to the guys at Neotis and especially Henry Grexit, who has been yesterday, today, as time of the recording, 24th of March, he did a 24-hour performance marathon with, I think, 20 different presenters around the globe. He got up at five o'clock in the morning and he did it 24-7, well, 24 hours. So it was really great to have all these folks. And he allowed me to talk about Captain and very good. He keeps doing it. I had to bring it in.
Starting point is 00:02:45 But now to something different. Yes. Our guest. I'm very happy to, it took about a year from the first, I think, tweet that I sent to Jana about, hey, I read this blog post, and it was called things I wish developers would know about databases. And she was going on about database performance and things like that. And I was like, oh, this is awesome.
Starting point is 00:03:09 Because Brian and I constantly talk about application performance problems and we always blame it on the database. But we always try to bring up the points that if you are writing good code that is making efficient usage of the database, then the problems might not be in the database. But anyway, I reached out to Jana. A year later, Jana, welcome to the show. How are you?
Starting point is 00:03:31 Hi. Yeah, sorry for making you wait. But to be honest, I also table flipped and left databases since we had our initial conversation. So I'm kind of coming back to performance observability to that field. So I'm great. As far as I can understand from this conversation, I'm super happy not to have kids.
Starting point is 00:03:55 Not yet, not before this pandemic ends. So yeah, I'm happy. Thanks a lot for the wait. And I'm a bit ashamed. The way I look at it is it's like, do you ever hear of The Muppet Show? Kermit the Frog and all that, right? When that show was on, so I was a young'un when that show initially aired. And their first season, they were having trouble getting guests. And then once people started seeing how good it was, that's when everyone was like, oh yeah, I'll be on the show.
Starting point is 00:04:23 And I think you were just looking to wait to see, is this show any good? Do they have good people on? You're just being judicious and smart about it. I had no questions about that. I had no questions about it. It was just really the matter of me being a bit busy last year because of the pandemic and everything.
Starting point is 00:04:39 And then I switched jobs and onboarding at a large company takes a long time. So sorry for that. I was trying to give you an excuse. But Jana, you bring up a good point. Can you remind folks your background? Because you said databases was just a little intermezzo that you had
Starting point is 00:04:55 and you actually did observability performance before and now you're back to observability? Yeah, I can give a bit of background. I'm working at AWS right now. I'm a principal engineer working on, like, mainly observability and our overall, like, instrumentation stack and, like, you know, roadmap and so on. But before this, you know, I was at Google for a long time.
Starting point is 00:05:19 My last, you know, endeavor at Google was databases, and that's how we kind of initiated our initial conversations. But before databases, I was working on the instrumentation team. Maybe some of you have known about this project called OpenCensus, which was kind of
Starting point is 00:05:38 inspired from the instrumentation library Google uses that's linked to every production binary called Census. So we decided to just kind of build some vendor agnostic instrumentation library that may work for multiple backends. Some of the concepts were inspired from Census, but it was really like a project from scratch. That project became OpenTelemetry after its merge with OpenTracing.
Starting point is 00:06:08 And before, you know, all of that, I actually ended up coming back to this field after many years because I was working on the Go programming language. And the last two years I was on the team, I was mainly focused on like performance, diagnostic tools. You know, there's a bunch of different things in Go. And the community is very, very, I think, motivated to, you know, use these tools and understand like there's a, I think,
Starting point is 00:06:35 a good culture in the language community. So that was my kind of like, I realized that at some point, oh, some of the areas that I want to work on, like, for example, distributed tracing, it's just not, you know, you cannot scope it to a programming language. It's just like a larger effort, and you have to have, like, consensus, you know, from all the parties in order to make anything. That's how I kind of, like, pivoted myself more into this field.
Starting point is 00:06:58 But, yeah, I had, like, you know, going backwards, I just kind of, like, pivoted back and forth, I think into observability and kind of ended up being here. So it's interesting. Great. I mean, talking about Go, I just spent four hours today trying to, well, let's say that I have a love-hate relationship with Go. In the end, love always prevails.
Starting point is 00:07:24 And so literally five minutes before we started the recording, I finally got my code to work and I just deployed it on my cluster. It's a little Go service, but it is hilarious. I feel like that love and hate, right? Like I always compare it to, you know, any other, especially I think in the beginning of my, like, you know, when I became a user, like early in the earlier days, it was much, much difficult because I was always comparing it to my experience with other languages.
Starting point is 00:07:53 I came from a JVM background. So in terms of diagnostic tools over there, it was not in a good shape. And still not in a good shape. There's a lot of work going on. But I kind of like you know stayed for the simplicity of the language and you know i just can't get things done with it so that that kind of works for me um but you know i'm trying not to be religious i think i'm too biased because you know i've been very involved in in go but you know i took a break i think four
Starting point is 00:08:21 years it's been four years i haven't been working on go as a full-time thing. So it's just been very healthy. Now, to your current role. So you said you switched from Google to AWS, and they are responsible for observability, open telemetry. I would like to get maybe an overview. What's the current state of open telemetry at AWS? Yeah, so actually like i you know said like you know i generally work on instrumentation open telemetry is a piece of it um a big piece of
Starting point is 00:08:53 it actually but there's a lot more going on and i can give you kind of like a you know brief maybe summary of what we're trying to do with open telemetry so you know in the last five to ten years you know we have um all these different vendors building their different so you know in the last five to ten years you know we have um all these different vendors building their different like you know solutions um and they all had like different instrumentation telemetry generation client libraries one of the difficulty was this was um you know you go to a customer who doesn't want to know too much about instrumentation or don't want to understand like you know the telemetry points um and they don't want to re-instrument everything so this was an initial problem and then there's all these open source projects that you want to put some you
Starting point is 00:09:33 know out of the box instrumentation but they there's not a vendor agnostic way of doing this so they end up doing nothing or inventing their propriety know, formats. This was a huge, big issue because most of the time, people just want to get things out of the box. So that's what, you know, how OpenTelemetry came around. And AWS is, I think, like, given the scale and the diversity of our customers, like, there's so much, right? Like, in terms of our customers usually want to use,
Starting point is 00:10:03 like, three to four or, like like sometimes four to five different solutions. They want to see their, you know, data on multiple products and not just on AWS products, also like, you know, other products, right? And us being able to, you know, collect, produce, maybe analyze and also stream, you know, the telemetry data in a vendor agnostic format is very important for us. And important for, like, all the other partners and companies
Starting point is 00:10:34 that we're working with in the APM and observability space. So AWS has been, I think, very sort of, like, smart in terms of saying that, hey, maybe we should try to like align with what OpenTelemetry is doing, because that's going to be the format we can speak, everybody can understand, right? So the goal right now is trying to make, you know, OpenTelemetry collection much easier on all our compute platforms, EC2, EKS, ECS, Lambda. If you want to have your own instrumentation, you should be able to export it to a sync, like a collector available in the platform, so you don't have to deal with running the collector yourself.
Starting point is 00:11:23 So that's one of the goals that we're trying to achieve. The next thing is we also have like a lot of managed services that produce those telemetry data. Now, historically, this type of data has been very vendor specific. We would use CloudWatch for metrics, you know, extra traces. Now we're like looking at cases where it can be produced as a vendor agnostically. So we can, you know, push that data as well to, you know, everyone else or to, you know,
Starting point is 00:11:48 whatever tool the customer is, you know, wants to use. So OpenTelemetry may end up being this like, you know, common telemetry format for us. And I think like I came to a realization a couple of years ago that being a cloud provider is really like being a telemetry provider in some sense so you know if we're we're a telemetry provider you know this is the format where we want to use um and we want to communicate um not just internally to ourselves but you know externally as well there's also um there's a i can give a bit more um of what else is going on. So I think OpenTelemetry is very limited right now
Starting point is 00:12:28 to metrics, traces, and logs is coming up maybe next year. It's in the early stages. But we care about other things like database performance, for example, is one of them. Can we propagate OpenTelemetry labels all the way to the databases because they can do accounting based on labels and so users can go and break down. There's a lot of effort going on for eBPF. How are we going to be enabling eBPF?
Starting point is 00:12:55 Can we generate aggregated data and output it to OpenTelemetry? So that's the other piece of thing that I'm working on. So there's a lot going on, to be honest. Yeah, I got a couple of questions here. So it feels for me, at least I assume the first kind of focus of all of this is to enable your users that deploy their applications or services on the AWS infrastructure to make it very easy for them to expose their own telemetry data or maybe get some telemetry data off the runtimes
Starting point is 00:13:30 already run on. And actually, that's a question now for me. So if I would, let's say, use a Lambda function and I want my telemetry, you make it easy for me so that my code, my Lambda function, will expose telemetry data, make it easy to figure out if something is wrong in my code. Now, are you also planning on exposing your own telemetry data to the end user within
Starting point is 00:13:55 your own runtime code and within your service code? Because eventually, maybe it's not my code, but because I use your services wrong, or maybe there's a problem with your services. Is this also something that you are thinking of exposing to the end user, or is this just something you would use internally? Yes, I briefly mentioned that we have two main managed services and we want to expose telemetry data, but I didn't go into detail. So thanks for asking.
Starting point is 00:14:17 So, you know, most of the time, actually, like most of the customers, they can do their own instrumentation. But, you know, what is valuable to them is like being able to see all these black boxes, right? Like Lambda runtime, for example, it does a lot of interesting things. And that's not visible or user cannot do anything by just instrumenting their Lambda function because there's a lot else is going on outside of that, you know, function block. By using OpenTelemetry, we want to to do like i mean um we want to be able to you know produce our telemetry data in the same format as well um so there's been already some
Starting point is 00:14:53 you know work going on like for example uh some some of the databases um some of the like you know other many services um like s3 is trying to do some more, like trying to expose more internal traces and stuff. But it's been in our proprietary formats. So you have to still go and find that data in X-ray. We're kind of like thinking about what would happen if we can produce all that data in one format and give you as a streaming service, for example. So you can take everything see
Starting point is 00:15:25 everything end-to-end and you don't have to necessarily start with instrumenting yourself right like you should be able to see the big picture maybe end-to-end by uh traces coming from us automatically and then um you know if you want to participate into that like you might be able to you know using either open telemetry or something compatible with that propagation uh and you know the data format but that's sort of like the goal we actually don't want people to you know start themselves we just want them to be relying on the data coming from us also um some of the maybe like you know framework integrations and other stuff that we can still provide. So Lambda is an interesting example
Starting point is 00:16:06 because the Lambda runtime framework is also us, right? Like we can automatically create you a trace span, for example, for every Lambda function. And if you still want to participate into that, you are free to bring your instrumentation library or OpenTelemetry or OpenTelemetry-compatible library and add your custom things. But I think the overall vision is we should provide you out of the box as much as possible.
Starting point is 00:16:34 And if you want to do something on top of that, you should be free to do it. And this brings another question because you said one of the things that I wanted to ask you today is how can we get developers to actually you know really leverage OpenTelemetry and I was wondering are we moving towards a world where we ask developers to manually put in their their traces their spans like do they do they have to think about where do I call the OpenTelemetry library? Or is it more the other way, which I think it is? And I just watched your video on AWS Distro, it's a seven
Starting point is 00:17:12 minute intro video on how to get started instrumenting a Java app that is running in a container on EC2, where I think the video at least shows auto instrumentation. So my question really is, is the idea that it's a USAWS or it's the Open Telemetry Foundation is providing not only the common protocol, but also automated instrumentation for hopefully a large number of runtimes so that we make that easier, right? And then you can just, on top of that auto-instrumentation, you can still add two or three additional metric points that you may need.
Starting point is 00:17:54 Yeah, auto-instrumentation has always been the goal since the open census days. That's why we came up with this vendor agnostic thing. So we can make sure, I mean, everybody was tired of providing this all the instrumentation integration so we thought that like hey if there's a vendor agnostic thing we can actually have everyone you know uh doing the integrations although you know out of the box like instrumentation for common frameworks like for example one example i can give concrete
Starting point is 00:18:19 examples i think it's much easier uh grRPC can provide, you know, traces automatically without you doing anything, you know, if they use OpenTelemetry today to instrument, right? Like the framework. But they've never been able to do it because there hasn't been an industrial standard where, you know, you would use it and there's a way to get data out of it in a well-known format.
Starting point is 00:18:44 So they've never been able to do it. The idea was OpenTelemetry, when it becomes successful, we're first going to start with providing some of these integrations ourselves. But eventually, all these projects will take OpenTelemetry, import it, and provide instrumentation in the framework without us doing any work. And the same thing applies to like, maybe this is easier for libraries, but the bigger problem was the binaries,
Starting point is 00:19:11 for example, databases. How are they going to be utilizing some of those concepts? Similarly, we have this very well understood now like exposed data format. So they can, again, import the OpenTelemetry libraries, produce that data,
Starting point is 00:19:31 and then write it to, you know, either OpenTelemetry collector or somewhere. And then the entire community can, you know, the user can understand and push it to whatever backend. The idea was always like, we should provide out of the box automatic instrumentation as much as possible. Nobody wants to do any instrumentation work. Yeah.
Starting point is 00:19:52 And I think this will also make life easier from vendors like us, right? At Dynatrace, we've been building instrumentation agents for the last 15 years. And the challenge is always, I think we've become good with it, but still, there's so many new technologies, new versions, new frameworks, and you always have to adapt your instrumentation. And if we can all agree on a standard, then obviously the value of such an agent that a commercial vendor like we provide goes down because eventually over the next couple of years, they're kind of obsolete because hopefully open telemetry takes over.
Starting point is 00:20:29 But it makes life for the end users much easier because you don't have to think about, hey, I don't get the right telemetry data or traces. But now you do because you hopefully can be sure that this library, this software that you're using has been properly already instrumented either by the vendor itself or because these open source agents have become so good in auto instrumentation, you really get exactly what you need in order to do some troubleshooting, some performance profiling and so on. Yeah, exactly. And I was just going to say, Andy, that brings a smile on my face in a way, kind of a snarky
Starting point is 00:21:03 smile because I just think of, you know, the idea of auto-instrumentation, I think that's a wonderful goal. But what you all have to then deal with is what we have to deal with on a regular basis is new versions, users using the frameworks or the code in ways that you hadn't planned for, these things breaking and then turning into supports and it's you know the auto the auto instrumentation side of it is where it gets really sticky and um it's the smile i'd say it's a little bit more of a an evil smile because it's kind of like great now you'll now now other people will see what goes into the auto-instrumentation side. Because, yeah, I mean, manual instrumenting would be a real pain,
Starting point is 00:21:49 so it would be awesome if you all can take care of that. It's just always that staying on top of it and finding, you know, the funny thing that you always see is that you have a framework, you have best practices, and then you put it in the hands of developers and everything goes out the window. And you start looking at what they're doing. You're like, what are you doing? Well, I can do that. And I want to be like, okay. So it gets fun. It gets fun when you get on that side of it. Yeah, I'm very pessimistic, actually, as well. I'm trying to be more optimistic because there's also a lot of legacy. We have this super nice nice vision maybe eventually we'll have a stable data
Starting point is 00:22:25 format that everybody may be speaking but you know the data format itself is complex there's already 25 different data formats uh the instrumentation libraries and their stability is the other thing there's a lot of like complexities there like you know i gave a uh example hey why grpc is not importing open telemetry but you know it's difficult to rely on a dependency like that and make sure that your versions are compatible with the upcoming versions of OpenTelemetry and so on. So all these problems are very hard problems in the end of the day. I've seen, though, a couple of projects became very successful.
Starting point is 00:23:02 One example is Pprof. And I think I was very opinionated about we should have a very stable export data format, regardless of what the instrumentation libraries will be like. As soon as we document, we have a good spec and people know what type of data they should be exposing in OpenTelemetry protocol, this project can succeed. Because of what I learned from Pprof. Pprof, the profiling project from Google, it actually is a data format and a couple of tools. And everybody is going and writing the instrumentation piece themselves pretty much.
Starting point is 00:23:43 For example, GoRuntime exposes, you know, Pprof. It does all of its doing and then produces the Pprof export format. Pprof is, like, everywhere. Like, I mean, it's not maybe very, very globally, you know, accepted, like, profiling, like, one true format or anything, but it's a widely adopted thing
Starting point is 00:24:03 and used very close to like even language runtimes because it has this like very stable data format and they don't necessarily care which instrumentation library you use. You're free to do whatever as long as you produce that. The other, I think, example is Prometheus. Prometheus export format is just everywhere right now. The entire world
Starting point is 00:24:25 relies on that. Even though, you know, there's this new upcoming thing, open metrics. But, you know, they've been able to achieve it by having the exposition format and keeping it very stable and like making no compromises about, you know, the stability. Yeah. And I think you're right. We have to keep optimistic about this, right? Because if we do take that pessimistic approach, we'll never get started. So it's got to start somewhere and it's got to get built out. And I think the cloud vendors...
Starting point is 00:24:54 It's good to be, I think, skeptic about this because there's also a lot of projects that fail. Right, but if you go too skeptical, it'll never take off, right? You got to start building it somewhere. And I think the cloud vendors taking some of this approach have a little bit of an advantage because it's a lot easier for the cloud vendor to say,
Starting point is 00:25:14 let's say if you're going to use Lambda, for example, you're going to use Lambda. Here's how you use Lambda. Here's how you'll get the metrics from it if you use it this way. If you don't use it this way, you're not going to get it. Whereas when people come to vendors like us, they're expecting they're going to pay for us
Starting point is 00:25:30 to get this data from whatever they have because we're a vendor doing it as opposed to being that cloud provider which says this is the scope in which it works. We're expected to work within any scope. So you have a little bit of an advantage in that case because you can more easily set the guardrails of the conditions for leveraging it.
Starting point is 00:25:49 So you definitely have an advantage there. Yeah, yeah, yeah. One of the, I think, interesting aspects of OpenTelemetry is it's a forum between also vendors like you and cloud providers at this point. So I just felt like it was very difficult to kind of have some of these conversations because I think as a cloud provider,
Starting point is 00:26:10 I was always having these conversations one by one, but there hasn't been a single place where people would come and agree and sort of like, hey, you are a vendor. What do you expect from Lambda? There was not an open forum for that. I think it's just kind of like maybe working against if the vendor wants to, you know, do differentiator products.
Starting point is 00:26:32 But at the same time, it kind of helps in terms of at least, you know, having some consensus. So at least we know what basically we should cover and OpenTelegram is providing that forum, which is, I think, unique in the space at least. Now, Jana, from, I mean, collecting data is one thing or exposing traces. The question is, where do we send the data to?
Starting point is 00:26:58 And then also what happens with the data there? I mean, we have been and other vendors in our space as well, we've been not only talking about auto-instrumentation, but then automatically detecting problems, automatically detecting anomalies. And it goes beyond what you typically see. And again, I'm referring back to the video that you showed,
Starting point is 00:27:19 which is great, that you have on the distro page, which is great, but it shows I'm a single user and I'm a developer. I'm clicking on a link and then I want to see that trace. And then I turn the database off and then I see the database has a problem. But obviously, in large, if you want to use this for production use cases, then collecting data is one thing.
Starting point is 00:27:39 But then where do I send it to? And the analysis on top. Now, here's my question to you is this something that aws is also moving towards are you also moving towards the um the data back end and also the data analytics on top of it yeah so we have a bunch of different initiatives in this area um none of them are like very as far as i can understand, not everybody quite understands, but I can try to give a brief, you know, maybe summary. So, you know, the distributed tracing, maybe we can talk about distributed tracing.
Starting point is 00:28:19 The metrics, we have similar initiatives for metrics as well, though. But in distributed tracing, you know, our backend has been X-Ray. And X-Ray has been just a tool to kind of like, you can query, visualize, and all that. On top of that, there's a lot of different initiatives around anomaly detection at AWS. This is a hard topic, so it's just not like there's an easy solution. But there are a lot of teams trying to figure out
Starting point is 00:28:42 seasonal differences, or if there's a 90. One of the difficulties, I specifically want to talk about distributed tracing because distributed traces are usual down samples, so you don't always have full data. You may try to have, but there's a performance cost for that.
Starting point is 00:29:01 And we realized that our customers, most of the time, end up collecting customers most of the time end up collecting 90% of the time, collecting very boring traces, but missing out all these edge cases like 99% and 95% and above. So one thing that we want to do is
Starting point is 00:29:17 also making collection a bit more smarter. Let's try to collect as much as data and analyze it in the cluster before it's been exported to any backend. If it's an interesting case, if it's in a 95th percentile and above type of case. So this is one of the initiatives we're trying to do. The other one is like the anomaly detection that I described about. The other, like, there's, like, some other, like, initiatives such as we simultaneously want to be able to take a look at, like, multiple signals, like, not just traces, but also to
Starting point is 00:29:53 metrics. And this is one of the other reasons that we are interested in open telemetry, because we want to be able to correlate things. So we want to have, like, more consistent labeling all across the, you know, telemetry data we collect. And if there are any interesting cases that we capture in metrics, for example, we want to be able to enable maybe more tracing collection or tweak our sampling strategy and stuff like that. These are all initiatives in flight. But as you mentioned that the difficulties actually make an instance of this data it's so much data most of the time it's not very useful because
Starting point is 00:30:31 it's just a repetition repetitive you're collecting similar data again and again so what's valuable is being able to you know kind of give people um these are the interesting cases and you know i have like five categories of interesting things for this particular critical path maybe uh so you know that you can you can allow them to see um what else is going on in their critical path if you can alert on them like by you know anomaly detection that would be the next step but anomaly detection in my experience has been very very difficult problem well. That gives me hope, Brian, that what we are doing at Dynatrace still has a long way to go.
Starting point is 00:31:10 We have a long way to go, too. I'm just making it sound like we actually have been... I mean, this is a hard topic for the entire industry. Andy, I was going to say... Sorry, go on, Jana. Go ahead. I was just going to say, you said it gives you hope, Andy, I was going to say... Sorry, go on, Jana. You go, go ahead.
Starting point is 00:31:25 I was just going to say this. You said it gives you hope, Andy. It gives me fear that once we have the big arm of the cloud vendors going beyond collecting the traces and processing them, what will become of vendors like us? Oh, yeah, but the reason why I said I have hope is because we have been focusing on this for a long, long time.
Starting point is 00:31:47 And yes, obviously, you know, cloud vendors and others will catch up or will provide services. But I think in the end, it's collaborative what we're doing here. I feel like, you know, cloud vendors doesn't necessarily want to, you know, do all the like work
Starting point is 00:31:59 that you want to do because it's such a, you know, there's so many things that we can do but in the end of the day we care about our core platform right maybe that's why we're more focused on collecting at this point because we want to be able to collect and you know provide that data and enable like other you know companies to do the work with the data yeah um exactly i see it the same way right i mean you can focus on what you're strong in and why people come to you. And people may not come to you because you have the best end user monitoring and anomaly detection, but you can give the data.
Starting point is 00:32:33 But essentially, you help your customers to run the services and platforms reliably on your platform. Then you also have the data that they can then use with other tools. That's why I think we are super focused on core reliability. We want to have the core metrics, traces, being able to alert on those things. These are very fundamental for us because you as a customer should be able to come to our platform, be able to do things reliably at a minimal base. If you're looking for very advanced cases, you, you know, you might be also in the charge of, you know, also analyzing your data.
Starting point is 00:33:08 That's the other case that we never discussed. One of the interesting things about, like, I think we want to speak open telemetry is, hey, maybe we will be also able to export the S3s, and, you know, you will be able to read your telemetry data in raw format and do whatever you want to do with that, right? Like, so that's that's sort of like the other goal um yeah hey i brought up the term uh aws distro a couple of times can
Starting point is 00:33:34 you quickly maybe give a overview of what that is all about because i think i never mentioned what it really is a lot of people i have been asking me questions about what is the AWS Distro. I mentioned a couple of times that we are thinking about deploying the collector to our platforms. The AWS Distro came as a necessity because of that. There are a couple of reasons that the Distro came around. Distro is an open source project. I've seen other providers doing similar things, but in a closed source fashion.
Starting point is 00:34:10 So let me briefly explain, I think, some of the technical challenges first, and then I can also explain why we are doing the distro. So first of all, the collector in OpenTelegram is written in Go. And there's collector supports already, a bunch of the proper upstream collector repo has open source projects represented. You have Jaeger support, you have Zipkin, Prometheus support there. But everything that is related to vendors
Starting point is 00:34:42 are living in a different contract repo and in go in order to you know you have to build a static binary so you can't really like say like hey i just want this to be dynamic link like i just want this particular vendors you know component to be dynamic link so it became a necessity to for us to actually have a binary, like a main Go function. So we can pull in all the important bits that we care about and we want to support. And that's how the distro was born. And then there was a couple of other things. So AWS really cares about reliability, performance, regressions, especially if we're going to
Starting point is 00:35:21 deploy the collector on behalf of the user, and also the security reviews. So everything that we do in the distro is actually like, it goes through this release process. So we take the upstream, we put everything in the upstream. When we write contribute code, we put it in the upstream. And then once a while, when we're making a distro, everything goes through this reviews, performance regressions, security reviews, reliability issues, if there's anything backwards compatible or not. And then we cut the release based on that. We're trying to follow the upstream very closely. It's just really like our cadence is very similar to what upstream is doing.
Starting point is 00:36:02 But just because there's that process there's the distro the other thing is you know this enables us to also link the partners you know exporters or like other components into the distro uh so you know you don't have to go and like rely on this country repo which is not always well tested or like you know not you know not not going through the same process. So it kind of helps the customers to be able to at least rely on the collectors. They know that it's going to work. That's where the distro came from. It came from these technical challenges.
Starting point is 00:36:38 And I'm glad that it's an open source project. I mean, it doesn't do much. It's just upstream distributed in a different way., it's just upstream distributed, you know, in a different way. But it's just basically coming from the fact that, you know, it's difficult to vet, you know, all these components, because this is a very huge project. And the other thing is, I think people at least know that we provide support. People at least know that if there's an issue with any of the components there, we will be able to, you know, go go in fixed upstream if it's necessary.
Starting point is 00:37:07 We can make this promise for the entire project. It's huge, and there are hundreds of different components to be reviewed by tens of companies. So that's how the distro came around. It's trying to be just open telemetry, but distributed by going through some of the reviews that I mentioned. If I would be, if I'm a developer,
Starting point is 00:37:36 and I want to get started today with writing my app, deploying it on AWS, any of my, let's pick a combination of Lambda, Fargate, and maybe some database service in the back. Can all of this be handled today with OpenTelemetry? Do I get my end-to-end traces and my metrics from Lambda making calls to my microservices in Fargate and then the database? So, no, it's not, because we're still working on the end-to-end use cases. If you want to just do your user traces at this point, everything works. If you want to make calls to AWS services,
Starting point is 00:38:21 you have to use still the AWS's trace context, which we provide also OpenTelemetry support. So in OpenTelemetry, we provide propagators. If you do that connection, you will be able to see end-to-end traces on X-ray. But there are still things that we want to improve in terms of providing more detailed traces from databases. So if anyone is using and want to give feedback to us, we are at the stage of prioritizing what else we want to expose from the managed services. And the next thing that we want to do is, right now, we only understand our own context propagation header,
Starting point is 00:39:00 but we want to switch to W3C trace. So you don't have to do anything in terms of caring about the propagation like headers or anything. Just give us whatever OpenTelemetry produces. And at some point we will be able to understand that. And that's a very complicated project and I'm working on it like this year and probably next year and maybe it may take more longer than you think. Yeah.
Starting point is 00:39:27 So I would assume the API gateway is a perfect service, right? It's going to, yeah. It's going to be parsed. It's going to be the starting point. In the ingestion, like at the entry boundaries, we're thinking about being able to accept trace context and convert it into the internal format in the moment because all these teams, it requires them very long time
Starting point is 00:39:52 to switch to different trace context headers. So maybe initially we will be able to unit the API gateway or in any other entry point, we will be able to translate it to our internal when it's internal. And if you are exposing it to the user, we want to translate it back to trace context header. So that's sort of the goal initially.
Starting point is 00:40:12 But eventually we want to use it internally everywhere as well. So it's not visible to the user, but it's a technical detail for us that we also want to be aligning with the trace context. Is there anything else we've missed in this conversation? Remember, if we have people that are new to OpenTelemetry, if people that want to know the status of where it is, where it's going, any other things we should mention
Starting point is 00:40:42 or maybe some links where people can get started, maybe even contributing, seeing the status? What's a good way? Yeah, maybe we can talk a bit about also the status of the stability of the project, for example. That comes up as a question as well. So OpenTelemetry, given the number of companies involved, it's been trying to establish stability for a while. And it's now coming in phases. The trace, for example, has been the spec.
Starting point is 00:41:08 It's been stabilized. The instrumentation libraries will be stabilized. The next thing is the metric spec. The metric spec is soon to be stabilized. And we're trying to make this collector stabilized at the end of May. So starting with May, like traces and metrics will be stable
Starting point is 00:41:26 in terms of spec collector and instrumentation libraries. Maybe metrics may require some work, but otherwise things will be stable. This has been an adoption blocker because people didn't want their data to be broken or their entire pipeline
Starting point is 00:41:40 or they didn't want to invest too much build and tools around this type of data because they didn't know if the data was going going to be breaking or not so it's a huge milestone for the project and the next thing that we you know we're going to do it with with logs it's in the early stage um maybe the data format is not going to change that much but we're going to formalize a few other things make sure that the work is you know going well in the you know client libraries.
Starting point is 00:42:09 And we want to build some sort of parsers for well-known formats and so on. So there's a lot of conversation going on for logs at this point in terms of what are the next steps are. But yeah, we can give a couple of links. Please take a look. Documentation might be lacking a bit because the project was trying to figure out the stable version. So I think a lot of work will be done from this point on in terms of contributing to docs. There will be better examples, more end-to-end examples, and so on. So if you see anything, you don't have to do code contributions.
Starting point is 00:42:44 Even just reporting those type of cases would be a good contribution. And Jana, yeah, we will be making, we make sure that we collect some links and put them on the podcast summary as well. We'll do this. Yeah. Let's, let me, let me, let me give you a couple of links. We can do it now or later. Perfect. Yeah, GitHub OpenTelemetry obviously is a great start. Obviously. Yeah, these are too obvious. But you know, spec is the important repo. The data model is in the proto repo, which is also an important one. The client libraries, you know, each of them are there.
Starting point is 00:43:27 There's so much going on, so it's kind of hard to... I mean, I can give you all the links, but... That's fine. I think an entry point is good for people. I think these three, like spec, proto, collector, and maybe you can give a couple of links to libraries for some of the languages at least like these are really good entry points um and the the sig meetings like actually are are super useful if they want to contribute and they are all here uh you probably
Starting point is 00:43:58 know about this repo as well um there is a there's a calendar section where you know you can see all the um meetings um and yeah there is um there there's like you know kubecon coming up and there's another conference olifest where um there will be a day focused on open telemetry maybe we can also mention that like this the second day will be um open telemetry specific so thanks for that so folks that are listening we'll add all these links to the to the podcast summary so you can click on it. Hey, Brian, do you have anything else? Yeah, I just wanted to ask about, from what I understand,
Starting point is 00:44:55 X-Ray is using open telemetry, right? It's one of the pieces. Now, that's obviously not going to have any auto-instrumentation in it yet, but Andy, I believe we have, what is it, a propagator or a collector for X-Ray as well? Yeah, Exporter. Yeah, Exporter, yeah. So if anyone's running anything on that and they wanted to play around with open telemetry, if they own Dynatrace and want to have us analyze all the data, they can get started already with those two. I think so, right?
Starting point is 00:45:23 It seems like it's seen a blog or two of ours, I think that's already, I just haven't had a chance to play with it because, well, I'm as busy as everything. But it looks really interesting. I'm really excited to see what comes out of this. Although I know earlier I mentioned my fears of what happens when the cloud providers try to take over,
Starting point is 00:45:43 but I think on this tracing level, the automation instrumentation side of things, I think it's going to be really awesome as things go along. Because as Andy said, it will make our side of the house a little bit easier. Let's say we even got to a world where all the cloud providers did all the instrumentation. And we didn't have anybody on-prem anymore.
Starting point is 00:46:10 And I'm speaking fantasy world, right? But that would give the vendors and all of us the ability to solely concentrate on analyzing and alerting and AI for the data and all that. So I don't really see a net loss in anything for these things. As we see when automation and pipelines come in, it's all about moving forward and moving with the thing. So just another example of the ever evolving technology space that we live in, which is just crazy. As time goes on, it just gets crazier and crazier.
Starting point is 00:46:41 I tell you, I sound like an old man, but it's just insane. I feel like, you know, I'm tired of these things, right? Like in the last five years, I was like, I just want to retire. It's like the same thing again and again, right? You feel like you're not sure if it's any progress or it's just the same thing repeating again with different people. Well, that's the funny thing we see with Kubernetes, right? There was the idea of like, oh, Kubernetes comes out,
Starting point is 00:47:09 you just throw everything up there and it runs. Well, no, it doesn't. And then you start having your Kubernetes code as the new infrastructure to maintain. Everything just moves, you get, but as we discussed on previous episodes, there are definite benefits that come along with these changes. The workload shift, the model shift, there's a lot of similarities because it's just a
Starting point is 00:47:29 morphing model. But the benefits, I mean, if you think about Kubernetes and containers and all that, the world of automation that it opened up has just been absolutely incredible. But now you've got to write all your automation scripts. Yeah. And, you know, like I was super skeptical about Kubernetes because, you know, like, I was super skeptical about Kubernetes because, you know, hey, we're trying
Starting point is 00:47:48 to do this again. You know, it's the complex API surface because it tries to do everything at once. And, you know,
Starting point is 00:47:56 like, most of the people just don't care about all these use cases. This is more like an API for cloud providers. But, you know, it kind of enabled this big ecosystem and this providers. But, you know, it kind of enabled this big ecosystem and this entire, like, you know, new area of, like, automation.
Starting point is 00:48:11 And, like, it's a cluster-wide automation that just didn't exist before. So, you know, it just kind of, like, changed a lot of things. There's a lot of complexity in Kubernetes, but it also, that complexity enables some of these things that was not possible before, right? It enabled Captain. There you go. That's our open source project that brings automation.
Starting point is 00:48:33 And Andy always has to mention it. It's like, it wouldn't be an episode if Andy doesn't mention it. I used to have, when we first started with Captain, I used to, I didn't get to do it because I was recording in the morning, but it was the idea of a drinking game every time Andy mentions Captain. So I'll just drink some of my water here. All right. I think it's time to wrap it up. Jana, thank you so much for being on the show.
Starting point is 00:48:56 I hope to have you back. Thank you for waiting for a long time. It was worth it. Absolutely worth it. I'm happy to wait for one more year, and then we have you back for what happened in the last year where are we with OpenTelemetry I think that would be a good checkpoint
Starting point is 00:49:09 unless I start working on databases again or we can talk about databases we don't have many shows on database so that would be great if you're up for another show we can record it in the next couple weeks and say let's refresh your memory
Starting point is 00:49:25 on database performance. No, let's do it if I pivot back to databases. Okay. So, you know,
Starting point is 00:49:31 I sound legit enough. Like, I need to work on the, you know, area. I have too many opinions on, you know,
Starting point is 00:49:37 on that field, but I think let's do it when I'm back on databases. It's good to be an opinionated platform. But I'll challenge any of our listeners.
Starting point is 00:49:48 If our listeners are very knowledgeable in, or I shouldn't even say very knowledgeable because no one ever thinks they're knowledgeable enough. If you have knowledge on database performance, let us know. Maybe we can have you on and discuss it because I think it's a topic that doesn't get enough attention. Any final thoughts, Jana?
Starting point is 00:50:07 I mean, it was great to be here, but we just really pointed out a lot of the pain points. As much as you, I'm also a pessimist, but trying to be an optimist at the same time because I've seen some of these projects actually shift things in the long term. So I hope that it turns out to be a good
Starting point is 00:50:30 outcome. Critical optimist is a better word than pessimist. I was just going to say I was a pessimistic optimist. I hope for the best but expect the worst. This way I'm never disappointed. Yeah. Yeah. Alright, well thanks for being on the show. I'd like to thank our listeners for listening. this way I'm never disappointed yeah yeah
Starting point is 00:50:45 alright well thanks for being on the show I'd like to thank our listeners for listening if anybody has any questions or comments please reach out to us at pure underscore DT Jana do you do any social media that you wanted to share or yeah I have a Twitter account or you know I don't know
Starting point is 00:51:01 I'm trying to if people want to reach out, they can use this. Alright, we'll have that in the show notes for anyone who wants to follow that. And thanks everyone for listening. We'll be back soon. Bye-bye.
Starting point is 00:51:19 Thank you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.