Software at Scale - Software at Scale 47 - OpenTelemetry with Ted Young
Episode Date: May 26, 2022Ted Young is the Director of Developer Education at Lightstep and a co-founder of the OpenTelemetry project.Apple Podcasts | Spotify | Google PodcastsThis episode dives deep into the history of OpenTe...lemetry, why we need a new telemetry standard, all the work that goes into building generic telemetry processing infrastructure, and the vision for unified logging, metrics and traces.Episode Reading ListInstead of highlights, I’ve attached links to some of our discussion points.* HTTP Trace Context - new headers to support a standard way to preserve state across HTTP requests.* OpenTelemetry Data Collection* Zipkin* OpenCensus and OpenTracing - the precursor projects to OpenTelemetry This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
 Transcript
 Discussion  (0)
    
                                         Hey, welcome to another episode of the Software at Scale podcast.
                                         
                                         Joining me today is Ted Young, the Director of Developer Education at LightStep and the
                                         
                                         co-founder of OpenTelemetry.
                                         
                                         Welcome.
                                         
                                         Thank you.
                                         
                                         Glad to be here.
                                         
                                         Yeah.
                                         
                                         So I want to start with getting to know your background, right?
                                         
    
                                         You've done so many things.
                                         
                                         You used to do something with animation.
                                         
                                         Now you're the director of developer education,
                                         
                                         which is a role I haven't heard of before.
                                         
                                         So how did you get here?
                                         
                                         Yeah, I do have a funny background
                                         
                                         of kind of switching what I'm up to every seven years
                                         
                                         or something like that.
                                         
    
                                         I used to be a computer animator, actually.
                                         
                                         I got a computer science
                                         
                                         degree at tufts but was mostly interested in film and animation and uh helped run like a small
                                         
                                         post-production studio for a number of years and that was a lot of fun but i got really into um kind of like um environmental um stuff and like some civil rights
                                         
                                         stuff and began working on internet software to kind of help some of the social movements that i
                                         
                                         was part of and that turned into a full-time job uh a small consulting group called Radical Designs. And that was a lot of fun.
                                         
                                         Eventually, I wanted to get deeper into how computers actually work and start
                                         
                                         solving some of the problems that I was having as an application developer working on the internet. And so I started working
                                         
    
                                         on like container scheduling systems, like platform, what we call platforms, but I think
                                         
                                         of as distributed operating systems. And I did that for a while. After that, moved over here
                                         
                                         to observability to scratch the other itch, which, you know, through my entire
                                         
                                         career, whether I'm like rendering visual effects or, you know, trying to make scalable web apps or
                                         
                                         trying to make a container scheduler work, you know, the question, why is it slow? Why is it slow?
                                         
                                         Just keeps coming up over and over again.
                                         
                                         And it's always such a pernicious question to answer that getting involved with observability and in particular, the founder of LightStep, a while back.
                                         
                                         And we really hit it off and kind of saw eye to eye about how, not just how you do that,
                                         
    
                                         but what some industry standards needed to exist to really help make doing that easier.
                                         
                                         And so I joined up at LightStep and have ended up with various titles that basically amount to go run the OpenTelemetry project and talk to people about it.
                                         
                                         So that's basically what I do for LightStep and what I've been doing for like the past five years.
                                         
                                         You know, the whole idea of changing up your career every seven years. It sounds like a great idea to me.
                                         
                                         It definitely keeps things fresh.
                                         
                                         That's for sure.
                                         
                                         And going from computer animation to something as, I guess, platformy, like distributed operating
                                         
                                         systems, as you basically said, that sounds tough.
                                         
    
                                         Like, how did you manage that transition?
                                         
                                         You know, it's funny.
                                         
                                         On the surface, it seems very different.
                                         
                                         But actually, in the world of computer animation and visual effects,
                                         
                                         it's kind of like a small niche world.
                                         
                                         But most of the problems of distributed systems and scalability
                                         
                                         that people are hitting on the internet now are
                                         
                                         problems that uh 3d shops and like visual effects shops like industrial light and magic and places
                                         
    
                                         like that have been dealing with uh for a long time which is you know you are trying to do a
                                         
                                         massive amount of computation when you you think about rendering out,
                                         
                                         say, like a movie or commercial or visual effects shot, you have all these different compositing
                                         
                                         layers, all these different 3d elements, and you have to render out all those layers for every
                                         
                                         frame of the thing you're trying to make, which is a massive amount of computational power. And so you end up with what's called like a render farm, which is basically server farm, to farm all of that work
                                         
                                         out to. And when you do that, you run into all of these like pipelining and throughput and latency
                                         
                                         problems of, I want to see these particular things really fast, or I want to get it all done in time to meet this deadline.
                                         
                                         So I want the most throughput that I can have.
                                         
    
                                         And you kind of end up having to solve these scheduling problems
                                         
                                         that are sort of like what we call map reduce these days.
                                         
                                         But the problem of I have to get all of the relevant bits to do the job to all the different machines that I want to work on it.
                                         
                                         Getting the bits to the machines takes time and takes up space.
                                         
                                         And then I need to figure out what the most efficient way is to share these common resources, CPU, RAM, memory, network, all of that, between all the different jobs that
                                         
                                         I want to do. And so the kind of algorithms you come up with, the kind of approaches you take,
                                         
                                         end up being pretty similar in practice to turning around and saying like, okay, now I want
                                         
                                         to schedule a bunch of apps to run on a bunch of servers.
                                         
    
                                         And I want to think about the efficient way to download the images for all of those apps.
                                         
                                         And I also want to think about reliability and scalability and redundancy and stuff like that.
                                         
                                         And also managing resources like CPU and RAM
                                         
                                         across all the different apps and services I want to be running.
                                         
                                         It ends up being a more similar set of problems
                                         
                                         than it might look like on the surface.
                                         
                                         Does that experience or does that interest,
                                         
                                         is that what led to thinking about OpenTelemetry
                                         
    
                                         or was there something else
                                         
                                         in that story how did you get find yourself in the position to start an initiative like that
                                         
                                         yeah i mean i think the same way i got into like platform and container scheduling stuff
                                         
                                         um getting into observability was because i i hated my tools. And I hated using them,
                                         
                                         and I wanted them to be better. And when I did some research, it seemed like there absolutely
                                         
                                         were known better ways to do some of these things. And in both cases cases it was also like kind of an industry technology wave seemed like
                                         
                                         it was coming along where there was actually an opportunity to get out there and implement
                                         
                                         some of these better ideas and have them see some adoption in the field okay so i think that's the story with a lot of people who work on developer tools it's just
                                         
    
                                         i had my first job as xyz engineer but i felt like i could be so much more productive if
                                         
                                         the systems i worked on were better and then you go down that rabbit hole more and more and more
                                         
                                         that's certainly what happened to me um so maybe you could tell listeners a little bit about the
                                         
                                         project what it's aiming to do the kind of consensus it's trying to drive in the industry.
                                         
                                         Just the background of the project to start would be interesting.
                                         
                                         Yeah, absolutely. solve is coming up with a standard language for all computer programs to use to describe what it
                                         
                                         is that they're doing with kind of a focus on computer programs that are actually like distributed
                                         
                                         systems so networked applications where you need to actually talk to a large number of computers in the process
                                         
    
                                         of getting anything done the fact that you need to talk to a bunch of different computers but
                                         
                                         actually be able to sort of uh trace back all of those interactions across all those computers adds sort of like an extra layer of difficulty to capturing that
                                         
                                         information correctly. And that's actually the part that I found was missing the most
                                         
                                         in our traditional tooling, mainly because it's more painful to set up uh and install and get otherwise get these
                                         
                                         what are called context propagation uh mechanisms working uh so the open telemetry project is a
                                         
                                         project um to design and build all the tools you would need to emit that kind of modern telemetry.
                                         
                                         And it's designed in such a way that it can be very stable and very amenable to being embedded
                                         
                                         in shared libraries and shared services.
                                         
    
                                         So this was kind of actually another problem I had,
                                         
                                         which is I wrote a lot of open source software,
                                         
                                         like open source libraries and services and things.
                                         
                                         And you kind of hit this weird wall with observability, which is I would like to give you logs and traces and metrics,
                                         
                                         but I don't know how because my library has to integrate with
                                         
                                         all the other libraries you the application owner are running when you're compiling your
                                         
                                         application together. So even though I might have a good idea about what a good metrics library is,
                                         
                                         or good logging library, it's not very helpful for me to pick those things because it's really the
                                         
    
                                         application owner who has to pick where all that data is going and what format it should be in and
                                         
                                         all of that stuff. So one thing the OpenTelemetry project focuses on is being very friendly with all the existing observability and analysis tools.
                                         
                                         So it's, in fact, there is no backend for OpenTelemetry.
                                         
                                         The idea with the project is we focus on a set of observability pipeline tools.
                                         
                                         So APIs, like client implementations, a thing called the collector,
                                         
                                         which is sort of this awesome Swiss army knife that you can run for processing all of this
                                         
                                         telemetry. But the end goal is to be able to, in addition to, you know, producing what we think is
                                         
                                         next generation telemetry, be able to like receive and export telemetry in any format.
                                         
    
                                         And to,
                                         
                                         to make all the different pieces of open telemetry work together as a
                                         
                                         system,
                                         
                                         but also be useful as standalone pieces.
                                         
                                         So we really are trying to make something that is,
                                         
                                         is helpful for everybody and doesn't,
                                         
                                         doesn't go around imposing kind of like a straitjacket on people
                                         
                                         where if you want to use one part of it, then suddenly you're stuck running OpenTelemetryDB
                                         
    
                                         somewhere or something like that.
                                         
                                         Okay, so let me replay some of that to see if I've understood it correctly.
                                         
                                         Let's say that I'm a vendor like Datadog, right? That's the one observability tool, I guess, that I'm familiar with.
                                         
                                         It makes sense for me to integrate with OpenTelemetry because everyone else will be
                                         
                                         speaking in this language, essentially, and then all of my customers can use
                                         
                                         any system that integrates with OpenTelemetry automatically,
                                         
                                         and I don't have to build a collector for each one. Is that roughly correct to start with?
                                         
                                         Yeah, that's correct. So what this is solving for application developers and
                                         
    
                                         open source library authors, end users, if you will, is they don't really like
                                         
                                         vendor lock-in, right? Like people don't want, like instrumentation in particular is this
                                         
                                         cross-cutting concern that just gets embedded everywhere, right? You just have all these API
                                         
                                         calls all over the place when you add logging
                                         
                                         and metrics and all this stuff to your system. And if it's going to be a good system, all that
                                         
                                         stuff needs to be coherent. And it's a bummer to have to rip all of that out just because you want to use a different analysis tool to look at your data.
                                         
                                         So end users are interested in switching to OpenTelemetry because it's sort of a
                                         
                                         write once, read anywhere kind of solution. There's enough industry adoption from all the major vendors and cloud providers.
                                         
    
                                         Most of the big shops are involved directly in OpenTelemetry because we all kind of agreed that this would be better for everybody.
                                         
                                         And so even if like you're not super interested in the project, what you're going to see over time is more and more of your end users coming to you saying, you know, we've already like instrumented everything with open telemetry and we've got a nice open telemetry pipeline that we like, and we just want to point the fire hose at you and have
                                         
                                         you give us useful analysis. Can we do that, please? And I think over time, it's not even over time, it's already
                                         
                                         the case that almost everyone is like, well, yeah, we would love to ingest open telemetry data in
                                         
                                         that case, because it makes it really easy for us to onboard these new customers. And if instead
                                         
                                         we tell them, no, no, no, you can't just send us your data. You
                                         
                                         have to go back in and rip everything out and replace it with all our vendor specific stuff.
                                         
                                         That is like a pattern that as time goes on, developers are less and less interested in.
                                         
    
                                         I think just in general, proprietary code being, even if it's quote unquote open source, but it's like de facto
                                         
                                         proprietary because it's really designed to work just for one company or one backend.
                                         
                                         That's the kind of stuff people really don't want to put in their code base because it's kind of
                                         
                                         tying them to a particular service. So I think that's one of the reasons.
                                         
                                         There's a couple of reasons why open telemetry is really interesting,
                                         
                                         but that's like one very practical reason is I want to instrument once,
                                         
                                         have the data coming out of my system be very standard and regularized,
                                         
                                         and then be able to send that data off to a variety of different tools that all know how to properly ingest that regularized standardized data.
                                         
    
                                         Yeah, that makes a lot of sense.
                                         
                                         I think any developer who's gone through a painful migration,
                                         
                                         especially for something like telemetry,
                                         
                                         where you have no idea how useful a particular log line is or a metric is,
                                         
                                         but you do need to migrate all of them.
                                         
                                         I've been through migration like that once.
                                         
                                         It's like you never want to do that again.
                                         
                                         Exactly.
                                         
    
                                         And so with OTEL, part of my pitch for people is like,
                                         
                                         it's like the final migration.
                                         
                                         If you do the work to move over here,
                                         
                                         and you can do that work progressively too, again, because like the collector, for example, you could just take the data in there as a middleman to all of your existing data.
                                         
                                         And now you have something more like a router or switcher that can convert and regularize data that's coming in from different kinds of sources.
                                         
                                         And that's a like initial baby step.
                                         
                                         So then as you are progressively swapping out your instrumentation
                                         
                                         for OpenTelemetry clients and OpenTelemetry instrumentation,
                                         
    
                                         it's still that new stuff is going to the same collectors
                                         
                                         that are getting data from the services you haven't migrated yet.
                                         
                                         And it makes it a little bit more like a smoother rollout.
                                         
                                         So we put a lot of thought into the kind of practical aspects
                                         
                                         of rolling out and managing telemetry.
                                         
                                         And I think that's part of the reason why some of the open telemetry tools
                                         
                                         are getting popular with people,
                                         
                                         even if they haven't been able to fully migrate over to open telemetry tools are getting popular with people, even if they haven't been able to fully migrate over
                                         
    
                                         to open telemetry instrumentation.
                                         
                                         Yeah, the final migration just, it sounds too good to be true
                                         
                                         for like an infrastructure engineer, but it's a great pitch.
                                         
                                         Never say final, right?
                                         
                                         Like, I think that's the thing everyone learns,
                                         
                                         like at some point, never label a document as final.
                                         
                                         Final, final.
                                         
                                         Yeah.
                                         
    
                                         Final, final, final V2.
                                         
                                         Really final.
                                         
                                         Yeah.
                                         
                                         Yeah, yeah.
                                         
                                         But I do see it that way because we are really trying to make a standard.
                                         
                                         And we have a lot of buy-in already from a lot of like the appropriate groups you
                                         
                                         would want buy-in from.
                                         
                                         We've got a lot of organizational structure that makes that effective.
                                         
    
                                         And we've also put a lot of a surprising amount of thought into like how the
                                         
                                         code is actually structured and packaged in a way that makes maintaining very strict backwards compatibility
                                         
                                         and stability um uh much more feasible so basically assuming that instrumentation is like
                                         
                                         never going to get updated again but you want to be able to move to like the latest version of like the clients and the rest of the
                                         
                                         pipeline uh so that you can get new features but also security updates and things like that
                                         
                                         but you don't ever want to have to touch old instrumentation when you do that
                                         
                                         that's something that we care a lot about. Dependency conflicts.
                                         
                                         So the instrumentation API packages
                                         
    
                                         don't come with any dependencies.
                                         
                                         It's sort of just like an interface layer.
                                         
                                         So if you are like an open source library
                                         
                                         and you add open telemetry instrumentation to your library,
                                         
                                         you're not secretly taking on like a gRPC dependency under the hood
                                         
                                         that's then going to cause your library to conflict with some other library
                                         
                                         that also has a gRPC dependency that's like incompatible.
                                         
                                         That kind of thought is, I think, really important. If you are going to convince people that native instrumentation is a good idea, you really want to make sure you're not inflicting them with dependency conflicts and compatibility issues when you do. if you've seen that xkcd where there's this idea of like oh there's already 37 standards i'm going
                                         
    
                                         to introduce one more and now there's 38 competing standards like how do you kick start a project
                                         
                                         like this without making things worse no i i've actually never seen that xkcd comic no one has
                                         
                                         ever sent me a link to that xkcd comic ever uh yeah no i definitely know the the comic of which you speak and it's a very
                                         
                                         important point that is like a a question people ask is like how are you not just just making it
                                         
                                         worse and um one truism is standards real standards take. And one of the reasons they take time is that they require
                                         
                                         consensus. And the only way I believe you avoid being just the 38th standard, well, I guess there's
                                         
                                         two ways. One is just like overwhelming force. You are some force within the industry ecosystem whatever that carries so much weight
                                         
                                         that you can just inflict your opinion on everyone and they are just going to have to take it
                                         
    
                                         because it's just easier to to go along with what you say uh i don't like that. But that's that does happen from time to time. You know, looking at you system D. And but there's also a situation that's more like the IETF approach where you just get all of the interested parties together and say, look, like we really are making a forum to, to hear
                                         
                                         everybody out and try our best to come up with a solution, uh, that works for everybody without
                                         
                                         being compromised in the way where it's, it's just kind of like shoddy. And that, that takes work.
                                         
                                         It takes a willingness from the people at the core of the project to see their role as more um like the way i see my design role is not i come up with awesome
                                         
                                         ideas and then go push them on people it's more like i absorb everyone's requirements and then try to think hard about a design that would actually be clean, but also meet all of those requirements.
                                         
                                         Rather than saying like, oh, your requirements annoying.
                                         
                                         What if we just didn't do this thing and you had to change um so one part of the success of getting people involved
                                         
                                         was a willingness of the the core people to to organize the project more in that manner and that
                                         
    
                                         ended up attracting a large number of people first to the open tracing project and then also the open census project um which was
                                         
                                         mostly a google effort led effort microsoft also got involved there and then um we started lobbying
                                         
                                         each other uh leaders in both of those projects to to merge those two projects because that seemed like like kind of like the the final bit was like if we can
                                         
                                         settle our differences and find a way to merge these two projects then we can really have a
                                         
                                         standard uh because we've kind of everyone at that point had sort of rolled up into one of two balls
                                         
                                         and so now it was like well we've got to roll these two balls up into one ball and then we'll have something that really looks like broad industry consensus on on
                                         
                                         how this stuff should work and that honestly ended up being the best of both worlds because those two
                                         
                                         projects did have like a very similar pedigree but also different enough way of like looking at the world in other words
                                         
    
                                         like both looking at kind of like maybe a slightly different list of requirements and when you put
                                         
                                         those two lists together you've got the the complete list of requirements and uh that became
                                         
                                         the open telemetry project and uh it's uh And the merging of those two groups was also the final kind of like starting gun to cause everyone who had not gotten involved yet, but cared to kind of come over and get involved.
                                         
                                         I want to take the perspective of an engineer who's not an infrastructure engineer
                                         
                                         at all, right? Like I'm a product engineer. My job is to deliver value to customers as quickly
                                         
                                         as possible. Why should the way telemetry is being shipped from my app to whatever vendor or tool I'm
                                         
                                         using, why should a change in that like me excited? Is there something that OpenTelemetry does that is just unique and different?
                                         
                                         How is it pushing the boundary and helping us understand our software better?
                                         
    
                                         A large transition happening in observability from what I would call the traditional three pillars model of observability
                                         
                                         to a more modern observability model. I've been describing it as like a single braid
                                         
                                         as opposed to three pillars. But the basic issue uh, and to be clear,
                                         
                                         it's very smart.
                                         
                                         People will go out there and talk about like this three pillars model and say
                                         
                                         things that are very useful as to how you should set up and operate a
                                         
                                         traditional observability system.
                                         
                                         So I don't want to imply that,
                                         
    
                                         that those people are wrong or what they're saying is,
                                         
                                         is bad in some way.
                                         
                                         But what I don't like about saying the three pillars model of tracing, logging, and metrics,
                                         
                                         each being like this pillar in the observability, the Parthenon of observability,
                                         
                                         is it makes it sound like all of this is intentional somehow. Like there was a design plan in play and having these three totally siloed streams of data
                                         
                                         is like anything close to a good idea or how you should build a telemetry system.
                                         
                                         And the answer actually is that is not a good way to structure your data or build a telemetry system. And the answer actually is that is not a good way to
                                         
                                         structure your data or build a telemetry system. It's a very bad way to do it by having all of
                                         
    
                                         these data streams be totally isolated from each other. Having some of them like logging, for
                                         
                                         example, be very unstructured. Having other ones like tracing being very crudely and heavily sampled
                                         
                                         like all of this stuff ends up creating a lot of extra work for operators who are trying to
                                         
                                         look at this data and figure out their system is doing because we don't look at this data and figure out what their system is doing. Because we don't look at this data as like,
                                         
                                         you don't go solve your logging problem by looking at your logs
                                         
                                         and your metrics problem by looking at your metrics.
                                         
                                         You're trying to figure out what your system is doing
                                         
                                         and then come up with a root cause hypothesis and then go remediate it.
                                         
    
                                         And in order to do that, you are trying to synthesize
                                         
                                         information out of all of this data. You're using all of these tools together and you're kind of
                                         
                                         moving around between all these different ways of looking at your system. And if you do that in a
                                         
                                         world where this data is poorly structured, it's not organized into like a unified graph of data that represents
                                         
                                         the topology of your system that represents the causality of operations in your system
                                         
                                         that has a way of correlating between aggregate data like metrics and transactional data like all the logs in this particular
                                         
                                         transaction, you have to do that work anyways. And if your tools can't do that work because the data
                                         
                                         just isn't structured in such a way that your machines can put it all together for you,
                                         
    
                                         you end up doing that all in your head, right?
                                         
                                         So you end up trying to find correlations between different graphs by using your eyeballs
                                         
                                         to look at squiggly lines on a screen.
                                         
                                         We have like all this computing power and we're trying to find correlations between
                                         
                                         metrics by looking at squiggly lines with their eyeballs
                                         
                                         like that's crazy um trying to find all the logs in one particular transaction and by transaction
                                         
                                         i don't mean database transaction i mean someone clicks the checkout button on their mobile app
                                         
                                         which triggers an http request to some front-end service, which triggers a cascade
                                         
    
                                         of requests to various back-end services, and even some services after that, and kicks
                                         
                                         off some background jobs, and does all this other work.
                                         
                                         And then you have some problem or error way down deep in that stack, And you want to just look at all the logs that
                                         
                                         were part of that one transaction. But that means looking at logs that came from 12 out of 500
                                         
                                         computers that you're running. I don't know how much operational background you have. But I think
                                         
                                         if you have much, you like instantly that's that's like
                                         
                                         really annoying like grepping through your logs and like trying to just filter down to find a
                                         
                                         particular transaction um just it turns the human operator into like the glue that's trying to glue all this data together instead of emitting properly structured data where
                                         
    
                                         those graph relationships are already modeled properly in the data and your analysis tool
                                         
                                         can just automatically glue all that stuff together for you without you doing anything,
                                         
                                         which then frees you up to just look at all the data and do the
                                         
                                         human-centric work of trying to perform a subjective analysis about what the real
                                         
                                         problem might actually be. Okay, so the idea of open telemetry is since it can be pervasive across
                                         
                                         every single way you collect data, and as long as you thread through the right things like a request id
                                         
                                         or something like that it can help you stitch all of the relevant data together for a particular
                                         
                                         like transaction as you mentioned it like a particular web request in a certain sense
                                         
    
                                         or like a particular job and then the next part which which is visualizing that or showing it in a way that's debuggable,
                                         
                                         can be handed off to a visualization tool. But because of the way all of the data is structured
                                         
                                         from OpenTelemetry, it makes putting all this data together easier. Does that sound correct?
                                         
                                         Yeah, it does. So to get into some of the details about how that works,
                                         
                                         when you have a distributed transaction,
                                         
                                         so you have a request from the end user
                                         
                                         that ends up touching a bunch of computers
                                         
                                         in many tiers of backends,
                                         
    
                                         like big 20 computers are involved whatever um if you want to be able to index all of the logs that were part of that particular transaction
                                         
                                         you need a transaction id right like you need every log that gets emitted as part
                                         
                                         of that transaction to be connected to that transaction ID. And in order to do that, you
                                         
                                         have to have some way of passing that transaction ID along the execution path. So as the code executes through your program,
                                         
                                         the transaction ID needs to follow along with it. And so we call that context. So context is within
                                         
                                         a program runtime, some way of associating the execution context with a bag of environment variables,
                                         
                                         you might think of them as,
                                         
                                         that are specific to that particular execution context.
                                         
    
                                         And that means when that context jumps to another thread,
                                         
                                         that context bag has to jump along with it.
                                         
                                         If there are some kind of like user lands stuff
                                         
                                         like tornado or G event or something happening
                                         
                                         on top of the threading that's going on,
                                         
                                         you know, that system needs to manage
                                         
                                         keeping track of these contexts
                                         
                                         when it's switching coroutines and stuff like that.
                                         
    
                                         And that's tricky.
                                         
                                         Not very many programming languages
                                         
                                         have that concept fully baked into them.
                                         
                                         And the other thing you have to do
                                         
                                         is whenever you make a network request,
                                         
                                         so the work now is passed to
                                         
                                         another computer, and this computer is now sitting there idling, consuming resources, but idling,
                                         
                                         waiting for this other computer to do some work and give it some information back. You want all
                                         
    
                                         the logs that are on that other computer when it's performing this transaction, which means you have to now take that transaction ID in that context
                                         
                                         and staple it to that network request.
                                         
                                         So if it's HTTP, you would put it in an HTTP header
                                         
                                         so that on the server side on the other end,
                                         
                                         the controller action or whatever it is that's handling that request
                                         
                                         can pull that
                                         
                                         transaction ID out of the header, attach it to the context, and then continue on its merry way.
                                         
                                         And so that fundamental system of context and then propagating that context to the other servers that are part of the transaction, that is fundamental to what is traditionally called distributed tracing.
                                         
    
                                         And in OpenTelemetry, we've taken that distributed tracing concept
                                         
                                         and we've extracted it to a lower level concept
                                         
                                         that's just called context propagation.
                                         
                                         So there's this lower level system that's just called context propagation. So there's this lower level
                                         
                                         system that all it does is focus on being able to keep that bag of context attached to your
                                         
                                         execution and then serialize it and propagate it along your network requests and deserialize it on
                                         
                                         the other end and so on and so forth. And that's involved making changes to the HTTP spec.
                                         
                                         So we went to the W3C and helped design a standard header
                                         
    
                                         for putting some of this context in.
                                         
                                         And it involved, in every programming language,
                                         
                                         building one of these context propagation
                                         
                                         systems trying to leverage as much of what already existed in that language
                                         
                                         and all of open telemetry is built on top of that so uh
                                         
                                         the most fundamental system in open telemetry is actually the tracing system
                                         
                                         because what that does is it takes this context propagation mechanism and on top of that allows
                                         
                                         you to record and keep track of operations so you say i'm in you know a controller action
                                         
    
                                         operation and then i call out to a database client. And so
                                         
                                         that database client starts like a database client operation. In OpenTelemetry terms,
                                         
                                         we call those spans in a trace. And then all of the logs that might be occurring,
                                         
                                         those are all occurring in the context of those spans, those operations.
                                         
                                         And those spans are all linked together in a graph.
                                         
                                         So every operation knows the ID of the parent operation that called it and can be connected
                                         
                                         to the child operations that it spawned as part of doing its job. And those trace
                                         
                                         IDs, those span IDs, all of that stuff, get propagated to the next service. So they can
                                         
    
                                         continue the graph of saying I started an operation on this other computer. But my parent operation
                                         
                                         was this client HTTP call on this other system.
                                         
                                         So if people have worked with the tracing system before, that's just kind of like the fundamentals
                                         
                                         of how distributed tracing work. But instead of saying distributed tracing was this
                                         
                                         third system, like running off in a corner on its own, we're saying, no, no, no, that's like the fundamental context
                                         
                                         that everything executing needs to happen in.
                                         
                                         And then what you'd call your logging system
                                         
                                         is able to access that context.
                                         
    
                                         So all of your logs can get that trace ID
                                         
                                         and that span ID.
                                         
                                         In other words, the transaction they're part of,
                                         
                                         the operation they're part of.
                                         
                                         So then when you store them, you have these indexes. So once you've got that trace ID, if you have one
                                         
                                         log, like an exception or an error from a backend system, and you're like, well, show me all the
                                         
                                         logs in the entire transaction from the client all the way to this back end to like any other service that had anything to do
                                         
                                         with this transaction boom they're all indexed by that trace id and so instantly you can see
                                         
    
                                         all of them you don't have to do any filtering or or grapping about to make that happen
                                         
                                         so i think that right there maybe shows some of like the fundamental difference
                                         
                                         between having these as like totally separate systems versus having one
                                         
                                         coherent graph of data.
                                         
                                         And the metrics get involved in that graph too.
                                         
                                         But I think just talking about how tracing and logs are actually kind of one
                                         
                                         of the same system is,
                                         
                                         is a good starting point
                                         
    
                                         to see how something like open telemetry is a bit different from traditional observability.
                                         
                                         Well, yeah, absolutely. Like I had no idea you went all the way to like HTTP to come up with
                                         
                                         the standard header. That means serious business, because I'm sure that would have taken a lot of
                                         
                                         time. I just looked it up. And it looks like a fairly recent draft, like November 23rd of last year. So you go all the way to HTTP and
                                         
                                         that's how you can ensure interoperability because now it's a standard. Exactly. Exactly. This stuff,
                                         
                                         this stuff takes time. There were like prior de facto standards. I should call it like Zipkin
                                         
                                         was a very popular open source,
                                         
                                         distributed tracing tool. And they had a set of headers called B3, like the B3 Zipkin headers.
                                         
    
                                         So those were pretty common. And they work pretty well. But you know, they,
                                         
                                         it's a step up in standardization to actually put it into the HTTP spec. So that's a good... And that's sort of what we're saying when it comes to...
                                         
                                         When you're talking about these distributed systems
                                         
                                         and wanting to connect all this information up into a graph,
                                         
                                         modern distributed systems are not all owned and run
                                         
                                         by the same operator, right?
                                         
                                         You have different teams
                                         
                                         and different operational teams
                                         
    
                                         and service owners within an organization,
                                         
                                         but then those organizations are potentially
                                         
                                         contracting a lot of software as a service.
                                         
                                         In other words, software services
                                         
                                         that other organizations are running, like
                                         
                                         databases and things that the cloud providers are providing, or some other third party
                                         
                                         message queue, you know, provider is providing for you. And if you have a standard data format like open telemetry for describing logs and traces and propagating
                                         
                                         these indexes and identifiers now it becomes possible for those third-party providers
                                         
    
                                         to send you the rest of your trace that you could never access before right because that was being stored
                                         
                                         in some third parties systems where they they have like their logs and like their traces but
                                         
                                         if you want to know like some nitty-gritty details about how your database query or your usage of like a message queue was like causing latency or problems, you
                                         
                                         might be able to discern some of that data just from the clients that you're using attached
                                         
                                         to it.
                                         
                                         But there's even more data that you could get if you could actually get operations and events and metrics out of their system.
                                         
                                         That was just the portion of their resources that you as an organization are using and
                                         
                                         not anybody else.
                                         
    
                                         But if you have a standard, now there's a way for them to be like, well, we've done
                                         
                                         the work to add that instrumentation and we will admit it as an open telemetry fire hose.
                                         
                                         And so you can ingest that as well as the stuff coming out of your own applications and services.
                                         
                                         And now you have an even deeper trace of your overall system because it's including like these-party systems as part of it.
                                         
                                         It's kind of like if every system spoke their own language and didn't speak in HTTP,
                                         
                                         you'd be reinventing that for each system. And I'm sure there definitely is RPC systems and all of that. But basically, for most systems in the world, you can probably just communicate with
                                         
                                         them with HTTP and you hear back just fine and this is
                                         
                                         taking it a step further to understand your system like no matter where they are who runs them
                                         
    
                                         you can probably understand them but say okay this operation in this third-party vendor is taking
                                         
                                         time and that's why my requests are slow that i think i finally now get the vision and it makes
                                         
                                         a ton of sense to me.
                                         
                                         It's also much more ambitious than what I originally thought it was.
                                         
                                         There's this idea you brought up around the braid of observability.
                                         
                                         Initially, there's these three pillars.
                                         
                                         There's the part known as metrics, logging and tracing.
                                         
                                         They're often thought of as separate things and they really shouldn't be. And we spoke about how traces and logs can easily be correlated
                                         
    
                                         and should really just be the same thing.
                                         
                                         How do metrics play into that?
                                         
                                         Yeah, that's a great question.
                                         
                                         So there's two really practical ways
                                         
                                         that metrics plays into that.
                                         
                                         One is just maybe a fundamental concept,
                                         
                                         which is that metrics are just aggregates of events.
                                         
                                         So you have events that happen, like an operation occurs, like an HTTP request, and you want to know things about that particular HTTP request and how it fits in to an overall transaction.
                                         
    
                                         But you might want to know things in aggregate about that HTTP request.
                                         
                                         How long did it take?
                                         
                                         Not just an individual request, but all the requests like that.
                                         
                                         What is the spread of latency?
                                         
                                         You might want to count number of 500 status codes per minute or something like that.
                                         
                                         And one way to do that is to have a metrics instrumentation API where you create counts
                                         
                                         and gauges and histograms and things like that.
                                         
                                         And you embed that directly into your code.
                                         
    
                                         And then you get counts and histograms and gauges,
                                         
                                         very old school and traditional. But another way to do that is if the data, the event data
                                         
                                         that's coming out of your system is very regularized and well structured. In other words,
                                         
                                         it's not like an unstructured string blob that you have to parse and hunt around for content in,
                                         
                                         if it's all very regularized key-value pairs that have standard keys and standard value types,
                                         
                                         then it becomes much more feasible to create a lot of your metrics on the fly. So you could embed metrics API calls in your HTTP client
                                         
                                         to do things like count status codes and stuff like that.
                                         
                                         But if you're also emitting a span for that HTTP client request,
                                         
    
                                         you could farther down your pipeline,
                                         
                                         like let's say in the collector component of
                                         
                                         OpenTelemetry or in your backend, just anywhere farther down the line, you could just dynamically
                                         
                                         generate those metrics based off of that span being emitted. So that's, I would say, like one fundamental thing for people to think about. And
                                         
                                         why that's actually important is, if every time you want to change what metrics you're collecting
                                         
                                         about your system, you have to go into your code and make a code change and then redeploy your application, that's a bummer, right?
                                         
                                         That means bothering a developer who has the capacity
                                         
                                         to make that particular code change.
                                         
    
                                         That means recompiling and redeploying an application
                                         
                                         and doing stuff like that.
                                         
                                         That's kind of like a long path that has quite a large number of side
                                         
                                         effects compared to an operator, a system operator, wanting to get additional metrics
                                         
                                         and just changing the configuration of something in their telemetry pipeline to start emitting
                                         
                                         those metrics there
                                         
                                         and not touching the application services at all, like never restarting them. They don't even know
                                         
                                         that you're generating new metrics. You're doing this all farther down the pipeline.
                                         
    
                                         So that's one fundamental way that metrics are tied in as part of the braid of data with traces and logs,
                                         
                                         which is perhaps you should start switching to generating more of your metrics
                                         
                                         dynamically from your traces and logs.
                                         
                                         And with a regularized, highly structured system like OpenTelemetry that's a lot more feasible than
                                         
                                         if you weren't really running tracing or you're running tracing, but it was very heavily sampled
                                         
                                         up front, or your logs had this information in it, but it was not consistently structured.
                                         
                                         All the different things emitting logs emit an HTTP request a little bit
                                         
                                         differently. It's hard to, it's expensive to parse that stuff, et cetera, et cetera. So OpenTelemetry
                                         
    
                                         makes dynamic metrics creation a lot more feasible. It does have a metrics API, to be clear. So that's
                                         
                                         also there. Okay. Yeah. I think I need to understand this a little
                                         
                                         better. So let's say you have, again, like the web request example, right? Like you have a request
                                         
                                         that starts at a front end. It maybe makes a request to like one underlying service. It comes
                                         
                                         back and then it returned, it does some computation, returns that data to a user.
                                         
                                         So what you're saying is the fact that there is a trace that captures that
                                         
                                         also enables me to generate a metric and maybe generate like more metrics like for example i
                                         
                                         can automatically track things like http status codes from the underlying like service if i want
                                         
    
                                         to to the for the front end if i want like i guess i didn't fully grasp
                                         
                                         how i can dynamically generate more metrics given this is this this is how my
                                         
                                         trace looks or like this is what my request goes through well just that uh when you're talking about a metric, you're just fundamentally what you're talking about is an event that happens in your system that you want to look at in aggregate, right?
                                         
                                         You want to look at it in aggregate.
                                         
                                         You want it scoped along a certain number of dimensions the way you want to look at it might be counting something
                                         
                                         or summing it or putting it into histogram buckets or or making a gauge but a lot of that
                                         
                                         information that you're looking at isn't like a sampling of continuous information you have some
                                         
                                         stuff with like ram or cpu where you're you're where you do kind of need some kind of probe in there that's taking a sample.
                                         
    
                                         But a lot of what we're making metrics about, especially in the context of our transactions, are things like HTTP requests or database requests or exceptions occurring, things of that nature.
                                         
                                         So in all of these cases, there is something in your tracing system
                                         
                                         and your logging system describing that specific event occurring.
                                         
                                         And you could, right next to the place that you're recording that specific event in OpenTelemetry,
                                         
                                         also add right there, using the metrics API, something that counts essentially the same information
                                         
                                         or otherwise emits a metric about that event.
                                         
                                         But you could also do that farther down your pipeline like if you're trying to um count uh status codes right how many
                                         
                                         500s how many 403s etc etc how many 200s you're trying to count status codes uh in your system
                                         
    
                                         based on you know some set of dimensions which ap API endpoint you're talking about, or what
                                         
                                         route you're talking about, et cetera, et cetera.
                                         
                                         If you're already emitting all of that information about those HTTP events happening, there's
                                         
                                         no need necessarily to bake all of that metrics gathering into your code. You could instead
                                         
                                         create a trace processor or an event processor, essentially, later on down the pipeline. This is
                                         
                                         one of the things that Collector is very good at. It takes in all of your data and you can write
                                         
                                         these processing pipelines to do things like transform the data scrub
                                         
                                         sensitive information out of it but you can also use it as a place to generate more data
                                         
    
                                         and one particularly useful thing you can do there is generate metrics out of your events and given
                                         
                                         that there's like there isn't one canonical good set of dimensions to capture a particular metric,
                                         
                                         given that there are what you might think of as a default dashboard you might want to set up
                                         
                                         for particular services and particular libraries that you're using.
                                         
                                         There may be a default dashboard that captured some reasonable information about that.
                                         
                                         As time goes on and your systems get bigger and you understand them more
                                         
                                         and the problems you're trying to solve with them become more specific,
                                         
                                         it's hard to predict what metrics you really want in the future and what dimensions you want those metrics
                                         
    
                                         recorded by so the ability to dynamically create more metrics on the fly as like an operator
                                         
                                         as the or as like the analyst looking at that data and being like, dang, I really want to have this additional metrics or I want to change the dimensions that I'm recording this particular event across. just going to your telemetry pipeline by making configuration changes to your collectors
                                         
                                         and then restarting your collectors rather than having to make code changes to your applications
                                         
                                         and restarting your applications that gives um operators and people who are farther down the
                                         
                                         line as far as caring about you know the telemetry being emitted and the dashboards
                                         
                                         being set up and all of that. They now have the freedom to start generating the metrics they want
                                         
                                         without having to do these application restarts or bother, you know, the developer who would,
                                         
                                         specific developer who would need to do that because that's like their particular part of the code base or something like that.
                                         
    
                                         Okay. And the collector is a daemon, right? It's not like a server-side component. It's
                                         
                                         actually a client-side component. So it's actually pretty, it can do those transformations
                                         
                                         pretty efficiently. Yeah. Yeah. You can run the collector in a variety of pipeline roles. So one common place to run it is something like what's often called an agent. Basically, you can run it on the same machine, same virtual machine, or as a sidecar if you're running Kubernetes. So it's local, on a local network connection to your application.
                                         
                                         The advantage of running it there is it can collect a lot of additional data
                                         
                                         without the application having to do that.
                                         
                                         So that's a good place to configure the collector to collect things like CPU and memory and stuff like that.
                                         
                                         It can also collect additional information about the environment that the application might not be
                                         
                                         collecting about, you know, the Kubernetes environment or the cloud environment or,
                                         
    
                                         you know, just something about the resources being consumed by that particular
                                         
                                         application. And it can decorate all the data coming in with those additional resource attributes.
                                         
                                         So there's some good reasons for running it locally. The disadvantage for running it locally,
                                         
                                         of course, is that it's consuming the same resources as your application.
                                         
                                         So it's also feasible to run collectors on their own boxes.
                                         
                                         It's feasible to run collectors in a pool behind a load balancer. So what people often end up doing is having this sort of tiered pipeline
                                         
                                         where they have an application. That application is talking to a local collector.
                                         
                                         That local collector is doing a very minimal amount of work. It's maybe sampling machine metrics like CPU and RAM, and it's storing all of the telemetry data,
                                         
    
                                         basically acting as a buffer between the application and the rest of the telemetry
                                         
                                         pipeline. And because it's on a local network connection with the application, that means you can configure your application to not really buffer that telemetry data.
                                         
                                         And that's really helpful because that means if the application suddenly terminates, you're not losing a large batch of the telemetry data that you probably care most about, which is a problem when the, if the network back pressure on your telemetry system
                                         
                                         is reaching all the way into your application, then yeah, you run, start to run that risk.
                                         
                                         And so by then moving that to a sidecar or a local collector, then the collector can act
                                         
                                         like a better buffer to handle any back pressure that
                                         
                                         might be happening in your telemetry system. The reason then to run these collector pools
                                         
                                         farther down the line is if you're wanting to do more and more processing about your telemetry data
                                         
    
                                         that doesn't need to be done locally, that means you could be doing it later
                                         
                                         in kind of a pool that's collecting data
                                         
                                         from many separate application sources.
                                         
                                         And playing back to one of the big advantages
                                         
                                         of this decoupling is that I can have something
                                         
                                         like a simple structured log of something that I
                                         
                                         thought was important a year ago but I just decided to log because I thought it'd be interesting to
                                         
                                         see but today I think it's extremely important that I have a metric that comes out every time
                                         
    
                                         that log line is hit essentially like especially when a certain attribute of that structured log
                                         
                                         is like true or false or something else and open telemetry just
                                         
                                         lets me do that without configuring any client code or changing any client code like letting me
                                         
                                         add a new metric or whatever i can just do that by configuring or like tweaking the collector
                                         
                                         config to say when you see this structured log event generate a metric exactly yes exactly that's exactly what
                                         
                                         you can do and because uh open telemetry has what we call uh we call them semantic conventions
                                         
                                         which is kind of a a funny term and you might be better to call it a schema, like a semantic schema.
                                         
                                         But Elastic Common Schema is another example of one of these.
                                         
    
                                         But there's a schema to describe all the common operations that machines do.
                                         
                                         So if you're recording an HTTP client request,
                                         
                                         if you're recording a SQL database call,
                                         
                                         all of the common things that a computer program might do, we have a strictly
                                         
                                         defined set of key values that should be emitted to describe that event. So it's not just that you
                                         
                                         can use the collector to, like, say, parse a log line and figure out how to emit a metric.
                                         
                                         You can do that.
                                         
                                         But it's also the fact that that data coming into the collector for many of the things you would want to collect metrics on is already in a very nice, regularized, well-structured data format.
                                         
    
                                         So it's much more efficient to be generating metrics off of that kind of data.
                                         
                                         And it's also much more reliable, right?
                                         
                                         Because you can depend on what that data is going to look like.
                                         
                                         In fact, we even have schema versioning. So every instrumentation source indicates which version of the schema it's adhering to.
                                         
                                         So you can even do schema translations.
                                         
                                         That's one of the ways we handle backwards compatibility.
                                         
                                         If we figure out additional attributes we want to admit or change
                                         
                                         the way data is is split up uh in something we're reporting um all of those changes if they come to
                                         
    
                                         stable instrumentation would have to be uh um be released along with a schema processor in the collector.
                                         
                                         So you can build your pipeline tools to expect data to be in a particular format.
                                         
                                         And if it's not in that precise format, if it's in a different format,
                                         
                                         then the schema processor just gets run to convert it to the format you want
                                         
                                         so you're not you're not breaking your dashboards just because uh you you updated your instrumentation
                                         
                                         to to a new version that's interesting i'm sure versioning must have been a pain to design and
                                         
                                         like roll out like it's it seems like a tricky problem and like schema transformations and stuff i've had to do
                                         
                                         like a similar data modeling problem at work and we're just like for now let's skip versioning
                                         
    
                                         because there's so many implications that you have to think about like what if there's two
                                         
                                         conflicting versions but when you're building a standard and there's so many different systems
                                         
                                         that need to interact together i can see why you'd have to go all the way to build this.
                                         
                                         Yeah.
                                         
                                         I would say a lot of what we're doing,
                                         
                                         I think really are best practices that anyone who is maybe not so much
                                         
                                         for application code, but if someone is creating a shared library like a
                                         
                                         some something that is going to run in many different applications in many different
                                         
    
                                         environments um especially if it's a cross-cutting concern like you know telemetry, it's worth it to care about things like backwards compatibility and upgrade
                                         
                                         paths and transitive dependency conflicts where, you know, the dependencies that my thing depends
                                         
                                         on may conflict with the dependencies that other libraries depend on. If you think about those things up front,
                                         
                                         if right at the beginning when you're designing your stuff,
                                         
                                         you have a much better chance of coming up with a system
                                         
                                         that you can adhere to as time goes on
                                         
                                         to maintain those qualities and tests you can adhere to as time goes on to maintain those qualities
                                         
                                         and tests you can do
                                         
    
                                         to ensure you're maintaining those qualities.
                                         
                                         It's much harder, in my opinion and experience,
                                         
                                         it's much harder to add those qualities
                                         
                                         to a system later
                                         
                                         where you didn't think about it at the beginning and bake it into the
                                         
                                         design and architecture of the system.
                                         
                                         Not impossible,
                                         
                                         but you,
                                         
    
                                         it is worthwhile to think through the different ways you're going to want to,
                                         
                                         to mutate and update and improve the library that you're offering and just kind of figure out what
                                         
                                         um what is a good way to lay out those packages what is where the right places to introduce
                                         
                                         loose coupling things of that nature to ensure that you're going to be able to say once this particular piece is stable,
                                         
                                         that it will remain stable forever.
                                         
                                         And that might limit the kinds of backwards compatible changes you can make there.
                                         
                                         But if you also have a way to then introduce new experimental components
                                         
                                         in such a way that they aren't destabilizing the stable components like which in many languages
                                         
    
                                         comes down to like how you lay out your packages for example um if you come up with with a plan
                                         
                                         that it does require some work up up front to figure
                                         
                                         that out but if you do that work up front then implementing it becomes smooth uh so so that's
                                         
                                         that would be a best practice if i were to talk more in the future about open telemetry as an
                                         
                                         open source project and maybe some of the practices
                                         
                                         we do that other open source projects that are kicking off would benefit from i would say
                                         
                                         looking at how we handle versioning and backwards compatibility is like a place where i'm really
                                         
                                         proud of the project and as you mentioned like these things take time, right?
                                         
    
                                         You go through these problems once and then new requirements come in. It takes time for the industry to be like, yes, this is important. We need to make sure we have this. We've had this
                                         
                                         problem before. And one design iteration goes on. Maybe one project doesn't get it quite right,
                                         
                                         but the next project has all those
                                         
                                         learnings and then the industry is like yes we can converge to this new standard or this new project
                                         
                                         because it takes most of the boxes we need yeah and so we have a a development uh process that
                                         
                                         involves rfcs uh we call them oteps open telemetry enhancement proposals but they're
                                         
                                         very similar to say the rfc process from the itf um we tend to require that
                                         
                                         oteps come with prototypes so here is a change that I'm proposing to make.
                                         
    
                                         Here is an implementation of that change
                                         
                                         in two or three different languages
                                         
                                         if it's a client-level change.
                                         
                                         Really trying to get a lot of that design work.
                                         
                                         Basically, we don't want to have surprises show up after something
                                         
                                         is added to the spec it's you if you care about backwards compatibility in a strict sense
                                         
                                         then things are very sticky once they've gone into the spec um it's hard to pull them out.
                                         
                                         If they're going into an experimental part of the spec, then obviously later we can say, like, whoops, we're making a breaking change to this part.
                                         
    
                                         But even there, we actually do our best to try to avoid thrash. If for no other reason, then
                                         
                                         that just kind of dumps extra work on the different language maintainers. We're conscious
                                         
                                         about the fact that if we make a client change to OpenTelemetry, if we change an API or add an API or change how the client implementations work, then that's work that's going to then get repeated across like 11 different languages.
                                         
                                         So it's expensive to be like, build it this way.
                                         
                                         No, no, second thought, build it this other way.
                                         
                                         So for all those reasons, we've kind of developed a longer specification process
                                         
                                         that kind of involves doing more design and review work
                                         
                                         upfront than a lot of people are used to.
                                         
    
                                         I think many people, including myself, are more used to
                                         
                                         an approach that come up with a good idea or what maybe sounds like a good idea,
                                         
                                         write some code that seems like probably it implements that idea, throw it over the wall
                                         
                                         and run it in production and see what happens. And there's definitely some advantages to doing that.
                                         
                                         Not every piece of code that gets written has the requirements, something like Open
                                         
                                         Telemetry has.
                                         
                                         But I do think for projects that are some other equivalent of OpenTelemetry, like this is a big shared library that's going to get embedded in lots of different important applications, or this is a platform that everything is running on, or yada yada.
                                         
                                         Something that's like code that's really going to get exercised in a lot of different environments lots of people are going to care about i i think it's worthwhile for projects like that to to come up with a more structured approach
                                         
    
                                         to to how they they think about change yeah it's it's as you mentioned right like it depends on
                                         
                                         like the life cycle of the project who's using it how many people there are i can't even imagine there's like a security vulnerability in something
                                         
                                         like open telemetry because or like um like a remote code execution thing like i don't know
                                         
                                         if you're familiar with the log4j stuff that happened a few months ago it's like you have to
                                         
                                         be careful because there is a lot of impact of
                                         
                                         this, like, especially when you're a library that other applications depend on, right?
                                         
                                         You don't want to be of supply chain issues
                                         
                                         that are inherent in open source development.
                                         
    
                                         Not just open source development,
                                         
                                         but any form of development that involves leveraging code
                                         
                                         that you did not write, that is not of your providence.
                                         
                                         It's really a conundrum i honestly don't have a great answer for it because through the whole history of software development this has
                                         
                                         been one of the big lauded examples right is that we don't all need to recreate everything from scratch.
                                         
                                         We can build libraries that do something useful, and then we can all depend on those libraries.
                                         
                                         And the fact that we can reuse all of this code is this huge advantage that most people think of when they think of software development right like code
                                         
                                         code reuse and leveraging other people's code is a feature not a bug but it's definitely at odds
                                         
    
                                         with a concept of strict security right so that there is a fundamental mismatch there that is really unfortunate.
                                         
                                         And it's interesting to see how long we were able to fly without it becoming truly a widespread problem. It's a thing that's always been a bit of a problem,
                                         
                                         but maybe was more restricted in the past
                                         
                                         to things like state-sponsored actors
                                         
                                         targeting other states,
                                         
                                         very sensitive stuff,
                                         
                                         and then those sensitive things adhering to different stricter software patterns,
                                         
                                         presumably, in order to counter that.
                                         
    
                                         But now it's kind of all mushed together,
                                         
                                         where everything is intermediated by network computers.
                                         
                                         Everything's a computer program.
                                         
                                         Code goes everywhere.
                                         
                                         And all of that code everywhere leans very heavily
                                         
                                         on a lot of publicly available open source code.
                                         
                                         And we tried to solve some of that for OpenTelemetry,
                                         
                                         like scanning our dependencies and trying to make sure that we aren't sticking around as a vector for a supply chain dependency. And we try to think hard about what dependencies we are taking on and where.
                                         
    
                                         But the other thing is ensuring that it's possible
                                         
                                         for our end users to stay up to date.
                                         
                                         That's actually another form of like,
                                         
                                         I don't know if I would call it forwards compatibility,
                                         
                                         but yes, it's the idea that a way I often get stuck
                                         
                                         with things like frameworks, web frameworks I've used in the past is
                                         
                                         in order to get some security patch or something, I need to upgrade to a new version.
                                         
                                         But that new version changed things like the plugin interfaces for some plugins I use.
                                         
    
                                         And the plugins I use haven't been updated to use those new interfaces.
                                         
                                         So now I'm in a jam, right?
                                         
                                         Where there's something that I really want, maybe related to security, that I can get by rolling forwards to the latest version.
                                         
                                         But in order to get there, now suddenly I'm faced with potentially doing a lot of work,
                                         
                                         right? I either have to abandon these plugins. I have to, these plugins are now my code. I have
                                         
                                         to go in there and somehow make the upgrade myself to make them work.
                                         
                                         And it might not necessarily be plugin interfaces, it might be any interface that something presents.
                                         
                                         If it creates a breaking change, that means I'm now going to have to do all of that work
                                         
    
                                         before I can get the security patch. And as a side effect of that, people then start to camp on old versions
                                         
                                         of software. And then they start to rightly demand that the people who develop that software
                                         
                                         maintain those older versions and backport security patches and other things. And the maintenance cost of all of that goes up over time. But if you work really hard to ensure that that's not happening,
                                         
                                         that you're avoiding those kinds of situations that would make it more difficult
                                         
                                         for your end user to update to the latest version of your client other than just bumping the dependency version in their manifest.
                                         
                                         Or feeling like they can reliably pin their dependency version in their manifest to something that helps them stay up to date with everything short of like a major version bump
                                         
                                         and then you never make a major version bump then you're also creating a world where
                                         
                                         your end users aren't kind of lagging behind or being scared of updating and you're
                                         
    
                                         one hopefully they are staying more up to date and they are avoiding
                                         
                                         you know um a situation where they're hanging around without applying security patches but
                                         
                                         also it means like when they do need to go up to date they aren't hitting some wall and thus
                                         
                                         being stuck on an old version and then raising the maintenance burden the
                                         
                                         maintenance burden on the open telemetry project because now we're like oh well we have a
                                         
                                         responsibility to maintain all these different ancient versions of this thing because uh we did
                                         
                                         something that made it genuinely difficult for those users to receive security patches and performance boosts.
                                         
                                         Yeah, like, this is just such a tricky problem that I've been that that has been going around,
                                         
    
                                         like more and more, especially with the NPM ecosystem. But like, I thought about it a little
                                         
                                         bit. And it's not just related to NPM, like that might be one of the more egregiously bad ones
                                         
                                         where nobody pins dependencies. but it is a problem across
                                         
                                         every programming language in my opinion i think maybe languages like go people tend to not use
                                         
                                         dependencies as much and like standard library is really strong similar to python so you're a
                                         
                                         little safer there but i i really think the future is like some kind of capability based dependencies with like certain dependency should not be allowed.
                                         
                                         Like left pad should not be allowed to contact the Internet.
                                         
                                         And like specifying those kind of capabilities for your modules may be an answer.
                                         
    
                                         I really believe that is like the one part of it. And the other part is like more and more security tools that we just use on a
                                         
                                         day-to-day basis that like scan for stuff, scan for anomalies.
                                         
                                         I think that'll become more and more commonplace.
                                         
                                         But instead of talking about this, I have a question for you,
                                         
                                         which I think is like a good wrap up,
                                         
                                         which is what are you excited about with OpenTelemetry like next?
                                         
                                         Like what is the one big thing that you're working on
                                         
                                         or that you see coming up that makes you think,
                                         
    
                                         wow, this is great,
                                         
                                         and this is going to be a really good addition to the project?
                                         
                                         Well, it sounds a little boring,
                                         
                                         but stability is the main thing I'm really interested in.
                                         
                                         We're still laying in um the final touches on
                                         
                                         logs and metrics and there's i think this delayed process of those those things becoming stable and
                                         
                                         open telemetry the instrumentation we provide around those things becoming robust and then all these different
                                         
                                         backends and analysis tools starting to provide features that actually leverage the structured
                                         
    
                                         data that open telemetry provides one quick example we didn't talk about metrics exemplars, but one way metrics are tied to traces,
                                         
                                         besides dynamically generating metrics later,
                                         
                                         OpenTelemetry's data structure allows you to record a sampling of,
                                         
                                         say, trace and span IDs, um,
                                         
                                         that are associated with a particular metric.
                                         
                                         So in other words,
                                         
                                         you have like a range of metric values that you might be emitting and you have
                                         
                                         like these high values that represent something problematic.
                                         
    
                                         Um,
                                         
                                         let's say,
                                         
                                         you know,
                                         
                                         you have an alert threshold on some metric,
                                         
                                         and then the alert goes off. Your next question is going to be, well, what are these transactions
                                         
                                         that are generating these problematic values in this metric, right? That would be the next step
                                         
                                         in your investigation. And in open telemetry,'s like actually been a a sampling of those different
                                         
                                         transactions that were emitting those values associated with that metric which means you can
                                         
    
                                         build uh one workflows that allow you to just click directly through from that metrics dashboard
                                         
                                         into you know your logs and traces uh that were associated with you know, your logs and traces, uh, that were associated with,
                                         
                                         you know, those metric events. But it also means that machine analysis, like, uh,
                                         
                                         machine learning and other kinds of statistical tools and automated tools have,
                                         
                                         have that rich graph of data to perform their analysis on.
                                         
                                         So they're not using really crude heuristics to try to figure out what was related to what in your system.
                                         
                                         Like, well, these things over here happened around the same time as these things over there.
                                         
                                         You actually have a real graph of data
                                         
    
                                         that's connecting all of this together.
                                         
                                         And that means your machine analysis
                                         
                                         can become much more efficient and more accurate
                                         
                                         when it comes to finding correlations, for example,
                                         
                                         between different trends going on in your data.
                                         
                                         Being able to say, wow, when you're seeing this kind of exception over here, that's highly correlated with a small subset of IPs
                                         
                                         somewhere along the line or something along those lines.
                                         
                                         All of this latency is actually correlated highly with this particular Kafka
                                         
    
                                         node.
                                         
                                         Those are the,
                                         
                                         that's,
                                         
                                         I should stress that it's not root cause analysis.
                                         
                                         It's not things telling you what the root cause is, but
                                         
                                         just identifying those correlations is a really time-consuming process when fighting fires and
                                         
                                         triaging problems in your system. Realizing this bad behavior over here correlates with a particular
                                         
                                         configuration option having a particular value.
                                         
    
                                         Those are the kind of things where you spend hours hunting around before you figure out that that might be a thing. that graph analysis and statistical analysis can just walk and do their thing
                                         
                                         the way they're designed to do,
                                         
                                         then you can really start leveraging that stuff.
                                         
                                         So my hope in the future,
                                         
                                         my big hope is to start seeing
                                         
                                         more and more of these analysis tools
                                         
                                         pick up on these advanced features of open telemetry
                                         
                                         and start offering value to people based on them.
                                         
    
                                         Yeah, this is kind of similar to LightStep's change intelligence,
                                         
                                         but really like platformized for everyone.
                                         
                                         Exactly.
                                         
                                         Change intelligence is like a perfect example
                                         
                                         of this kind of correlation analysis,
                                         
                                         which we do a good job of today,
                                         
                                         but it's still, if the data is not coming out of open telemetry, then we have to rely, we have to fall back onto rougher heuristics.
                                         
                                         So you're still getting these correlations, but the chances that they're false connections starts to go up right like it's it's harder for those systems to accurately
                                         
    
                                         give you correlations because past a certain point it just has to guess because it doesn't
                                         
                                         it doesn't have the the data being handed to it in any other way yeah i think that is exciting
                                         
                                         right like and and the second thing you were mentioning just around these interfaces that maybe let you even analyze the graph of data that you're getting, right? Like maybe the system can be smart enough to find a correlation, but like, just these different human interfaces I'm imagining now, which let you like, write some kind of queries or even like, have a little bit of code, which actually lets you analyze all of these different changes. It's actually pretty exciting
                                         
                                         to think about. And now I'm thinking, we do all of this
                                         
                                         production monitoring, but I'm super interested in developer
                                         
                                         environments. Why don't we get this kind of analysis from our
                                         
                                         local machines when Git is acting up or NPM
                                         
                                         is acting too slow.
                                         
    
                                         We really need to have
                                         
                                         telemetry for all the things. Like I can't imagine
                                         
                                         tomorrow, you know, all of our build systems
                                         
                                         are integrated with like open
                                         
                                         telemetry and we know
                                         
                                         exactly why something is slowing down.
                                         
                                         Yes,
                                         
                                         that's actually a place people
                                         
    
                                         are definitely using it. Their CICD
                                         
                                         systems and trying to find latency problems and bottlenecks there.
                                         
                                         Another place I'm really, if we're talking about total geek out stuff,
                                         
                                         something I've been excited about for a long time and kind of been predicting.
                                         
                                         I've done some talks about it in the past,
                                         
                                         but haven't had a chance to implement much.
                                         
                                         But sure enough, some implementations are starting to show up,
                                         
                                         which is if you have this kind of structured data
                                         
    
                                         that OpenTelemetry is producing,
                                         
                                         you can start using that data as input into your tests.
                                         
                                         In other words, starting when you are developing software,
                                         
                                         you use tests, you use unit tests, you use integration tests,
                                         
                                         you use these different kinds of tests to test your software.
                                         
                                         But those tests have nothing to do with the kind of information
                                         
                                         you're getting about your system in production.
                                         
                                         So you could have high-quality tests,
                                         
    
                                         but that doesn't tell you anything about the quality of, say,
                                         
                                         the logs coming out of your system.
                                         
                                         And so there's this real disconnect between the tools we're using
                                         
                                         to query what our system is doing in production
                                         
                                         compared to the tools we use to query and
                                         
                                         verify what our system is doing when we're developing it. And I felt for a while there's
                                         
                                         a lot of room to build novel forms of integration testing that are being done on top of the production telemetry coming out of
                                         
                                         your system. And there's a lot of different advantages that can come with that. And I've
                                         
    
                                         been excited to actually see in this past year, a couple projects get started built on top of
                                         
                                         open telemetry to do exactly that. So I think if you Google around for trace-driven development
                                         
                                         or trace-based testing or OpenTelemetry testing tools,
                                         
                                         there's a couple different projects that are getting started up
                                         
                                         around doing that kind of stuff.
                                         
                                         And I think that's a real rich source of potential future
                                         
                                         because that hard split between how we test and verify our software
                                         
                                         when we're making it versus how we test and verify our software
                                         
    
                                         when we're running it, I've seen that as one of those arbitrary walls
                                         
                                         that should get knocked down at some point.
                                         
                                         Yeah, there's so much to expand on that.
                                         
                                         You can have unit tests for, I expect, at least one HTTP call
                                         
                                         to have status 200 that you can check against
                                         
                                         a fake OpenTelemetry collector locally.
                                         
                                         And then in production, you have the exact same test,
                                         
                                         but it's running continuously against
                                         
    
                                         that collector and it's basically an alarm so you can have parity against your unit tests and your
                                         
                                         even your production environment in theory oh my god like it sounds a little out there but
                                         
                                         could be it sounds wild at first but it's it's totally a thing and there's there are like i think if you start to think about
                                         
                                         writing your integration tests against this kind of production data it's possible to develop a
                                         
                                         testing querying language like an assertion framework that um feels more like the kind of assertions you're making when you're testing,
                                         
                                         but is also written in such a way that it's scalable. In other words, it could be run against a stream of data.
                                         
                                         And once you start having a language like that,
                                         
                                         I think you'll turn around and start to realize that the kind of alerting we do, quote unquote alerting, is actually testing.
                                         
    
                                         It's just very, very crude testing.
                                         
                                         It's like a testing framework where you just have like one assertion you can make.
                                         
                                         Thing passes a threshold with like some tolerance percentage.
                                         
                                         And that's, that's super crude compared to the kind of assertions we make in our
                                         
                                         integration and unit test environment.
                                         
                                         So I think there's,
                                         
                                         there's a lot of room to,
                                         
                                         you know,
                                         
    
                                         if you come up with something like that,
                                         
                                         anyone who does that and can figure out a way to,
                                         
                                         to run that thing against production data you're now handling
                                         
                                         operators a way to start you know they there's like the classic saying that it's like when you
                                         
                                         have a problem in production uh can you solve it by just writing more tests or running your tests
                                         
                                         again no but actually maybe you can yeah if you come up with an assertion language
                                         
                                         you can run against your production data that's that's a powerful introspection tool so i'm
                                         
                                         excited for for somebody to build that yeah yeah this the first when you first look at developer
                                         
    
                                         tools and you think okay we've made a lot of progress and things are so much better than
                                         
                                         before but there's so much more to go there's so many things we haven't thought about the more you dig into the space the more and engineers as
                                         
                                         you mentioned in your original story like you're always hungry for like better tooling and things
                                         
                                         are never efficient enough and engineers at least right now are still super expensive so like every
                                         
                                         tool that can make them more efficient. Yes.
                                         
                                         Probably.
                                         
                                         Absolutely.
                                         
                                         Become a bit.
                                         
    
                                         And it just makes our life easier to a lot of the improvements here.
                                         
                                         Come from like time savings,
                                         
                                         but it's also just saving a lot of like scut work that.
                                         
                                         Yeah.
                                         
                                         Yeah.
                                         
                                         Well, Ted, thank you so much for being a guest. I think this was a lot of like scut work that. Yeah. Yeah. Well, Ted, thank you so much for being a guest.
                                         
                                         I think this was a lot of fun.
                                         
                                         And I hope I can ask you again for like a round two at some point.
                                         
    
                                         Absolutely.
                                         
                                         Happy to come back and talk more about observability anytime.
                                         
                                         Thank you for having me.
                                         
