Software at Scale - Software at Scale 47 - OpenTelemetry with Ted Young

Episode Date: May 26, 2022

Ted Young is the Director of Developer Education at Lightstep and a co-founder of the OpenTelemetry project.Apple Podcasts | Spotify | Google PodcastsThis episode dives deep into the history of OpenTe...lemetry, why we need a new telemetry standard, all the work that goes into building generic telemetry processing infrastructure, and the vision for unified logging, metrics and traces.Episode Reading ListInstead of highlights, I’ve attached links to some of our discussion points.* HTTP Trace Context - new headers to support a standard way to preserve state across HTTP requests.* OpenTelemetry Data Collection* Zipkin* OpenCensus and OpenTracing - the precursor projects to OpenTelemetry This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, welcome to another episode of the Software at Scale podcast. Joining me today is Ted Young, the Director of Developer Education at LightStep and the co-founder of OpenTelemetry. Welcome. Thank you. Glad to be here. Yeah. So I want to start with getting to know your background, right?
Starting point is 00:00:18 You've done so many things. You used to do something with animation. Now you're the director of developer education, which is a role I haven't heard of before. So how did you get here? Yeah, I do have a funny background of kind of switching what I'm up to every seven years or something like that.
Starting point is 00:00:41 I used to be a computer animator, actually. I got a computer science degree at tufts but was mostly interested in film and animation and uh helped run like a small post-production studio for a number of years and that was a lot of fun but i got really into um kind of like um environmental um stuff and like some civil rights stuff and began working on internet software to kind of help some of the social movements that i was part of and that turned into a full-time job uh a small consulting group called Radical Designs. And that was a lot of fun. Eventually, I wanted to get deeper into how computers actually work and start solving some of the problems that I was having as an application developer working on the internet. And so I started working
Starting point is 00:01:47 on like container scheduling systems, like platform, what we call platforms, but I think of as distributed operating systems. And I did that for a while. After that, moved over here to observability to scratch the other itch, which, you know, through my entire career, whether I'm like rendering visual effects or, you know, trying to make scalable web apps or trying to make a container scheduler work, you know, the question, why is it slow? Why is it slow? Just keeps coming up over and over again. And it's always such a pernicious question to answer that getting involved with observability and in particular, the founder of LightStep, a while back. And we really hit it off and kind of saw eye to eye about how, not just how you do that,
Starting point is 00:02:52 but what some industry standards needed to exist to really help make doing that easier. And so I joined up at LightStep and have ended up with various titles that basically amount to go run the OpenTelemetry project and talk to people about it. So that's basically what I do for LightStep and what I've been doing for like the past five years. You know, the whole idea of changing up your career every seven years. It sounds like a great idea to me. It definitely keeps things fresh. That's for sure. And going from computer animation to something as, I guess, platformy, like distributed operating systems, as you basically said, that sounds tough.
Starting point is 00:03:42 Like, how did you manage that transition? You know, it's funny. On the surface, it seems very different. But actually, in the world of computer animation and visual effects, it's kind of like a small niche world. But most of the problems of distributed systems and scalability that people are hitting on the internet now are problems that uh 3d shops and like visual effects shops like industrial light and magic and places
Starting point is 00:04:15 like that have been dealing with uh for a long time which is you know you are trying to do a massive amount of computation when you you think about rendering out, say, like a movie or commercial or visual effects shot, you have all these different compositing layers, all these different 3d elements, and you have to render out all those layers for every frame of the thing you're trying to make, which is a massive amount of computational power. And so you end up with what's called like a render farm, which is basically server farm, to farm all of that work out to. And when you do that, you run into all of these like pipelining and throughput and latency problems of, I want to see these particular things really fast, or I want to get it all done in time to meet this deadline. So I want the most throughput that I can have.
Starting point is 00:05:10 And you kind of end up having to solve these scheduling problems that are sort of like what we call map reduce these days. But the problem of I have to get all of the relevant bits to do the job to all the different machines that I want to work on it. Getting the bits to the machines takes time and takes up space. And then I need to figure out what the most efficient way is to share these common resources, CPU, RAM, memory, network, all of that, between all the different jobs that I want to do. And so the kind of algorithms you come up with, the kind of approaches you take, end up being pretty similar in practice to turning around and saying like, okay, now I want to schedule a bunch of apps to run on a bunch of servers.
Starting point is 00:06:05 And I want to think about the efficient way to download the images for all of those apps. And I also want to think about reliability and scalability and redundancy and stuff like that. And also managing resources like CPU and RAM across all the different apps and services I want to be running. It ends up being a more similar set of problems than it might look like on the surface. Does that experience or does that interest, is that what led to thinking about OpenTelemetry
Starting point is 00:06:44 or was there something else in that story how did you get find yourself in the position to start an initiative like that yeah i mean i think the same way i got into like platform and container scheduling stuff um getting into observability was because i i hated my tools. And I hated using them, and I wanted them to be better. And when I did some research, it seemed like there absolutely were known better ways to do some of these things. And in both cases cases it was also like kind of an industry technology wave seemed like it was coming along where there was actually an opportunity to get out there and implement some of these better ideas and have them see some adoption in the field okay so i think that's the story with a lot of people who work on developer tools it's just
Starting point is 00:07:47 i had my first job as xyz engineer but i felt like i could be so much more productive if the systems i worked on were better and then you go down that rabbit hole more and more and more that's certainly what happened to me um so maybe you could tell listeners a little bit about the project what it's aiming to do the kind of consensus it's trying to drive in the industry. Just the background of the project to start would be interesting. Yeah, absolutely. solve is coming up with a standard language for all computer programs to use to describe what it is that they're doing with kind of a focus on computer programs that are actually like distributed systems so networked applications where you need to actually talk to a large number of computers in the process
Starting point is 00:08:48 of getting anything done the fact that you need to talk to a bunch of different computers but actually be able to sort of uh trace back all of those interactions across all those computers adds sort of like an extra layer of difficulty to capturing that information correctly. And that's actually the part that I found was missing the most in our traditional tooling, mainly because it's more painful to set up uh and install and get otherwise get these what are called context propagation uh mechanisms working uh so the open telemetry project is a project um to design and build all the tools you would need to emit that kind of modern telemetry. And it's designed in such a way that it can be very stable and very amenable to being embedded in shared libraries and shared services.
Starting point is 00:10:08 So this was kind of actually another problem I had, which is I wrote a lot of open source software, like open source libraries and services and things. And you kind of hit this weird wall with observability, which is I would like to give you logs and traces and metrics, but I don't know how because my library has to integrate with all the other libraries you the application owner are running when you're compiling your application together. So even though I might have a good idea about what a good metrics library is, or good logging library, it's not very helpful for me to pick those things because it's really the
Starting point is 00:10:45 application owner who has to pick where all that data is going and what format it should be in and all of that stuff. So one thing the OpenTelemetry project focuses on is being very friendly with all the existing observability and analysis tools. So it's, in fact, there is no backend for OpenTelemetry. The idea with the project is we focus on a set of observability pipeline tools. So APIs, like client implementations, a thing called the collector, which is sort of this awesome Swiss army knife that you can run for processing all of this telemetry. But the end goal is to be able to, in addition to, you know, producing what we think is next generation telemetry, be able to like receive and export telemetry in any format.
Starting point is 00:11:47 And to, to make all the different pieces of open telemetry work together as a system, but also be useful as standalone pieces. So we really are trying to make something that is, is helpful for everybody and doesn't, doesn't go around imposing kind of like a straitjacket on people where if you want to use one part of it, then suddenly you're stuck running OpenTelemetryDB
Starting point is 00:12:14 somewhere or something like that. Okay, so let me replay some of that to see if I've understood it correctly. Let's say that I'm a vendor like Datadog, right? That's the one observability tool, I guess, that I'm familiar with. It makes sense for me to integrate with OpenTelemetry because everyone else will be speaking in this language, essentially, and then all of my customers can use any system that integrates with OpenTelemetry automatically, and I don't have to build a collector for each one. Is that roughly correct to start with? Yeah, that's correct. So what this is solving for application developers and
Starting point is 00:13:01 open source library authors, end users, if you will, is they don't really like vendor lock-in, right? Like people don't want, like instrumentation in particular is this cross-cutting concern that just gets embedded everywhere, right? You just have all these API calls all over the place when you add logging and metrics and all this stuff to your system. And if it's going to be a good system, all that stuff needs to be coherent. And it's a bummer to have to rip all of that out just because you want to use a different analysis tool to look at your data. So end users are interested in switching to OpenTelemetry because it's sort of a write once, read anywhere kind of solution. There's enough industry adoption from all the major vendors and cloud providers.
Starting point is 00:14:15 Most of the big shops are involved directly in OpenTelemetry because we all kind of agreed that this would be better for everybody. And so even if like you're not super interested in the project, what you're going to see over time is more and more of your end users coming to you saying, you know, we've already like instrumented everything with open telemetry and we've got a nice open telemetry pipeline that we like, and we just want to point the fire hose at you and have you give us useful analysis. Can we do that, please? And I think over time, it's not even over time, it's already the case that almost everyone is like, well, yeah, we would love to ingest open telemetry data in that case, because it makes it really easy for us to onboard these new customers. And if instead we tell them, no, no, no, you can't just send us your data. You have to go back in and rip everything out and replace it with all our vendor specific stuff. That is like a pattern that as time goes on, developers are less and less interested in.
Starting point is 00:15:18 I think just in general, proprietary code being, even if it's quote unquote open source, but it's like de facto proprietary because it's really designed to work just for one company or one backend. That's the kind of stuff people really don't want to put in their code base because it's kind of tying them to a particular service. So I think that's one of the reasons. There's a couple of reasons why open telemetry is really interesting, but that's like one very practical reason is I want to instrument once, have the data coming out of my system be very standard and regularized, and then be able to send that data off to a variety of different tools that all know how to properly ingest that regularized standardized data.
Starting point is 00:16:11 Yeah, that makes a lot of sense. I think any developer who's gone through a painful migration, especially for something like telemetry, where you have no idea how useful a particular log line is or a metric is, but you do need to migrate all of them. I've been through migration like that once. It's like you never want to do that again. Exactly.
Starting point is 00:16:33 And so with OTEL, part of my pitch for people is like, it's like the final migration. If you do the work to move over here, and you can do that work progressively too, again, because like the collector, for example, you could just take the data in there as a middleman to all of your existing data. And now you have something more like a router or switcher that can convert and regularize data that's coming in from different kinds of sources. And that's a like initial baby step. So then as you are progressively swapping out your instrumentation for OpenTelemetry clients and OpenTelemetry instrumentation,
Starting point is 00:17:34 it's still that new stuff is going to the same collectors that are getting data from the services you haven't migrated yet. And it makes it a little bit more like a smoother rollout. So we put a lot of thought into the kind of practical aspects of rolling out and managing telemetry. And I think that's part of the reason why some of the open telemetry tools are getting popular with people, even if they haven't been able to fully migrate over to open telemetry tools are getting popular with people, even if they haven't been able to fully migrate over
Starting point is 00:18:07 to open telemetry instrumentation. Yeah, the final migration just, it sounds too good to be true for like an infrastructure engineer, but it's a great pitch. Never say final, right? Like, I think that's the thing everyone learns, like at some point, never label a document as final. Final, final. Yeah.
Starting point is 00:18:27 Final, final, final V2. Really final. Yeah. Yeah, yeah. But I do see it that way because we are really trying to make a standard. And we have a lot of buy-in already from a lot of like the appropriate groups you would want buy-in from. We've got a lot of organizational structure that makes that effective.
Starting point is 00:18:53 And we've also put a lot of a surprising amount of thought into like how the code is actually structured and packaged in a way that makes maintaining very strict backwards compatibility and stability um uh much more feasible so basically assuming that instrumentation is like never going to get updated again but you want to be able to move to like the latest version of like the clients and the rest of the pipeline uh so that you can get new features but also security updates and things like that but you don't ever want to have to touch old instrumentation when you do that that's something that we care a lot about. Dependency conflicts. So the instrumentation API packages
Starting point is 00:19:49 don't come with any dependencies. It's sort of just like an interface layer. So if you are like an open source library and you add open telemetry instrumentation to your library, you're not secretly taking on like a gRPC dependency under the hood that's then going to cause your library to conflict with some other library that also has a gRPC dependency that's like incompatible. That kind of thought is, I think, really important. If you are going to convince people that native instrumentation is a good idea, you really want to make sure you're not inflicting them with dependency conflicts and compatibility issues when you do. if you've seen that xkcd where there's this idea of like oh there's already 37 standards i'm going
Starting point is 00:20:46 to introduce one more and now there's 38 competing standards like how do you kick start a project like this without making things worse no i i've actually never seen that xkcd comic no one has ever sent me a link to that xkcd comic ever uh yeah no i definitely know the the comic of which you speak and it's a very important point that is like a a question people ask is like how are you not just just making it worse and um one truism is standards real standards take. And one of the reasons they take time is that they require consensus. And the only way I believe you avoid being just the 38th standard, well, I guess there's two ways. One is just like overwhelming force. You are some force within the industry ecosystem whatever that carries so much weight that you can just inflict your opinion on everyone and they are just going to have to take it
Starting point is 00:21:57 because it's just easier to to go along with what you say uh i don't like that. But that's that does happen from time to time. You know, looking at you system D. And but there's also a situation that's more like the IETF approach where you just get all of the interested parties together and say, look, like we really are making a forum to, to hear everybody out and try our best to come up with a solution, uh, that works for everybody without being compromised in the way where it's, it's just kind of like shoddy. And that, that takes work. It takes a willingness from the people at the core of the project to see their role as more um like the way i see my design role is not i come up with awesome ideas and then go push them on people it's more like i absorb everyone's requirements and then try to think hard about a design that would actually be clean, but also meet all of those requirements. Rather than saying like, oh, your requirements annoying. What if we just didn't do this thing and you had to change um so one part of the success of getting people involved was a willingness of the the core people to to organize the project more in that manner and that
Starting point is 00:23:36 ended up attracting a large number of people first to the open tracing project and then also the open census project um which was mostly a google effort led effort microsoft also got involved there and then um we started lobbying each other uh leaders in both of those projects to to merge those two projects because that seemed like like kind of like the the final bit was like if we can settle our differences and find a way to merge these two projects then we can really have a standard uh because we've kind of everyone at that point had sort of rolled up into one of two balls and so now it was like well we've got to roll these two balls up into one ball and then we'll have something that really looks like broad industry consensus on on how this stuff should work and that honestly ended up being the best of both worlds because those two projects did have like a very similar pedigree but also different enough way of like looking at the world in other words
Starting point is 00:24:47 like both looking at kind of like maybe a slightly different list of requirements and when you put those two lists together you've got the the complete list of requirements and uh that became the open telemetry project and uh it's uh And the merging of those two groups was also the final kind of like starting gun to cause everyone who had not gotten involved yet, but cared to kind of come over and get involved. I want to take the perspective of an engineer who's not an infrastructure engineer at all, right? Like I'm a product engineer. My job is to deliver value to customers as quickly as possible. Why should the way telemetry is being shipped from my app to whatever vendor or tool I'm using, why should a change in that like me excited? Is there something that OpenTelemetry does that is just unique and different? How is it pushing the boundary and helping us understand our software better?
Starting point is 00:25:56 A large transition happening in observability from what I would call the traditional three pillars model of observability to a more modern observability model. I've been describing it as like a single braid as opposed to three pillars. But the basic issue uh, and to be clear, it's very smart. People will go out there and talk about like this three pillars model and say things that are very useful as to how you should set up and operate a traditional observability system. So I don't want to imply that,
Starting point is 00:26:38 that those people are wrong or what they're saying is, is bad in some way. But what I don't like about saying the three pillars model of tracing, logging, and metrics, each being like this pillar in the observability, the Parthenon of observability, is it makes it sound like all of this is intentional somehow. Like there was a design plan in play and having these three totally siloed streams of data is like anything close to a good idea or how you should build a telemetry system. And the answer actually is that is not a good way to structure your data or build a telemetry system. And the answer actually is that is not a good way to structure your data or build a telemetry system. It's a very bad way to do it by having all of
Starting point is 00:27:32 these data streams be totally isolated from each other. Having some of them like logging, for example, be very unstructured. Having other ones like tracing being very crudely and heavily sampled like all of this stuff ends up creating a lot of extra work for operators who are trying to look at this data and figure out their system is doing because we don't look at this data and figure out what their system is doing. Because we don't look at this data as like, you don't go solve your logging problem by looking at your logs and your metrics problem by looking at your metrics. You're trying to figure out what your system is doing and then come up with a root cause hypothesis and then go remediate it.
Starting point is 00:28:22 And in order to do that, you are trying to synthesize information out of all of this data. You're using all of these tools together and you're kind of moving around between all these different ways of looking at your system. And if you do that in a world where this data is poorly structured, it's not organized into like a unified graph of data that represents the topology of your system that represents the causality of operations in your system that has a way of correlating between aggregate data like metrics and transactional data like all the logs in this particular transaction, you have to do that work anyways. And if your tools can't do that work because the data just isn't structured in such a way that your machines can put it all together for you,
Starting point is 00:29:22 you end up doing that all in your head, right? So you end up trying to find correlations between different graphs by using your eyeballs to look at squiggly lines on a screen. We have like all this computing power and we're trying to find correlations between metrics by looking at squiggly lines with their eyeballs like that's crazy um trying to find all the logs in one particular transaction and by transaction i don't mean database transaction i mean someone clicks the checkout button on their mobile app which triggers an http request to some front-end service, which triggers a cascade
Starting point is 00:30:07 of requests to various back-end services, and even some services after that, and kicks off some background jobs, and does all this other work. And then you have some problem or error way down deep in that stack, And you want to just look at all the logs that were part of that one transaction. But that means looking at logs that came from 12 out of 500 computers that you're running. I don't know how much operational background you have. But I think if you have much, you like instantly that's that's like really annoying like grepping through your logs and like trying to just filter down to find a particular transaction um just it turns the human operator into like the glue that's trying to glue all this data together instead of emitting properly structured data where
Starting point is 00:31:08 those graph relationships are already modeled properly in the data and your analysis tool can just automatically glue all that stuff together for you without you doing anything, which then frees you up to just look at all the data and do the human-centric work of trying to perform a subjective analysis about what the real problem might actually be. Okay, so the idea of open telemetry is since it can be pervasive across every single way you collect data, and as long as you thread through the right things like a request id or something like that it can help you stitch all of the relevant data together for a particular like transaction as you mentioned it like a particular web request in a certain sense
Starting point is 00:31:58 or like a particular job and then the next part which which is visualizing that or showing it in a way that's debuggable, can be handed off to a visualization tool. But because of the way all of the data is structured from OpenTelemetry, it makes putting all this data together easier. Does that sound correct? Yeah, it does. So to get into some of the details about how that works, when you have a distributed transaction, so you have a request from the end user that ends up touching a bunch of computers in many tiers of backends,
Starting point is 00:32:48 like big 20 computers are involved whatever um if you want to be able to index all of the logs that were part of that particular transaction you need a transaction id right like you need every log that gets emitted as part of that transaction to be connected to that transaction ID. And in order to do that, you have to have some way of passing that transaction ID along the execution path. So as the code executes through your program, the transaction ID needs to follow along with it. And so we call that context. So context is within a program runtime, some way of associating the execution context with a bag of environment variables, you might think of them as, that are specific to that particular execution context.
Starting point is 00:33:54 And that means when that context jumps to another thread, that context bag has to jump along with it. If there are some kind of like user lands stuff like tornado or G event or something happening on top of the threading that's going on, you know, that system needs to manage keeping track of these contexts when it's switching coroutines and stuff like that.
Starting point is 00:34:23 And that's tricky. Not very many programming languages have that concept fully baked into them. And the other thing you have to do is whenever you make a network request, so the work now is passed to another computer, and this computer is now sitting there idling, consuming resources, but idling, waiting for this other computer to do some work and give it some information back. You want all
Starting point is 00:34:59 the logs that are on that other computer when it's performing this transaction, which means you have to now take that transaction ID in that context and staple it to that network request. So if it's HTTP, you would put it in an HTTP header so that on the server side on the other end, the controller action or whatever it is that's handling that request can pull that transaction ID out of the header, attach it to the context, and then continue on its merry way. And so that fundamental system of context and then propagating that context to the other servers that are part of the transaction, that is fundamental to what is traditionally called distributed tracing.
Starting point is 00:35:53 And in OpenTelemetry, we've taken that distributed tracing concept and we've extracted it to a lower level concept that's just called context propagation. So there's this lower level system that's just called context propagation. So there's this lower level system that all it does is focus on being able to keep that bag of context attached to your execution and then serialize it and propagate it along your network requests and deserialize it on the other end and so on and so forth. And that's involved making changes to the HTTP spec. So we went to the W3C and helped design a standard header
Starting point is 00:36:36 for putting some of this context in. And it involved, in every programming language, building one of these context propagation systems trying to leverage as much of what already existed in that language and all of open telemetry is built on top of that so uh the most fundamental system in open telemetry is actually the tracing system because what that does is it takes this context propagation mechanism and on top of that allows you to record and keep track of operations so you say i'm in you know a controller action
Starting point is 00:37:23 operation and then i call out to a database client. And so that database client starts like a database client operation. In OpenTelemetry terms, we call those spans in a trace. And then all of the logs that might be occurring, those are all occurring in the context of those spans, those operations. And those spans are all linked together in a graph. So every operation knows the ID of the parent operation that called it and can be connected to the child operations that it spawned as part of doing its job. And those trace IDs, those span IDs, all of that stuff, get propagated to the next service. So they can
Starting point is 00:38:16 continue the graph of saying I started an operation on this other computer. But my parent operation was this client HTTP call on this other system. So if people have worked with the tracing system before, that's just kind of like the fundamentals of how distributed tracing work. But instead of saying distributed tracing was this third system, like running off in a corner on its own, we're saying, no, no, no, that's like the fundamental context that everything executing needs to happen in. And then what you'd call your logging system is able to access that context.
Starting point is 00:38:56 So all of your logs can get that trace ID and that span ID. In other words, the transaction they're part of, the operation they're part of. So then when you store them, you have these indexes. So once you've got that trace ID, if you have one log, like an exception or an error from a backend system, and you're like, well, show me all the logs in the entire transaction from the client all the way to this back end to like any other service that had anything to do with this transaction boom they're all indexed by that trace id and so instantly you can see
Starting point is 00:39:33 all of them you don't have to do any filtering or or grapping about to make that happen so i think that right there maybe shows some of like the fundamental difference between having these as like totally separate systems versus having one coherent graph of data. And the metrics get involved in that graph too. But I think just talking about how tracing and logs are actually kind of one of the same system is, is a good starting point
Starting point is 00:40:05 to see how something like open telemetry is a bit different from traditional observability. Well, yeah, absolutely. Like I had no idea you went all the way to like HTTP to come up with the standard header. That means serious business, because I'm sure that would have taken a lot of time. I just looked it up. And it looks like a fairly recent draft, like November 23rd of last year. So you go all the way to HTTP and that's how you can ensure interoperability because now it's a standard. Exactly. Exactly. This stuff, this stuff takes time. There were like prior de facto standards. I should call it like Zipkin was a very popular open source, distributed tracing tool. And they had a set of headers called B3, like the B3 Zipkin headers.
Starting point is 00:40:56 So those were pretty common. And they work pretty well. But you know, they, it's a step up in standardization to actually put it into the HTTP spec. So that's a good... And that's sort of what we're saying when it comes to... When you're talking about these distributed systems and wanting to connect all this information up into a graph, modern distributed systems are not all owned and run by the same operator, right? You have different teams and different operational teams
Starting point is 00:41:28 and service owners within an organization, but then those organizations are potentially contracting a lot of software as a service. In other words, software services that other organizations are running, like databases and things that the cloud providers are providing, or some other third party message queue, you know, provider is providing for you. And if you have a standard data format like open telemetry for describing logs and traces and propagating these indexes and identifiers now it becomes possible for those third-party providers
Starting point is 00:42:18 to send you the rest of your trace that you could never access before right because that was being stored in some third parties systems where they they have like their logs and like their traces but if you want to know like some nitty-gritty details about how your database query or your usage of like a message queue was like causing latency or problems, you might be able to discern some of that data just from the clients that you're using attached to it. But there's even more data that you could get if you could actually get operations and events and metrics out of their system. That was just the portion of their resources that you as an organization are using and not anybody else.
Starting point is 00:43:17 But if you have a standard, now there's a way for them to be like, well, we've done the work to add that instrumentation and we will admit it as an open telemetry fire hose. And so you can ingest that as well as the stuff coming out of your own applications and services. And now you have an even deeper trace of your overall system because it's including like these-party systems as part of it. It's kind of like if every system spoke their own language and didn't speak in HTTP, you'd be reinventing that for each system. And I'm sure there definitely is RPC systems and all of that. But basically, for most systems in the world, you can probably just communicate with them with HTTP and you hear back just fine and this is taking it a step further to understand your system like no matter where they are who runs them
Starting point is 00:44:10 you can probably understand them but say okay this operation in this third-party vendor is taking time and that's why my requests are slow that i think i finally now get the vision and it makes a ton of sense to me. It's also much more ambitious than what I originally thought it was. There's this idea you brought up around the braid of observability. Initially, there's these three pillars. There's the part known as metrics, logging and tracing. They're often thought of as separate things and they really shouldn't be. And we spoke about how traces and logs can easily be correlated
Starting point is 00:44:46 and should really just be the same thing. How do metrics play into that? Yeah, that's a great question. So there's two really practical ways that metrics plays into that. One is just maybe a fundamental concept, which is that metrics are just aggregates of events. So you have events that happen, like an operation occurs, like an HTTP request, and you want to know things about that particular HTTP request and how it fits in to an overall transaction.
Starting point is 00:45:22 But you might want to know things in aggregate about that HTTP request. How long did it take? Not just an individual request, but all the requests like that. What is the spread of latency? You might want to count number of 500 status codes per minute or something like that. And one way to do that is to have a metrics instrumentation API where you create counts and gauges and histograms and things like that. And you embed that directly into your code.
Starting point is 00:46:02 And then you get counts and histograms and gauges, very old school and traditional. But another way to do that is if the data, the event data that's coming out of your system is very regularized and well structured. In other words, it's not like an unstructured string blob that you have to parse and hunt around for content in, if it's all very regularized key-value pairs that have standard keys and standard value types, then it becomes much more feasible to create a lot of your metrics on the fly. So you could embed metrics API calls in your HTTP client to do things like count status codes and stuff like that. But if you're also emitting a span for that HTTP client request,
Starting point is 00:47:01 you could farther down your pipeline, like let's say in the collector component of OpenTelemetry or in your backend, just anywhere farther down the line, you could just dynamically generate those metrics based off of that span being emitted. So that's, I would say, like one fundamental thing for people to think about. And why that's actually important is, if every time you want to change what metrics you're collecting about your system, you have to go into your code and make a code change and then redeploy your application, that's a bummer, right? That means bothering a developer who has the capacity to make that particular code change.
Starting point is 00:47:53 That means recompiling and redeploying an application and doing stuff like that. That's kind of like a long path that has quite a large number of side effects compared to an operator, a system operator, wanting to get additional metrics and just changing the configuration of something in their telemetry pipeline to start emitting those metrics there and not touching the application services at all, like never restarting them. They don't even know that you're generating new metrics. You're doing this all farther down the pipeline.
Starting point is 00:48:35 So that's one fundamental way that metrics are tied in as part of the braid of data with traces and logs, which is perhaps you should start switching to generating more of your metrics dynamically from your traces and logs. And with a regularized, highly structured system like OpenTelemetry that's a lot more feasible than if you weren't really running tracing or you're running tracing, but it was very heavily sampled up front, or your logs had this information in it, but it was not consistently structured. All the different things emitting logs emit an HTTP request a little bit differently. It's hard to, it's expensive to parse that stuff, et cetera, et cetera. So OpenTelemetry
Starting point is 00:49:34 makes dynamic metrics creation a lot more feasible. It does have a metrics API, to be clear. So that's also there. Okay. Yeah. I think I need to understand this a little better. So let's say you have, again, like the web request example, right? Like you have a request that starts at a front end. It maybe makes a request to like one underlying service. It comes back and then it returned, it does some computation, returns that data to a user. So what you're saying is the fact that there is a trace that captures that also enables me to generate a metric and maybe generate like more metrics like for example i can automatically track things like http status codes from the underlying like service if i want
Starting point is 00:50:21 to to the for the front end if i want like i guess i didn't fully grasp how i can dynamically generate more metrics given this is this this is how my trace looks or like this is what my request goes through well just that uh when you're talking about a metric, you're just fundamentally what you're talking about is an event that happens in your system that you want to look at in aggregate, right? You want to look at it in aggregate. You want it scoped along a certain number of dimensions the way you want to look at it might be counting something or summing it or putting it into histogram buckets or or making a gauge but a lot of that information that you're looking at isn't like a sampling of continuous information you have some stuff with like ram or cpu where you're you're where you do kind of need some kind of probe in there that's taking a sample.
Starting point is 00:51:30 But a lot of what we're making metrics about, especially in the context of our transactions, are things like HTTP requests or database requests or exceptions occurring, things of that nature. So in all of these cases, there is something in your tracing system and your logging system describing that specific event occurring. And you could, right next to the place that you're recording that specific event in OpenTelemetry, also add right there, using the metrics API, something that counts essentially the same information or otherwise emits a metric about that event. But you could also do that farther down your pipeline like if you're trying to um count uh status codes right how many 500s how many 403s etc etc how many 200s you're trying to count status codes uh in your system
Starting point is 00:52:40 based on you know some set of dimensions which ap API endpoint you're talking about, or what route you're talking about, et cetera, et cetera. If you're already emitting all of that information about those HTTP events happening, there's no need necessarily to bake all of that metrics gathering into your code. You could instead create a trace processor or an event processor, essentially, later on down the pipeline. This is one of the things that Collector is very good at. It takes in all of your data and you can write these processing pipelines to do things like transform the data scrub sensitive information out of it but you can also use it as a place to generate more data
Starting point is 00:53:32 and one particularly useful thing you can do there is generate metrics out of your events and given that there's like there isn't one canonical good set of dimensions to capture a particular metric, given that there are what you might think of as a default dashboard you might want to set up for particular services and particular libraries that you're using. There may be a default dashboard that captured some reasonable information about that. As time goes on and your systems get bigger and you understand them more and the problems you're trying to solve with them become more specific, it's hard to predict what metrics you really want in the future and what dimensions you want those metrics
Starting point is 00:54:28 recorded by so the ability to dynamically create more metrics on the fly as like an operator as the or as like the analyst looking at that data and being like, dang, I really want to have this additional metrics or I want to change the dimensions that I'm recording this particular event across. just going to your telemetry pipeline by making configuration changes to your collectors and then restarting your collectors rather than having to make code changes to your applications and restarting your applications that gives um operators and people who are farther down the line as far as caring about you know the telemetry being emitted and the dashboards being set up and all of that. They now have the freedom to start generating the metrics they want without having to do these application restarts or bother, you know, the developer who would, specific developer who would need to do that because that's like their particular part of the code base or something like that.
Starting point is 00:55:48 Okay. And the collector is a daemon, right? It's not like a server-side component. It's actually a client-side component. So it's actually pretty, it can do those transformations pretty efficiently. Yeah. Yeah. You can run the collector in a variety of pipeline roles. So one common place to run it is something like what's often called an agent. Basically, you can run it on the same machine, same virtual machine, or as a sidecar if you're running Kubernetes. So it's local, on a local network connection to your application. The advantage of running it there is it can collect a lot of additional data without the application having to do that. So that's a good place to configure the collector to collect things like CPU and memory and stuff like that. It can also collect additional information about the environment that the application might not be collecting about, you know, the Kubernetes environment or the cloud environment or,
Starting point is 00:57:01 you know, just something about the resources being consumed by that particular application. And it can decorate all the data coming in with those additional resource attributes. So there's some good reasons for running it locally. The disadvantage for running it locally, of course, is that it's consuming the same resources as your application. So it's also feasible to run collectors on their own boxes. It's feasible to run collectors in a pool behind a load balancer. So what people often end up doing is having this sort of tiered pipeline where they have an application. That application is talking to a local collector. That local collector is doing a very minimal amount of work. It's maybe sampling machine metrics like CPU and RAM, and it's storing all of the telemetry data,
Starting point is 00:58:11 basically acting as a buffer between the application and the rest of the telemetry pipeline. And because it's on a local network connection with the application, that means you can configure your application to not really buffer that telemetry data. And that's really helpful because that means if the application suddenly terminates, you're not losing a large batch of the telemetry data that you probably care most about, which is a problem when the, if the network back pressure on your telemetry system is reaching all the way into your application, then yeah, you run, start to run that risk. And so by then moving that to a sidecar or a local collector, then the collector can act like a better buffer to handle any back pressure that might be happening in your telemetry system. The reason then to run these collector pools farther down the line is if you're wanting to do more and more processing about your telemetry data
Starting point is 00:59:19 that doesn't need to be done locally, that means you could be doing it later in kind of a pool that's collecting data from many separate application sources. And playing back to one of the big advantages of this decoupling is that I can have something like a simple structured log of something that I thought was important a year ago but I just decided to log because I thought it'd be interesting to see but today I think it's extremely important that I have a metric that comes out every time
Starting point is 00:59:56 that log line is hit essentially like especially when a certain attribute of that structured log is like true or false or something else and open telemetry just lets me do that without configuring any client code or changing any client code like letting me add a new metric or whatever i can just do that by configuring or like tweaking the collector config to say when you see this structured log event generate a metric exactly yes exactly that's exactly what you can do and because uh open telemetry has what we call uh we call them semantic conventions which is kind of a a funny term and you might be better to call it a schema, like a semantic schema. But Elastic Common Schema is another example of one of these.
Starting point is 01:00:50 But there's a schema to describe all the common operations that machines do. So if you're recording an HTTP client request, if you're recording a SQL database call, all of the common things that a computer program might do, we have a strictly defined set of key values that should be emitted to describe that event. So it's not just that you can use the collector to, like, say, parse a log line and figure out how to emit a metric. You can do that. But it's also the fact that that data coming into the collector for many of the things you would want to collect metrics on is already in a very nice, regularized, well-structured data format.
Starting point is 01:01:46 So it's much more efficient to be generating metrics off of that kind of data. And it's also much more reliable, right? Because you can depend on what that data is going to look like. In fact, we even have schema versioning. So every instrumentation source indicates which version of the schema it's adhering to. So you can even do schema translations. That's one of the ways we handle backwards compatibility. If we figure out additional attributes we want to admit or change the way data is is split up uh in something we're reporting um all of those changes if they come to
Starting point is 01:02:37 stable instrumentation would have to be uh um be released along with a schema processor in the collector. So you can build your pipeline tools to expect data to be in a particular format. And if it's not in that precise format, if it's in a different format, then the schema processor just gets run to convert it to the format you want so you're not you're not breaking your dashboards just because uh you you updated your instrumentation to to a new version that's interesting i'm sure versioning must have been a pain to design and like roll out like it's it seems like a tricky problem and like schema transformations and stuff i've had to do like a similar data modeling problem at work and we're just like for now let's skip versioning
Starting point is 01:03:31 because there's so many implications that you have to think about like what if there's two conflicting versions but when you're building a standard and there's so many different systems that need to interact together i can see why you'd have to go all the way to build this. Yeah. I would say a lot of what we're doing, I think really are best practices that anyone who is maybe not so much for application code, but if someone is creating a shared library like a some something that is going to run in many different applications in many different
Starting point is 01:04:13 environments um especially if it's a cross-cutting concern like you know telemetry, it's worth it to care about things like backwards compatibility and upgrade paths and transitive dependency conflicts where, you know, the dependencies that my thing depends on may conflict with the dependencies that other libraries depend on. If you think about those things up front, if right at the beginning when you're designing your stuff, you have a much better chance of coming up with a system that you can adhere to as time goes on to maintain those qualities and tests you can adhere to as time goes on to maintain those qualities and tests you can do
Starting point is 01:05:08 to ensure you're maintaining those qualities. It's much harder, in my opinion and experience, it's much harder to add those qualities to a system later where you didn't think about it at the beginning and bake it into the design and architecture of the system. Not impossible, but you,
Starting point is 01:05:33 it is worthwhile to think through the different ways you're going to want to, to mutate and update and improve the library that you're offering and just kind of figure out what um what is a good way to lay out those packages what is where the right places to introduce loose coupling things of that nature to ensure that you're going to be able to say once this particular piece is stable, that it will remain stable forever. And that might limit the kinds of backwards compatible changes you can make there. But if you also have a way to then introduce new experimental components in such a way that they aren't destabilizing the stable components like which in many languages
Starting point is 01:06:36 comes down to like how you lay out your packages for example um if you come up with with a plan that it does require some work up up front to figure that out but if you do that work up front then implementing it becomes smooth uh so so that's that would be a best practice if i were to talk more in the future about open telemetry as an open source project and maybe some of the practices we do that other open source projects that are kicking off would benefit from i would say looking at how we handle versioning and backwards compatibility is like a place where i'm really proud of the project and as you mentioned like these things take time, right?
Starting point is 01:07:31 You go through these problems once and then new requirements come in. It takes time for the industry to be like, yes, this is important. We need to make sure we have this. We've had this problem before. And one design iteration goes on. Maybe one project doesn't get it quite right, but the next project has all those learnings and then the industry is like yes we can converge to this new standard or this new project because it takes most of the boxes we need yeah and so we have a a development uh process that involves rfcs uh we call them oteps open telemetry enhancement proposals but they're very similar to say the rfc process from the itf um we tend to require that oteps come with prototypes so here is a change that I'm proposing to make.
Starting point is 01:08:27 Here is an implementation of that change in two or three different languages if it's a client-level change. Really trying to get a lot of that design work. Basically, we don't want to have surprises show up after something is added to the spec it's you if you care about backwards compatibility in a strict sense then things are very sticky once they've gone into the spec um it's hard to pull them out. If they're going into an experimental part of the spec, then obviously later we can say, like, whoops, we're making a breaking change to this part.
Starting point is 01:09:20 But even there, we actually do our best to try to avoid thrash. If for no other reason, then that just kind of dumps extra work on the different language maintainers. We're conscious about the fact that if we make a client change to OpenTelemetry, if we change an API or add an API or change how the client implementations work, then that's work that's going to then get repeated across like 11 different languages. So it's expensive to be like, build it this way. No, no, second thought, build it this other way. So for all those reasons, we've kind of developed a longer specification process that kind of involves doing more design and review work upfront than a lot of people are used to.
Starting point is 01:10:23 I think many people, including myself, are more used to an approach that come up with a good idea or what maybe sounds like a good idea, write some code that seems like probably it implements that idea, throw it over the wall and run it in production and see what happens. And there's definitely some advantages to doing that. Not every piece of code that gets written has the requirements, something like Open Telemetry has. But I do think for projects that are some other equivalent of OpenTelemetry, like this is a big shared library that's going to get embedded in lots of different important applications, or this is a platform that everything is running on, or yada yada. Something that's like code that's really going to get exercised in a lot of different environments lots of people are going to care about i i think it's worthwhile for projects like that to to come up with a more structured approach
Starting point is 01:11:33 to to how they they think about change yeah it's it's as you mentioned right like it depends on like the life cycle of the project who's using it how many people there are i can't even imagine there's like a security vulnerability in something like open telemetry because or like um like a remote code execution thing like i don't know if you're familiar with the log4j stuff that happened a few months ago it's like you have to be careful because there is a lot of impact of this, like, especially when you're a library that other applications depend on, right? You don't want to be of supply chain issues that are inherent in open source development.
Starting point is 01:12:32 Not just open source development, but any form of development that involves leveraging code that you did not write, that is not of your providence. It's really a conundrum i honestly don't have a great answer for it because through the whole history of software development this has been one of the big lauded examples right is that we don't all need to recreate everything from scratch. We can build libraries that do something useful, and then we can all depend on those libraries. And the fact that we can reuse all of this code is this huge advantage that most people think of when they think of software development right like code code reuse and leveraging other people's code is a feature not a bug but it's definitely at odds
Starting point is 01:13:35 with a concept of strict security right so that there is a fundamental mismatch there that is really unfortunate. And it's interesting to see how long we were able to fly without it becoming truly a widespread problem. It's a thing that's always been a bit of a problem, but maybe was more restricted in the past to things like state-sponsored actors targeting other states, very sensitive stuff, and then those sensitive things adhering to different stricter software patterns, presumably, in order to counter that.
Starting point is 01:14:36 But now it's kind of all mushed together, where everything is intermediated by network computers. Everything's a computer program. Code goes everywhere. And all of that code everywhere leans very heavily on a lot of publicly available open source code. And we tried to solve some of that for OpenTelemetry, like scanning our dependencies and trying to make sure that we aren't sticking around as a vector for a supply chain dependency. And we try to think hard about what dependencies we are taking on and where.
Starting point is 01:15:26 But the other thing is ensuring that it's possible for our end users to stay up to date. That's actually another form of like, I don't know if I would call it forwards compatibility, but yes, it's the idea that a way I often get stuck with things like frameworks, web frameworks I've used in the past is in order to get some security patch or something, I need to upgrade to a new version. But that new version changed things like the plugin interfaces for some plugins I use.
Starting point is 01:16:07 And the plugins I use haven't been updated to use those new interfaces. So now I'm in a jam, right? Where there's something that I really want, maybe related to security, that I can get by rolling forwards to the latest version. But in order to get there, now suddenly I'm faced with potentially doing a lot of work, right? I either have to abandon these plugins. I have to, these plugins are now my code. I have to go in there and somehow make the upgrade myself to make them work. And it might not necessarily be plugin interfaces, it might be any interface that something presents. If it creates a breaking change, that means I'm now going to have to do all of that work
Starting point is 01:16:59 before I can get the security patch. And as a side effect of that, people then start to camp on old versions of software. And then they start to rightly demand that the people who develop that software maintain those older versions and backport security patches and other things. And the maintenance cost of all of that goes up over time. But if you work really hard to ensure that that's not happening, that you're avoiding those kinds of situations that would make it more difficult for your end user to update to the latest version of your client other than just bumping the dependency version in their manifest. Or feeling like they can reliably pin their dependency version in their manifest to something that helps them stay up to date with everything short of like a major version bump and then you never make a major version bump then you're also creating a world where your end users aren't kind of lagging behind or being scared of updating and you're
Starting point is 01:18:22 one hopefully they are staying more up to date and they are avoiding you know um a situation where they're hanging around without applying security patches but also it means like when they do need to go up to date they aren't hitting some wall and thus being stuck on an old version and then raising the maintenance burden the maintenance burden on the open telemetry project because now we're like oh well we have a responsibility to maintain all these different ancient versions of this thing because uh we did something that made it genuinely difficult for those users to receive security patches and performance boosts. Yeah, like, this is just such a tricky problem that I've been that that has been going around,
Starting point is 01:19:12 like more and more, especially with the NPM ecosystem. But like, I thought about it a little bit. And it's not just related to NPM, like that might be one of the more egregiously bad ones where nobody pins dependencies. but it is a problem across every programming language in my opinion i think maybe languages like go people tend to not use dependencies as much and like standard library is really strong similar to python so you're a little safer there but i i really think the future is like some kind of capability based dependencies with like certain dependency should not be allowed. Like left pad should not be allowed to contact the Internet. And like specifying those kind of capabilities for your modules may be an answer.
Starting point is 01:20:00 I really believe that is like the one part of it. And the other part is like more and more security tools that we just use on a day-to-day basis that like scan for stuff, scan for anomalies. I think that'll become more and more commonplace. But instead of talking about this, I have a question for you, which I think is like a good wrap up, which is what are you excited about with OpenTelemetry like next? Like what is the one big thing that you're working on or that you see coming up that makes you think,
Starting point is 01:20:31 wow, this is great, and this is going to be a really good addition to the project? Well, it sounds a little boring, but stability is the main thing I'm really interested in. We're still laying in um the final touches on logs and metrics and there's i think this delayed process of those those things becoming stable and open telemetry the instrumentation we provide around those things becoming robust and then all these different backends and analysis tools starting to provide features that actually leverage the structured
Starting point is 01:21:16 data that open telemetry provides one quick example we didn't talk about metrics exemplars, but one way metrics are tied to traces, besides dynamically generating metrics later, OpenTelemetry's data structure allows you to record a sampling of, say, trace and span IDs, um, that are associated with a particular metric. So in other words, you have like a range of metric values that you might be emitting and you have like these high values that represent something problematic.
Starting point is 01:22:02 Um, let's say, you know, you have an alert threshold on some metric, and then the alert goes off. Your next question is going to be, well, what are these transactions that are generating these problematic values in this metric, right? That would be the next step in your investigation. And in open telemetry,'s like actually been a a sampling of those different transactions that were emitting those values associated with that metric which means you can
Starting point is 01:22:32 build uh one workflows that allow you to just click directly through from that metrics dashboard into you know your logs and traces uh that were associated with you know, your logs and traces, uh, that were associated with, you know, those metric events. But it also means that machine analysis, like, uh, machine learning and other kinds of statistical tools and automated tools have, have that rich graph of data to perform their analysis on. So they're not using really crude heuristics to try to figure out what was related to what in your system. Like, well, these things over here happened around the same time as these things over there. You actually have a real graph of data
Starting point is 01:23:25 that's connecting all of this together. And that means your machine analysis can become much more efficient and more accurate when it comes to finding correlations, for example, between different trends going on in your data. Being able to say, wow, when you're seeing this kind of exception over here, that's highly correlated with a small subset of IPs somewhere along the line or something along those lines. All of this latency is actually correlated highly with this particular Kafka
Starting point is 01:24:15 node. Those are the, that's, I should stress that it's not root cause analysis. It's not things telling you what the root cause is, but just identifying those correlations is a really time-consuming process when fighting fires and triaging problems in your system. Realizing this bad behavior over here correlates with a particular configuration option having a particular value.
Starting point is 01:24:46 Those are the kind of things where you spend hours hunting around before you figure out that that might be a thing. that graph analysis and statistical analysis can just walk and do their thing the way they're designed to do, then you can really start leveraging that stuff. So my hope in the future, my big hope is to start seeing more and more of these analysis tools pick up on these advanced features of open telemetry and start offering value to people based on them.
Starting point is 01:25:24 Yeah, this is kind of similar to LightStep's change intelligence, but really like platformized for everyone. Exactly. Change intelligence is like a perfect example of this kind of correlation analysis, which we do a good job of today, but it's still, if the data is not coming out of open telemetry, then we have to rely, we have to fall back onto rougher heuristics. So you're still getting these correlations, but the chances that they're false connections starts to go up right like it's it's harder for those systems to accurately
Starting point is 01:26:06 give you correlations because past a certain point it just has to guess because it doesn't it doesn't have the the data being handed to it in any other way yeah i think that is exciting right like and and the second thing you were mentioning just around these interfaces that maybe let you even analyze the graph of data that you're getting, right? Like maybe the system can be smart enough to find a correlation, but like, just these different human interfaces I'm imagining now, which let you like, write some kind of queries or even like, have a little bit of code, which actually lets you analyze all of these different changes. It's actually pretty exciting to think about. And now I'm thinking, we do all of this production monitoring, but I'm super interested in developer environments. Why don't we get this kind of analysis from our local machines when Git is acting up or NPM is acting too slow.
Starting point is 01:27:05 We really need to have telemetry for all the things. Like I can't imagine tomorrow, you know, all of our build systems are integrated with like open telemetry and we know exactly why something is slowing down. Yes, that's actually a place people
Starting point is 01:27:21 are definitely using it. Their CICD systems and trying to find latency problems and bottlenecks there. Another place I'm really, if we're talking about total geek out stuff, something I've been excited about for a long time and kind of been predicting. I've done some talks about it in the past, but haven't had a chance to implement much. But sure enough, some implementations are starting to show up, which is if you have this kind of structured data
Starting point is 01:27:56 that OpenTelemetry is producing, you can start using that data as input into your tests. In other words, starting when you are developing software, you use tests, you use unit tests, you use integration tests, you use these different kinds of tests to test your software. But those tests have nothing to do with the kind of information you're getting about your system in production. So you could have high-quality tests,
Starting point is 01:28:27 but that doesn't tell you anything about the quality of, say, the logs coming out of your system. And so there's this real disconnect between the tools we're using to query what our system is doing in production compared to the tools we use to query and verify what our system is doing when we're developing it. And I felt for a while there's a lot of room to build novel forms of integration testing that are being done on top of the production telemetry coming out of your system. And there's a lot of different advantages that can come with that. And I've
Starting point is 01:29:14 been excited to actually see in this past year, a couple projects get started built on top of open telemetry to do exactly that. So I think if you Google around for trace-driven development or trace-based testing or OpenTelemetry testing tools, there's a couple different projects that are getting started up around doing that kind of stuff. And I think that's a real rich source of potential future because that hard split between how we test and verify our software when we're making it versus how we test and verify our software
Starting point is 01:29:57 when we're running it, I've seen that as one of those arbitrary walls that should get knocked down at some point. Yeah, there's so much to expand on that. You can have unit tests for, I expect, at least one HTTP call to have status 200 that you can check against a fake OpenTelemetry collector locally. And then in production, you have the exact same test, but it's running continuously against
Starting point is 01:30:26 that collector and it's basically an alarm so you can have parity against your unit tests and your even your production environment in theory oh my god like it sounds a little out there but could be it sounds wild at first but it's it's totally a thing and there's there are like i think if you start to think about writing your integration tests against this kind of production data it's possible to develop a testing querying language like an assertion framework that um feels more like the kind of assertions you're making when you're testing, but is also written in such a way that it's scalable. In other words, it could be run against a stream of data. And once you start having a language like that, I think you'll turn around and start to realize that the kind of alerting we do, quote unquote alerting, is actually testing.
Starting point is 01:31:30 It's just very, very crude testing. It's like a testing framework where you just have like one assertion you can make. Thing passes a threshold with like some tolerance percentage. And that's, that's super crude compared to the kind of assertions we make in our integration and unit test environment. So I think there's, there's a lot of room to, you know,
Starting point is 01:31:57 if you come up with something like that, anyone who does that and can figure out a way to, to run that thing against production data you're now handling operators a way to start you know they there's like the classic saying that it's like when you have a problem in production uh can you solve it by just writing more tests or running your tests again no but actually maybe you can yeah if you come up with an assertion language you can run against your production data that's that's a powerful introspection tool so i'm excited for for somebody to build that yeah yeah this the first when you first look at developer
Starting point is 01:32:38 tools and you think okay we've made a lot of progress and things are so much better than before but there's so much more to go there's so many things we haven't thought about the more you dig into the space the more and engineers as you mentioned in your original story like you're always hungry for like better tooling and things are never efficient enough and engineers at least right now are still super expensive so like every tool that can make them more efficient. Yes. Probably. Absolutely. Become a bit.
Starting point is 01:33:07 And it just makes our life easier to a lot of the improvements here. Come from like time savings, but it's also just saving a lot of like scut work that. Yeah. Yeah. Well, Ted, thank you so much for being a guest. I think this was a lot of like scut work that. Yeah. Yeah. Well, Ted, thank you so much for being a guest. I think this was a lot of fun. And I hope I can ask you again for like a round two at some point.
Starting point is 01:33:34 Absolutely. Happy to come back and talk more about observability anytime. Thank you for having me.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.