Software at Scale - Software at Scale 47 - OpenTelemetry with Ted Young
Episode Date: May 26, 2022Ted Young is the Director of Developer Education at Lightstep and a co-founder of the OpenTelemetry project.Apple Podcasts | Spotify | Google PodcastsThis episode dives deep into the history of OpenTe...lemetry, why we need a new telemetry standard, all the work that goes into building generic telemetry processing infrastructure, and the vision for unified logging, metrics and traces.Episode Reading ListInstead of highlights, I’ve attached links to some of our discussion points.* HTTP Trace Context - new headers to support a standard way to preserve state across HTTP requests.* OpenTelemetry Data Collection* Zipkin* OpenCensus and OpenTracing - the precursor projects to OpenTelemetry This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Hey, welcome to another episode of the Software at Scale podcast.
Joining me today is Ted Young, the Director of Developer Education at LightStep and the
co-founder of OpenTelemetry.
Welcome.
Thank you.
Glad to be here.
Yeah.
So I want to start with getting to know your background, right?
You've done so many things.
You used to do something with animation.
Now you're the director of developer education,
which is a role I haven't heard of before.
So how did you get here?
Yeah, I do have a funny background
of kind of switching what I'm up to every seven years
or something like that.
I used to be a computer animator, actually.
I got a computer science
degree at tufts but was mostly interested in film and animation and uh helped run like a small
post-production studio for a number of years and that was a lot of fun but i got really into um kind of like um environmental um stuff and like some civil rights
stuff and began working on internet software to kind of help some of the social movements that i
was part of and that turned into a full-time job uh a small consulting group called Radical Designs. And that was a lot of fun.
Eventually, I wanted to get deeper into how computers actually work and start
solving some of the problems that I was having as an application developer working on the internet. And so I started working
on like container scheduling systems, like platform, what we call platforms, but I think
of as distributed operating systems. And I did that for a while. After that, moved over here
to observability to scratch the other itch, which, you know, through my entire
career, whether I'm like rendering visual effects or, you know, trying to make scalable web apps or
trying to make a container scheduler work, you know, the question, why is it slow? Why is it slow?
Just keeps coming up over and over again.
And it's always such a pernicious question to answer that getting involved with observability and in particular, the founder of LightStep, a while back.
And we really hit it off and kind of saw eye to eye about how, not just how you do that,
but what some industry standards needed to exist to really help make doing that easier.
And so I joined up at LightStep and have ended up with various titles that basically amount to go run the OpenTelemetry project and talk to people about it.
So that's basically what I do for LightStep and what I've been doing for like the past five years.
You know, the whole idea of changing up your career every seven years. It sounds like a great idea to me.
It definitely keeps things fresh.
That's for sure.
And going from computer animation to something as, I guess, platformy, like distributed operating
systems, as you basically said, that sounds tough.
Like, how did you manage that transition?
You know, it's funny.
On the surface, it seems very different.
But actually, in the world of computer animation and visual effects,
it's kind of like a small niche world.
But most of the problems of distributed systems and scalability
that people are hitting on the internet now are
problems that uh 3d shops and like visual effects shops like industrial light and magic and places
like that have been dealing with uh for a long time which is you know you are trying to do a
massive amount of computation when you you think about rendering out,
say, like a movie or commercial or visual effects shot, you have all these different compositing
layers, all these different 3d elements, and you have to render out all those layers for every
frame of the thing you're trying to make, which is a massive amount of computational power. And so you end up with what's called like a render farm, which is basically server farm, to farm all of that work
out to. And when you do that, you run into all of these like pipelining and throughput and latency
problems of, I want to see these particular things really fast, or I want to get it all done in time to meet this deadline.
So I want the most throughput that I can have.
And you kind of end up having to solve these scheduling problems
that are sort of like what we call map reduce these days.
But the problem of I have to get all of the relevant bits to do the job to all the different machines that I want to work on it.
Getting the bits to the machines takes time and takes up space.
And then I need to figure out what the most efficient way is to share these common resources, CPU, RAM, memory, network, all of that, between all the different jobs that
I want to do. And so the kind of algorithms you come up with, the kind of approaches you take,
end up being pretty similar in practice to turning around and saying like, okay, now I want
to schedule a bunch of apps to run on a bunch of servers.
And I want to think about the efficient way to download the images for all of those apps.
And I also want to think about reliability and scalability and redundancy and stuff like that.
And also managing resources like CPU and RAM
across all the different apps and services I want to be running.
It ends up being a more similar set of problems
than it might look like on the surface.
Does that experience or does that interest,
is that what led to thinking about OpenTelemetry
or was there something else
in that story how did you get find yourself in the position to start an initiative like that
yeah i mean i think the same way i got into like platform and container scheduling stuff
um getting into observability was because i i hated my tools. And I hated using them,
and I wanted them to be better. And when I did some research, it seemed like there absolutely
were known better ways to do some of these things. And in both cases cases it was also like kind of an industry technology wave seemed like
it was coming along where there was actually an opportunity to get out there and implement
some of these better ideas and have them see some adoption in the field okay so i think that's the story with a lot of people who work on developer tools it's just
i had my first job as xyz engineer but i felt like i could be so much more productive if
the systems i worked on were better and then you go down that rabbit hole more and more and more
that's certainly what happened to me um so maybe you could tell listeners a little bit about the
project what it's aiming to do the kind of consensus it's trying to drive in the industry.
Just the background of the project to start would be interesting.
Yeah, absolutely. solve is coming up with a standard language for all computer programs to use to describe what it
is that they're doing with kind of a focus on computer programs that are actually like distributed
systems so networked applications where you need to actually talk to a large number of computers in the process
of getting anything done the fact that you need to talk to a bunch of different computers but
actually be able to sort of uh trace back all of those interactions across all those computers adds sort of like an extra layer of difficulty to capturing that
information correctly. And that's actually the part that I found was missing the most
in our traditional tooling, mainly because it's more painful to set up uh and install and get otherwise get these
what are called context propagation uh mechanisms working uh so the open telemetry project is a
project um to design and build all the tools you would need to emit that kind of modern telemetry.
And it's designed in such a way that it can be very stable and very amenable to being embedded
in shared libraries and shared services.
So this was kind of actually another problem I had,
which is I wrote a lot of open source software,
like open source libraries and services and things.
And you kind of hit this weird wall with observability, which is I would like to give you logs and traces and metrics,
but I don't know how because my library has to integrate with
all the other libraries you the application owner are running when you're compiling your
application together. So even though I might have a good idea about what a good metrics library is,
or good logging library, it's not very helpful for me to pick those things because it's really the
application owner who has to pick where all that data is going and what format it should be in and
all of that stuff. So one thing the OpenTelemetry project focuses on is being very friendly with all the existing observability and analysis tools.
So it's, in fact, there is no backend for OpenTelemetry.
The idea with the project is we focus on a set of observability pipeline tools.
So APIs, like client implementations, a thing called the collector,
which is sort of this awesome Swiss army knife that you can run for processing all of this
telemetry. But the end goal is to be able to, in addition to, you know, producing what we think is
next generation telemetry, be able to like receive and export telemetry in any format.
And to,
to make all the different pieces of open telemetry work together as a
system,
but also be useful as standalone pieces.
So we really are trying to make something that is,
is helpful for everybody and doesn't,
doesn't go around imposing kind of like a straitjacket on people
where if you want to use one part of it, then suddenly you're stuck running OpenTelemetryDB
somewhere or something like that.
Okay, so let me replay some of that to see if I've understood it correctly.
Let's say that I'm a vendor like Datadog, right? That's the one observability tool, I guess, that I'm familiar with.
It makes sense for me to integrate with OpenTelemetry because everyone else will be
speaking in this language, essentially, and then all of my customers can use
any system that integrates with OpenTelemetry automatically,
and I don't have to build a collector for each one. Is that roughly correct to start with?
Yeah, that's correct. So what this is solving for application developers and
open source library authors, end users, if you will, is they don't really like
vendor lock-in, right? Like people don't want, like instrumentation in particular is this
cross-cutting concern that just gets embedded everywhere, right? You just have all these API
calls all over the place when you add logging
and metrics and all this stuff to your system. And if it's going to be a good system, all that
stuff needs to be coherent. And it's a bummer to have to rip all of that out just because you want to use a different analysis tool to look at your data.
So end users are interested in switching to OpenTelemetry because it's sort of a
write once, read anywhere kind of solution. There's enough industry adoption from all the major vendors and cloud providers.
Most of the big shops are involved directly in OpenTelemetry because we all kind of agreed that this would be better for everybody.
And so even if like you're not super interested in the project, what you're going to see over time is more and more of your end users coming to you saying, you know, we've already like instrumented everything with open telemetry and we've got a nice open telemetry pipeline that we like, and we just want to point the fire hose at you and have
you give us useful analysis. Can we do that, please? And I think over time, it's not even over time, it's already
the case that almost everyone is like, well, yeah, we would love to ingest open telemetry data in
that case, because it makes it really easy for us to onboard these new customers. And if instead
we tell them, no, no, no, you can't just send us your data. You
have to go back in and rip everything out and replace it with all our vendor specific stuff.
That is like a pattern that as time goes on, developers are less and less interested in.
I think just in general, proprietary code being, even if it's quote unquote open source, but it's like de facto
proprietary because it's really designed to work just for one company or one backend.
That's the kind of stuff people really don't want to put in their code base because it's kind of
tying them to a particular service. So I think that's one of the reasons.
There's a couple of reasons why open telemetry is really interesting,
but that's like one very practical reason is I want to instrument once,
have the data coming out of my system be very standard and regularized,
and then be able to send that data off to a variety of different tools that all know how to properly ingest that regularized standardized data.
Yeah, that makes a lot of sense.
I think any developer who's gone through a painful migration,
especially for something like telemetry,
where you have no idea how useful a particular log line is or a metric is,
but you do need to migrate all of them.
I've been through migration like that once.
It's like you never want to do that again.
Exactly.
And so with OTEL, part of my pitch for people is like,
it's like the final migration.
If you do the work to move over here,
and you can do that work progressively too, again, because like the collector, for example, you could just take the data in there as a middleman to all of your existing data.
And now you have something more like a router or switcher that can convert and regularize data that's coming in from different kinds of sources.
And that's a like initial baby step.
So then as you are progressively swapping out your instrumentation
for OpenTelemetry clients and OpenTelemetry instrumentation,
it's still that new stuff is going to the same collectors
that are getting data from the services you haven't migrated yet.
And it makes it a little bit more like a smoother rollout.
So we put a lot of thought into the kind of practical aspects
of rolling out and managing telemetry.
And I think that's part of the reason why some of the open telemetry tools
are getting popular with people,
even if they haven't been able to fully migrate over to open telemetry tools are getting popular with people, even if they haven't been able to fully migrate over
to open telemetry instrumentation.
Yeah, the final migration just, it sounds too good to be true
for like an infrastructure engineer, but it's a great pitch.
Never say final, right?
Like, I think that's the thing everyone learns,
like at some point, never label a document as final.
Final, final.
Yeah.
Final, final, final V2.
Really final.
Yeah.
Yeah, yeah.
But I do see it that way because we are really trying to make a standard.
And we have a lot of buy-in already from a lot of like the appropriate groups you
would want buy-in from.
We've got a lot of organizational structure that makes that effective.
And we've also put a lot of a surprising amount of thought into like how the
code is actually structured and packaged in a way that makes maintaining very strict backwards compatibility
and stability um uh much more feasible so basically assuming that instrumentation is like
never going to get updated again but you want to be able to move to like the latest version of like the clients and the rest of the
pipeline uh so that you can get new features but also security updates and things like that
but you don't ever want to have to touch old instrumentation when you do that
that's something that we care a lot about. Dependency conflicts.
So the instrumentation API packages
don't come with any dependencies.
It's sort of just like an interface layer.
So if you are like an open source library
and you add open telemetry instrumentation to your library,
you're not secretly taking on like a gRPC dependency under the hood
that's then going to cause your library to conflict with some other library
that also has a gRPC dependency that's like incompatible.
That kind of thought is, I think, really important. If you are going to convince people that native instrumentation is a good idea, you really want to make sure you're not inflicting them with dependency conflicts and compatibility issues when you do. if you've seen that xkcd where there's this idea of like oh there's already 37 standards i'm going
to introduce one more and now there's 38 competing standards like how do you kick start a project
like this without making things worse no i i've actually never seen that xkcd comic no one has
ever sent me a link to that xkcd comic ever uh yeah no i definitely know the the comic of which you speak and it's a very
important point that is like a a question people ask is like how are you not just just making it
worse and um one truism is standards real standards take. And one of the reasons they take time is that they require
consensus. And the only way I believe you avoid being just the 38th standard, well, I guess there's
two ways. One is just like overwhelming force. You are some force within the industry ecosystem whatever that carries so much weight
that you can just inflict your opinion on everyone and they are just going to have to take it
because it's just easier to to go along with what you say uh i don't like that. But that's that does happen from time to time. You know, looking at you system D. And but there's also a situation that's more like the IETF approach where you just get all of the interested parties together and say, look, like we really are making a forum to, to hear
everybody out and try our best to come up with a solution, uh, that works for everybody without
being compromised in the way where it's, it's just kind of like shoddy. And that, that takes work.
It takes a willingness from the people at the core of the project to see their role as more um like the way i see my design role is not i come up with awesome
ideas and then go push them on people it's more like i absorb everyone's requirements and then try to think hard about a design that would actually be clean, but also meet all of those requirements.
Rather than saying like, oh, your requirements annoying.
What if we just didn't do this thing and you had to change um so one part of the success of getting people involved
was a willingness of the the core people to to organize the project more in that manner and that
ended up attracting a large number of people first to the open tracing project and then also the open census project um which was
mostly a google effort led effort microsoft also got involved there and then um we started lobbying
each other uh leaders in both of those projects to to merge those two projects because that seemed like like kind of like the the final bit was like if we can
settle our differences and find a way to merge these two projects then we can really have a
standard uh because we've kind of everyone at that point had sort of rolled up into one of two balls
and so now it was like well we've got to roll these two balls up into one ball and then we'll have something that really looks like broad industry consensus on on
how this stuff should work and that honestly ended up being the best of both worlds because those two
projects did have like a very similar pedigree but also different enough way of like looking at the world in other words
like both looking at kind of like maybe a slightly different list of requirements and when you put
those two lists together you've got the the complete list of requirements and uh that became
the open telemetry project and uh it's uh And the merging of those two groups was also the final kind of like starting gun to cause everyone who had not gotten involved yet, but cared to kind of come over and get involved.
I want to take the perspective of an engineer who's not an infrastructure engineer
at all, right? Like I'm a product engineer. My job is to deliver value to customers as quickly
as possible. Why should the way telemetry is being shipped from my app to whatever vendor or tool I'm
using, why should a change in that like me excited? Is there something that OpenTelemetry does that is just unique and different?
How is it pushing the boundary and helping us understand our software better?
A large transition happening in observability from what I would call the traditional three pillars model of observability
to a more modern observability model. I've been describing it as like a single braid
as opposed to three pillars. But the basic issue uh, and to be clear,
it's very smart.
People will go out there and talk about like this three pillars model and say
things that are very useful as to how you should set up and operate a
traditional observability system.
So I don't want to imply that,
that those people are wrong or what they're saying is,
is bad in some way.
But what I don't like about saying the three pillars model of tracing, logging, and metrics,
each being like this pillar in the observability, the Parthenon of observability,
is it makes it sound like all of this is intentional somehow. Like there was a design plan in play and having these three totally siloed streams of data
is like anything close to a good idea or how you should build a telemetry system.
And the answer actually is that is not a good way to structure your data or build a telemetry system. And the answer actually is that is not a good way to
structure your data or build a telemetry system. It's a very bad way to do it by having all of
these data streams be totally isolated from each other. Having some of them like logging, for
example, be very unstructured. Having other ones like tracing being very crudely and heavily sampled
like all of this stuff ends up creating a lot of extra work for operators who are trying to
look at this data and figure out their system is doing because we don't look at this data and figure out what their system is doing. Because we don't look at this data as like,
you don't go solve your logging problem by looking at your logs
and your metrics problem by looking at your metrics.
You're trying to figure out what your system is doing
and then come up with a root cause hypothesis and then go remediate it.
And in order to do that, you are trying to synthesize
information out of all of this data. You're using all of these tools together and you're kind of
moving around between all these different ways of looking at your system. And if you do that in a
world where this data is poorly structured, it's not organized into like a unified graph of data that represents
the topology of your system that represents the causality of operations in your system
that has a way of correlating between aggregate data like metrics and transactional data like all the logs in this particular
transaction, you have to do that work anyways. And if your tools can't do that work because the data
just isn't structured in such a way that your machines can put it all together for you,
you end up doing that all in your head, right?
So you end up trying to find correlations between different graphs by using your eyeballs
to look at squiggly lines on a screen.
We have like all this computing power and we're trying to find correlations between
metrics by looking at squiggly lines with their eyeballs
like that's crazy um trying to find all the logs in one particular transaction and by transaction
i don't mean database transaction i mean someone clicks the checkout button on their mobile app
which triggers an http request to some front-end service, which triggers a cascade
of requests to various back-end services, and even some services after that, and kicks
off some background jobs, and does all this other work.
And then you have some problem or error way down deep in that stack, And you want to just look at all the logs that
were part of that one transaction. But that means looking at logs that came from 12 out of 500
computers that you're running. I don't know how much operational background you have. But I think
if you have much, you like instantly that's that's like
really annoying like grepping through your logs and like trying to just filter down to find a
particular transaction um just it turns the human operator into like the glue that's trying to glue all this data together instead of emitting properly structured data where
those graph relationships are already modeled properly in the data and your analysis tool
can just automatically glue all that stuff together for you without you doing anything,
which then frees you up to just look at all the data and do the
human-centric work of trying to perform a subjective analysis about what the real
problem might actually be. Okay, so the idea of open telemetry is since it can be pervasive across
every single way you collect data, and as long as you thread through the right things like a request id
or something like that it can help you stitch all of the relevant data together for a particular
like transaction as you mentioned it like a particular web request in a certain sense
or like a particular job and then the next part which which is visualizing that or showing it in a way that's debuggable,
can be handed off to a visualization tool. But because of the way all of the data is structured
from OpenTelemetry, it makes putting all this data together easier. Does that sound correct?
Yeah, it does. So to get into some of the details about how that works,
when you have a distributed transaction,
so you have a request from the end user
that ends up touching a bunch of computers
in many tiers of backends,
like big 20 computers are involved whatever um if you want to be able to index all of the logs that were part of that particular transaction
you need a transaction id right like you need every log that gets emitted as part
of that transaction to be connected to that transaction ID. And in order to do that, you
have to have some way of passing that transaction ID along the execution path. So as the code executes through your program,
the transaction ID needs to follow along with it. And so we call that context. So context is within
a program runtime, some way of associating the execution context with a bag of environment variables,
you might think of them as,
that are specific to that particular execution context.
And that means when that context jumps to another thread,
that context bag has to jump along with it.
If there are some kind of like user lands stuff
like tornado or G event or something happening
on top of the threading that's going on,
you know, that system needs to manage
keeping track of these contexts
when it's switching coroutines and stuff like that.
And that's tricky.
Not very many programming languages
have that concept fully baked into them.
And the other thing you have to do
is whenever you make a network request,
so the work now is passed to
another computer, and this computer is now sitting there idling, consuming resources, but idling,
waiting for this other computer to do some work and give it some information back. You want all
the logs that are on that other computer when it's performing this transaction, which means you have to now take that transaction ID in that context
and staple it to that network request.
So if it's HTTP, you would put it in an HTTP header
so that on the server side on the other end,
the controller action or whatever it is that's handling that request
can pull that
transaction ID out of the header, attach it to the context, and then continue on its merry way.
And so that fundamental system of context and then propagating that context to the other servers that are part of the transaction, that is fundamental to what is traditionally called distributed tracing.
And in OpenTelemetry, we've taken that distributed tracing concept
and we've extracted it to a lower level concept
that's just called context propagation.
So there's this lower level system that's just called context propagation. So there's this lower level
system that all it does is focus on being able to keep that bag of context attached to your
execution and then serialize it and propagate it along your network requests and deserialize it on
the other end and so on and so forth. And that's involved making changes to the HTTP spec.
So we went to the W3C and helped design a standard header
for putting some of this context in.
And it involved, in every programming language,
building one of these context propagation
systems trying to leverage as much of what already existed in that language
and all of open telemetry is built on top of that so uh
the most fundamental system in open telemetry is actually the tracing system
because what that does is it takes this context propagation mechanism and on top of that allows
you to record and keep track of operations so you say i'm in you know a controller action
operation and then i call out to a database client. And so
that database client starts like a database client operation. In OpenTelemetry terms,
we call those spans in a trace. And then all of the logs that might be occurring,
those are all occurring in the context of those spans, those operations.
And those spans are all linked together in a graph.
So every operation knows the ID of the parent operation that called it and can be connected
to the child operations that it spawned as part of doing its job. And those trace
IDs, those span IDs, all of that stuff, get propagated to the next service. So they can
continue the graph of saying I started an operation on this other computer. But my parent operation
was this client HTTP call on this other system.
So if people have worked with the tracing system before, that's just kind of like the fundamentals
of how distributed tracing work. But instead of saying distributed tracing was this
third system, like running off in a corner on its own, we're saying, no, no, no, that's like the fundamental context
that everything executing needs to happen in.
And then what you'd call your logging system
is able to access that context.
So all of your logs can get that trace ID
and that span ID.
In other words, the transaction they're part of,
the operation they're part of.
So then when you store them, you have these indexes. So once you've got that trace ID, if you have one
log, like an exception or an error from a backend system, and you're like, well, show me all the
logs in the entire transaction from the client all the way to this back end to like any other service that had anything to do
with this transaction boom they're all indexed by that trace id and so instantly you can see
all of them you don't have to do any filtering or or grapping about to make that happen
so i think that right there maybe shows some of like the fundamental difference
between having these as like totally separate systems versus having one
coherent graph of data.
And the metrics get involved in that graph too.
But I think just talking about how tracing and logs are actually kind of one
of the same system is,
is a good starting point
to see how something like open telemetry is a bit different from traditional observability.
Well, yeah, absolutely. Like I had no idea you went all the way to like HTTP to come up with
the standard header. That means serious business, because I'm sure that would have taken a lot of
time. I just looked it up. And it looks like a fairly recent draft, like November 23rd of last year. So you go all the way to HTTP and
that's how you can ensure interoperability because now it's a standard. Exactly. Exactly. This stuff,
this stuff takes time. There were like prior de facto standards. I should call it like Zipkin
was a very popular open source,
distributed tracing tool. And they had a set of headers called B3, like the B3 Zipkin headers.
So those were pretty common. And they work pretty well. But you know, they,
it's a step up in standardization to actually put it into the HTTP spec. So that's a good... And that's sort of what we're saying when it comes to...
When you're talking about these distributed systems
and wanting to connect all this information up into a graph,
modern distributed systems are not all owned and run
by the same operator, right?
You have different teams
and different operational teams
and service owners within an organization,
but then those organizations are potentially
contracting a lot of software as a service.
In other words, software services
that other organizations are running, like
databases and things that the cloud providers are providing, or some other third party
message queue, you know, provider is providing for you. And if you have a standard data format like open telemetry for describing logs and traces and propagating
these indexes and identifiers now it becomes possible for those third-party providers
to send you the rest of your trace that you could never access before right because that was being stored
in some third parties systems where they they have like their logs and like their traces but
if you want to know like some nitty-gritty details about how your database query or your usage of like a message queue was like causing latency or problems, you
might be able to discern some of that data just from the clients that you're using attached
to it.
But there's even more data that you could get if you could actually get operations and events and metrics out of their system.
That was just the portion of their resources that you as an organization are using and
not anybody else.
But if you have a standard, now there's a way for them to be like, well, we've done
the work to add that instrumentation and we will admit it as an open telemetry fire hose.
And so you can ingest that as well as the stuff coming out of your own applications and services.
And now you have an even deeper trace of your overall system because it's including like these-party systems as part of it.
It's kind of like if every system spoke their own language and didn't speak in HTTP,
you'd be reinventing that for each system. And I'm sure there definitely is RPC systems and all of that. But basically, for most systems in the world, you can probably just communicate with
them with HTTP and you hear back just fine and this is
taking it a step further to understand your system like no matter where they are who runs them
you can probably understand them but say okay this operation in this third-party vendor is taking
time and that's why my requests are slow that i think i finally now get the vision and it makes
a ton of sense to me.
It's also much more ambitious than what I originally thought it was.
There's this idea you brought up around the braid of observability.
Initially, there's these three pillars.
There's the part known as metrics, logging and tracing.
They're often thought of as separate things and they really shouldn't be. And we spoke about how traces and logs can easily be correlated
and should really just be the same thing.
How do metrics play into that?
Yeah, that's a great question.
So there's two really practical ways
that metrics plays into that.
One is just maybe a fundamental concept,
which is that metrics are just aggregates of events.
So you have events that happen, like an operation occurs, like an HTTP request, and you want to know things about that particular HTTP request and how it fits in to an overall transaction.
But you might want to know things in aggregate about that HTTP request.
How long did it take?
Not just an individual request, but all the requests like that.
What is the spread of latency?
You might want to count number of 500 status codes per minute or something like that.
And one way to do that is to have a metrics instrumentation API where you create counts
and gauges and histograms and things like that.
And you embed that directly into your code.
And then you get counts and histograms and gauges,
very old school and traditional. But another way to do that is if the data, the event data
that's coming out of your system is very regularized and well structured. In other words,
it's not like an unstructured string blob that you have to parse and hunt around for content in,
if it's all very regularized key-value pairs that have standard keys and standard value types,
then it becomes much more feasible to create a lot of your metrics on the fly. So you could embed metrics API calls in your HTTP client
to do things like count status codes and stuff like that.
But if you're also emitting a span for that HTTP client request,
you could farther down your pipeline,
like let's say in the collector component of
OpenTelemetry or in your backend, just anywhere farther down the line, you could just dynamically
generate those metrics based off of that span being emitted. So that's, I would say, like one fundamental thing for people to think about. And
why that's actually important is, if every time you want to change what metrics you're collecting
about your system, you have to go into your code and make a code change and then redeploy your application, that's a bummer, right?
That means bothering a developer who has the capacity
to make that particular code change.
That means recompiling and redeploying an application
and doing stuff like that.
That's kind of like a long path that has quite a large number of side
effects compared to an operator, a system operator, wanting to get additional metrics
and just changing the configuration of something in their telemetry pipeline to start emitting
those metrics there
and not touching the application services at all, like never restarting them. They don't even know
that you're generating new metrics. You're doing this all farther down the pipeline.
So that's one fundamental way that metrics are tied in as part of the braid of data with traces and logs,
which is perhaps you should start switching to generating more of your metrics
dynamically from your traces and logs.
And with a regularized, highly structured system like OpenTelemetry that's a lot more feasible than
if you weren't really running tracing or you're running tracing, but it was very heavily sampled
up front, or your logs had this information in it, but it was not consistently structured.
All the different things emitting logs emit an HTTP request a little bit
differently. It's hard to, it's expensive to parse that stuff, et cetera, et cetera. So OpenTelemetry
makes dynamic metrics creation a lot more feasible. It does have a metrics API, to be clear. So that's
also there. Okay. Yeah. I think I need to understand this a little
better. So let's say you have, again, like the web request example, right? Like you have a request
that starts at a front end. It maybe makes a request to like one underlying service. It comes
back and then it returned, it does some computation, returns that data to a user.
So what you're saying is the fact that there is a trace that captures that
also enables me to generate a metric and maybe generate like more metrics like for example i
can automatically track things like http status codes from the underlying like service if i want
to to the for the front end if i want like i guess i didn't fully grasp
how i can dynamically generate more metrics given this is this this is how my
trace looks or like this is what my request goes through well just that uh when you're talking about a metric, you're just fundamentally what you're talking about is an event that happens in your system that you want to look at in aggregate, right?
You want to look at it in aggregate.
You want it scoped along a certain number of dimensions the way you want to look at it might be counting something
or summing it or putting it into histogram buckets or or making a gauge but a lot of that
information that you're looking at isn't like a sampling of continuous information you have some
stuff with like ram or cpu where you're you're where you do kind of need some kind of probe in there that's taking a sample.
But a lot of what we're making metrics about, especially in the context of our transactions, are things like HTTP requests or database requests or exceptions occurring, things of that nature.
So in all of these cases, there is something in your tracing system
and your logging system describing that specific event occurring.
And you could, right next to the place that you're recording that specific event in OpenTelemetry,
also add right there, using the metrics API, something that counts essentially the same information
or otherwise emits a metric about that event.
But you could also do that farther down your pipeline like if you're trying to um count uh status codes right how many
500s how many 403s etc etc how many 200s you're trying to count status codes uh in your system
based on you know some set of dimensions which ap API endpoint you're talking about, or what
route you're talking about, et cetera, et cetera.
If you're already emitting all of that information about those HTTP events happening, there's
no need necessarily to bake all of that metrics gathering into your code. You could instead
create a trace processor or an event processor, essentially, later on down the pipeline. This is
one of the things that Collector is very good at. It takes in all of your data and you can write
these processing pipelines to do things like transform the data scrub
sensitive information out of it but you can also use it as a place to generate more data
and one particularly useful thing you can do there is generate metrics out of your events and given
that there's like there isn't one canonical good set of dimensions to capture a particular metric,
given that there are what you might think of as a default dashboard you might want to set up
for particular services and particular libraries that you're using.
There may be a default dashboard that captured some reasonable information about that.
As time goes on and your systems get bigger and you understand them more
and the problems you're trying to solve with them become more specific,
it's hard to predict what metrics you really want in the future and what dimensions you want those metrics
recorded by so the ability to dynamically create more metrics on the fly as like an operator
as the or as like the analyst looking at that data and being like, dang, I really want to have this additional metrics or I want to change the dimensions that I'm recording this particular event across. just going to your telemetry pipeline by making configuration changes to your collectors
and then restarting your collectors rather than having to make code changes to your applications
and restarting your applications that gives um operators and people who are farther down the
line as far as caring about you know the telemetry being emitted and the dashboards
being set up and all of that. They now have the freedom to start generating the metrics they want
without having to do these application restarts or bother, you know, the developer who would,
specific developer who would need to do that because that's like their particular part of the code base or something like that.
Okay. And the collector is a daemon, right? It's not like a server-side component. It's
actually a client-side component. So it's actually pretty, it can do those transformations
pretty efficiently. Yeah. Yeah. You can run the collector in a variety of pipeline roles. So one common place to run it is something like what's often called an agent. Basically, you can run it on the same machine, same virtual machine, or as a sidecar if you're running Kubernetes. So it's local, on a local network connection to your application.
The advantage of running it there is it can collect a lot of additional data
without the application having to do that.
So that's a good place to configure the collector to collect things like CPU and memory and stuff like that.
It can also collect additional information about the environment that the application might not be
collecting about, you know, the Kubernetes environment or the cloud environment or,
you know, just something about the resources being consumed by that particular
application. And it can decorate all the data coming in with those additional resource attributes.
So there's some good reasons for running it locally. The disadvantage for running it locally,
of course, is that it's consuming the same resources as your application.
So it's also feasible to run collectors on their own boxes.
It's feasible to run collectors in a pool behind a load balancer. So what people often end up doing is having this sort of tiered pipeline
where they have an application. That application is talking to a local collector.
That local collector is doing a very minimal amount of work. It's maybe sampling machine metrics like CPU and RAM, and it's storing all of the telemetry data,
basically acting as a buffer between the application and the rest of the telemetry
pipeline. And because it's on a local network connection with the application, that means you can configure your application to not really buffer that telemetry data.
And that's really helpful because that means if the application suddenly terminates, you're not losing a large batch of the telemetry data that you probably care most about, which is a problem when the, if the network back pressure on your telemetry system
is reaching all the way into your application, then yeah, you run, start to run that risk.
And so by then moving that to a sidecar or a local collector, then the collector can act
like a better buffer to handle any back pressure that
might be happening in your telemetry system. The reason then to run these collector pools
farther down the line is if you're wanting to do more and more processing about your telemetry data
that doesn't need to be done locally, that means you could be doing it later
in kind of a pool that's collecting data
from many separate application sources.
And playing back to one of the big advantages
of this decoupling is that I can have something
like a simple structured log of something that I
thought was important a year ago but I just decided to log because I thought it'd be interesting to
see but today I think it's extremely important that I have a metric that comes out every time
that log line is hit essentially like especially when a certain attribute of that structured log
is like true or false or something else and open telemetry just
lets me do that without configuring any client code or changing any client code like letting me
add a new metric or whatever i can just do that by configuring or like tweaking the collector
config to say when you see this structured log event generate a metric exactly yes exactly that's exactly what
you can do and because uh open telemetry has what we call uh we call them semantic conventions
which is kind of a a funny term and you might be better to call it a schema, like a semantic schema.
But Elastic Common Schema is another example of one of these.
But there's a schema to describe all the common operations that machines do.
So if you're recording an HTTP client request,
if you're recording a SQL database call,
all of the common things that a computer program might do, we have a strictly
defined set of key values that should be emitted to describe that event. So it's not just that you
can use the collector to, like, say, parse a log line and figure out how to emit a metric.
You can do that.
But it's also the fact that that data coming into the collector for many of the things you would want to collect metrics on is already in a very nice, regularized, well-structured data format.
So it's much more efficient to be generating metrics off of that kind of data.
And it's also much more reliable, right?
Because you can depend on what that data is going to look like.
In fact, we even have schema versioning. So every instrumentation source indicates which version of the schema it's adhering to.
So you can even do schema translations.
That's one of the ways we handle backwards compatibility.
If we figure out additional attributes we want to admit or change
the way data is is split up uh in something we're reporting um all of those changes if they come to
stable instrumentation would have to be uh um be released along with a schema processor in the collector.
So you can build your pipeline tools to expect data to be in a particular format.
And if it's not in that precise format, if it's in a different format,
then the schema processor just gets run to convert it to the format you want
so you're not you're not breaking your dashboards just because uh you you updated your instrumentation
to to a new version that's interesting i'm sure versioning must have been a pain to design and
like roll out like it's it seems like a tricky problem and like schema transformations and stuff i've had to do
like a similar data modeling problem at work and we're just like for now let's skip versioning
because there's so many implications that you have to think about like what if there's two
conflicting versions but when you're building a standard and there's so many different systems
that need to interact together i can see why you'd have to go all the way to build this.
Yeah.
I would say a lot of what we're doing,
I think really are best practices that anyone who is maybe not so much
for application code, but if someone is creating a shared library like a
some something that is going to run in many different applications in many different
environments um especially if it's a cross-cutting concern like you know telemetry, it's worth it to care about things like backwards compatibility and upgrade
paths and transitive dependency conflicts where, you know, the dependencies that my thing depends
on may conflict with the dependencies that other libraries depend on. If you think about those things up front,
if right at the beginning when you're designing your stuff,
you have a much better chance of coming up with a system
that you can adhere to as time goes on
to maintain those qualities and tests you can adhere to as time goes on to maintain those qualities
and tests you can do
to ensure you're maintaining those qualities.
It's much harder, in my opinion and experience,
it's much harder to add those qualities
to a system later
where you didn't think about it at the beginning and bake it into the
design and architecture of the system.
Not impossible,
but you,
it is worthwhile to think through the different ways you're going to want to,
to mutate and update and improve the library that you're offering and just kind of figure out what
um what is a good way to lay out those packages what is where the right places to introduce
loose coupling things of that nature to ensure that you're going to be able to say once this particular piece is stable,
that it will remain stable forever.
And that might limit the kinds of backwards compatible changes you can make there.
But if you also have a way to then introduce new experimental components
in such a way that they aren't destabilizing the stable components like which in many languages
comes down to like how you lay out your packages for example um if you come up with with a plan
that it does require some work up up front to figure
that out but if you do that work up front then implementing it becomes smooth uh so so that's
that would be a best practice if i were to talk more in the future about open telemetry as an
open source project and maybe some of the practices
we do that other open source projects that are kicking off would benefit from i would say
looking at how we handle versioning and backwards compatibility is like a place where i'm really
proud of the project and as you mentioned like these things take time, right?
You go through these problems once and then new requirements come in. It takes time for the industry to be like, yes, this is important. We need to make sure we have this. We've had this
problem before. And one design iteration goes on. Maybe one project doesn't get it quite right,
but the next project has all those
learnings and then the industry is like yes we can converge to this new standard or this new project
because it takes most of the boxes we need yeah and so we have a a development uh process that
involves rfcs uh we call them oteps open telemetry enhancement proposals but they're
very similar to say the rfc process from the itf um we tend to require that
oteps come with prototypes so here is a change that I'm proposing to make.
Here is an implementation of that change
in two or three different languages
if it's a client-level change.
Really trying to get a lot of that design work.
Basically, we don't want to have surprises show up after something
is added to the spec it's you if you care about backwards compatibility in a strict sense
then things are very sticky once they've gone into the spec um it's hard to pull them out.
If they're going into an experimental part of the spec, then obviously later we can say, like, whoops, we're making a breaking change to this part.
But even there, we actually do our best to try to avoid thrash. If for no other reason, then
that just kind of dumps extra work on the different language maintainers. We're conscious
about the fact that if we make a client change to OpenTelemetry, if we change an API or add an API or change how the client implementations work, then that's work that's going to then get repeated across like 11 different languages.
So it's expensive to be like, build it this way.
No, no, second thought, build it this other way.
So for all those reasons, we've kind of developed a longer specification process
that kind of involves doing more design and review work
upfront than a lot of people are used to.
I think many people, including myself, are more used to
an approach that come up with a good idea or what maybe sounds like a good idea,
write some code that seems like probably it implements that idea, throw it over the wall
and run it in production and see what happens. And there's definitely some advantages to doing that.
Not every piece of code that gets written has the requirements, something like Open
Telemetry has.
But I do think for projects that are some other equivalent of OpenTelemetry, like this is a big shared library that's going to get embedded in lots of different important applications, or this is a platform that everything is running on, or yada yada.
Something that's like code that's really going to get exercised in a lot of different environments lots of people are going to care about i i think it's worthwhile for projects like that to to come up with a more structured approach
to to how they they think about change yeah it's it's as you mentioned right like it depends on
like the life cycle of the project who's using it how many people there are i can't even imagine there's like a security vulnerability in something
like open telemetry because or like um like a remote code execution thing like i don't know
if you're familiar with the log4j stuff that happened a few months ago it's like you have to
be careful because there is a lot of impact of
this, like, especially when you're a library that other applications depend on, right?
You don't want to be of supply chain issues
that are inherent in open source development.
Not just open source development,
but any form of development that involves leveraging code
that you did not write, that is not of your providence.
It's really a conundrum i honestly don't have a great answer for it because through the whole history of software development this has
been one of the big lauded examples right is that we don't all need to recreate everything from scratch.
We can build libraries that do something useful, and then we can all depend on those libraries.
And the fact that we can reuse all of this code is this huge advantage that most people think of when they think of software development right like code
code reuse and leveraging other people's code is a feature not a bug but it's definitely at odds
with a concept of strict security right so that there is a fundamental mismatch there that is really unfortunate.
And it's interesting to see how long we were able to fly without it becoming truly a widespread problem. It's a thing that's always been a bit of a problem,
but maybe was more restricted in the past
to things like state-sponsored actors
targeting other states,
very sensitive stuff,
and then those sensitive things adhering to different stricter software patterns,
presumably, in order to counter that.
But now it's kind of all mushed together,
where everything is intermediated by network computers.
Everything's a computer program.
Code goes everywhere.
And all of that code everywhere leans very heavily
on a lot of publicly available open source code.
And we tried to solve some of that for OpenTelemetry,
like scanning our dependencies and trying to make sure that we aren't sticking around as a vector for a supply chain dependency. And we try to think hard about what dependencies we are taking on and where.
But the other thing is ensuring that it's possible
for our end users to stay up to date.
That's actually another form of like,
I don't know if I would call it forwards compatibility,
but yes, it's the idea that a way I often get stuck
with things like frameworks, web frameworks I've used in the past is
in order to get some security patch or something, I need to upgrade to a new version.
But that new version changed things like the plugin interfaces for some plugins I use.
And the plugins I use haven't been updated to use those new interfaces.
So now I'm in a jam, right?
Where there's something that I really want, maybe related to security, that I can get by rolling forwards to the latest version.
But in order to get there, now suddenly I'm faced with potentially doing a lot of work,
right? I either have to abandon these plugins. I have to, these plugins are now my code. I have
to go in there and somehow make the upgrade myself to make them work.
And it might not necessarily be plugin interfaces, it might be any interface that something presents.
If it creates a breaking change, that means I'm now going to have to do all of that work
before I can get the security patch. And as a side effect of that, people then start to camp on old versions
of software. And then they start to rightly demand that the people who develop that software
maintain those older versions and backport security patches and other things. And the maintenance cost of all of that goes up over time. But if you work really hard to ensure that that's not happening,
that you're avoiding those kinds of situations that would make it more difficult
for your end user to update to the latest version of your client other than just bumping the dependency version in their manifest.
Or feeling like they can reliably pin their dependency version in their manifest to something that helps them stay up to date with everything short of like a major version bump
and then you never make a major version bump then you're also creating a world where
your end users aren't kind of lagging behind or being scared of updating and you're
one hopefully they are staying more up to date and they are avoiding
you know um a situation where they're hanging around without applying security patches but
also it means like when they do need to go up to date they aren't hitting some wall and thus
being stuck on an old version and then raising the maintenance burden the
maintenance burden on the open telemetry project because now we're like oh well we have a
responsibility to maintain all these different ancient versions of this thing because uh we did
something that made it genuinely difficult for those users to receive security patches and performance boosts.
Yeah, like, this is just such a tricky problem that I've been that that has been going around,
like more and more, especially with the NPM ecosystem. But like, I thought about it a little
bit. And it's not just related to NPM, like that might be one of the more egregiously bad ones
where nobody pins dependencies. but it is a problem across
every programming language in my opinion i think maybe languages like go people tend to not use
dependencies as much and like standard library is really strong similar to python so you're a
little safer there but i i really think the future is like some kind of capability based dependencies with like certain dependency should not be allowed.
Like left pad should not be allowed to contact the Internet.
And like specifying those kind of capabilities for your modules may be an answer.
I really believe that is like the one part of it. And the other part is like more and more security tools that we just use on a
day-to-day basis that like scan for stuff, scan for anomalies.
I think that'll become more and more commonplace.
But instead of talking about this, I have a question for you,
which I think is like a good wrap up,
which is what are you excited about with OpenTelemetry like next?
Like what is the one big thing that you're working on
or that you see coming up that makes you think,
wow, this is great,
and this is going to be a really good addition to the project?
Well, it sounds a little boring,
but stability is the main thing I'm really interested in.
We're still laying in um the final touches on
logs and metrics and there's i think this delayed process of those those things becoming stable and
open telemetry the instrumentation we provide around those things becoming robust and then all these different
backends and analysis tools starting to provide features that actually leverage the structured
data that open telemetry provides one quick example we didn't talk about metrics exemplars, but one way metrics are tied to traces,
besides dynamically generating metrics later,
OpenTelemetry's data structure allows you to record a sampling of,
say, trace and span IDs, um,
that are associated with a particular metric.
So in other words,
you have like a range of metric values that you might be emitting and you have
like these high values that represent something problematic.
Um,
let's say,
you know,
you have an alert threshold on some metric,
and then the alert goes off. Your next question is going to be, well, what are these transactions
that are generating these problematic values in this metric, right? That would be the next step
in your investigation. And in open telemetry,'s like actually been a a sampling of those different
transactions that were emitting those values associated with that metric which means you can
build uh one workflows that allow you to just click directly through from that metrics dashboard
into you know your logs and traces uh that were associated with you know, your logs and traces, uh, that were associated with,
you know, those metric events. But it also means that machine analysis, like, uh,
machine learning and other kinds of statistical tools and automated tools have,
have that rich graph of data to perform their analysis on.
So they're not using really crude heuristics to try to figure out what was related to what in your system.
Like, well, these things over here happened around the same time as these things over there.
You actually have a real graph of data
that's connecting all of this together.
And that means your machine analysis
can become much more efficient and more accurate
when it comes to finding correlations, for example,
between different trends going on in your data.
Being able to say, wow, when you're seeing this kind of exception over here, that's highly correlated with a small subset of IPs
somewhere along the line or something along those lines.
All of this latency is actually correlated highly with this particular Kafka
node.
Those are the,
that's,
I should stress that it's not root cause analysis.
It's not things telling you what the root cause is, but
just identifying those correlations is a really time-consuming process when fighting fires and
triaging problems in your system. Realizing this bad behavior over here correlates with a particular
configuration option having a particular value.
Those are the kind of things where you spend hours hunting around before you figure out that that might be a thing. that graph analysis and statistical analysis can just walk and do their thing
the way they're designed to do,
then you can really start leveraging that stuff.
So my hope in the future,
my big hope is to start seeing
more and more of these analysis tools
pick up on these advanced features of open telemetry
and start offering value to people based on them.
Yeah, this is kind of similar to LightStep's change intelligence,
but really like platformized for everyone.
Exactly.
Change intelligence is like a perfect example
of this kind of correlation analysis,
which we do a good job of today,
but it's still, if the data is not coming out of open telemetry, then we have to rely, we have to fall back onto rougher heuristics.
So you're still getting these correlations, but the chances that they're false connections starts to go up right like it's it's harder for those systems to accurately
give you correlations because past a certain point it just has to guess because it doesn't
it doesn't have the the data being handed to it in any other way yeah i think that is exciting
right like and and the second thing you were mentioning just around these interfaces that maybe let you even analyze the graph of data that you're getting, right? Like maybe the system can be smart enough to find a correlation, but like, just these different human interfaces I'm imagining now, which let you like, write some kind of queries or even like, have a little bit of code, which actually lets you analyze all of these different changes. It's actually pretty exciting
to think about. And now I'm thinking, we do all of this
production monitoring, but I'm super interested in developer
environments. Why don't we get this kind of analysis from our
local machines when Git is acting up or NPM
is acting too slow.
We really need to have
telemetry for all the things. Like I can't imagine
tomorrow, you know, all of our build systems
are integrated with like open
telemetry and we know
exactly why something is slowing down.
Yes,
that's actually a place people
are definitely using it. Their CICD
systems and trying to find latency problems and bottlenecks there.
Another place I'm really, if we're talking about total geek out stuff,
something I've been excited about for a long time and kind of been predicting.
I've done some talks about it in the past,
but haven't had a chance to implement much.
But sure enough, some implementations are starting to show up,
which is if you have this kind of structured data
that OpenTelemetry is producing,
you can start using that data as input into your tests.
In other words, starting when you are developing software,
you use tests, you use unit tests, you use integration tests,
you use these different kinds of tests to test your software.
But those tests have nothing to do with the kind of information
you're getting about your system in production.
So you could have high-quality tests,
but that doesn't tell you anything about the quality of, say,
the logs coming out of your system.
And so there's this real disconnect between the tools we're using
to query what our system is doing in production
compared to the tools we use to query and
verify what our system is doing when we're developing it. And I felt for a while there's
a lot of room to build novel forms of integration testing that are being done on top of the production telemetry coming out of
your system. And there's a lot of different advantages that can come with that. And I've
been excited to actually see in this past year, a couple projects get started built on top of
open telemetry to do exactly that. So I think if you Google around for trace-driven development
or trace-based testing or OpenTelemetry testing tools,
there's a couple different projects that are getting started up
around doing that kind of stuff.
And I think that's a real rich source of potential future
because that hard split between how we test and verify our software
when we're making it versus how we test and verify our software
when we're running it, I've seen that as one of those arbitrary walls
that should get knocked down at some point.
Yeah, there's so much to expand on that.
You can have unit tests for, I expect, at least one HTTP call
to have status 200 that you can check against
a fake OpenTelemetry collector locally.
And then in production, you have the exact same test,
but it's running continuously against
that collector and it's basically an alarm so you can have parity against your unit tests and your
even your production environment in theory oh my god like it sounds a little out there but
could be it sounds wild at first but it's it's totally a thing and there's there are like i think if you start to think about
writing your integration tests against this kind of production data it's possible to develop a
testing querying language like an assertion framework that um feels more like the kind of assertions you're making when you're testing,
but is also written in such a way that it's scalable. In other words, it could be run against a stream of data.
And once you start having a language like that,
I think you'll turn around and start to realize that the kind of alerting we do, quote unquote alerting, is actually testing.
It's just very, very crude testing.
It's like a testing framework where you just have like one assertion you can make.
Thing passes a threshold with like some tolerance percentage.
And that's, that's super crude compared to the kind of assertions we make in our
integration and unit test environment.
So I think there's,
there's a lot of room to,
you know,
if you come up with something like that,
anyone who does that and can figure out a way to,
to run that thing against production data you're now handling
operators a way to start you know they there's like the classic saying that it's like when you
have a problem in production uh can you solve it by just writing more tests or running your tests
again no but actually maybe you can yeah if you come up with an assertion language
you can run against your production data that's that's a powerful introspection tool so i'm
excited for for somebody to build that yeah yeah this the first when you first look at developer
tools and you think okay we've made a lot of progress and things are so much better than
before but there's so much more to go there's so many things we haven't thought about the more you dig into the space the more and engineers as
you mentioned in your original story like you're always hungry for like better tooling and things
are never efficient enough and engineers at least right now are still super expensive so like every
tool that can make them more efficient. Yes.
Probably.
Absolutely.
Become a bit.
And it just makes our life easier to a lot of the improvements here.
Come from like time savings,
but it's also just saving a lot of like scut work that.
Yeah.
Yeah.
Well, Ted, thank you so much for being a guest. I think this was a lot of like scut work that. Yeah. Yeah. Well, Ted, thank you so much for being a guest.
I think this was a lot of fun.
And I hope I can ask you again for like a round two at some point.
Absolutely.
Happy to come back and talk more about observability anytime.
Thank you for having me.