PurePerformance - The State of OpenTelemetry with Jaana Dogan
Episode Date: April 26, 2021Googles Census, OpenCencus, OpenTelemetry and AWS Distro for OpenTelemetry. Our guest Jaana Dogan, Principal Engineer at AWS, has been working in observability over many years and definitely had a pos...itive impact on the where OpenTelemetry is today. In this episode Jaana (@rakyll) explains which problems the industry, and especially cloud vendors, try to solve with their investment in open source standards such as OpenTelemetry. She gives an update where OpenTelemetry is, the next upcoming milestones such as metrics and logs and what a bright future with OpenTelemetry being widely adopted could bring.https://twitter.com/rakyllIf you are interested in learning more – here are the links we discussed during the podcast: https://github.com/open-telemetryhttps://github.com/open-telemetry/opentelemetry-specificationhttps://github.com/open-telemetry/opentelemetry-protohttps://github.com/open-telemetry/opentelemetry-collectorhttps://github.com/open-telemetry/communityhttps://o11yfest.org/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always my co-host Andy Grabner.
Hey Andy!
Hey!
I guess I... What?
I was going to say it's funny getting to be able to see you while we do this and have you see what you make fun of me.
Probably you do all the time and now I'm finally getting to see it for the first time. Brian you see what you make fun of me, probably you do all the time.
And now I'm finally getting to see it for the first time.
Brian, I would never make fun of you.
It was really the first time when I tried to make some facial expressions,
seeing if I can get you out of your kind of your talking flow.
But I guess, no, it was really the first time.
You did though.
You see, if you go back and listen,
you'll notice I've made a couple of mistakes, especially coming out of it.
I'm also a little flustered today, Andy.
It's, you know, performance, you know, everybody always thinks about work.
It's like, oh, you know, sometimes I like work, sometimes I don't.
It's a challenge and all this.
But sometimes, I've got to tell you, when know, to deal with not computer problems, not deployment
problems, but arguing with my 11 year old on why she needs to brush her teeth in the
morning before going to school and her fighting back with me because kids just want to fight
back over the dumbest crap they can and trying not trying to hold it together and not just
like throw her through the window, you know, and you're just like, just brush your teeth.
You know, I, I could have got into the argument that,
you know, it's, it's the conspiracy of fluoride to take, you know,
but I didn't want to. Anyhow,
I'm glad to be here because this is an oasis as I was saying,
and it's always a great time. And I hope our listeners feel that way.
And most importantly, Andy,
I'm glad to be here with you because your smile always brings a smile to my
face.
I'm happy.
So, hey, Brian, before going on to our guests, I also have to say thank you to somebody else that is not on the show today.
But I wanted to say thank you to the guys at Neotis and especially Henry Grexit, who has been yesterday, today, as time of the recording, 24th of March, he did a 24-hour performance
marathon with, I think, 20 different presenters around the globe. He got up at five o'clock
in the morning and he did it 24-7, well, 24 hours. So it was really great to have all
these folks. And he allowed me to talk about Captain and very good. He keeps doing it. I had to bring it in.
But now to something different.
Yes.
Our guest.
I'm very happy to, it took about a year from the first, I think,
tweet that I sent to Jana about, hey, I read this blog post,
and it was called things I wish developers would know about databases.
And she was going on about database performance and things like that.
And I was like, oh, this is awesome.
Because Brian and I constantly talk about application performance problems
and we always blame it on the database.
But we always try to bring up the points that if you are writing good code
that is making efficient usage of the database,
then the problems might not be in the database.
But anyway, I reached out to Jana.
A year later, Jana, welcome to the show.
How are you?
Hi.
Yeah, sorry for making you wait.
But to be honest, I also table flipped and left databases
since we had our initial conversation.
So I'm kind of coming back to performance observability to that field.
So I'm great.
As far as I can understand from this conversation,
I'm super happy not to have kids.
Not yet, not before this pandemic ends.
So yeah, I'm happy.
Thanks a lot for the wait.
And I'm a bit ashamed.
The way I look at it is it's like, do you ever hear of The Muppet Show? Kermit the Frog and all that, right?
When that show was on, so I was a young'un when that show initially aired.
And their first season, they were having trouble getting guests.
And then once people started seeing how good it was, that's when everyone was like, oh yeah, I'll be on the show.
And I think you were just looking to wait to see,
is this show any good?
Do they have good people on?
You're just being judicious and smart about it.
I had no questions about that.
I had no questions about it.
It was just really the matter of me being a bit busy last year
because of the pandemic and everything.
And then I switched jobs and onboarding at a large company
takes a long time.
So sorry for that.
I was trying to give you an excuse.
But Jana, you bring up a good point.
Can you remind folks your background?
Because you said databases was just
a little intermezzo that you had
and you actually did
observability performance before and now you're back
to observability? Yeah, I can give a
bit of background.
I'm working at AWS right now.
I'm a principal engineer working on, like, mainly observability
and our overall, like, instrumentation stack and, like, you know, roadmap and so on.
But before this, you know, I was at Google for a long time.
My last, you know, endeavor at Google was databases,
and that's how we kind of initiated our initial
conversations. But
before databases, I was
working on the instrumentation team.
Maybe some of you have known about
this project called OpenCensus,
which was kind of
inspired from
the instrumentation library Google uses
that's linked to every production
binary called Census.
So we decided to just kind of build some vendor agnostic instrumentation library that may work for multiple backends.
Some of the concepts were inspired from Census, but it was really like a project from scratch.
That project became OpenTelemetry
after its merge with OpenTracing.
And before, you know, all of that,
I actually ended up coming back to this field after many years
because I was working on the Go programming language.
And the last two years I was on the team,
I was mainly focused on like performance, diagnostic tools.
You know, there's a bunch of different things in Go.
And the community is very, very, I think, motivated to, you know,
use these tools and understand like there's a, I think,
a good culture in the language community.
So that was my kind of like, I realized that at some point,
oh, some of the areas that I want to work on, like, for example,
distributed tracing, it's just not, you know,
you cannot scope it to a programming language.
It's just like a larger effort, and you have to have, like,
consensus, you know, from all the parties in order to make anything.
That's how I kind of, like, pivoted myself more into this field.
But, yeah, I had, like, you know, going backwards,
I just kind of, like, pivoted back and forth, I think into observability
and kind of ended up being here.
So it's interesting.
Great. I mean, talking about Go,
I just spent four hours today trying to,
well, let's say that I have a love-hate relationship with Go.
In the end, love always prevails.
And so literally five minutes before we started the recording,
I finally got my code to work
and I just deployed it on my cluster.
It's a little Go service, but it is hilarious.
I feel like that love and hate, right?
Like I always compare it to, you know, any other,
especially I think in the beginning of my, like, you know,
when I became a user, like early in the earlier days, it was much, much difficult because I was always comparing it to my experience with other languages.
I came from a JVM background.
So in terms of diagnostic tools over there, it was not in a good shape.
And still not in a good shape.
There's a lot of work going on.
But I kind of like
you know stayed for the simplicity of the language and you know i just can't get things done with it
so that that kind of works for me um but you know i'm trying not to be religious i think i'm too
biased because you know i've been very involved in in go but you know i took a break i think four
years it's been four years i haven't been working on go as a full-time thing. So it's just been very healthy.
Now, to your current role.
So you said you switched from Google to AWS,
and they are responsible for observability, open telemetry.
I would like to get maybe an overview.
What's the current state of open telemetry at AWS?
Yeah, so actually like i you know said
like you know i generally work on instrumentation open telemetry is a piece of it um a big piece of
it actually but there's a lot more going on and i can give you kind of like a you know brief maybe
summary of what we're trying to do with open telemetry so you know in the last five to ten
years you know we have um all these different vendors building their different so you know in the last five to ten years you know we have um all these different
vendors building their different like you know solutions um and they all had like different
instrumentation telemetry generation client libraries one of the difficulty was this was
um you know you go to a customer who doesn't want to know too much about instrumentation or
don't want to understand like you know the telemetry points um and they don't want to re-instrument everything so this was an
initial problem and then there's all these open source projects that you want to put some you
know out of the box instrumentation but they there's not a vendor agnostic way of doing this
so they end up doing nothing or inventing their propriety know, formats. This was a huge, big issue because most of the time,
people just want to get things out of the box.
So that's what, you know, how OpenTelemetry came around.
And AWS is, I think, like, given the scale
and the diversity of our customers,
like, there's so much, right?
Like, in terms of our customers usually want to use,
like, three to four or, like like sometimes four to five different solutions.
They want to see their, you know, data on multiple products
and not just on AWS products, also like, you know,
other products, right?
And us being able to, you know, collect, produce,
maybe analyze and also stream, you know, the telemetry data
in a vendor agnostic format is very important for us.
And important for, like, all the other partners and companies
that we're working with in the APM and observability space.
So AWS has been, I think, very sort of, like,
smart in terms of saying that, hey, maybe we should
try to like align with what OpenTelemetry is doing, because that's going to be the format
we can speak, everybody can understand, right?
So the goal right now is trying to make, you know, OpenTelemetry collection much easier on all our compute platforms, EC2, EKS, ECS, Lambda.
If you want to have your own instrumentation, you should be able to export it to a sync,
like a collector available in the platform, so you don't have to deal with running the collector yourself.
So that's one of the goals that we're trying to achieve.
The next thing is we also have like a lot of managed services that produce
those telemetry data.
Now, historically, this type of data has been very vendor specific.
We would use CloudWatch for metrics, you know, extra traces.
Now we're like looking at cases where it can be produced as a vendor
agnostically.
So we can, you know, push that data as well to, you know, everyone else or to, you know,
whatever tool the customer is, you know, wants to use.
So OpenTelemetry may end up being this like, you know, common telemetry format for us.
And I think like I came to a realization a couple of years ago that being a cloud provider
is really like being a telemetry provider in some sense
so you know if we're we're a telemetry provider you know this is the format where we want to use
um and we want to communicate um not just internally to ourselves but you know externally
as well there's also um there's a i can give a bit more um of what else is going on.
So I think OpenTelemetry is very limited right now
to metrics, traces, and logs is coming up maybe next year.
It's in the early stages.
But we care about other things like database performance,
for example, is one of them.
Can we propagate OpenTelemetry labels all the way to the databases
because they can do accounting based on labels and so users can go and break down.
There's a lot of effort going on for eBPF.
How are we going to be enabling eBPF?
Can we generate aggregated data and output it to OpenTelemetry?
So that's the other piece of thing that I'm working on.
So there's a lot going on, to be honest.
Yeah, I got a couple of questions here.
So it feels for me, at least I assume the first kind of focus of all of this is to enable
your users that deploy their applications or services on the AWS infrastructure to make
it very easy for them to expose their own telemetry data
or maybe get some telemetry data off the runtimes
already run on.
And actually, that's a question now for me.
So if I would, let's say, use a Lambda function
and I want my telemetry, you make it easy for me
so that my code, my Lambda function,
will expose telemetry data,
make it easy to figure out if something is wrong in my code.
Now, are you also planning on exposing your own telemetry data to the end user within
your own runtime code and within your service code?
Because eventually, maybe it's not my code, but because I use your services wrong, or
maybe there's a problem with your services.
Is this also something that you are thinking of exposing to the end user,
or is this just something you would use internally?
Yes, I briefly mentioned that we have two main managed services
and we want to expose telemetry data, but I didn't go into detail.
So thanks for asking.
So, you know, most of the time, actually,
like most of the customers, they can do their own instrumentation.
But, you know, what is valuable to them is like being able to see all these black boxes, right?
Like Lambda runtime, for example, it does a lot of interesting things.
And that's not visible or user cannot do anything by just instrumenting their Lambda function
because there's a lot else is going on outside of that, you know, function block.
By using OpenTelemetry, we want to to do like i mean um we want to be able to
you know produce our telemetry data in the same format as well um so there's been already some
you know work going on like for example uh some some of the databases um some of the like you
know other many services um like s3 is trying to do some more, like trying to expose more internal traces and stuff.
But it's been in our proprietary formats.
So you have to still go and find that data in X-ray.
We're kind of like thinking about what would happen
if we can produce all that data in one format
and give you as a streaming service, for example.
So you can take everything see
everything end-to-end and you don't have to necessarily start with instrumenting yourself
right like you should be able to see the big picture maybe end-to-end by uh traces coming
from us automatically and then um you know if you want to participate into that like you might be
able to you know using either open telemetry or
something compatible with that propagation uh and you know the data format but that's sort of like
the goal we actually don't want people to you know start themselves we just want them to be relying
on the data coming from us also um some of the maybe like you know framework integrations and
other stuff that we can still provide. So Lambda is an interesting example
because the Lambda runtime framework is also us, right?
Like we can automatically create you a trace span, for example,
for every Lambda function.
And if you still want to participate into that,
you are free to bring your instrumentation library
or OpenTelemetry or OpenTelemetry-compatible library and add your custom things.
But I think the overall vision is we should provide you
out of the box as much as possible.
And if you want to do something on top of that,
you should be free to do it.
And this brings another question because you said
one of the things that I wanted to ask you
today is how can we get developers to actually you know really leverage OpenTelemetry and I was
wondering are we moving towards a world where we ask developers to manually put in their
their traces their spans like do they do they have to think about where do I call the OpenTelemetry library? Or is it more
the other way, which I think it is? And I just watched your video on AWS Distro, it's a seven
minute intro video on how to get started instrumenting a Java app that is running
in a container on EC2, where I think the video at least shows auto instrumentation. So my question really is, is the idea that it's a USAWS
or it's the Open Telemetry Foundation is providing not only the common protocol,
but also automated instrumentation for hopefully a large number of runtimes
so that we make that easier, right?
And then you can just, on top of that auto-instrumentation,
you can still add two or three additional metric points
that you may need.
Yeah, auto-instrumentation has always been the goal
since the open census days.
That's why we came up with this vendor agnostic thing.
So we can make sure, I mean, everybody was tired
of providing this
all the instrumentation integration so we thought that like hey if there's a vendor agnostic thing
we can actually have everyone you know uh doing the integrations although you know out of the box
like instrumentation for common frameworks like for example one example i can give concrete
examples i think it's much easier uh grRPC can provide, you know, traces automatically without you doing anything, you know,
if they use OpenTelemetry today to instrument, right?
Like the framework.
But they've never been able to do it
because there hasn't been an industrial standard
where, you know, you would use it
and there's a way to get data out of it
in a well-known format.
So they've never been able to do it.
The idea was OpenTelemetry, when it becomes successful,
we're first going to start with providing some of these integrations ourselves.
But eventually, all these projects will take OpenTelemetry, import it,
and provide instrumentation in the framework without us doing any work.
And the same thing applies to like,
maybe this is easier for libraries,
but the bigger problem was the binaries,
for example, databases.
How are they going to be utilizing
some of those concepts?
Similarly, we have this very well understood
now like exposed data format.
So they can, again,
import the OpenTelemetry libraries,
produce that data,
and then write it to, you know,
either OpenTelemetry collector or somewhere.
And then the entire community can, you know,
the user can understand and push it to whatever backend.
The idea was always like,
we should provide out of the box automatic instrumentation as much as possible.
Nobody wants to do any instrumentation work.
Yeah.
And I think this will also make life easier from vendors like us, right?
At Dynatrace, we've been building instrumentation agents for the last 15 years.
And the challenge is always, I think we've become good with it, but
still, there's so many new technologies, new versions, new frameworks, and you always have
to adapt your instrumentation. And if we can all agree on a standard, then obviously the value of
such an agent that a commercial vendor like we provide goes down because eventually over the
next couple of years, they're kind
of obsolete because hopefully open telemetry takes over.
But it makes life for the end users much easier because you don't have to think about, hey,
I don't get the right telemetry data or traces.
But now you do because you hopefully can be sure that this library, this software that
you're using has been properly already instrumented either by the vendor itself or because these open source agents have become so good in
auto instrumentation, you really get exactly what you need in order to do some troubleshooting,
some performance profiling and so on.
Yeah, exactly.
And I was just going to say, Andy, that brings a smile on my face in a way, kind of a snarky
smile because I just think of, you know, the idea of auto-instrumentation, I think that's a wonderful
goal. But what you all have to then deal with is what we have to deal with on a regular basis is
new versions, users using the frameworks or the code in ways that you hadn't planned for,
these things breaking and then turning into supports and
it's you know the auto the auto instrumentation side of it is where it gets really sticky and um
it's the smile i'd say it's a little bit more of a an evil smile because it's kind of like great
now you'll now now other people will see what goes into the auto-instrumentation side.
Because, yeah, I mean, manual instrumenting would be a real pain,
so it would be awesome if you all can take care of that.
It's just always that staying on top of it and finding, you know,
the funny thing that you always see is that you have a framework,
you have best practices, and then you put it in the hands of developers
and everything goes out the window. And you start looking at what they're doing. You're like, what are you doing?
Well, I can do that. And I want to be like, okay. So it gets fun. It gets fun when you get on that
side of it. Yeah, I'm very pessimistic, actually, as well. I'm trying to be more optimistic because
there's also a lot of legacy. We have this super nice nice vision maybe eventually we'll have a stable data
format that everybody may be speaking but you know the data format itself is complex there's already
25 different data formats uh the instrumentation libraries and their stability is the other thing
there's a lot of like complexities there like you know i gave a uh example hey why grpc is not
importing open telemetry but you know it's difficult to rely on a dependency like that
and make sure that your versions are compatible
with the upcoming versions of OpenTelemetry and so on.
So all these problems are very hard problems in the end of the day.
I've seen, though, a couple of projects became very successful.
One example is Pprof.
And I think I was very opinionated about
we should have a very stable export data format, regardless of what the instrumentation libraries
will be like. As soon as we document, we have a good spec and people know what type of data they
should be exposing in OpenTelemetry protocol, this project can succeed. Because of what I learned from Pprof.
Pprof, the profiling project from Google,
it actually is a data format and a couple of tools.
And everybody is going and writing the instrumentation piece themselves pretty much.
For example, GoRuntime exposes, you know, Pprof.
It does all of its doing
and then produces the Pprof export format.
Pprof is, like, everywhere.
Like, I mean, it's not maybe very, very globally,
you know, accepted, like, profiling,
like, one true format or anything,
but it's a widely adopted thing
and used very close to like even language runtimes
because it has this like very stable data format
and they don't necessarily care
which instrumentation library you use.
You're free to do whatever as long as you produce that.
The other, I think, example is Prometheus.
Prometheus export format is just everywhere right now.
The entire world
relies on that. Even though, you know, there's this new upcoming thing, open metrics. But,
you know, they've been able to achieve it by having the exposition format and keeping it
very stable and like making no compromises about, you know, the stability. Yeah. And I think you're
right. We have to keep optimistic about this, right?
Because if we do take that pessimistic approach,
we'll never get started.
So it's got to start somewhere and it's got to get built out.
And I think the cloud vendors...
It's good to be, I think, skeptic about this
because there's also a lot of projects that fail.
Right, but if you go too skeptical,
it'll never take off, right?
You got to start building it somewhere.
And I think the cloud vendors taking some of this approach
have a little bit of an advantage
because it's a lot easier for the cloud vendor to say,
let's say if you're going to use Lambda, for example,
you're going to use Lambda.
Here's how you use Lambda.
Here's how you'll get the metrics from it
if you use it this way.
If you don't use it this way, you're not going to get it.
Whereas when people come to vendors like us,
they're expecting they're going to pay for us
to get this data from whatever they have
because we're a vendor doing it
as opposed to being that cloud provider
which says this is the scope in which it works.
We're expected to work within any scope.
So you have a little bit of an advantage in that case
because you can more easily set the guardrails
of the conditions for leveraging it.
So you definitely have an advantage there.
Yeah, yeah, yeah.
One of the, I think, interesting aspects of OpenTelemetry
is it's a forum between also vendors like you
and cloud providers at this point.
So I just felt like it was very difficult
to kind of have some of these conversations
because I think as a cloud provider,
I was always having these conversations one by one,
but there hasn't been a single place
where people would come and agree
and sort of like, hey, you are a vendor.
What do you expect from Lambda?
There was not an open forum for that.
I think it's just kind of like maybe working against
if the vendor wants to, you know, do differentiator products.
But at the same time, it kind of helps in terms of at least,
you know, having some consensus.
So at least we know what basically we should cover
and OpenTelegram is providing that forum,
which is, I think, unique in the space
at least.
Now, Jana, from, I mean, collecting data is one thing or exposing traces.
The question is, where do we send the data to?
And then also what happens with the data there?
I mean, we have been and other vendors in our space as well,
we've been not only talking about auto-instrumentation,
but then automatically detecting problems,
automatically detecting anomalies.
And it goes beyond what you typically see.
And again, I'm referring back
to the video that you showed,
which is great,
that you have on the distro page,
which is great,
but it shows I'm a single user
and I'm a developer.
I'm clicking on a link and then I want to see that trace.
And then I turn the database off and then I see the database has a problem.
But obviously, in large, if you want to use this for production use cases, then collecting data is one thing.
But then where do I send it to? And the analysis on top.
Now, here's my question to you is this something that aws is
also moving towards are you also moving towards the um the data back end and also the data analytics
on top of it yeah so we have a bunch of different initiatives in this area um none of them are like
very as far as i can understand, not everybody quite understands,
but I can try to give a brief, you know, maybe summary.
So, you know, the distributed tracing,
maybe we can talk about distributed tracing.
The metrics, we have similar initiatives for metrics as well, though.
But in distributed tracing, you know, our backend has been X-Ray.
And X-Ray has been just a tool to kind of like,
you can query, visualize, and all that.
On top of that, there's a lot of different initiatives around anomaly detection at AWS.
This is a hard topic,
so it's just not like there's an easy solution.
But there are a lot of teams trying to figure out
seasonal differences,
or if there's a 90.
One of the difficulties,
I specifically want to talk about distributed tracing
because distributed traces are usual down samples,
so you don't always have full data.
You may try to have,
but there's a performance cost for that.
And we realized that our customers,
most of the time, end up collecting customers most of the time end up collecting
90% of the time,
collecting very boring traces, but
missing out all these edge cases like 99%
and 95%
and above.
So one thing that we want to do is
also making collection a bit more
smarter. Let's try to collect as much
as data and analyze it in the cluster
before it's been exported to any backend.
If it's an interesting case, if it's in a 95th percentile and above type of case.
So this is one of the initiatives we're trying to do.
The other one is like the anomaly detection that I described about. The other, like, there's, like, some other, like, initiatives such as we simultaneously want
to be able to take a look at, like, multiple signals, like, not just traces, but also to
metrics. And this is one of the other reasons that we are interested in open telemetry, because we
want to be able to correlate things. So we want to have, like, more consistent labeling all across
the, you know, telemetry data we collect.
And if there are any interesting cases that we capture in metrics, for example,
we want to be able to enable maybe more tracing collection or tweak our sampling strategy and stuff like that.
These are all initiatives in flight.
But as you mentioned that the difficulties actually
make an instance of this data it's so much data most of the time it's not very useful because
it's just a repetition repetitive you're collecting similar data again and again so what's valuable is
being able to you know kind of give people um these are the interesting cases and you know i
have like five categories of interesting things for this particular critical path maybe uh so you know that you can you can
allow them to see um what else is going on in their critical path if you can alert on them like
by you know anomaly detection that would be the next step but anomaly detection in my experience
has been very very difficult problem well. That gives me hope, Brian,
that what we are doing at Dynatrace
still has a long way to go.
We have a long way to go, too.
I'm just making it sound like
we actually have been...
I mean, this is a hard topic
for the entire industry.
Andy, I was going to say...
Sorry, go on, Jana.
Go ahead. I was just going to say, you said it gives you hope, Andy, I was going to say... Sorry, go on, Jana. You go, go ahead.
I was just going to say this.
You said it gives you hope, Andy.
It gives me fear that once we have the big arm
of the cloud vendors going beyond collecting the traces
and processing them, what will become of vendors like us?
Oh, yeah, but the reason why I said I have hope
is because we have been focusing on this
for a long, long time.
And yes, obviously, you know,
cloud vendors and others will catch up
or will provide services.
But I think in the end,
it's collaborative what we're doing here.
I feel like, you know,
cloud vendors doesn't necessarily want to,
you know, do all the like work
that you want to do
because it's such a, you know,
there's so many things that we can do but in the end of the day we
care about our core platform right maybe that's why we're more focused on collecting at this point
because we want to be able to collect and you know provide that data and enable like other you know
companies to do the work with the data yeah um exactly i see it the same way right i mean you
can focus on what you're strong in and why people come to you.
And people may not come to you because you have the best end user monitoring and anomaly detection, but you can give the data.
But essentially, you help your customers to run the services and platforms reliably on your platform.
Then you also have the data that they can then use with other tools.
That's why I think we are super focused on core reliability.
We want to have the core metrics, traces, being able to alert on those things.
These are very fundamental for us because you as a customer should be able to come to
our platform, be able to do things reliably at a minimal base.
If you're looking for very advanced cases, you, you know, you might be also in the charge of,
you know, also analyzing your data.
That's the other case that we never discussed.
One of the interesting things about, like,
I think we want to speak open telemetry is,
hey, maybe we will be also able to export the S3s,
and, you know, you will be able to read your telemetry data
in raw format and do whatever you want to do with that, right?
Like, so that's that's
sort of like the other goal um yeah hey i brought up the term uh aws distro a couple of times can
you quickly maybe give a overview of what that is all about because i think i never mentioned what
it really is a lot of people i have been asking me questions about what is the AWS Distro. I mentioned a couple
of times that we are thinking about deploying the collector
to our platforms.
The AWS Distro came as a necessity because of that.
There are a couple of reasons that the Distro came around. Distro is an open source project.
I've seen other providers doing similar things,
but in a closed source fashion.
So let me briefly explain, I think, some of the technical challenges first,
and then I can also explain why we are doing the distro.
So first of all, the collector in OpenTelegram is written in Go.
And there's collector supports already,
a bunch of the proper upstream collector repo
has open source projects represented.
You have Jaeger support, you have Zipkin, Prometheus support there.
But everything that is related to vendors
are living in a different contract repo and in go
in order to you know you have to build a static binary so you can't really like say like hey i
just want this to be dynamic link like i just want this particular vendors you know component
to be dynamic link so it became a necessity to for us to actually have a binary, like a main Go function.
So we can pull in all the important bits that we care about and we want to support.
And that's how the distro was born.
And then there was a couple of other things.
So AWS really cares about reliability, performance, regressions, especially if we're going to
deploy the collector on behalf of the user, and also the security reviews.
So everything that we do in the distro is actually like, it goes through this release process.
So we take the upstream, we put everything in the upstream.
When we write contribute code, we put it in the upstream.
And then once a while, when we're making a distro, everything goes through this reviews, performance regressions, security reviews, reliability issues, if there's anything backwards compatible or not.
And then we cut the release based on that.
We're trying to follow the upstream very closely.
It's just really like our cadence is very similar to what upstream is doing.
But just because there's that process there's the distro
the other thing is you know this enables us to also link the partners you know exporters or like
other components into the distro uh so you know you don't have to go and like rely on this country
repo which is not always well tested or like you know not you know not not going through the same process. So it kind of helps the customers
to be able to at least rely on the collectors.
They know that it's going to work.
That's where the distro came from.
It came from these technical challenges.
And I'm glad that it's an open source project.
I mean, it doesn't do much.
It's just upstream distributed in a different way., it's just upstream distributed, you know, in a different way.
But it's just basically coming from the fact that, you know, it's difficult to vet, you
know, all these components, because this is a very huge project.
And the other thing is, I think people at least know that we provide support.
People at least know that if there's an issue with any of the components there, we will
be able to, you know, go go in fixed upstream if it's necessary.
We can make this promise for the entire project.
It's huge, and there are hundreds of different components
to be reviewed by tens of companies.
So that's how the distro came around.
It's trying to be just open telemetry,
but distributed
by going through some of the reviews that I mentioned.
If I would be, if I'm a developer,
and I want to get started today with writing my app,
deploying it on AWS, any of my, let's
pick a combination of Lambda, Fargate, and maybe some database
service in the back. Can all of this be handled today with OpenTelemetry? Do I get my end-to-end
traces and my metrics from Lambda making calls to my microservices in Fargate and then the database? So, no, it's not, because we're still working on the end-to-end use cases.
If you want to just do your user traces at this point,
everything works.
If you want to make calls to AWS services,
you have to use still the AWS's trace context, which we provide also OpenTelemetry
support. So in OpenTelemetry, we provide propagators. If you do that connection,
you will be able to see end-to-end traces on X-ray. But there are still things that we want
to improve in terms of providing more detailed traces from databases. So if anyone is using and want to give feedback to us,
we are at the stage of prioritizing
what else we want to expose from the managed services.
And the next thing that we want to do is,
right now, we only understand our own context propagation header,
but we want to switch to W3C trace.
So you don't have to do anything in terms of caring about the propagation
like headers or anything.
Just give us whatever OpenTelemetry produces.
And at some point we will be able to understand that.
And that's a very complicated project and I'm working on it like this year
and probably next year and maybe it may take more longer than you think.
Yeah.
So I would assume the API gateway is a perfect service, right?
It's going to, yeah.
It's going to be parsed.
It's going to be the starting point.
In the ingestion, like at the entry boundaries,
we're thinking about being able to accept trace context
and convert it into the internal format in the moment
because all these teams, it requires them very long time
to switch to different trace context headers.
So maybe initially we will be able to unit the API gateway
or in any other entry point,
we will be able to translate it to our internal
when it's internal.
And if you are exposing it to the user,
we want to translate it back to trace context header.
So that's sort of the goal initially.
But eventually we want to use it internally everywhere as well.
So it's not visible to the user,
but it's a technical detail for us
that we also want to be aligning with the trace context.
Is there anything else we've missed in this conversation?
Remember, if we have people that are new to OpenTelemetry,
if people that want to know the status of where it is,
where it's going, any other things we should mention
or maybe some links where people can get started,
maybe even contributing, seeing the status?
What's a good way?
Yeah, maybe we can talk a bit about also the status of the stability of the project, for example.
That comes up as a question as well.
So OpenTelemetry, given the number of companies involved, it's been trying to establish stability for a while.
And it's now coming in phases.
The trace, for example, has been the spec.
It's been stabilized.
The instrumentation libraries will be stabilized.
The next thing is the metric spec.
The metric spec is soon to be stabilized.
And we're trying to make this collector stabilized
at the end of May.
So starting with May, like traces and metrics
will be stable
in terms of spec collector
and instrumentation libraries.
Maybe metrics may require some work,
but otherwise things will be stable.
This has been an adoption blocker
because people didn't want
their data to be broken
or their entire pipeline
or they didn't want to invest
too much build and tools
around this type of data
because they didn't know if the data was going going to be breaking or not so it's a huge
milestone for the project and the next thing that we you know we're going to do it with with logs
it's in the early stage um maybe the data format is not going to change that much but we're going
to formalize a few other things make sure that the work is you know going well in the you know
client libraries.
And we want to build some sort of parsers for well-known formats and so on. So there's a lot of conversation going on for logs at this point in terms of what are
the next steps are.
But yeah, we can give a couple of links.
Please take a look.
Documentation might be lacking a bit because the project was trying to figure out the stable version.
So I think a lot of work will be done from this point on in terms of contributing to docs.
There will be better examples, more end-to-end examples, and so on.
So if you see anything, you don't have to do code contributions.
Even just reporting those type of cases would be a good contribution.
And Jana, yeah, we will be making, we make sure that we collect some
links and put them on the podcast summary as well. We'll do this.
Yeah. Let's, let me, let me, let me give you a couple of links. We can do it now or later.
Perfect. Yeah, GitHub OpenTelemetry obviously is a great start.
Obviously. Yeah, these are too obvious. But you know, spec is the important repo.
The data model is in the proto repo, which is also an important one. The client libraries,
you know, each of them are there.
There's so much going on, so it's kind of hard to...
I mean, I can give you all the links, but...
That's fine. I think an entry point is good for people.
I think these three, like spec, proto, collector,
and maybe you can give a couple of links to libraries
for some of
the languages at least like these are really good entry points um and the the sig meetings like
actually are are super useful if they want to contribute and they are all here uh you probably
know about this repo as well um there is a there's a calendar section where you know you can see all the um meetings um
and yeah there is um there there's like you know kubecon coming up and there's another
conference olifest where um there will be a day focused on
open telemetry maybe we can also mention that like this the second day will be um open telemetry
specific so thanks for that so folks that are listening we'll add all these links to the
to the podcast summary so you can click on it.
Hey, Brian, do you have anything else?
Yeah, I just wanted to ask about, from what I understand,
X-Ray is using open telemetry, right?
It's one of the pieces.
Now, that's obviously not going to have any auto-instrumentation in it yet, but Andy, I believe we have, what is it,
a propagator or a collector for X-Ray as well?
Yeah, Exporter.
Yeah, Exporter, yeah.
So if anyone's running anything on that and they wanted to play around with open telemetry, if they own Dynatrace and want to have us analyze all the data, they can get started already with those two.
I think so, right?
It seems like it's seen a blog or two of ours,
I think that's already,
I just haven't had a chance to play with it
because, well, I'm as busy as everything.
But it looks really interesting.
I'm really excited to see what comes out of this.
Although I know earlier I mentioned my fears
of what happens when the cloud providers try to take over,
but I think on this tracing level,
the automation instrumentation side of things,
I think it's going to be really awesome as things go along.
Because as Andy said,
it will make our side of the house a little bit easier.
Let's say we even got to a world
where all the cloud providers did all the instrumentation.
And we didn't have anybody on-prem anymore.
And I'm speaking fantasy world, right?
But that would give the vendors and all of us the ability to solely concentrate on analyzing and alerting and AI for the data and all that.
So I don't really see a net loss in anything for these things.
As we see when automation and pipelines come in,
it's all about moving forward and moving with the thing.
So just another example of the ever evolving technology space that we live
in, which is just crazy.
As time goes on, it just gets crazier and crazier.
I tell you, I sound like an old man, but it's just insane.
I feel like, you know, I'm tired of these things, right?
Like in the last five years, I was like, I just want to retire.
It's like the same thing again and again, right?
You feel like you're not sure if it's any progress
or it's just the same thing repeating again with different people.
Well, that's the funny thing we see with Kubernetes, right?
There was the idea of like, oh, Kubernetes comes out,
you just throw everything up there and it runs.
Well, no, it doesn't.
And then you start having your Kubernetes code
as the new infrastructure to maintain.
Everything just moves, you get,
but as we discussed on previous episodes,
there are definite benefits that come along with these changes.
The workload shift, the model shift, there's a lot of similarities because it's just a
morphing model.
But the benefits, I mean, if you think about Kubernetes and containers and all that, the
world of automation that it opened up has just been absolutely incredible.
But now you've got to write all your automation scripts.
Yeah.
And, you know, like I was super skeptical about Kubernetes because, you know, like, I was super skeptical about Kubernetes
because, you know,
hey, we're trying
to do this again.
You know,
it's the complex
API surface
because it tries
to do everything
at once.
And, you know,
like, most of the people
just don't care
about all these use cases.
This is more like
an API for cloud providers.
But, you know,
it kind of enabled this big ecosystem and this providers. But, you know, it kind of enabled this big ecosystem
and this entire, like, you know, new area of, like, automation.
And, like, it's a cluster-wide automation that just didn't exist before.
So, you know, it just kind of, like, changed a lot of things.
There's a lot of complexity in Kubernetes,
but it also, that complexity enables some of these things
that was not possible before, right?
It enabled Captain.
There you go.
That's our open source project that brings automation.
And Andy always has to mention it.
It's like, it wouldn't be an episode if Andy doesn't mention it.
I used to have, when we first started with Captain, I used to, I didn't get to do it
because I was recording in the morning, but it was the idea of a drinking game every time Andy mentions Captain.
So I'll just drink some of my water here.
All right.
I think it's time to wrap it up.
Jana, thank you so much for being on the show.
I hope to have you back.
Thank you for waiting for a long time.
It was worth it.
Absolutely worth it.
I'm happy to wait for one more year,
and then we have you back for what happened in the last year
where are we with OpenTelemetry
I think that would be a good checkpoint
unless I start working on
databases again or we can talk about
databases
we don't have many shows on database
so that would be great
if you're up for another show we can
record it in the next couple weeks and say
let's refresh your memory
on database performance.
No,
let's do it
if I pivot back
to databases.
Okay.
So,
you know,
I sound legit enough.
Like,
I need to work on the,
you know,
area.
I have too many opinions
on,
you know,
on that field,
but I think
let's do it
when I'm back
on databases.
It's good to be
an opinionated platform.
But I'll challenge any of our listeners.
If our listeners are very knowledgeable in,
or I shouldn't even say very knowledgeable
because no one ever thinks they're knowledgeable enough.
If you have knowledge on database performance,
let us know.
Maybe we can have you on and discuss it
because I think it's a topic that doesn't get enough attention.
Any final thoughts, Jana?
I mean, it was great to be here,
but we just really pointed out a lot of the pain points.
As much as you, I'm also a pessimist,
but trying to be an optimist at the same time
because I've seen some of these projects
actually shift things in the long term.
So I hope that
it turns out to be a good
outcome.
Critical optimist is a better
word than pessimist.
I was just going to say I was a pessimistic optimist.
I hope for the best but expect the worst.
This way I'm never disappointed.
Yeah.
Yeah. Alright, well thanks for being on the show. I'd like to thank our listeners for listening. this way I'm never disappointed yeah yeah
alright well thanks for being on the show
I'd like to thank our listeners for listening
if anybody has any questions or comments
please reach out to us at pure underscore DT
Jana do you do any social media that you wanted to share
or
yeah I have a Twitter account or you know
I don't know
I'm trying to
if people want to reach out,
they can use this.
Alright, we'll have that in the show notes
for anyone who wants to follow that.
And thanks everyone for listening. We'll be back
soon.
Bye-bye.
Thank you. Bye.