PurePerformance - Open Observability: The limits of the 3 pillars with Dotan Horovits
Episode Date: January 31, 2022“Whether open source or commercial – just focusing on logs, traces and metrics is limiting our conversation and missing the point what observability really is!”, says Dotan Horovits, Tech Evange...list at Logz, in his opening statement in this podcast. Listen an and learn more about why observability is not about collecting data. Observability is rather a data analytics problem as it needs to give humans answers to DevOps, SRE and Business questions.To learn more beyond what was discussed in this podcast listen in to OpenObservability Talks, stay up to date on OpenTelemetry or follow Dotan at @horovitsShow LinksDotan Horovits on Linkedinhttps://www.linkedin.com/in/horovits/Open Observability Talkshttps://openobservability.io/Open Telemetry Projecthttps://opentelemetry.io/Dotan Horovits on Twitterhttps://twitter.com/horovits
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Welcome everyone to another episode of Pure Performance.
Unfortunately, the second time I think in a row, I don't have the fun introduction from Brian.
Actually, I think he probably will put a fun introduction to the whole thing.
But unfortunately, he couldn't make it today. So I have the honor again to be here just by myself.
Well, and with obviously our great guest today
a newcomer on the podcast and I always try to not mess up the pronunciation of the name
Doltan Horowitz I hope I said it correctly Doltan is this uh welcome to the show did I
did I get your name correctly yeah it's Doltan Horowitz and yeah glad to be here on the show
thank you for inviting me and I hope it will be entertaining enough to make up for Brian's
absence.
Yeah, he's always pretty good with jokes and he always brings in a special mix to the show.
Yeah, let's see if you can live up to his standards.
Dottan, can you give the audience a little background about yourself, what you do, but
not only what you do right now, but especially where you came from. I think that's also very interesting.
Yeah. So today, these days, I'm a principal developer advocate at a company called logs.io,
which provides a cloud native observability platform. I come from an engineering background,
many years as an engineer, then a systems architect,
a solutions architect, consulting to customers about architecture, design, implementation,
and so on, tuning, and so on.
I even had an episode as a product manager for developer platforms and cloud orchestration
and things like that.
And these days as a developer advocate, startups, enterprises,
so all the range of goodies there.
If I look at your LinkedIn profile, and by the way,
we will link to your LinkedIn in the podcast proceedings.
So if you're listening in, if you want to follow up with him,
just follow the links. I see you worked at Gigaspaces for a while.
Yeah, I was the solutions architect there. For those who don't know in the audience,
Gigaspaces provides an in-memory data grid, sort of a distributed in-memory database,
so to speak. And yeah, I was there and I consulted customers about how to build an architect, distributed
applications on top of that and cool things like that.
Yeah, because maybe we have actually worked in the past together because I remember in
the early days of Dynatrace.
So I've been with Dynatrace since 2008.
I remember when I lived in Boston, I had quite some exchange with your colleagues back then in GigaSpaces,
instrumenting GigaSpaces with our AppMon product
of kind of previous generation distributed tracing.
And we were kind of, you know, instrumenting apps,
but then also instrumenting GigaSpaces itself
and kind of following traces end to end.
And this is why I looked at your profile and said,
hey, I need to bring this up at least. A small world. I think back then it wasn't even called distributed tracing as a
discipline. So it's a pre-incarnation of the current distributed tracing. But yeah, definitely
a challenge we've been encountering wherever it was in the distributed applications realm for
quite some time now. Yeah. And you are in your current work. I mean, you work for LogC,
you're a principal developer, but you also have your own podcast, videocast.
Can you tell us a little bit about that? So I'm also beyond my role. I also am very
passionate about open source and communities. And I'm involved with the CNCF, the Cloud Native Computing Foundation. I'm a
co-organizer of the local CNCF chapter here in Israel. And as part of Reach Out to the Community,
I also have my podcast called Open Observability Talks. So I invite all your listeners, if they
are interested in this topic, open source, DevOps, observability, maybe they'll find it interesting as well.
It's a monthly cadence.
And I get guests that are maintainers,
committers, end users,
all the perspectives around these topics.
Yeah, it's perfect.
I'll have you on one of the episodes soon.
So stay tuned.
Yeah, well, yeah.
And thanks thank you.
I mean, you reached out initially to me, right,
in regards to getting on the show.
And then we said, let's do a show on both sides.
So that's great.
So that also kind of brings me now to the topic of today,
because we want to talk about the role of open source
for better observability.
I think observability is a hot topic.
There's many things going on.
And in preparation for today's recording, I sent you a link to a recent Twitter space.
I'm not sure if people are familiar with Twitter spaces, but it's a way on Twitter where people
can join in into a discussion.
That's typically a couple of speakers that discuss about it, but then people can bring
in their questions, but they can also join the conversation.
And it was a Twitter space on open telemetry.
And I thought what was really interesting
was the quote that initially triggered that Twitter space.
It was from Adolf Reitbauer.
And he said, he had a tweet that says,
I again had a call where I had to explain that just sending
some open telemetry data does not give you observability.
To be clear, I'm a big fan of open telemetry, but observability is, however, a much wider concept, because we need
to not only collect the data, we need to figure out how to collect data in context. What do we
do with the data? And I also agree with him on this thread. It's just one aspect of it.
But I would really love to hear it from you, especially as you're so engaged in the open
source space.
You've been helping the community to make sure that observability is thriving through
open source.
So what I would like to know is a little bit of like, where are we right now?
And where do you think we will go?
And what can we do in our podcast today to help people better understand what observability is,
what OpenTelemetry is today and where it goes
so we make sure that we really have something
that is truly delivering value to organizations.
So first of all, yeah, thanks for highlighting this discussion.
I think it brings a lot of very good points.
Let's start even before open source about observability. I think the very basic point is that many people out there limit the discussion
around observability to the three, what is known as the three pillars of observability, namely
metrics, logs, and traces. And while I think these signals are important, maybe also the
formulation of the three pillars of observability as a term helped kickstarting the conversation,
I think now it's come to the point where it limits the conversation in fact. Because ultimately,
and that's again, I find myself hearing the very same questions over and over again, and
you see companies collecting logs, metrics, traces.
They're sure, they're confident that they have observability because they have all the signals.
And no, they don't have observability.
And it's disappointing because it's just setting of expectations.
So I would say maybe, you know, maybe even the definition that we use for observability,
the one that we took from control theory is the one to blame because it talks about
how does it go, a way to
track the state of our system
based on the signals it produces, right?
And it puts a lot of emphasis on the
signals.
But then the other piece of this definition, the inference piece, somehow gets, I guess, lost or in less focus.
And I think this is the critical part, actually.
I actually, someone, I think, I don't remember who said that definition, but I like it very much. I use another definition of observability that's just the capability to allow a human to ask and answer questions about your system.
And the reason I like this definition much better is that it makes it very, very clear that observability is ultimately a data analytics problem. The more questions you can ask and answer about your
system, the better, more observability. So I use that definition because it makes it much clearer.
And it's not just semantics of, okay, Dutane, you're using one definition rather than the other.
I think it's fundamental to the way that we implement. It's in a way changing the mindset to thinking,
let's say, about more like BI analysts rather than, I don't know, reactive monitoring, sysadmin,
classical type of thing. So I definitely resonate with that.
So maybe I can try to put it into very simple terms, but you're basically saying observability is not about how we are capturing
data but the most important thing is what can this data help us to do as a next step and if you are
if i'm bringing a let's say a more an example from our regular let's say physical world if i'm in a
car and my car tells me i'm driving 180 kilometers in the city and i
don't know what to do with this information that maybe i should slow down because otherwise i'll
get a ticket or i cause an accident then the data doesn't do anything good to me right
if i'm running out of gas soon and i need to still go 100 kilometers but i only have gas for 10
kilometers and if i don't make a good decision, then obviously, what is this all good for?
Yeah.
Again, I don't want to make it sound as if the data is not important.
Just that we need to remember that the signals, metrics, logs, traces, first of all, these
are not the only signals we like.
We as humans like the number three, so three pillars.
But, you know, my last episode on the show was about continuous profiling as an emerging signal.
And people are talking about events and formalizing them and others.
So first of all, it's not just three.
And secondly, this is the raw data.
We need the data.
We need the data structured.
We need the data in many ways, enriched.
We need the data.
But we need the data for a reason.
The data is only a means to an end.
Ultimately, we want to be able to understand our system. And the more we go into the current
architectures that are like cloud native, Kubernetes, microservices, and so on, and it
becomes much more dynamic and much more high cardinality and so on and so forth, the set of permutations that you need to address
is such that you can't foresee the questions.
So we can't pre-aggregate, we can't do pre-calculations,
we can't put assumptions.
We need to support the ad hoc questions
much more extensively than we used to in the past.
And that drives, I think this is the driver actually for observability in general.
And this is why I put so much emphasis on the data analytics type of things,
because we can't anticipate.
So we can't do all these preparations in advance.
Yeah.
But on the other side, I completely agree with you on this.
But on the other side, I think a good observability platform
or whatever you want to call it then
is also anticipating certain things
because as a human being,
I may not know all the questions I need to ask
because I know what I know, what I think I need to ask.
But I think as an observability platform,
it should be smart enough to make me aware of certain things
that I may not even ask because I don't know about it.
I think it goes a little beyond that, but I assume this is also what you mean.
No, definitely. I agree. I think vendors such as Dynatrace and Logs.io and many others bring a lot of experience seeing so many customers to create, the buzzwords is around AI and machine learning and stuff like that.
But ultimately it's capturing the models that capture the aggregated knowledge of what the
typical signals are and the correlation of signals, not in the telemetry signals, but
what, let's say, abnormal behavior we should be paying attention to.
But then again, if we do that on the collection side
or even the initial side,
and we don't even send part of this data,
then we won't be able to ask these questions ultimately,
or at least we won't be able to answer these questions
if I actually don't have the raw data.
If I use sampling that is not intelligent
and I just don't have the traces
or if my metric aggregation doesn't provide that,
I know suddenly I want to ask on the error rate across all my servers and it was pre-aggregated by, I don't know,
clusters or something else. And I can't do the P99 across, you know, you can't be P99 of P99s,
things like that. Then I lost the data. Yeah, no, I completely agree with you. Now,
then let me ask another question. Do you think then there's a misconception
still out there that we hopefully can address
if it's out there that just by, let's say,
looking into open telemetry
or in just looking into Prometheus
will solve all of your problems?
Because I think that comes back to the question
that Alois has raised, right?
It's just sending open telemetry data
is not really observability.
And you'll get the questions as well.
So what can we do?
What can we tell people what else
it means to really build a good observability platform?
It's not just collecting the raw data.
I think you mentioned if you collect the raw data,
you have to be very smart with, because it's a lot of data,
how you aggregate and what you aggregate, if you aggregate.
Because I think certain things need to be aggregated,
because otherwise it's just a sheer volume,
but you have to be very smart on what you aggregate.
And I guess you need to store the data somewhere.
It needs to be analyzed somewhere.
That means you need storage, you need resource,
you need compute.
It's not just for free. What else can we do?
What else do we, what other misconceptions are out there maybe?
So first of all, again, the conversations, if they don't go to signals,
they go to tooling and then say, okay, Prometheus, this or open telemetry,
that I want to put it aside and maybe we should spend some time on,
on open telemetry as a, as a platform. I'm quite involved there,
so I'd be glad to share. But again, before a specific tool, we need to understand what exactly
is the problem that we're trying to achieve. And as you said, what are the best practices in doing
that? And as I said, the mindset of once you understand it's a data analytics problem, it
impacts the whole pipeline. It starts from collecting the data from different signals on different sources and being able to aggregate.
And by the way, across the different signals, and it's not just open telemetry, you see all
the industry heading in this direction. You see, if we talk about open source sphere, then you see
FluentD that used to specialize in log collection, now expanding into collecting metrics. And if you
have a Telegraph that starts from metrics, now expanding to logs and events. And Elastic that
had like gazillion bits, file bit, metric bit, packet bit, whatever, now they have one aggregated
bits collector agent. So first of all, the aggregation of the different signals
is one pain that needs to be addressed,
especially, it's not just about one way of collecting,
it's also about one standardized way of representing the data.
And this goes to the way that you structure the data,
especially logs.
It's a nightmare seeing this plain text, freeform logs,
as if we as humans are going to sit down and read a gazillion lines of logs to understand.
If you understand it's a data analytics problem,
and you understand that the machine is actually going to ingest that and parse that
and be able to derive data,
you immediately understand we need structured logs. We need to export them not as plain text,
but as, I don't know, JSON maybe. We need to maybe enrich the data with certain things such as,
I don't know, the trace ID to enable log trace correlation. Maybe we need to use concise data model because if any other piece of my stack
will call the service differently,
one will call it service,
one will call it service with a capital,
one will call it service name,
one will call it service underscore name,
how would I know it's actually the same entity?
So the data modeling is another important thing.
The data enrichment, as I said.
So all of these, and that's just the, let's say, the ingestion part
and the very beginning of the pipeline.
And, of course, again, going back to tooling, the tools address that.
So if you look at OpenTelemetry,
OpenTelemetry puts an effort on standardizing the payload,
both in the standard, you know, using a protobuf
format, working also, by the way, on JSON and formalizing the data model for traces, logs,
metrics, and so on. So there's work in these projects that will help converge the industry.
But let's first understand the problem. The problem is that when we don't have that,
it's very, very difficult to understand that actually we
were talking about even the same entity.
So that's one, I guess, piece of the puzzle.
Then if you go further down the line,
you talk about also querying and visualization and alerting.
Because again, you use one language,
I don't know, PromQL to query your time series data.
And you use maybe Lucene to query your log data.
But what if I want to ask a question across?
So all of these are, I guess, the challenges
that we as an industry face.
They don't necessarily have a boxed answer to everyone,
I have to say upfront.
We're still learning that as an industry, but just to say a few.
But then let me ask you something. If I hear you correctly, you said the importance is
that we don't have individual data silos because each individual data silo, first of all,
won't be able to decide how to best aggregate because you don't need to have a holistic view.
Also, each individual data silo is not able to answer questions across the different pillars,
right? And there's more than three pillars that we understand.
So isn't that then, I mean, I guess this is exactly why
all the vendors, it seems, are moving and expanding
into all regions, right?
Somebody starts with logs and goes into metrics and traces.
Somebody starts with metrics and now goes into other areas.
So if this is kind of the ultimate direction
that we need to cover everything,
what does this again mean for open source? Does this mean, and especially the do-it-yourself,
I think I see a lot of organizations that just pick some open source logging framework here,
some open source tracing here, some open source logs here. Does this then mean if these
organizations that do it themselves and use these individual pillars still need to solve the overall problem of really combining all that data and then putting an analytics engine on top of it that understands everything?
And then if everybody's doing this, aren't we then again duplicating a lot of work?
Because every organization then has to set up a team that fully understands the problem that they need to solve.
It sounds very strange to me.
You're perfectly right. Actually, my own company, Logz.io, we were debating exactly the same thing
and this is why we said on the one hand, we identified these are very, very popular
open source projects out there and this is what people love to use. But then again, the challenge of
each one is a distinct silo and how can we as a vendor help them achieve using the best of breed open source, but still with interaction between them. So we offer a suite that combines,
let's say, the ELK stack alongside the Ager for tracing,, alongside Prometheus for metrics, and then overlay with
features to correlate.
Again, putting logs aside, generally, this is a challenge to the entire industry.
I think there are very important moves in the open source sphere in that direction.
So if we divide the observability pipeline, let's say, to the different parts. So if we look at the ingestion part or even the very basic instrumentation,
as I said, you see open source projects that before used to specialize
on specific signals like Telegraph, like FluentD that I mentioned before
that are expanding.
So you see that these open source projects and the communities behind them
realize that they need to cover more, otherwise they become less relevant.
Or the difficulty for the users to use that disconnected from the rest of the stack.
Then there's open telemetry.
And maybe let's talk a bit about open telemetry, because I think this is...
It doesn't address the full pipeline.
It addresses only the telemetry generation
and the telemetry collection side of things.
But still, it looks at it as an aggregate
or as a holistic platform.
So one specification for the APIs and SDKs
for generating logs, metrics, traces,
one standardized collector for collecting the signals
and exporting to whichever backend you'd like.
And by having a unified platform
and also a protocol for transmitting it,
OTLP, OpenTelemetry Protocol,
that again is one way to represent the data model
as we said before,
and one way to transmit it standardized.
It could be the transmission between the SDK
and the client library and the collector.
It could be between the collector and the backend.
It's just an agnostic, general-purpose telemetry broadcast protocol.
So by having that under one project with one holistic view
of all the signals together, that is a very,
very important step in, as you said, breaking out of these silos, at least on the side, as we said,
of the generation and the collection. So imagine again, we're not there yet, but imagine that,
you know, you have a Java backend and I don't know where Node.js frontend application.
And with open telemetry, unlike what we used to have in the past,
you'll have one API and one SDK for Java and one for Node.js,
but they are under the same specification.
And that's it.
You don't need the many, many libraries that we used to have
in order to instrument different pieces of the puzzle.
So that's the vision, at least.
We're not there. but the realization is there,
not just with the vendors, but also with the open source community.
I mean, as you mentioned, right, I mean,
your organization are obviously betting on OpenTelemetry,
many others do.
My organization, Dynatrace, we do the same thing, right?
We're obviously understanding that this is a major step forward
in making it easier, especially covering other technologies under one umbrella. Because we've been, I mean, I've been with Dynatrace
for 14 years and we've been doing automated instrumentation for even longer because the
company was founded in 2005. And this is now another question I want to get to. And I think
you have an update for me because I'm not as deep into OpenTelemetry as you
are.
So please give me an update here.
I always thought, at least, and I think this is still true for many of the technologies
and SDKs already available, but I always thought that OpenTelemetry means developers need to
manually instrument their code.
That's what I thought.
And manual instrumentation means a lot of additional work.
Now I know there is some auto instrumentation
is already going on.
Can you just give me, as a novice,
can you give an update on what the status of automated
instrumentation is, and also how it works
and how I can use automated instrumentation?
Just would be interested in for which technologies it's
currently supported, if you know.
Yeah, so maybe just for the audience that is not familiar,
the idea is that, again, OpenTelemetry,
we think about this one project.
It's actually a mega project.
It's the second most active project now in the CNCF
after Kubernetes.
And so it's essentially many, many projects under it.
So it's very important to say
because people often treat it as one aggregate
and different projects are in different state
of the maturity and different focus areas.
So it's very important to say that upfront.
Now on the, let's say the multi-call,
the telemetry generation side or instrumentation side,
there is, as I said before, the specification for the API,
the SDK, the data model that is a cross language,
and that's the cross language requirements.
And then each group per programming language
develops its own reference implementation,
if you'd like, for that API and SDK in that specific language.
And they, the maintainers and contributors there, look at what each language has to,
you know, in the facilities.
If it's a bytecode-based language, if it's a just-in-time compilation,
if it's a, you know, Go that is very, very explicit and doesn't allow any hooks, or each language with its own tricks
and schticks, as we say, and then find the best way to do that.
The range is from manual instrumentation that you mentioned, which is just like we used
to do in logs. We just put the developer needs to put in open start
a span and the span at the end of the section
and say, I want this to be a defined span.
And it's very, very explicit, but it's dirty and, so to speak,
the code from the business logic and requires all the developers
to know the stuff around the instrumentation and so on.
As you said, this is the most advanced usage. The other end of the spectrum is that each programming language group
works on auto instrumentation agents. And the agent is, as I said, it could be on the injection
on the bytecode, or it could be other ways of hooking and finding it so that it's codeless. And there's in between.
So there's all sorts of language-specific integrations
with popular web framework, storage clients,
RPC libraries, and so on,
that enable to automatically capture relevant traces
and metrics and handle the context propagation
on these libraries.
So, for example, we work with Java and then on Spring,
you have integration with Spring.
We use Node.js with Happy and Express,
and then we have the integration that we use there.
So you have the full range from the fully manual
to the fully automatic.
And my recommendation when I guide my customers and users
and the community members is to leverage the most out of
the auto instrumentation, the more,
as much as they can to get a baseline and in many languages and SDKs,
it's pretty advanced, which is nice,
but then oftentimes you'll find yourself still needing to augment with manual
instrumentation because I know you have some sort of very sophisticated
calculation there algorithm that you want to measure that specifically.
It's not the full function.
It's just a piece of code that only you knowing your source code know that this is something
that you want a specific measurement on or things like that.
So that's how I view the, or I think this is how we view it as open telemetry.
And this is why we definitely put a lot of emphasis on an automatic instrumentation.
And this is a known problem.
It is a barrier to entry to many
if they need to result to only manual instrumentation.
I agree.
You do it just out of curiosity, I guess,
for a Java that you know well.
I guess you have a set of rules,
like if your rule base of what you're instrumenting
or is this just hard-coded in the HM
or is there any way how I, as a user,
can kind of define the rule set that you then instrument?
How does this work in terms of auto-instrumentation?
You have configuration of the agent, if that's your question.
I don't have it off the top of my head to say exactly
which features you can configure and which not.
But remember then that one of the advantages of having it as an open source is that you
can actually take the project and if you need some very advanced tweaking of the agent,
you can actually fork it off the main one, hopefully to contribute it upstream so that
the rest of the community will benefit from that.
So very advanced users might go into the agent's bytecode.
I am sure that people like Dynatrace probably have a lot to contribute.
And that's a very important thing to mention.
The architecture is very modular.
So what I talked about, extreme condition of having to open the source code.
The pluggability of the architecture, both, by the way, it's on the SDK side
and on the controller side and the
collector side, sorry, is such that you can actually inject, so to speak, your own pieces.
For example, you can put your own exporter from the SDK and during the export phase, you can do
some sort of logic that you apply that can do additional, I don't know, filtering or sampling or something
like that on the SDK side. And then on the collector side, maybe again, for the audience,
it doesn't know. So you have the SDK, let's say on your application, and then you have a collector
that collects from the SDKs and also from infrastructure. So, you know, you have Kafka,
you have Redis, you have Mongo or MySQL, whatever. So collecting all of the telemetry. And there also
you have like a data pipeline.
So you have a receiver in many protocols,
supporting many receivers actually in many protocols.
And then you can plug in processors that can manipulate the data.
They can do filtering, batching, sampling, whatever.
And there again, you can use the pluggable APIs to put in your own,
define your own processors and plug them
into the process.
So I would advise first to go with the built-in hooks
within the SDK or collector, depending
on where is the right point in the process to inject.
And only very, very advanced users, such as maybe yourself,
maybe will result to even further tuning the actual
underlying framework.
Cool.
I got to ask a couple of other questions because I've been doing distributed tracing for the
last 15 years since I'm with the company, or 14 years.
And I know we had always challenges.
First of all, there was always questions from our users.
In auto instrumentation, what do you really do?
What's the overhead going to be?
How do I know you're not capturing data
that you're not supposed to capture?
And there was a long, obviously, every vendor in that space,
whether it's us, whether it's, I don't know,
back in the days, Wiley, AppDynamics, New Relic, Datadog.
And I think we all had the same challenges
where we got asked the same questions.
We always had to defend and prove
that we are collecting the right data and not the wrong data
and that we don't have a lot of overhead.
But at least for our customers, they had to come to one entity.
Now it seems with OpenTelemetry, I
have for every single technology,
I have different teams responsible for it
with different status of maturity.
And I guess if I would now,
if I'm an enterprise
and I have five major technologies
and I'm betting on open telemetry,
that means I need to go
to five different stakeholders
and basically ask the same questions.
Overhead, what do you instrument?
How does it work?
Or is there also something
where I can go to one
entity and get these
questions addressed? Because I think, especially as an
enterprise, I would probably
like to have one entity to go
to and not play with five
different parties.
Yeah, that's
a challenge in the industry. I think
there
are some advantages of having an open source mentality or open source project
because many of the questions that were very prominent in black box, closed source things,
you can just say it's all out there in the plain field.
You can see it.
So you can see that we're not stealing any information that, you know, on the back end
to gather some information about yourself.
So that's on this side of things. On the other hand, you're right that, you know,
a monolithic or one solution like all the vendors that you mentioned used to provide
had the convenience, the amenity of having, okay, I have one agent to rule them all. I know that
they're fine tuning it and they're doing the thing and I have one agent to rule them all. I know that they're fine-tuning it and
they're doing the thing and I have my support and that's it. That's a trade-off. So you're going,
when you go with do-it-yourself and when you go with a platform, it means that you take,
you prefer to put aside some of the convenience for the flexibility of defining your own.
You can plug in again. What I said to you, I also say to customers,
you can plug in your own processing logic. You can enhance,
you can tune it to your specific organizations, workload types,
and, and, and data modeling and, and, and, and schemas and so on and so forth.
So you have this flexibility, obviously.
I want to make sure that people understand it's not either open telemetry, do it yourself,
or going with closed source vendor monolith. There is something in between which we as open
telemetry project encourage, which is the vendors to participate and vendors participating, not just
in the sense of contributing upstream, but also, and this is the pluggability that I talked about before, is where vendors can create differentiation and value add.
So vendors are more than welcome to take open telemetry, as you said that you at Dynatrace do, as we at Logz.io do and others,
and actually wrap it, add logic on top of that, or add maybe managed services or services of other types and any to provide
enterprises that prefer to go with the simplicity rather than the flexibility, still one vendor
or one commercial entity that will take, assume ownership, help them tune, help them configure,
help them escort them, professional services, support, and all the thing around it.
So OpenTelemedry does not exclude the vendors.
From my perspective, it actually enables,
because some of the questions that you used to face,
you don't need to face now.
You say, I'm based on OpenTelemetry.
I'm just giving you the amenities on top of that.
Yeah, and I think, you know, I took some notes earlier
when you said, I think it's the chance
for all the vendors to say,
we are not only doing data collection, which is where
OpenTelemetry comes in, but we're really providing what is really a true observability platform.
We are giving you the answers to the questions that you have, and the answers come from the
data that we collect.
And now we have a new way of collecting the data in a better, hopefully better way than
in the past.
Exactly.
Yeah.
Really great.
Toten, is there anything else that you want to make sure our listeners take away from this discussion?
I think we talked a lot about, first of all,
I really like the way you said how you see observability
and there's more than three pillars and that observability, and there's more than three pillars.
And that observability is really about being able to ask questions to pressing or getting answers to pressing questions.
I think that's a really great definition.
Talked a lot about open telemetry.
What else do we miss in this conversation?
So maybe just to give a very brief overview
where open telemetit currently stands, because
let's say that we convinced people that it's interesting for them to look into that. But
as we said, because it's not one monolithic project, but rather many sub-projects,
it's important to say the vision is what I said, but where it currently stands is that for tracing signal, it is what in CNCF, it's called stable.
And stable is the equivalent of GA, generally available.
So if you're looking for, if you're now starting a project and you're looking for distributed tracing,
I would highly recommend looking at open telemetry as a mature production ready way of doing that.
Metrix is very soon to have this GA.
We were hoping to have it by end of this year.
Probably will spill over to beginning of 2022.
But maybe by the time that you hear this podcast live, it will already be announced.
So it's really there.
Like the API is already stable.
The SDK is code free and soon to be stable. The collector is nearly there. So it's really there. Like the API is already stable. The SDK is code-free and soon to be stable.
The collector is nearly there.
So it's really, really there.
And one important thing is a great collaboration
with the Prometheus working group
to get the collector to support Prometheus.
So one of the advantages of the open source community
having both under the CNCF.
So again, for metrics,
I would also highly recommend
looking into that if you're now starting a project suddenly.
And logs is unfortunately still behind.
Still behind.
And I guess the focus there is less about formalizing
a new API in SDKs because we do realize that most,
if not all the customers already have some logging
frameworks there. So the first focus is to get the integration with existing logging systems
and getting later on to formalizing a new API and something like that, ingesting from existing log
appenders. There was a Stanza project that was
contributed to OpenTeleMT Collector that
contributed lots of log processing in many data formats
that pushed the collector forwards,
if you're familiar with Stanza.
So this is where logging stands.
So less ready for production, but also looking promising.
Just to give a very, very high and brief overview of
where that stands. Yeah, perfect. I got to ask two more questions. The first one is,
you know, it feels like we've been shifting a lot of responsibilities to developers because now,
and now we're asking them to know more about what they need to instrument and what data and what questions to ask.
Just a very brief answer from you.
Do you think we are pushing too much on developers?
Because we've also, the whole shifting left,
meaning testing earlier, doing more things earlier,
asking especially developers to do more with less time.
Do you think we're asking too much?
Or do you think it's just a natural kind of wave
where this is just the, with every kind of evolution
of a new, let's say, paradigm, in the beginning,
you have a lot of work that needs to be done,
especially by the technical folks like the developers.
But then as we mature, it will get easier
because we kind of then standardize it,
and we are, kind of becomes a commodity.
I think it puts a lot more responsibility and awareness on developers.
I definitely agree. Developers now need to be fully aware on how the application looks like also in production and non-functional requirements on performance and things like that.
On the other hand, I think at least the developers that I work with in my current company and customer companies and users and others,
I think they like it in the sense that before they felt disconnected
from how it's being used.
And I think the developer ultimately,
he's the father and the mother and the parent of this piece of functionality
likes seeing how it's being put to use
and how we can implement it better for better use.
I think that many of the time-consuming tasks
will become much, much easier
the more we get the dev tools that are easier, better
user experience, maybe AI behind the scenes that will help provide more effective insights
and less having to dig into raw data and finding yourself.
So it will become much easier.
That's part of our job role as the observability vendors
to make their lives easier.
They don't, and as you said,
the auto instrumentation
in every piece of the journey,
we should try and make it easier
for them to enable their focus
on the main core business
writing and the business logic.
Now another, and my last question,
and I think, I don't know where I saw it.
It was either a tweet from you or it was a LinkedIn posting.
But earlier you mentioned that observability is only three pillars, but many other pillars
as well.
Yet you were posting or highlighting a presentation that somebody gave where they said the only
thing that they need is they said the only thing that they
need is distributed traces. It was a presentation, I think, from Tel Aviv or from Israel from
somebody. And I thought that was kind of interesting because here we are and we talk
about metrics, logs, traces, all types of events, end user, everything. And then I see your post
and say, hey, the only thing we need is to do the traces. How does that work together?
Well, uh, yeah, I think my, my tweet was misunderstood.
I said that it was interesting for me to hear.
It was a talk by a very young startup called velocity, uh,
based in Tel Aviv. And I saw them at the conference and, uh,
their engineer or head of engineering that was talking there, uh,
said that they specifically decided
to go all in with distributed tracing and not do logging. Or let's say the logging is part of the
span payload. I found it very interesting just because actually it's a very rare decision to
make these days. Not many organizations can do that. And yeah, it started some discussion there.
Some people were asking me,
so how can that be?
It doesn't fit all organizations.
And I agree.
I don't think that every organization
can work that way.
Just to put a very simple example
that they gave us on that discussion,
that if you use sampling,
which is very, very common,
a very common practice in tracing,
and you only send 1% or 0.1% of your
span data to the backend, which is enough for performance and latency type of use cases,
which is the classic use case for, for tracing. But then again, if you base your logs on the,
on the spans and you discard, you know, you drop these spans because of the sampling policy.
If you now have an error, you don't have these logs. So this is, and by the way, Velocity, and I was asking them that same question in their
case, and they said they're using 100% sampling in their case.
They have the luxury of not dropping anything and sending everything to the backend.
So again, not every organization works under the same workload situations.
So I don't think that we're there as an industry.
However, and that's important to say,
I do hear more and more people talking about shifting some of the information that they
use to deliver via plain logs into the payload of the traces. So it is something that people
are now debating. It's not clear cut. Okay. I just throw it in the log line. Let's think for
a moment. If this piece of information is actually better served in a span
where it's under the context of that specific request and so on and so forth.
And even if not, my best practice, again, would be, or my recommendation would be
to enrich the logs with a trace ID so that I can anyway make the correlation from the log to the trace
to do this observability that
we talked about before it was an interesting talk i hope that it comes out as a recording and i
recommend everyone to uh to check it out i'll try to put the link uh to the tweet um in the
proceedings hey um dotton thank you so much for this talk. For me, it's always interesting to learn from my guests
because my guests are always experts in a particular topic,
and you're clearly much more versed in observability
when it comes to open source,
when it comes especially to open telemetry.
So thank you so much.
Also, thank you for allowing me to ask
maybe the one or the other stupid question
or coming up with a strange idea. But I think I'm just, uh, I'm just trying to learn
right. And, uh, and, uh, people that learn sometimes ask questions that say, ah, why does
he ask this question? But, you know, I want to just know. So thank you so much. Um, are there
any, uh, what's the best way to get ahold of us or a hold of you in case people want to follow up? I know LinkedIn.
What's your Twitter handle?
So my Twitter handle is Horowitz, H-O-R-O-V-I-T-S.
And in fact, it's my name everywhere.
So, you know, Medium and WordPress
and LinkedIn and GitHub and everywhere.
Probably if you just search for Horowitz,
H-O-R-O-V-I-T-S, you'll probably find me.
And yeah, just reach out to me.
Any feedback on this chat, on the links that I'll post, that we'll post here on the episode,
anything else.
I'm more than curious to hear feedback and engaging conversation following this.
So that will be bidirectional.
And by the way, your questions were excellent, not stupid at all.
And these are the right questions that we should be asking ourselves as a community.
And also, listeners, make sure that you are watching his Open Observability talks.
Really great guests, great conversations. Check out the work that he's doing for LogC.
We're all in the same space and it's an exciting space. And it's a space that is evolving and in
the end making lives of our users easier, hopefully,
because then they can get answers to the questions that they have,
spanning everything from logs to traces to events to end use,
whatever it is, right?
I think I like that.
Yeah, amazing.
Thank you very much for inviting me and for the interesting chat
and looking forward to having you in the open observability talks as well
to carry on the chats and about some of your open source.
And Brian, very sorry that you couldn't be with us.
It would have been great to have you as well, but pretty sure we'll have him back and then
Brian will be back as well.
Okay.
Thank you.
Bye bye.
Bye bye.