PurePerformance - The Road to OpenTelemetry Adoption at Booking with Anton Timofieiev
Episode Date: December 23, 2024For the past 10 years Anton has been working at Booking.com - one of the leading digital travel companies based out of Amsterdam. The journey that started as System Administrator has led Anton to be a...n Engineering Manager for Site Reliability where over the past 3 years he led the rollout and adoption of OpenTelemetry as the standard for getting observability into new cloud native deployments.Tune in and learn how Anton saw R&D grow from 300 to 2000, why they replaced their home-grown Perl-based Observability Framework with OpenTelemetry, how they tackle adoption challenges and how they extend and contribute back to the open source communityLinks we discussed:Anton's LinkedIn Profile: https://www.linkedin.com/in/antontimofieiev/Observability & SRE Summit: https://www.iqpc.com/events-observability-sre-summit/speakers/anton-timofieievOpenTelemetry: https://opentelemetry.io/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me my fantastic co-host, Mr. Denver in November, Andy Grabner. Now it's not November, Andy, right?
But I bring it up because we had a great time having a few beers and a few whiskeys together in Denver
and meeting in person for the first time, not the first time, but the first time in a long time,
just not at a work event, especially.
No, thanks for spending the time. I know it was a weekend.
It is the best time. I did some posts on
LinkedIn as well to make sure people know we actually get together,
the hosts of your performance.
And yeah, thanks for showing me, especially the whiskey bar.
That was really cool.
That was cool.
You had a busy few weeks of travel there,
so it's good to see you back home.
Back now.
And coming back to the whiskey bar, just one thing.
I observed a lot of things I've never seen
before in my life.
And I think the topic
of today is also something like getting visibility
into things that you otherwise may
not have visibility in.
Right, because you got to pick which whiskey you wanted to use.
Exactly. And in fact,
it wasn't even a pre-done flight. It wasn't
a pre-created flight of whiskey
that you had. You picked each individual one and custom tailored it to exactly to your needs right yeah yeah and in
the end i was able to to observe it from the outside and from the inside but now let's stop
with the whiskey uh before we bore our guest to death because he's been waiting for a while now
i'm very happy uh it's been almost two months since I bumped into Anton.
Anton, thank you so much for making it to Pure Performance.
We've met at the Observability and SRE Summit in London,
where we both joined the stage at a panel around open telemetry
and open telemetry adoption.
Before I started to ask a couple of questions,
could you just introduce yourself quickly to our audience,
maybe a little bit of a background of who you are,
who you work for, and what kind of drives and motivates you?
Yeah, sure.
Nice to see you again, Andres, and nice to meet Brian.
Hello, everyone. My name is Anton.
I work for Booking.com as an engineering manager in the observability team.
So predominantly what we do is work on building open telemetry infrastructure for Booking.com
and also running some other related systems, both homegrown and open source.
Cool.
And looking at your LinkedIn profile, and folks, as always, we link to details like
the LinkedIn profile.
There might also be some additional links we post on.
Anton, you said your colleagues wrote some papers or blogs around some of the stuff you
do.
So folks, check out the description.
But looking at your LinkedIn profile shows me something
that is, I would say, rather unusual for many out there.
You've spent more than 10 years at Booking.com,
which is phenomenal because a lot of people in our industry
are jumping around all the time.
Booking.com is obviously a well-known entity.
I'm sure many of the listeners have used Booking.com
to book their trips, their hotels, their cars, their their hotels whatever you know you book obviously to have a nice vacation
uh what just going back where did you how did you start there and how did you end up
being in the role right now to drive observability
oh yeah it's indeed been a long time. Although it doesn't feel like it sometimes, time flies.
I came to Booking, like basically, I joined as a system administrator, was working on
some non-observability stuff in the beginning. And after a few years, couple years, I moved
into one of the observability teams. Originally, that was a team that owned kind of like in-house custom sort of open telemetry.
Like the system which does the same stuff as open telemetry, but it was completely in-house.
And that was pretty fun.
It's a big system.
It was using back in the day like React Storage.
It was all written in
Perl,
had a lot of interesting code
inside it, a lot of
technology that was
built through
hard work, sweat, and tears.
And when I joined,
I started to work
on that and helped
move it towards the next phase of the architecture which is
replacing React Storage with Kafka. Back at that time, I think around six years-ish,
kind of Kafka was all the hype and everybody was really excited about it and we saw that it would
also work well for us and Kafka is a simpler system to run compared to React. It has a different rate of different pros and cons, but it seemed like a good fit for
us.
So then we were working on that.
And at that part, there were some changes in the team, there were different opportunities,
and eventually I started working also as a team lead, meanwhile keeping my technical
role as well.
And throughout the years, also the organization transformed a lot and
SRE practices were introduced inside Booking.com.
SLO was also introduced at most teams and all of those things, so we were also working on that.
So I was kind of working both on the SRE side and a little bit also on development side, a little bit
on pure sysadmin side, a bit of everything. And then also team leading. So eventually
I kind of progressed from sysadmin to senior sysadmin. And now, sorry, I guess it was sysadmin
to SRE. Well, you can double check it on my link, but I guess it was sysadmin to SRE and
then senior SRE and then senior SRE and team lead at the same time and then basically engineering manager.
But that's not necessarily my path, it's also a path of the company because the transformations like introduction of SRE, like migration of people from either development or system roles to SRE and also transformations
between team with an engineering manager so that's kind of the company was growing when I joined it
was around 300 people in tech 2014 and now it's 2000 plus people in tech so the company was kind
of maturing moving from startup mode mode, and also following the industry standards.
When Google published the S3 book, that was like all the hype inside the company,
and we really started to believe into it. Had different attempts, didn't get it right from
first time, but never gave up. And now kind of it's built into our operations.
Fantastic. ago that has already implemented their own kind of observability solution for their Perl-based
system.
I know Brian and I have been around for quite a bit in the observability space.
Auto-instrumentation of different runtimes has been around for a long, long time.
But I think Perl and Brian, I don't know, I think Perl was never really on our radar,
at least not for me.
No.
In fact, I just came up with one of our prospects was Pearl.
And maybe you can correct me, Tom,
but my understanding is there is open telemetry code for Pearl,
but there's no exporter.
But that's
what I just
heard from
somebody.
But yeah,
no, Perl
has always
been there,
but just
kind of
hiding in
the corner.
Every once
in a while
you'd see
it, like,
oh, hey,
Perl, how
are you
doing?
And maybe
it's also
because we
always talk
about this,
we always
live in our
own little
bubble, and
in our
bubble it
never
materialized.
But obviously at Booking, Perl has been powering all of your systems.
And so do you remember why they decided to write their own Perl-based observability solution?
Did you guys look around back then into what's available?
Obviously OpenTelemetry wasn't there yet, but do you remember the details?
Yeah, so basically it was kind of a path of evolution.
Basically, from the first days of the company,
like the first version of the website
and the first systems were written in Perl.
And from that day on,
everything was written in Perl.
There was nothing else. And from that day on, everything was written in Perl. There was nothing else.
And from the early years, the people who were long before my time at Booking, I think early
2000s, they started building a powerful kind of observability culture, basically, which
looked like back in the day, it was mostly a monolith application, which would write structured logs into, let's say, files, and then they will be picked up and sent to other systems.
So from that point on, it gradually grew into an ecosystem of consumers of these structured logs.
And then there were systems which would build metrics out of these logs.
There would be a system which would put it into Hadoop for analysis, and over time more and more data were added into the structure
logs and it grew to a point where you would have half a megabyte, one megabyte single log messages.
And basically it was kind of spread all over.
More systems started to be created over time, but they were using the same libraries, the same approaches, the same observability ecosystem.
So when I joined the team in 2014, that was already well spread, well developed, well matured and super powerful. I haven't seen anything like this in other places where I worked because it was kind of
it's like over
over abundance of data and it's available with one like CLI
command and you can do a lot of things with it. That was super powerful.
When I joined the team
basically for us the question was not about...
The system was already there, so for us it was more about evolving it to meet the new challenges and to allow more scale. So we focused more on the infrastructure side, basically replacing React
with Kafka and also introduced some new components in Go. So we over time, we replaced
all the Perl components with Go components
and also replaced React storage with Kafka.
But that's more about the pipe itself.
The consumers of the data were still a mix of different things, mainly Perl based.
But there also was some JavaScript
on the front end and different like Hadoop and other things.
Yeah, and I just fascinating to hear these stories, how things evolve. And now I remember
when we had our discussion back in London a couple of weeks back, you didn't say it
right, there was obviously a moment when you had to decide that something new is needed.
Because obviously you said you had a monolithic system first.
And not that monoliths are easy, but still it is, I guess, easier,
especially in an existing system that is providing a lot of great structured logs
to then get the telemetry out of it that you need.
But then you were looking into new architectures,
you broke the monolith into smaller pieces,
call it microservice or whatever.
And this was then also the time, I think,
if I remember correctly, where you said,
so what do we do now?
How do we get observability
in this much more complex distributed environment?
Do we want to implement and write our own or keep
implementing our own or do we do something, take something that is
existing that has good community adoption? I guess this was the time we
actively looked into OpenTelemetry. Do I remember this correctly?
Yeah, exactly. Basically once we started splitting of parts of the business logic
into separate applications, then we desperately needed
some kind of tracing capability. And that was never part of the original deal because we never
needed it. So first what we did, we started looking what we could do in-house and then we
quickly realized that it's not a small problem to tackle. So I started also looking at vendors and we basically started sending
our structured logs information into one of the vendor solutions and it kind of worked
out of the box because structured logs already had lots of rich information inside it. We
just defined which excerpt of those messages we would ship on because it's not free. So we wouldn't send everything. In terms of inside one message, we wouldn't send all the fields. We also added a few
like root ID, parent ID kind of fields, which was also in many places already present and where it's
not present, we would edit. And then this vendor tool would match up all the spans into traces. And it was basically super easy to adopt
just because we already had all the data in the stream. So we just
send the stream to a new place.
But that means you really had one critical component. That means in the structured logs
you had some type of ID, some type of a correlation
ID or a transaction ID
that you can then use to actually stitch everything together.
Because this is often the hardest part, that you have a distributed system
and you have maybe one correlation ID here and another one over here,
or you don't even know what correlates well.
But it seems with your structured logging approach, you had the basics covered.
Yeah, also we have basically
default libraries for most languages so fast forward to today the main languages in the company are Java and like TypeScript Node.js but then we also have some Go, some Python and still
a lot of Perl and maybe other things in some places, some corners as well.
So what we have is for each language, we would have a library,
which would drop over some industry default, industry standard, let's say HTTP library,
and it would add those fields.
So it would add those IDs, and then you would end up everybody using those libraries.
So as long as people just keep updating versions, they will always have the current standard.
Now in that transformation story, at which point were the logs
no longer sufficient? At which point did you have to look into open telemetry?
And then also for me, and this was I think also a big thing that we
discussed in London,
how much did you then manually have to instrument?
Did you go in with a manual instrumentation?
Did you use existing libraries that you already had to then do the instrumentation for you?
Did you look into auto-instrumentation agents in OpenTelemetry. Just curious, because I think a lot of our listeners
are also in the transformation phase right now,
where they're looking into OpenTelemetry
for one or another reason,
but they're coming from a stage where they are right now.
So I would like to know a little bit about
what made you look into OpenTelemetry,
what were the decision points,
and then also how did you implement it?
Yes, so at around the same time that we were splitting Monolith, we were also
like adopting new kind of runtime environments. So we went from all bare metal into also adopting.
We tried different things like Mesos and other things, but eventually we settled on Kubernetes. And
after that, we also started exploring using Lambda, Fargate, ECS, everything.
So at this moment, basically, we have lots of different places where people can run their code.
And we were thinking, okay, how do we integrate all of that? Because our homegrown system was built just for bare metal infra,
and we had nothing for the other runtime environments.
And the question was, do we invest the time and build it,
or we look what is already there in the industry?
And then OpenTelemetry was, that was, I think, around three years ago.
So it was kind of still new and just emerging but at the same time we saw that
the momentum was there and a lot of people were really pushing it and it was evolving very fast
and even though most of the things would be marked as alpha they would just work so we started playing
with it looking into it and we saw yeah it working, even though it's very young and new. It kind of seems to work.
So we decided to build out the POC, the pipeline, and send some traffic to see how it goes.
Focused on that for the first phase, and that worked quite nicely.
Basically, our experience is that we could integrate it with everything that we needed
to integrate.
We didn't need to develop anything at all.
It was mainly just configuring it, running it, integrating it with queues,
integrating it with our databases and all of those things.
And
yeah, at that point we decided, okay, then we would start migrating towards open telemetry and phase out our kind of
previous system over time even though it's not it's not a short-term pro like it's not a
short-term project it's not something that we can achieve in in a month or in a year but the kind of
that's our plan for now. And was this,
obviously, it's always challenging
to move from something existing,
especially because you said earlier,
the people that were using the data,
there was people running their reports,
there were people that were making decisions
based on this data.
Now you're changing the way you collect the data.
You're changing the data that is collected because OpenTelemetry
just gives you, by default, probably just
some different data, hopefully more even.
Was this a tough thing, a tough sell as well for the people that consumed the data to say,
hey, we need to change something and maybe in that migration phase
we have some data gaps, so we
don't really know how to compare what you had before with what we have now. Was this a challenge?
Well, from one perspective, it is a challenge that you need people kind of to learn new systems and
to, let's say, switch from between different query languages and kind of adopt different
ways of doing things. For example, for metrics, our main metrics database is Graphite.
It has its own query language, its own
capabilities and limitations.
And now with OpenTelemetry, we're also moving to kind of Prometheus compatible
database, and that's a different query language.
And also it has new things that were never in graphite but
on the other hand it doesn't have some of the things that graphite has so people
need to learn new tricks and kind of learn how to do the same thing
differently so that was one thing like sharing information sharing tips and
best practices and also kind of selling to people why they need to try new tools.
At the same time, right now, basically,
the way we plan migration is we're starting with new components.
So we don't take some existing data and replace it with new data.
We just say for everybody who's building new services,
they're adopting new systems.
So they're not moving from old way to new way.
They're just starting adopting new systems. So they are not moving from old way to new way, they are just starting with new way. And over time as we build more
confidence in the new tooling, in the new systems, we will start moving pre-existing older data also.
Once that happens, then we will exactly have the problems that you mentioned that we would need to have sort of a full feature
kind of replacement for each thing that was existing before. We need to have it in the new way
even though it's not going to be technically possible for everything because like let's say
graphite and Mimer they work differently and also we switch from Elasticsearch to Locky for logs.
They also have different
absolutely different like query languages different query modes
um so that's going to be for us it's going to be like a multi-stage process with different
challenges at each state but we don't want to rush into it and do everything at once we just go
in little steps one by one as we
become more mature into it and we ourselves learn more about both open telemetry and also like
prometheus lock and other things then we can explain to users better and how to migrate how
to move how to get things done in a new way yeah and, and I think it's a great approach.
Maybe I phrased it wrong earlier
because I see some organizations
that try to really kind of rebuild
and replace the old stuff that they already have.
But your strategy is the new stuff that is built.
By default, it is OpenTelemetry.
And then over time,
as you are updating your stack anyway,
more and more will then be just using OpenTelemetry by default.
You made a statement earlier that you also made in the panel discussion we had, because
you said even though you started early with OpenTelemetry and a lot of things were still
marked with alpha, it worked exceptionally well.
And I believe, and correct me if I'm wrong,
you also said that while a lot of things were there by default,
there was also a lot of stuff that you contributed back
that you extended and you built on top of OpenTelemetry
and your engineers extended it.
Do I also remember this correctly?
And if so, I was just interested in hearing uh did you also
contribute back upstream to open telemetry or was this just an internal extensions internal version
that you built how does that work right so we do build our own processors mostly it's for
kind of integrations and things which are internal so we don't push them
upstream because they they're only relevant in the context of our systems we did have i think
one bug fix that was accepted upstream and i think at this moment that was more or less the extent of
our involvement upstream but internally we use open telery Builder and we build our own image with a lot of processors that we build ourselves.
In general, we are willing to
contribute upstream whatever would be relevant. So we are
evaluating, let's say, something we built, is it generally useful?
Then we would consider putting it upstream. If it's company-specific,
then we would leave it in an internal repository
and just build it into our images.
And also for the folks that are listening,
I think we've covered Ryan OpenTelemetry quite a bit
over the last months and years,
but just for clarification for those people that might be new,
when you talk about your receivers that you build,
the OpenTelemetry Collector, which is a central component,
has a receiver concept where you can basically receive
different data sources or pull in data from various systems.
In other observability platforms,
these might be called plugins or extensions or however.
It's basically pulling in data from certain
systems that you have in your infrastructure where
maybe it's an API call that they provide or
some type of special file format and then you can pull this data in.
Do you have
any numbers on the current adoption of open telemetry
would you know, would you say hey at Booking we are at x percent
of adoption and our goal is
by I don't know 2025, 2026 we want to
reach a certain level of adoption or how do you measure this and how do you measure it?
Like right now we have, I think, more than 300 services using OpenTelemetry.
In total, that's probably around 3%, 4%, 5% of number of services.
It's mainly kind of not super high volume ones because
it's mostly new services that are not getting a lot of traffic. But we want to grow it aggressively
as we move forward because also we don't want to maintain two kind of ecosystems for a long time. So we want to adopt the new ecosystem
and the new stack more and more as we move on.
For 2025, we want to see that gradually improve
to much higher numbers.
Like they do depend on how it goes,
but like 20, 30, 40, 50%.
Right now we we already...
So it also depends on the signal type.
So for logging, we are currently in the process of moving away
from the last search and we want around, yeah, like more than half
of the logging volume to be on OpenTelemetry and new database
within, let's say, three, four months.
Our metrics is different because we have more than a hundred million metrics in Graphite and
also we have more than 10,000 Grafana dashboards working with this data. So that's going to be a
different story and we need to plan carefully how we would migrate all of that. For traces, I think about 5-10%
also is original open telemetry and that also we plan to increase
a lot in 2025. Cool. So that's already good numbers
of 5-10% of traces. And thanks for that clarification. That's really
important that you are obviously the different types of signals
you are adopting them
in in in various speeds so that's that's good to point out because people sometimes just put out
a number and say this is our open telemetry adoption but then it very much depends on what
you're really talking about because while i think many people think about open telemetry as distributed traces
because that's just I think where it became very popular initially
logs and metrics are obviously
the two other major signals that we see there.
With your size of the company
and you said you're pretty much still getting started,
still in the early stages of adopting it,
are there any challenges that you have?
Are there any things that you,
especially for people that listen in and say,
hey, cool, open telemetry is ready for prime time,
start with new services maybe,
and then grow from there.
But are there any challenges that you foresee or have already overcome that you wanted to
maybe share with us for our listeners to try to also go down that path?
Yeah, I think the main challenges for us is migrating applications, migrating teams, owning these applications to new tools.
For example, and there are two parts. So one part is the pipeline itself, which is something users
not necessarily exposed to too much. Like maybe they need to update library, maybe they need to
use different library, but other than that, the path of the data from their application to let's say your logging
or tracing or matrix database that's internal and that's not necessarily make a difference for a
user. So as we move from our legacy or let's say our previous pipeline to OpenTelemetry that thing
is a smaller change to a user compared to that we are at the same time moving to new
databases. So we are moving from, let's say, Graphi to Prometheus, a compatible database,
we are moving from Elastirch to Loki. That is a much bigger change because then it affects
all the Grafana dashboards, all the alerts, and people need to learn how to use new tools effectively.
So what we learned is that
we need to prepare training materials for engineers, we need to share best practices, provide recipes how to solve certain use cases that existed in previous tools with new tools, communicate a lot, announce changes, help
people prepare to it, and also explain to the company and also kind of get them excited.
Then they will get excited, people around them, and then it's easier to get adoption across the company.
Because we ourselves, for example, we can reach a certain amount of people,
but if there are ambassadors across the company also reaching out to their
colleagues, then you can spread the word and you can build interest.
And then it will go easier and then people will be more interested in meeting you halfway.
This reminds me so much of community work that you have to do in any type of open source
project or I guess in any type of community community because you might be one, two people have a great idea,
you put out a new project
and then you cannot assume that just two people
can basically change the whole world
and converting them into adopters.
You need to start with the adoption
with some teams that really see the value
and then you have to convert them into your advocates.
You have to make sure that the way you scale
is through other people sharing your passion and sharing their stories.
I think that's a big point.
And also what's so important is you talked about creating educational material,
keeping them updated,
providing newsletters, sessions, office hours.
I'm not sure what you're providing as a team to make sure that your engineers have everything
that they need in order to become successful
as you as an organization are transitioning over
to the new way of observing your applications.
That's a head shake, yes.
So I might have missed it earlier,
because this leads into my next question,
but I think I missed it earlier.
So you're collecting the OpenTelemetry data.
What is that being fed into?
Is that something you built,
or are you using something like
I don't know who
any third party to ingest and consume the open telemetry data?
Or does that something you built?
Yeah.
So currently we are feeding into
like a SaaS vendor for tracing.
We are feeding into like a SaaS vendor for tracing. We are also feeding into
also SaaS vendor for Prometheus compatible database, but we are also running on-prem also
Grafana Mimir that we run. Well, not on-prem, we run it ourselves on Amazon Cloud. And for logs,
we're also running Locky on AWS.
Okay.
Sorry, go ahead.
Yeah, that's like a new stack. The old stack is kind of
completely different, but
we are moving to this new stack.
Okay, so most of the work you're doing is
more on the ingestion as opposed
to the consumption of the data, correct?
Like you have to get everybody adopting OTEL.
I was just curious how many, like you guys have a very large, you know, going back to the idea of Andy's question earlier of, you know, what advice for people, right?
How big of a team does it take to execute what you're doing?
Is it like you and one other person supporting the entire
organization? Is it sort of part time that here and there you get things going? Maybe you're more
busy. But for people who are looking to, you know, maybe they hear this like, hey, we're going to
start it today. What are they getting to themselves into in terms of time commitment to do this?
Is it something that's, yeah, I guess I'll leave the question there.
Yeah, there are like... So in total, at Booking, there is a lot of people working
on observability. Like for example, our org is around 20-25 people. But you could say there are
other people working on it as well, because it kind of touches everybody and every application
needs some amount of observability.
Open telemetry is not the only thing we are doing, so not everybody is working on it.
The team I'm on is, as you rightfully mentioned, is about ingestion.
So we are predominantly working on that.
We are like four or five people mostly at one time and we are not only
working on OpenTelemetry
because we also have
other things
so I'd say
you can start
with one
two people
building a POC
and
actually
connecting everything
is not hard
because you
have the recipes
online
you just
there is Helm chart
there is already everything is, everything is there, everything is ready.
You basically just put a few ready YAMLs and Helm charts on Kubernetes and it will already work
right out of the box. Of course, then you start to tweak and tune it to your environment, to your
context, add things that you need. And then once you start actually integrating it with real data,
for example, you need to receive data in different formats.
You need to figure out all the security integrations.
You need to figure out the data formats
and integrations with client applications that will take much more time than just
setting up the OpenTelemetry pipeline. Setting up the OpenTelemetry pipeline,
that's going to be something that you could much more time than just setting up the OpenTelemetry pipeline. Setting up the OpenTelemetry pipeline,
that's going to be something that you could just get going after reading online some emails
from official website.
It's more about getting fitting into the organizational needs at that point then.
It's kind of cool because, Andy, if you think to and and obviously um anton you saw the earlier
days of open telemetry it was you know it seems like it was a lot more difficult in the early days
um right i mean it makes sense but now as you said there's all these recipes and all this other
stuff that you can use to just to just get going on it and and running it um it sounds like the
there'll be that hurdle in the beginning for all the, for lack of a better word, bureaucracy.
You have to build around it just to make it compliant within the company, but once you get past that
it's sort of
I don't want to say strictly maintenance, but it's
not full-time, as you said. And then it's the trick of getting people
just to adopt and start using it.
But you could say that about any tool, right?
For anything.
Hey, we've got this new tool.
Use it.
All right, so it doesn't seem like it's too high of a hill to climb.
Yeah, definitely.
I think the community is doing a great job
making it very easy to start with.
Great.
Hey, Anton, a quick question on your team
and especially the responsibilities.
You mentioned earlier you currently have about 300 services using OTEL.
Do you as a team, as a platform team,
provide them and also operate their collectors and everything?
Or do you just provide the guidance and the description and maybe some self-service
for them to stand up the whole observability stack,
including the collectors?
I'm just curious because I've seen organizations
doing it differently where some are just like,
hey, you just instrument your app
and you just point us to a collector
and we make sure of scaling the collector
and then taking the data pipeline from there but i've also seen other models where teams also own
the uh the the open television collector like part of that pipeline before it gets stored into
a backend yeah right so there are like some teams who were very early adopters open open telemetry
before it became like a company thing.
And some teams were running their own collectors before there was a centralized solution.
And also we had some other sort of parts of the business where they were actually running OpenTelemetry before the bigger part of the business was doing it.
In some places, people were running HotelCollector as a sidecar in their apps, Kubernetes deployments, but now we are
converging towards a centralized solution where the application owners
would not need to do that. Instead they would be either a BKS deployment or daemon set, and that will
be owned by a centralized team.
Or application just writes, for example, logs to stdout, or metrics are exposed to a Prometheus
or HTTP endpoint, and then we would take the logs from files written by Kubernetes from
a stdout or we would take metrics from Prometheus endpoint and then all of that will be owned by
central teams. So the goal for us is that application owners don't need to become a
maintainer of observability systems, don't need to learn like the configuration of open telemetry,
own the configuration and kind of handling different edge cases.
So we want basically just application owners to send the data somewhere and then we'll take it from there and own the whole pipeline.
Yeah.
And I think that's also the more frictionless thing to do because then people don't need
to worry about things
that they shouldn't worry about
because really what you provide is observability,
is a self-service.
So send me the data to that endpoint
and I'll take care of it.
But do you then, again, this is stuff that I see
on a regular basis, do you provide feedback
to those teams that consume your service about,
hey, your login just has just tripled. We see bad
logging patterns. You're logging things with the
wrong log level. I don't know. Do you provide any
guidance and feedback? Yeah, so
when we know just certain things, we can reach out to user teams. Of course,
we can't kind of investigate and follow up on each and every issue.
The bigger the scale, the less we can kind of control all of it.
What we do first about the volume and the metric explosions and the logs volume explosions,
we give everybody a quota.
So every service everybody a quota.
So every service has a quota.
And if you go above the quota, then data is going to be dropped.
So, for example, we can notify users about that happening.
But even if users don't get notification from us,
they would either notice and reach out to us and ask for, say, increased quota,
or we would investigate what else they can do,
or they're not looking into it and they wouldn't know.
So I think both situations are happening,
depending on the team and depending on the communication.
For incorrect formatting, it's kind of also like that.
So at the end of the day, we as an ingestion team,
we can't look at each and every log line and look, oh maybe this is a multi-line message that is actually not joined together, or maybe that's a warning and not an info and it's marked as an info
and things like that. So at the end, if an application
owner is looking at these logs and he notices something is wrong, then he would initiate
like an investigation and in most cases, let's say they may need our help or may not need
our help. If they need our help, we'll also support. But if a user, for example, doesn't
notice an issue, then most probably we will also not notice it
just because of the share scale because those things require like eyeballing and we can't
eyeball let's say hundreds and thousands of applications one by one so if you notice
something then we would follow up but it's not a goal to double-check that each and every service is for matching the logs correctly.
So one additional question that I have,
because this comes up on a regular basis for me,
is as you're pushing the power back to the engineers
to build in their own instrumentation,
they can basically control what they capture.
Is there something that you enforce or help them to make sure
they're not capturing the wrong data?
So for instance, personal identifiable data,
any confidential information, credit card details, something like this.
Is this something that would fall into the same thing
what you just explained earlier in a way that you obviously look around,
but it's not that you can look
into every single trace and every single log.
Or do you have some things in place where you validated instrumentation is correct and
they're not capturing things that they shouldn't capture?
Yeah, that's a big challenge. Basically, at the moment, the responsibilities on application owners, at the end of the data, but again, that's a best effort scenario.
For example, we might have some red axis for emails, we might have some red axis for credit card numbers,
but that's not going to be super exhaustive list. And also, the more we add, the more expensive it gets on CPU side.
So what we are looking to add in the future is we want to add some limited, like credit card number scrubbing and things like that for applications which are running in, let's say, restricted environments like SOX compliant or PCI compliant.
So for Zeus, we will selectively enable some scrubbing. But other than that, you could still have
applications running in any environment kind of leaking sensitive information.
There is no kind of generic solution for it. It's more on application owners. But
there is also a question of application design that in most cases application
doesn't need to use PI directly. instead it could use some IDs from internal systems and that's kind of the approach that we are
also taking in the company that a lot of data you can't just easily get access to
it and you can't work with it directly instead you need to use kind of IDs from
specific like specially built storage systems for that. Cool.
I'm just curious because these questions come up.
How do you make sure that the right data
and not the wrong data makes it through all of the data pipelines
until it ends up in a backend system
where people have access to it, shouldn't have access to it?
It's a tough problem to tackle
and also as you said, it's a trade-off
between how much do you put on your data pipeline,
how many REC access can you add,
especially as the volume of ingest grows at some point.
You need to make a trade-off as well.
Hey, so now you've been on your journey.
Well, you've been at Booking.com for 10 years.
You've been on the OpenTelemetry journey for about three years.
To kind of conclude this podcast,
where do you see yourself, if that's even possible,
where do you see yourself in a year and two years from now
as it comes to providing OpenTelemetry-based observability
to your organization?
Yeah, I think for us, the ambition at the moment is maturing the system, driving adoption
much higher, eventually making it kind of the only observability stack available. And the goal on the next year is going to be to cover all the different
cloud-based runtime environments as much as possible to have first-class kind of support
for Lambda, first-class support for ECS on Fargate, while also maintaining first-class support for Kubernetes and ensuring that the user experience for all three signals,
like metrics, logs, and traces, is kind of very good.
So for us, it's about maturing that
and ensuring that the user experience will not suffer
as we move from something that you've been building for, say, 15, 20 years,
and up until now
that's super mature it's like many problems were solved over the years and kind of replicating that
maturity on the new uh observability stack cool this almost sounds i should have asked the question
what is your wish list for christmas because i just noticed I just noticed that this episode
will actually air
on the 23rd, I believe,
of December,
which is just
two days before Christmas.
It's a good future
to have in mind
and what you wanted to achieve.
Did we miss anything, Anton?
For people that are listening in,
is there anything else
that we
missed to
ask?
No, I
think for
anybody who's
interested in
the topic,
I would
definitely
encourage
people to
try out
Open
Telemetry.
I mean,
there is a
lot of
good solutions
on the
market.
There is a
lot of
very mature kind of systems which
have lots of functionality and their own pros and cons. But what kind of is, I think, very
potentially kind of adoring with OpenTelemetry is that you have a lot of community support.
It's very vibrant. You have super fast kind of evolution. For example,
just last year, I think logs were not officially kind of unstable, even though
support was already there, but everything is moving very fast. And now profiling is being
added to OpenTelemetry, like Elastic Labs kind of contributed the profiler.
And I think that the OpenTelemetry story,
although it's already quite capable,
I think it's just starting.
And I think every year there's going to be
major innovations happening
and major new capabilities added.
So as you start using it,
you're just going to get a lot of cool new stuff
and that is going to be very easy to add to your stack.
Yeah, and I can only echo that.
I was just at KubeCon in Salt Lake City and observability
and OpenTeleMatch with the whole ecosystem that was built around it
or is currently being built around it has just a very big spotlight at KubeCon.
They have their own observability day, which is a co-located event that happens one day
prior to the main conference.
Many observability-related talks.
And folks, if you are listening to this and you want to gather and connect with other
folks that are interested in observability, there's many conferences around.
We have just been at the Observability Summit
and the CERI Summit in London,
CubeCon. London is also coming up.
I think it's the first week of April.
So for everybody that is in Europe,
it might be an easier destination
to get to.
I'm pretty sure there's
many other different
local events, meetups,
Kubernetes, KCDs,
Kubernetes Community Days,
or Cloud Native Days,
where you can hear all these stories
and also connect with the community
that is actually building these frameworks and tools.
So basically just keep your eyes open
and observe where the local meetings are?
Exactly.
Look at that.
We should write an open telemetry receiver that can scrape all of these conference websites
and then make you aware of new upcoming conferences and CFPs and speakers. I don't know, emitting
some metrics about the number of speakers on a conference for observability.
And then we need like an exporter which would send emails to people with like digests.
Exactly, yeah. See, that's a Christmas project folks, if you're bored over Christmas.
Get working on it, Andy.
I expect it by January 1st.
Yeah.
No, very cool.
Brian, any final words from you?
No, I was mostly observing this because I'm never this deep into OTEL, right?
So I was being more of an observer.
But it's fascinating to hear what you've been able to accomplish with all this, Anton, especially
where the team started with the Pearl setup and everything
and how, as the organization
changed and your needs
changed for observability, You didn't just throw
I don't want to say throw in the towel because heck I'm one of the vendors right but you didn't just say
well I guess we have to go to a vendor now right. You said you know we've got
a good thing going here. Let's take some time and effort
especially in the earlier days of open telemetry where you were using
the alpha codes and all.
But you stuck with it, and you've got this fantastic practice that you've built around it.
So it's inspiring to see what can be done with it.
And I think the future is definitely bright with open telemetry.
And as long as people like you and your team and others that are really pushing for large adoption keep going with it it's
it's going to go some really wonderful places so thank you for you know just using it and and
um keeping it i shouldn't say keeping it alive it's not like it's in danger of dying you know
what i mean but like you know being an inspiration uh on open telemetry, that's a better way to say it.
Awesome.
All right.
Good.
Then I would say, folks, thanks for listening in.
And any, yeah, happy whatever.
I think this is the last episode of the year.
Okay.
Right?
So I'm looking forward to 2025 and
also obviously
looking back
to all these great episodes
it was a great end
to the year episode
and I'm sure
there will be many
many more to come
on topics
that are adjacent
to OpenTelemetry
yep
and thank you to all
our listeners
for a wonderful year
and
look forward to
the next year
of pure performance alright Anton thank you very much it was a pleasure to meet you thanks and look forward to the next year of Pure Performance.
All right, Anton, thank you very much.
It was a pleasure to meet you.
Thanks, Brian and Andres.
Thank you.
Bye-bye. Thank you.
Bye-bye.
Peace, bye.