PurePerformance - The Road to OpenTelemetry Adoption at Booking with Anton Timofieiev

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my fantastic co-host, Mr. Denver in November, Andy Grabner. Now it's not November, Andy, right? But I bring it up because we had a great time having a few beers and a few whiskeys together in Denver and meeting in person for the first time, not the first time, but the first time in a long time, just not at a work event, especially. No, thanks for spending the time. I know it was a weekend.

Starting point is 00:00:55 It is the best time. I did some posts on LinkedIn as well to make sure people know we actually get together, the hosts of your performance. And yeah, thanks for showing me, especially the whiskey bar. That was really cool. That was cool. You had a busy few weeks of travel there, so it's good to see you back home.

Starting point is 00:01:18 Back now. And coming back to the whiskey bar, just one thing. I observed a lot of things I've never seen before in my life. And I think the topic of today is also something like getting visibility into things that you otherwise may not have visibility in.

Starting point is 00:01:35 Right, because you got to pick which whiskey you wanted to use. Exactly. And in fact, it wasn't even a pre-done flight. It wasn't a pre-created flight of whiskey that you had. You picked each individual one and custom tailored it to exactly to your needs right yeah yeah and in the end i was able to to observe it from the outside and from the inside but now let's stop with the whiskey uh before we bore our guest to death because he's been waiting for a while now i'm very happy uh it's been almost two months since I bumped into Anton.

Starting point is 00:02:08 Anton, thank you so much for making it to Pure Performance. We've met at the Observability and SRE Summit in London, where we both joined the stage at a panel around open telemetry and open telemetry adoption. Before I started to ask a couple of questions, could you just introduce yourself quickly to our audience, maybe a little bit of a background of who you are, who you work for, and what kind of drives and motivates you?

Starting point is 00:02:40 Yeah, sure. Nice to see you again, Andres, and nice to meet Brian. Hello, everyone. My name is Anton. I work for Booking.com as an engineering manager in the observability team. So predominantly what we do is work on building open telemetry infrastructure for Booking.com and also running some other related systems, both homegrown and open source. Cool. And looking at your LinkedIn profile, and folks, as always, we link to details like

Starting point is 00:03:12 the LinkedIn profile. There might also be some additional links we post on. Anton, you said your colleagues wrote some papers or blogs around some of the stuff you do. So folks, check out the description. But looking at your LinkedIn profile shows me something that is, I would say, rather unusual for many out there. You've spent more than 10 years at Booking.com,

Starting point is 00:03:33 which is phenomenal because a lot of people in our industry are jumping around all the time. Booking.com is obviously a well-known entity. I'm sure many of the listeners have used Booking.com to book their trips, their hotels, their cars, their their hotels whatever you know you book obviously to have a nice vacation uh what just going back where did you how did you start there and how did you end up being in the role right now to drive observability oh yeah it's indeed been a long time. Although it doesn't feel like it sometimes, time flies.

Starting point is 00:04:08 I came to Booking, like basically, I joined as a system administrator, was working on some non-observability stuff in the beginning. And after a few years, couple years, I moved into one of the observability teams. Originally, that was a team that owned kind of like in-house custom sort of open telemetry. Like the system which does the same stuff as open telemetry, but it was completely in-house. And that was pretty fun. It's a big system. It was using back in the day like React Storage. It was all written in

Starting point is 00:04:45 Perl, had a lot of interesting code inside it, a lot of technology that was built through hard work, sweat, and tears. And when I joined, I started to work

Starting point is 00:05:01 on that and helped move it towards the next phase of the architecture which is replacing React Storage with Kafka. Back at that time, I think around six years-ish, kind of Kafka was all the hype and everybody was really excited about it and we saw that it would also work well for us and Kafka is a simpler system to run compared to React. It has a different rate of different pros and cons, but it seemed like a good fit for us. So then we were working on that. And at that part, there were some changes in the team, there were different opportunities,

Starting point is 00:05:36 and eventually I started working also as a team lead, meanwhile keeping my technical role as well. And throughout the years, also the organization transformed a lot and SRE practices were introduced inside Booking.com. SLO was also introduced at most teams and all of those things, so we were also working on that. So I was kind of working both on the SRE side and a little bit also on development side, a little bit on pure sysadmin side, a bit of everything. And then also team leading. So eventually I kind of progressed from sysadmin to senior sysadmin. And now, sorry, I guess it was sysadmin

Starting point is 00:06:19 to SRE. Well, you can double check it on my link, but I guess it was sysadmin to SRE and then senior SRE and then senior SRE and team lead at the same time and then basically engineering manager. But that's not necessarily my path, it's also a path of the company because the transformations like introduction of SRE, like migration of people from either development or system roles to SRE and also transformations between team with an engineering manager so that's kind of the company was growing when I joined it was around 300 people in tech 2014 and now it's 2000 plus people in tech so the company was kind of maturing moving from startup mode mode, and also following the industry standards. When Google published the S3 book, that was like all the hype inside the company, and we really started to believe into it. Had different attempts, didn't get it right from

Starting point is 00:07:14 first time, but never gave up. And now kind of it's built into our operations. Fantastic. ago that has already implemented their own kind of observability solution for their Perl-based system. I know Brian and I have been around for quite a bit in the observability space. Auto-instrumentation of different runtimes has been around for a long, long time. But I think Perl and Brian, I don't know, I think Perl was never really on our radar, at least not for me. No.

Starting point is 00:08:09 In fact, I just came up with one of our prospects was Pearl. And maybe you can correct me, Tom, but my understanding is there is open telemetry code for Pearl, but there's no exporter. But that's what I just heard from somebody.

Starting point is 00:08:27 But yeah, no, Perl has always been there, but just kind of hiding in the corner.

Starting point is 00:08:34 Every once in a while you'd see it, like, oh, hey, Perl, how are you doing?

Starting point is 00:08:38 And maybe it's also because we always talk about this, we always live in our own little

Starting point is 00:08:43 bubble, and in our bubble it never materialized. But obviously at Booking, Perl has been powering all of your systems. And so do you remember why they decided to write their own Perl-based observability solution? Did you guys look around back then into what's available?

Starting point is 00:09:01 Obviously OpenTelemetry wasn't there yet, but do you remember the details? Yeah, so basically it was kind of a path of evolution. Basically, from the first days of the company, like the first version of the website and the first systems were written in Perl. And from that day on, everything was written in Perl. There was nothing else. And from that day on, everything was written in Perl. There was nothing else.

Starting point is 00:09:26 And from the early years, the people who were long before my time at Booking, I think early 2000s, they started building a powerful kind of observability culture, basically, which looked like back in the day, it was mostly a monolith application, which would write structured logs into, let's say, files, and then they will be picked up and sent to other systems. So from that point on, it gradually grew into an ecosystem of consumers of these structured logs. And then there were systems which would build metrics out of these logs. There would be a system which would put it into Hadoop for analysis, and over time more and more data were added into the structure logs and it grew to a point where you would have half a megabyte, one megabyte single log messages. And basically it was kind of spread all over.

Starting point is 00:10:25 More systems started to be created over time, but they were using the same libraries, the same approaches, the same observability ecosystem. So when I joined the team in 2014, that was already well spread, well developed, well matured and super powerful. I haven't seen anything like this in other places where I worked because it was kind of it's like over over abundance of data and it's available with one like CLI command and you can do a lot of things with it. That was super powerful. When I joined the team basically for us the question was not about... The system was already there, so for us it was more about evolving it to meet the new challenges and to allow more scale. So we focused more on the infrastructure side, basically replacing React

Starting point is 00:11:19 with Kafka and also introduced some new components in Go. So we over time, we replaced all the Perl components with Go components and also replaced React storage with Kafka. But that's more about the pipe itself. The consumers of the data were still a mix of different things, mainly Perl based. But there also was some JavaScript on the front end and different like Hadoop and other things. Yeah, and I just fascinating to hear these stories, how things evolve. And now I remember

Starting point is 00:11:55 when we had our discussion back in London a couple of weeks back, you didn't say it right, there was obviously a moment when you had to decide that something new is needed. Because obviously you said you had a monolithic system first. And not that monoliths are easy, but still it is, I guess, easier, especially in an existing system that is providing a lot of great structured logs to then get the telemetry out of it that you need. But then you were looking into new architectures, you broke the monolith into smaller pieces,

Starting point is 00:12:28 call it microservice or whatever. And this was then also the time, I think, if I remember correctly, where you said, so what do we do now? How do we get observability in this much more complex distributed environment? Do we want to implement and write our own or keep implementing our own or do we do something, take something that is

Starting point is 00:12:49 existing that has good community adoption? I guess this was the time we actively looked into OpenTelemetry. Do I remember this correctly? Yeah, exactly. Basically once we started splitting of parts of the business logic into separate applications, then we desperately needed some kind of tracing capability. And that was never part of the original deal because we never needed it. So first what we did, we started looking what we could do in-house and then we quickly realized that it's not a small problem to tackle. So I started also looking at vendors and we basically started sending our structured logs information into one of the vendor solutions and it kind of worked

Starting point is 00:13:32 out of the box because structured logs already had lots of rich information inside it. We just defined which excerpt of those messages we would ship on because it's not free. So we wouldn't send everything. In terms of inside one message, we wouldn't send all the fields. We also added a few like root ID, parent ID kind of fields, which was also in many places already present and where it's not present, we would edit. And then this vendor tool would match up all the spans into traces. And it was basically super easy to adopt just because we already had all the data in the stream. So we just send the stream to a new place. But that means you really had one critical component. That means in the structured logs you had some type of ID, some type of a correlation

Starting point is 00:14:23 ID or a transaction ID that you can then use to actually stitch everything together. Because this is often the hardest part, that you have a distributed system and you have maybe one correlation ID here and another one over here, or you don't even know what correlates well. But it seems with your structured logging approach, you had the basics covered. Yeah, also we have basically default libraries for most languages so fast forward to today the main languages in the company are Java and like TypeScript Node.js but then we also have some Go, some Python and still

Starting point is 00:15:03 a lot of Perl and maybe other things in some places, some corners as well. So what we have is for each language, we would have a library, which would drop over some industry default, industry standard, let's say HTTP library, and it would add those fields. So it would add those IDs, and then you would end up everybody using those libraries. So as long as people just keep updating versions, they will always have the current standard. Now in that transformation story, at which point were the logs no longer sufficient? At which point did you have to look into open telemetry?

Starting point is 00:15:40 And then also for me, and this was I think also a big thing that we discussed in London, how much did you then manually have to instrument? Did you go in with a manual instrumentation? Did you use existing libraries that you already had to then do the instrumentation for you? Did you look into auto-instrumentation agents in OpenTelemetry. Just curious, because I think a lot of our listeners are also in the transformation phase right now, where they're looking into OpenTelemetry

Starting point is 00:16:12 for one or another reason, but they're coming from a stage where they are right now. So I would like to know a little bit about what made you look into OpenTelemetry, what were the decision points, and then also how did you implement it? Yes, so at around the same time that we were splitting Monolith, we were also like adopting new kind of runtime environments. So we went from all bare metal into also adopting.

Starting point is 00:16:39 We tried different things like Mesos and other things, but eventually we settled on Kubernetes. And after that, we also started exploring using Lambda, Fargate, ECS, everything. So at this moment, basically, we have lots of different places where people can run their code. And we were thinking, okay, how do we integrate all of that? Because our homegrown system was built just for bare metal infra, and we had nothing for the other runtime environments. And the question was, do we invest the time and build it, or we look what is already there in the industry? And then OpenTelemetry was, that was, I think, around three years ago.

Starting point is 00:17:21 So it was kind of still new and just emerging but at the same time we saw that the momentum was there and a lot of people were really pushing it and it was evolving very fast and even though most of the things would be marked as alpha they would just work so we started playing with it looking into it and we saw yeah it working, even though it's very young and new. It kind of seems to work. So we decided to build out the POC, the pipeline, and send some traffic to see how it goes. Focused on that for the first phase, and that worked quite nicely. Basically, our experience is that we could integrate it with everything that we needed to integrate.

Starting point is 00:18:03 We didn't need to develop anything at all. It was mainly just configuring it, running it, integrating it with queues, integrating it with our databases and all of those things. And yeah, at that point we decided, okay, then we would start migrating towards open telemetry and phase out our kind of previous system over time even though it's not it's not a short-term pro like it's not a short-term project it's not something that we can achieve in in a month or in a year but the kind of that's our plan for now. And was this,

Starting point is 00:18:47 obviously, it's always challenging to move from something existing, especially because you said earlier, the people that were using the data, there was people running their reports, there were people that were making decisions based on this data. Now you're changing the way you collect the data.

Starting point is 00:19:04 You're changing the data that is collected because OpenTelemetry just gives you, by default, probably just some different data, hopefully more even. Was this a tough thing, a tough sell as well for the people that consumed the data to say, hey, we need to change something and maybe in that migration phase we have some data gaps, so we don't really know how to compare what you had before with what we have now. Was this a challenge? Well, from one perspective, it is a challenge that you need people kind of to learn new systems and

Starting point is 00:19:37 to, let's say, switch from between different query languages and kind of adopt different ways of doing things. For example, for metrics, our main metrics database is Graphite. It has its own query language, its own capabilities and limitations. And now with OpenTelemetry, we're also moving to kind of Prometheus compatible database, and that's a different query language. And also it has new things that were never in graphite but on the other hand it doesn't have some of the things that graphite has so people

Starting point is 00:20:09 need to learn new tricks and kind of learn how to do the same thing differently so that was one thing like sharing information sharing tips and best practices and also kind of selling to people why they need to try new tools. At the same time, right now, basically, the way we plan migration is we're starting with new components. So we don't take some existing data and replace it with new data. We just say for everybody who's building new services, they're adopting new systems.

Starting point is 00:20:42 So they're not moving from old way to new way. They're just starting adopting new systems. So they are not moving from old way to new way, they are just starting with new way. And over time as we build more confidence in the new tooling, in the new systems, we will start moving pre-existing older data also. Once that happens, then we will exactly have the problems that you mentioned that we would need to have sort of a full feature kind of replacement for each thing that was existing before. We need to have it in the new way even though it's not going to be technically possible for everything because like let's say graphite and Mimer they work differently and also we switch from Elasticsearch to Locky for logs. They also have different

Starting point is 00:21:26 absolutely different like query languages different query modes um so that's going to be for us it's going to be like a multi-stage process with different challenges at each state but we don't want to rush into it and do everything at once we just go in little steps one by one as we become more mature into it and we ourselves learn more about both open telemetry and also like prometheus lock and other things then we can explain to users better and how to migrate how to move how to get things done in a new way yeah and, and I think it's a great approach. Maybe I phrased it wrong earlier

Starting point is 00:22:09 because I see some organizations that try to really kind of rebuild and replace the old stuff that they already have. But your strategy is the new stuff that is built. By default, it is OpenTelemetry. And then over time, as you are updating your stack anyway, more and more will then be just using OpenTelemetry by default.

Starting point is 00:22:29 You made a statement earlier that you also made in the panel discussion we had, because you said even though you started early with OpenTelemetry and a lot of things were still marked with alpha, it worked exceptionally well. And I believe, and correct me if I'm wrong, you also said that while a lot of things were there by default, there was also a lot of stuff that you contributed back that you extended and you built on top of OpenTelemetry and your engineers extended it.

Starting point is 00:23:01 Do I also remember this correctly? And if so, I was just interested in hearing uh did you also contribute back upstream to open telemetry or was this just an internal extensions internal version that you built how does that work right so we do build our own processors mostly it's for kind of integrations and things which are internal so we don't push them upstream because they they're only relevant in the context of our systems we did have i think one bug fix that was accepted upstream and i think at this moment that was more or less the extent of our involvement upstream but internally we use open telery Builder and we build our own image with a lot of processors that we build ourselves.

Starting point is 00:23:48 In general, we are willing to contribute upstream whatever would be relevant. So we are evaluating, let's say, something we built, is it generally useful? Then we would consider putting it upstream. If it's company-specific, then we would leave it in an internal repository and just build it into our images. And also for the folks that are listening, I think we've covered Ryan OpenTelemetry quite a bit

Starting point is 00:24:16 over the last months and years, but just for clarification for those people that might be new, when you talk about your receivers that you build, the OpenTelemetry Collector, which is a central component, has a receiver concept where you can basically receive different data sources or pull in data from various systems. In other observability platforms, these might be called plugins or extensions or however.

Starting point is 00:24:44 It's basically pulling in data from certain systems that you have in your infrastructure where maybe it's an API call that they provide or some type of special file format and then you can pull this data in. Do you have any numbers on the current adoption of open telemetry would you know, would you say hey at Booking we are at x percent of adoption and our goal is

Starting point is 00:25:17 by I don't know 2025, 2026 we want to reach a certain level of adoption or how do you measure this and how do you measure it? Like right now we have, I think, more than 300 services using OpenTelemetry. In total, that's probably around 3%, 4%, 5% of number of services. It's mainly kind of not super high volume ones because it's mostly new services that are not getting a lot of traffic. But we want to grow it aggressively as we move forward because also we don't want to maintain two kind of ecosystems for a long time. So we want to adopt the new ecosystem and the new stack more and more as we move on.

Starting point is 00:26:11 For 2025, we want to see that gradually improve to much higher numbers. Like they do depend on how it goes, but like 20, 30, 40, 50%. Right now we we already... So it also depends on the signal type. So for logging, we are currently in the process of moving away from the last search and we want around, yeah, like more than half

Starting point is 00:26:39 of the logging volume to be on OpenTelemetry and new database within, let's say, three, four months. Our metrics is different because we have more than a hundred million metrics in Graphite and also we have more than 10,000 Grafana dashboards working with this data. So that's going to be a different story and we need to plan carefully how we would migrate all of that. For traces, I think about 5-10% also is original open telemetry and that also we plan to increase a lot in 2025. Cool. So that's already good numbers of 5-10% of traces. And thanks for that clarification. That's really

Starting point is 00:27:20 important that you are obviously the different types of signals you are adopting them in in in various speeds so that's that's good to point out because people sometimes just put out a number and say this is our open telemetry adoption but then it very much depends on what you're really talking about because while i think many people think about open telemetry as distributed traces because that's just I think where it became very popular initially logs and metrics are obviously the two other major signals that we see there.

Starting point is 00:27:59 With your size of the company and you said you're pretty much still getting started, still in the early stages of adopting it, are there any challenges that you have? Are there any things that you, especially for people that listen in and say, hey, cool, open telemetry is ready for prime time, start with new services maybe,

Starting point is 00:28:21 and then grow from there. But are there any challenges that you foresee or have already overcome that you wanted to maybe share with us for our listeners to try to also go down that path? Yeah, I think the main challenges for us is migrating applications, migrating teams, owning these applications to new tools. For example, and there are two parts. So one part is the pipeline itself, which is something users not necessarily exposed to too much. Like maybe they need to update library, maybe they need to use different library, but other than that, the path of the data from their application to let's say your logging or tracing or matrix database that's internal and that's not necessarily make a difference for a

Starting point is 00:29:14 user. So as we move from our legacy or let's say our previous pipeline to OpenTelemetry that thing is a smaller change to a user compared to that we are at the same time moving to new databases. So we are moving from, let's say, Graphi to Prometheus, a compatible database, we are moving from Elastirch to Loki. That is a much bigger change because then it affects all the Grafana dashboards, all the alerts, and people need to learn how to use new tools effectively. So what we learned is that we need to prepare training materials for engineers, we need to share best practices, provide recipes how to solve certain use cases that existed in previous tools with new tools, communicate a lot, announce changes, help people prepare to it, and also explain to the company and also kind of get them excited.

Starting point is 00:30:30 Then they will get excited, people around them, and then it's easier to get adoption across the company. Because we ourselves, for example, we can reach a certain amount of people, but if there are ambassadors across the company also reaching out to their colleagues, then you can spread the word and you can build interest. And then it will go easier and then people will be more interested in meeting you halfway. This reminds me so much of community work that you have to do in any type of open source project or I guess in any type of community community because you might be one, two people have a great idea, you put out a new project

Starting point is 00:31:09 and then you cannot assume that just two people can basically change the whole world and converting them into adopters. You need to start with the adoption with some teams that really see the value and then you have to convert them into your advocates. You have to make sure that the way you scale is through other people sharing your passion and sharing their stories.

Starting point is 00:31:36 I think that's a big point. And also what's so important is you talked about creating educational material, keeping them updated, providing newsletters, sessions, office hours. I'm not sure what you're providing as a team to make sure that your engineers have everything that they need in order to become successful as you as an organization are transitioning over to the new way of observing your applications.

Starting point is 00:32:10 That's a head shake, yes. So I might have missed it earlier, because this leads into my next question, but I think I missed it earlier. So you're collecting the OpenTelemetry data. What is that being fed into? Is that something you built, or are you using something like

Starting point is 00:32:28 I don't know who any third party to ingest and consume the open telemetry data? Or does that something you built? Yeah. So currently we are feeding into like a SaaS vendor for tracing. We are feeding into like a SaaS vendor for tracing. We are also feeding into also SaaS vendor for Prometheus compatible database, but we are also running on-prem also

Starting point is 00:33:05 Grafana Mimir that we run. Well, not on-prem, we run it ourselves on Amazon Cloud. And for logs, we're also running Locky on AWS. Okay. Sorry, go ahead. Yeah, that's like a new stack. The old stack is kind of completely different, but we are moving to this new stack. Okay, so most of the work you're doing is

Starting point is 00:33:22 more on the ingestion as opposed to the consumption of the data, correct? Like you have to get everybody adopting OTEL. I was just curious how many, like you guys have a very large, you know, going back to the idea of Andy's question earlier of, you know, what advice for people, right? How big of a team does it take to execute what you're doing? Is it like you and one other person supporting the entire organization? Is it sort of part time that here and there you get things going? Maybe you're more busy. But for people who are looking to, you know, maybe they hear this like, hey, we're going to

Starting point is 00:33:55 start it today. What are they getting to themselves into in terms of time commitment to do this? Is it something that's, yeah, I guess I'll leave the question there. Yeah, there are like... So in total, at Booking, there is a lot of people working on observability. Like for example, our org is around 20-25 people. But you could say there are other people working on it as well, because it kind of touches everybody and every application needs some amount of observability. Open telemetry is not the only thing we are doing, so not everybody is working on it. The team I'm on is, as you rightfully mentioned, is about ingestion.

Starting point is 00:34:39 So we are predominantly working on that. We are like four or five people mostly at one time and we are not only working on OpenTelemetry because we also have other things so I'd say you can start with one

Starting point is 00:34:54 two people building a POC and actually connecting everything is not hard because you have the recipes

Starting point is 00:35:01 online you just there is Helm chart there is already everything is, everything is there, everything is ready. You basically just put a few ready YAMLs and Helm charts on Kubernetes and it will already work right out of the box. Of course, then you start to tweak and tune it to your environment, to your context, add things that you need. And then once you start actually integrating it with real data, for example, you need to receive data in different formats.

Starting point is 00:35:30 You need to figure out all the security integrations. You need to figure out the data formats and integrations with client applications that will take much more time than just setting up the OpenTelemetry pipeline. Setting up the OpenTelemetry pipeline, that's going to be something that you could much more time than just setting up the OpenTelemetry pipeline. Setting up the OpenTelemetry pipeline, that's going to be something that you could just get going after reading online some emails from official website. It's more about getting fitting into the organizational needs at that point then.

Starting point is 00:35:59 It's kind of cool because, Andy, if you think to and and obviously um anton you saw the earlier days of open telemetry it was you know it seems like it was a lot more difficult in the early days um right i mean it makes sense but now as you said there's all these recipes and all this other stuff that you can use to just to just get going on it and and running it um it sounds like the there'll be that hurdle in the beginning for all the, for lack of a better word, bureaucracy. You have to build around it just to make it compliant within the company, but once you get past that it's sort of I don't want to say strictly maintenance, but it's

Starting point is 00:36:39 not full-time, as you said. And then it's the trick of getting people just to adopt and start using it. But you could say that about any tool, right? For anything. Hey, we've got this new tool. Use it. All right, so it doesn't seem like it's too high of a hill to climb. Yeah, definitely.

Starting point is 00:36:56 I think the community is doing a great job making it very easy to start with. Great. Hey, Anton, a quick question on your team and especially the responsibilities. You mentioned earlier you currently have about 300 services using OTEL. Do you as a team, as a platform team, provide them and also operate their collectors and everything?

Starting point is 00:37:21 Or do you just provide the guidance and the description and maybe some self-service for them to stand up the whole observability stack, including the collectors? I'm just curious because I've seen organizations doing it differently where some are just like, hey, you just instrument your app and you just point us to a collector and we make sure of scaling the collector

Starting point is 00:37:43 and then taking the data pipeline from there but i've also seen other models where teams also own the uh the the open television collector like part of that pipeline before it gets stored into a backend yeah right so there are like some teams who were very early adopters open open telemetry before it became like a company thing. And some teams were running their own collectors before there was a centralized solution. And also we had some other sort of parts of the business where they were actually running OpenTelemetry before the bigger part of the business was doing it. In some places, people were running HotelCollector as a sidecar in their apps, Kubernetes deployments, but now we are converging towards a centralized solution where the application owners

Starting point is 00:38:35 would not need to do that. Instead they would be either a BKS deployment or daemon set, and that will be owned by a centralized team. Or application just writes, for example, logs to stdout, or metrics are exposed to a Prometheus or HTTP endpoint, and then we would take the logs from files written by Kubernetes from a stdout or we would take metrics from Prometheus endpoint and then all of that will be owned by central teams. So the goal for us is that application owners don't need to become a maintainer of observability systems, don't need to learn like the configuration of open telemetry, own the configuration and kind of handling different edge cases.

Starting point is 00:39:35 So we want basically just application owners to send the data somewhere and then we'll take it from there and own the whole pipeline. Yeah. And I think that's also the more frictionless thing to do because then people don't need to worry about things that they shouldn't worry about because really what you provide is observability, is a self-service. So send me the data to that endpoint

Starting point is 00:39:52 and I'll take care of it. But do you then, again, this is stuff that I see on a regular basis, do you provide feedback to those teams that consume your service about, hey, your login just has just tripled. We see bad logging patterns. You're logging things with the wrong log level. I don't know. Do you provide any guidance and feedback? Yeah, so

Starting point is 00:40:20 when we know just certain things, we can reach out to user teams. Of course, we can't kind of investigate and follow up on each and every issue. The bigger the scale, the less we can kind of control all of it. What we do first about the volume and the metric explosions and the logs volume explosions, we give everybody a quota. So every service everybody a quota. So every service has a quota. And if you go above the quota, then data is going to be dropped.

Starting point is 00:40:56 So, for example, we can notify users about that happening. But even if users don't get notification from us, they would either notice and reach out to us and ask for, say, increased quota, or we would investigate what else they can do, or they're not looking into it and they wouldn't know. So I think both situations are happening, depending on the team and depending on the communication. For incorrect formatting, it's kind of also like that.

Starting point is 00:41:27 So at the end of the day, we as an ingestion team, we can't look at each and every log line and look, oh maybe this is a multi-line message that is actually not joined together, or maybe that's a warning and not an info and it's marked as an info and things like that. So at the end, if an application owner is looking at these logs and he notices something is wrong, then he would initiate like an investigation and in most cases, let's say they may need our help or may not need our help. If they need our help, we'll also support. But if a user, for example, doesn't notice an issue, then most probably we will also not notice it just because of the share scale because those things require like eyeballing and we can't

Starting point is 00:42:13 eyeball let's say hundreds and thousands of applications one by one so if you notice something then we would follow up but it's not a goal to double-check that each and every service is for matching the logs correctly. So one additional question that I have, because this comes up on a regular basis for me, is as you're pushing the power back to the engineers to build in their own instrumentation, they can basically control what they capture. Is there something that you enforce or help them to make sure

Starting point is 00:42:45 they're not capturing the wrong data? So for instance, personal identifiable data, any confidential information, credit card details, something like this. Is this something that would fall into the same thing what you just explained earlier in a way that you obviously look around, but it's not that you can look into every single trace and every single log. Or do you have some things in place where you validated instrumentation is correct and

Starting point is 00:43:14 they're not capturing things that they shouldn't capture? Yeah, that's a big challenge. Basically, at the moment, the responsibilities on application owners, at the end of the data, but again, that's a best effort scenario. For example, we might have some red axis for emails, we might have some red axis for credit card numbers, but that's not going to be super exhaustive list. And also, the more we add, the more expensive it gets on CPU side. So what we are looking to add in the future is we want to add some limited, like credit card number scrubbing and things like that for applications which are running in, let's say, restricted environments like SOX compliant or PCI compliant. So for Zeus, we will selectively enable some scrubbing. But other than that, you could still have applications running in any environment kind of leaking sensitive information. There is no kind of generic solution for it. It's more on application owners. But

Starting point is 00:44:36 there is also a question of application design that in most cases application doesn't need to use PI directly. instead it could use some IDs from internal systems and that's kind of the approach that we are also taking in the company that a lot of data you can't just easily get access to it and you can't work with it directly instead you need to use kind of IDs from specific like specially built storage systems for that. Cool. I'm just curious because these questions come up. How do you make sure that the right data and not the wrong data makes it through all of the data pipelines

Starting point is 00:45:16 until it ends up in a backend system where people have access to it, shouldn't have access to it? It's a tough problem to tackle and also as you said, it's a trade-off between how much do you put on your data pipeline, how many REC access can you add, especially as the volume of ingest grows at some point. You need to make a trade-off as well.

Starting point is 00:45:40 Hey, so now you've been on your journey. Well, you've been at Booking.com for 10 years. You've been on the OpenTelemetry journey for about three years. To kind of conclude this podcast, where do you see yourself, if that's even possible, where do you see yourself in a year and two years from now as it comes to providing OpenTelemetry-based observability to your organization?

Starting point is 00:46:06 Yeah, I think for us, the ambition at the moment is maturing the system, driving adoption much higher, eventually making it kind of the only observability stack available. And the goal on the next year is going to be to cover all the different cloud-based runtime environments as much as possible to have first-class kind of support for Lambda, first-class support for ECS on Fargate, while also maintaining first-class support for Kubernetes and ensuring that the user experience for all three signals, like metrics, logs, and traces, is kind of very good. So for us, it's about maturing that and ensuring that the user experience will not suffer as we move from something that you've been building for, say, 15, 20 years,

Starting point is 00:47:04 and up until now that's super mature it's like many problems were solved over the years and kind of replicating that maturity on the new uh observability stack cool this almost sounds i should have asked the question what is your wish list for christmas because i just noticed I just noticed that this episode will actually air on the 23rd, I believe, of December, which is just

Starting point is 00:47:30 two days before Christmas. It's a good future to have in mind and what you wanted to achieve. Did we miss anything, Anton? For people that are listening in, is there anything else that we

Starting point is 00:47:46 missed to ask? No, I think for anybody who's interested in the topic, I would

Starting point is 00:47:54 definitely encourage people to try out Open Telemetry. I mean, there is a

Starting point is 00:48:00 lot of good solutions on the market. There is a lot of very mature kind of systems which have lots of functionality and their own pros and cons. But what kind of is, I think, very

Starting point is 00:48:14 potentially kind of adoring with OpenTelemetry is that you have a lot of community support. It's very vibrant. You have super fast kind of evolution. For example, just last year, I think logs were not officially kind of unstable, even though support was already there, but everything is moving very fast. And now profiling is being added to OpenTelemetry, like Elastic Labs kind of contributed the profiler. And I think that the OpenTelemetry story, although it's already quite capable, I think it's just starting.

Starting point is 00:48:52 And I think every year there's going to be major innovations happening and major new capabilities added. So as you start using it, you're just going to get a lot of cool new stuff and that is going to be very easy to add to your stack. Yeah, and I can only echo that. I was just at KubeCon in Salt Lake City and observability

Starting point is 00:49:14 and OpenTeleMatch with the whole ecosystem that was built around it or is currently being built around it has just a very big spotlight at KubeCon. They have their own observability day, which is a co-located event that happens one day prior to the main conference. Many observability-related talks. And folks, if you are listening to this and you want to gather and connect with other folks that are interested in observability, there's many conferences around. We have just been at the Observability Summit

Starting point is 00:49:46 and the CERI Summit in London, CubeCon. London is also coming up. I think it's the first week of April. So for everybody that is in Europe, it might be an easier destination to get to. I'm pretty sure there's many other different

Starting point is 00:50:01 local events, meetups, Kubernetes, KCDs, Kubernetes Community Days, or Cloud Native Days, where you can hear all these stories and also connect with the community that is actually building these frameworks and tools. So basically just keep your eyes open

Starting point is 00:50:22 and observe where the local meetings are? Exactly. Look at that. We should write an open telemetry receiver that can scrape all of these conference websites and then make you aware of new upcoming conferences and CFPs and speakers. I don't know, emitting some metrics about the number of speakers on a conference for observability. And then we need like an exporter which would send emails to people with like digests. Exactly, yeah. See, that's a Christmas project folks, if you're bored over Christmas.

Starting point is 00:51:04 Get working on it, Andy. I expect it by January 1st. Yeah. No, very cool. Brian, any final words from you? No, I was mostly observing this because I'm never this deep into OTEL, right? So I was being more of an observer. But it's fascinating to hear what you've been able to accomplish with all this, Anton, especially

Starting point is 00:51:31 where the team started with the Pearl setup and everything and how, as the organization changed and your needs changed for observability, You didn't just throw I don't want to say throw in the towel because heck I'm one of the vendors right but you didn't just say well I guess we have to go to a vendor now right. You said you know we've got a good thing going here. Let's take some time and effort especially in the earlier days of open telemetry where you were using

Starting point is 00:52:03 the alpha codes and all. But you stuck with it, and you've got this fantastic practice that you've built around it. So it's inspiring to see what can be done with it. And I think the future is definitely bright with open telemetry. And as long as people like you and your team and others that are really pushing for large adoption keep going with it it's it's going to go some really wonderful places so thank you for you know just using it and and um keeping it i shouldn't say keeping it alive it's not like it's in danger of dying you know what i mean but like you know being an inspiration uh on open telemetry, that's a better way to say it.

Starting point is 00:52:47 Awesome. All right. Good. Then I would say, folks, thanks for listening in. And any, yeah, happy whatever. I think this is the last episode of the year. Okay. Right?

Starting point is 00:53:03 So I'm looking forward to 2025 and also obviously looking back to all these great episodes it was a great end to the year episode and I'm sure there will be many

Starting point is 00:53:12 many more to come on topics that are adjacent to OpenTelemetry yep and thank you to all our listeners for a wonderful year

Starting point is 00:53:19 and look forward to the next year of pure performance alright Anton thank you very much it was a pleasure to meet you thanks and look forward to the next year of Pure Performance. All right, Anton, thank you very much. It was a pleasure to meet you. Thanks, Brian and Andres. Thank you.

Starting point is 00:53:33 Bye-bye. Thank you. Bye-bye. Peace, bye.

PurePerformance - The Road to OpenTelemetry Adoption at Booking with Anton Timofieiev

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.