PurePerformance - Adopting Open Observability Across Your Organization with Daniel Gomez Blanco

Episode Date: April 24, 2023

Organizations that experience Monitoring Data Obesity – having too many arbitrary logs or metrics without context – are suffering twice: high cost for storage and not getting the answers they need...!OpenTelemetry, the cloud native standard for observability, solves those challenges and therefore sees rapid adoption from both startups and established enterprises.In this episode we have Daniel Gomez Blanco (@dan_gomezblanco), Principal Software Engineer at Skyscanner and author of the recently published book Practical OpenTelemetry.Tune in and learn about the latest status of OpenTelemetry, lessons learned from adopting OpenTelemetry in a large organization, considerations between metrics and traces, the difference between statistical and tail based sampling and much more Here the links we discussed during the episode:Chat I had with M. Hausenblas on his podcast the other day: https://inuse.o11y.engineering/episode/meet-daniel-skyscannerLink to QCon talk (although I believe the video won't be made available till later in the year) https://qconlondon.com/presentation/mar2023/effective-and-efficient-observability-opentelemetryRecent InfoQ interview covering the talk: https://www.infoq.com/news/2023/03/effective-observability-otel/Video on a talk I did with Ted Young a couple years ago during our tracing migration to OpenTelemetry: https://youtu.be/HExcLWA2b8MTalk at o11yfest 2021 on our tracing migration to OTel: https://vi.to/hubs/o11yfest/videos/3143Mastodon: https://mas.to/@dan_gomezblanco

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my co-host Andy Grabner who's making fun of my mustache. Andy, I gotta tell you, I had another dream. Another nightmare or was it a pleasure dream? It was more of an interesting dream. So yeah, it was not a nightmare i wouldn't call it a pleasure dream because you were in it and that's closer to nightmare i'd say um but there
Starting point is 00:00:53 was nothing actually terrible so i was i was outside and i was looking through my telescope yeah i see what's going i was looking through my telescope at the moon and suddenly i see a little andy gravner running around the moon without a helmet even i'm like andy you know i was able to i yelled up to the moon andy what are you doing on the moon and you yelled back and somehow i could hear you and uh you were like oh brian i am up here on the moon i'm like yeah i could see that andy how did you get there what's going on he said well everybody's going to the moon now everyone's making their own rocket ship. So I made my own rocket ship and I went to the moon.
Starting point is 00:01:28 And I was like, yeah, but what are you doing on the moon? Are those my kids with you? So you still have my kids from two dreams ago. And I was just really confused as how you could even breathe without the helmet. But really, you know, what dawned on me is I was observing you through even breathe without the helmet, but really what dawned on me as I was observing you through this telescope on the moon, could somehow communicate with you, get information back from you, and
Starting point is 00:01:51 was just astounded that you did this yourself, and you made your own rocket ship, and you got there. And then I woke up, and I was just confused about why I keep having weird dreams about you, Andy. Did I leave a trace on the surface of the moon? No, you magically left no footprints.
Starting point is 00:02:08 No. It was, yeah, because I don't know. You're just a magical being, Andy. You know, but for some reason, I think you made this up. I don't know why. Well, there was, I see where you're going. I see where you're going. Yeah, yeah. I messed up your trace,
Starting point is 00:02:24 but sorry about that. That's okay. You're too smart for me. Well, that's a good one. And hopefully, if Roman is listening, because I talked with him the other day.
Starting point is 00:02:34 He messaged me, which inspired me to... Oh, come up with a dream. Okay, so this is your dream to start again. But now, enough with the dreams. Back to reality.
Starting point is 00:02:42 Back to observability. Back to tracing. Back to observability. We didn't even go there yet. Yeah, well, you observed me on the moon. Back to reality. Back to observability. Back to tracing. Back to observability. We didn't even go there yet. Yeah, well, you observed me on the moon. I did. We have an awesome guest today. He's not only, I think he's not only an expert in the field that we're talking about today.
Starting point is 00:02:57 He's also a book author. And just recently published the Practical Open Telemetry book, which everybody should definitely take a look at if you're interested in open telemetry. But now I think it's really time that I hand the mic over to Daniel Gomez-Blanco. Welcome to the show, Daniel. Thank you so much for being here and do us the favor, please introduce yourself to the audience. Hello. Yeah, thanks for inviting me to your show. My name is Daniel. I'm a principal engineer at Skyscanner and I'm the tech lead for observability and our observability strategy long term. And recently I've been just leading a really exciting project that has got to do with us
Starting point is 00:03:41 adopting open standards and trying to rethink how we do observability at Skyscanner. And what do you think about the strange dream story? I can see where that was going. I was getting the tracing part of it. Yeah, I missed the tracing part. I was still on the move. Daniel, I think we, as I said, we will also leave all the links.
Starting point is 00:04:09 So folks, if you are interested in OpenTelemetry, it's been a topic we've been discussing for quite a while now, off and on with different experts in the field. But I think we can never stop educating people about it because it still needs a lot of education. Like Brian and I, we've been talking about performance engineering and performance patterns for so many years
Starting point is 00:04:30 and still we look at applications and are still wondering, does anybody listen to us because they still make the same mistakes. So for me, one of the ways I actually got in touch with you is because I found some of your presentations that you recently did at QCon, SRECon. And for me, it seems that we've been in observability for... I had my 15 years anniversary, so we've been doing this for a while.
Starting point is 00:04:59 Brian, you've been around also for very long in our industry. And for us, it's just with the tracing and the company we work for, it's nothing earth-shattering new. We've been doing this for a while. But my question to you, Daniel, is why all of a sudden do you think we see such a huge uptick of observability? Everybody talking about distributed traces. Why the sudden new hype about this whole thing? I think if we go back, I guess, 10 years, I think it's something that perhaps should have happened before when we started to adopt cloud native deployments, microservice architectures.
Starting point is 00:05:37 There are way more components that are moving pieces in these architectures. I think over the last 10 years, we've been adding more and more data to our systems in a way that sometimes, if we don't think about tracing, we think about metrics and logs, for example. If you take a monolith that you had 20 years ago and you try to break that down into microservices, but you're still using metrics and logs, then you start to see that the data that you produce starts to skyrocket. And then your engineers can no longer make any sense of it, really. So I think there was a case where companies that got themselves into that position
Starting point is 00:06:15 see that the amount of data they've got is too much. It's not useful. There's no actionable data most of the time. And I think it's been more like the moment that organizations start to look at that and the value that they can get from data, that's when they think, okay, so we need something better, something that can describe systems better and allow us to debug regressions faster.
Starting point is 00:06:41 So I think it's that, it's just basically we're getting more deployments, we're getting faster in the way that we deploy. We're deploying thousands of times a day, where before maybe some other companies in the past used to deploy once every two months. Now we do it dozens or hundreds of times a day. And then with more moving pieces, you need something better. You need observability. Yeah, so basically things got out of hand almost and then we needed to find a way to better control, better specify and standardize the way we do it.
Starting point is 00:07:14 I mean, I actually thought you went into a different direction when you started talking about in the old days, right? I think we had obviously simple systems and I remember, Brian, when we started, we started observability and kind of distributed tracing back in the days when the world was dominated by Java.
Starting point is 00:07:32 I mean, it was... And.NET, but yeah, Java was dominated by Java. Yeah, but in the beginning, everybody in our industry started building agents for Java and with that, you covered a huge ground, right?
Starting point is 00:07:46 And then maybe edit.net. And then Daniel, what you said, I thought was really interesting because now with kind of containerized systems and you're having Kubernetes and everybody can basically pick from not hundreds, but many, many different runtimes that fit their needs. It's impossible to write, I guess,
Starting point is 00:08:05 agents for all these technologies. And therefore, I really like that we have come to a situation where we say, okay, we need to standardize, first of all, what we collect, how we collect it, but also, I think, need to give the power back to the people that actually build these systems
Starting point is 00:08:21 and build these frameworks and applications and put their instrumentation in instead of relying on some vendors to figure out a way how to reverse engineer runtimes. Vendors or even the engineers themselves. If you're only relying on open source software, it may be up to the engineer that is using a particular library
Starting point is 00:08:43 to add the telemetry to it, because that library doesn't produce any metrics. And it makes sense, because if you're the author of a library, what do you use? Do you use Prometheus? Do you use StatsD? It's very difficult for a library author to decide on what SDK, what metrics client to use, for example. But when you think about open telemetry, you've got libraries that can describe themselves
Starting point is 00:09:06 and then have the user of the library decide how they want to export that data. Which is what this is all about. When you've got all those different moving parts and you've got systems and libraries that describe themselves in the way that the author intended, there's no need for someone else to come after and try to add telemetry on top. So that's an interesting feature of open telemetry.
Starting point is 00:09:40 What I would be interested in, so we talked about this a little bit in the preparation for this. So you have to book out there, practical open telemetry, and writing a book on a topic that is so new and is so fast moving. I mean, isn't it, how do you, what is covered in the book that won't age? Well, it's three pages long, right? It's three pages long and then he just publishes a new addendum. It was challenging.
Starting point is 00:10:08 From the moment that I started writing the books to the moment that I finished, things moved. Especially in the logs, in the login site, which is the area of OpenTelemetry that is still experimental in some ways, not fully stable, at least from the point of view of APIs. But yeah, so there are parts of the book that are stable, like the metrics API, the tracing API, the baggage API, all those parts of the OpenTelemetry API are stable.
Starting point is 00:10:40 The SDKs that come with them as well are stable, and OpenTelemetry does provide some really strong stability guarantees on these libraries. So you can take long-term dependencies on it. And in a way, my book is taking long-term dependencies on these APIs. I mean, it does, in the intro, mention at what point it was written and to what point it was compatible with. But yeah, so the stabilization of metrics and traces
Starting point is 00:11:08 was probably what made, at that point it made sense to create a book. Now we've got the login API that is a lot more stable than when I started writing it. But yeah, so I think if you look at the concepts around metrics of the API and the concepts around traces, all those will be stable. How you integrate with around traces, all those will be stable. How you integrate with OpenTelemetry collectors will be stable as well. And the protocols that are used to publish and to export this data is all stable. And then there is a bit in the book that is related to OpenTelemetry, but the start and the beginning of the book is related to observability in general and how you can roll out an observability function
Starting point is 00:11:49 within your organization and how to make engineers adopt best practices. So there are bits in it that are not specific to OpenTelemetry as well. And that's actually a perfect segue almost because this is what I wanted to talk about. I mean, obviously for OpenTelemetry itself, there's a lot of information out there online in the communities uh folks if you're listening in and you want to learn more about open telemetry and you don't know where to start we put a lot of links uh on the in the description of the podcast and obviously
Starting point is 00:12:18 you know start with the book practical open telemetry but as you said many organizations are currently trying to figure out what does this all mean? You just said how to build an observability function. How do you convert from maybe something that you used to do to using open standards now? Where do you start? What are best practices? How do you get started? Looking at your own history at Skyscanner, can you share some of the thoughts that went into what was a good starting point?
Starting point is 00:12:50 What type of people did you need? What type of practices? Did you set some goals to say, hey, in a year we need to be there? Or anything that could be really helpful? Yes, we had a really complex... At Skyscanner, we always relied on open source libraries, open source tooling to for anything related to telemetry, we were integrated with some some vendors in the
Starting point is 00:13:13 past, but, but in general, we we try to basically run everything in house. And, and this basically we started with, when we started to adopt open telemetry, we started with tracing. Now, two reasons for that. The first one being it was the most stable signal at the time. The second one being that we were already relying on open tracing. So if you're like OpenTelemetry being the merger of open tracing and open sensors to other projects.
Starting point is 00:13:41 So if you're like a user of open sensors or of open tracing, you're going to have open sensors or of open tracing, you're going to have a really easy migration path. And this is intended, but as well, it shows the value of the API design of OpenTelemetry. So you've got that separation between the API and the implementation. OpenTracing already had that same design. And for us to basically adopt OpenTracing, we could sort of do And for us to basically adopt OpenTracing, we could sort of do it pretty much under the hood. So you can just apply a shim,
Starting point is 00:14:10 and that says anytime you call the OpenTracing API, that will be translated into the OpenTelemetry API call. So there is a lot of, as well in the industry, in the tech industry, you've probably had guests on your podcast as well that talk about platform engineering as a discipline, right? At Skyscanner, we have been invested in platform engineering for a long time, and when we were approached
Starting point is 00:14:35 with the decision to adopt OpenTelemetry, we already had things in place, like, for example, a set of internal libraries that we can roll out across the company, and they contain some standard and default config for libraries, right? So for us to migrate to OpenTelemetry, the tracing part to OpenTelemetry was pretty much a minor version bump of that internal library that said, well, now you configure the open tracing API like this. And then people could start to basically gradually move to open telemetry.
Starting point is 00:15:10 Now, that was the easy part, let's say. You've got a new technology, you've got tracing, you've got support for open telemetry, and everything is all rainbows and it's all nice. Now the hardest part I think to adopt is when you've got a legacy system or a system that's been running for years and their engineers are used to using metrics and logs and that's it, no tracing.
Starting point is 00:15:39 So then how do you go about instrumenting a service like this one? And this is where there's a lot of education that's needed, a lot of enablement as well, but education on what is the best signal to use for a specific concept. Because when you've got something like OpenTelemetry that allows you to use any of these three main signals, like traces, metrics, and logs, how do you know which one to use for something?
Starting point is 00:16:05 And this is where I wanted to take my book into that sort of aspect of, okay, you've got all these things, but if you want to, I don't know, look at requests for a particular service, do you add it as a metric? Do you add it as a trace? What sort of information do you put in each of the signals?
Starting point is 00:16:22 And then basically that's the part where we're still going through now in SkyScanner is adopting metrics and then having to review some of the metrics that were produced by your application. Some of them could have been really high cardinality metrics because there was nothing else and they were used to debugging that.
Starting point is 00:16:42 Maybe you can stop producing those really high cardinality metrics and then link to traces, which have that really granular view that you're interested in for debugging. And then maybe drop some of the logs and also rely on tracing. So that sort of re-instrumentation of services is quite really, it's challenging for teams, but there is a lot of return on investment here on using each signal for its purpose. So I hope I don't divert now, but I think you just triggered something in my head. Talking about what to choose, metrics and traces.
Starting point is 00:17:20 And I give you a practical example, the metric of response time. We are interested in response time of transactions. So I think there's two ways we could do this. We are capturing a metric and for every single request that comes in, we then say, hey, this is the response time. And then we have high-catenality and we can calculate all of your min, your max, your averages, your percentiles, and all that stuff. If you don't choose a metric for this, because you say, why do I need a metric?
Starting point is 00:17:50 Because I have it on the distributed trace anyway, because if I start a trace at the beginning of the transaction, then the duration of the trace basically gives me the response time as well. So that would be, I guess, an argument to say, why capture things twice if you already have it on the trace? Yes, I think there is a reason for that. I think there is as well a view of, okay, you only need tracing. We've got tracing now and we only need tracing.
Starting point is 00:18:16 I personally disagree with that a little bit because when you've got metrics, when you've got a service that is producing thousands of spans per second, but you've got metrics that can aggregate a lower, sort of like a lower cardinality, the pipelines that you use to export these metrics or the retries that you can have on a specific data point,
Starting point is 00:18:40 or even if you use Prometheus, for example, or cumulative counters, the way that the metrics are designed, they will produce a more stable signal than traces or spans. And if you've got then sampling applied on top of spans, which I think we can go into sampling later as a way to only keep the data that you care about, you can aggregate those metrics, have all the events count, but in one aggregated data point, and
Starting point is 00:19:06 then link to individual examples that are in traces that will allow you to debug further, deeper down and save costs or save transfer payloads and so on. Basically optimize what you're actually storing, what you're actually sending over the wire. Well, thanks for the answer because I was hoping you were going this direction because that's exactly what we have seen in our experience, right? Because as sampling at some point comes into play, you would basically lose a certain level of visibility and accuracy. And I think this is something where I believe what you're doing with the book, what you're
Starting point is 00:19:44 doing in the community, what we're doing here with the podcast is so important that we make people aware of this because I think a lot of people are now entering the observability space and may not have thought about these implications. Because on a Hello World app, everything works perfectly. But if you look at, I guess, in your environment, at Skyscanner, you have high transaction volume, and then the question is, do we really need to capture?
Starting point is 00:20:11 Can we even capture? Is it cost-efficient to capture every spend? Yeah, it's not. I can give you some details off the top of my head. Because you think about Skyscanner, we have an average of over 100 million monthly active users, right? But then when you think about the internal systems,
Starting point is 00:20:35 because it's a travel search engine for flights, hotels, car rentals, every time that someone searches for something, you need to call multiple partners, multiple airlines. So it's basically like a fan out sort of thing. So the amount of traffic that goes on internally in the systems could be hundreds of times higher than you see when you normally interact with that.
Starting point is 00:21:02 If you've got a service or a system where it's like a normal user interaction with the system itself, we here have thousands of partners that could be called. So in terms of we produce around 1.8 million spans per second that goes through our collectors. Now that is a lot of data and a lot of data that most of that is probably not that useful for debugging. Because it is normally,
Starting point is 00:21:29 it would correspond to successful requests, or to requests that completed in an acceptable amount of time. So the ones that are not really that useful for debugging. You want to keep those in the metrics, you want to know about the general state, but then, you know, not that useful for debugging. So yeah, so iSky's kind of, we use
Starting point is 00:21:49 tail-based sampling, which is where like in traces where you can look at the whole trace and then try to, you know, you can just basically make a decision looking at a whole trace instead of like one particular service. And we tend to keep around between 4 and 5% of all the traces. And this is enough. This is enough because it contains all the traces that contain a single error, it contains the slow traces, and a random percentage of the rest. And if we need to store more, we can store more. But generally, that's what we keep. And we've seen as well teams that migrated from login to tracing, and get better observability. It's not like they stopped doing login,
Starting point is 00:22:33 but for example, they stopped debug-level transactional login, and they started to rely on tracing, and they just didn't get better observability. They reduced their operational costs as well, which is a benefit on both sides. I think there's another big aspect with the metric versus the trace there. It's just the simplicity. I mean, metrics, if you're going to get a response time from the traces,
Starting point is 00:23:00 you're basically going to turn that into your own metric, which means now you have to add some calculation layer to it. And you're also going to not have every single data point on a chart or something. So you're going to do maybe a one-second, 10-second, 30-second resolution, and the metric's already there. It's already done. So why would you need to reinvent that? Yeah, and even with the logging thing too this this comes up all
Starting point is 00:23:27 the time there's the the idea of use law logs are still relevant right people still use logs plenty i know back in the old days andy and i used to hear like oh we don't need logs and then our developers were like no we still need logs and we've had some podcasts of like when we started adding logs back into our platform it's like well yeah no people still need logs but you need logs for certain things if you have traces you can get that information from the trace which you're already collecting this way you can save on your your log storage by saying what are the things that we can only get from logs what are the use cases that are specific to log and let's keep the logs focused on that and let's leverage these other things that are going to have a bunch of other rich additional contextual data in it for those other pieces.
Starting point is 00:24:11 I don't need that in a debug log, for instance. Exactly. There's as well another benefit of logs and how they integrate with OpenTelemetry is that you have legacy systems that their libraries may not be instrumented for open telemetry or for tracing. But thanks to these correlation and semantic conventions as well, you can then instrument your logs so that they appear next to your traces. Which is not a use case per se, but it's something that may ease the way
Starting point is 00:24:46 for a lot of people that say, well, I mean, you still get that information from logs attached to your traces, which I think is great to basically try to give more context around these anomalies that you can find from tracing data that you couldn't find from logs. Hey, I got to jump back to something you said earlier,
Starting point is 00:25:07 because remember, maybe not everybody's completely familiar with all the terminology. You said you're capturing about 4% to 5% of your spans because they contain enough error. You mentioned tail-based sampling. Could you just explain what tail-based sampling means again? Remember, there's some people that may have never heard. Yeah, there is generally two forms that you can do sampling, right? The first one is
Starting point is 00:25:31 probability sampling. So you've got, you can think about logs, for example. The only way that you can do sampling on logs is probability sampling. So you've got a percentage of logs that you want to keep. In the same way that you can apply that to your traces, percentage of traces that you want to keep. Now, you may be able to say, well, if there are error, for example, error level logs,
Starting point is 00:25:59 you may want to keep 100% of them, but only 20% of the debug logs. But then the problem is that when there's an error, you actually want the debug logs. So that's why people don't end up sampling debug logs, it's because they actually need them. So with tracing, when you think about probabilistic sampling, it does give you a bit more, which is that you can actually say, if you want to store a trace, you can store the whole trace.
Starting point is 00:26:26 And that means that spans from every service that a transaction went through. So you're going to keep the whole trace. The way that tracing does it is you can propagate that decision. So in the same way that you propagate a trace, you can propagate that decision. So if you have a server that is accepting a request and says, well, I want to sample 20% of the traces, then when it makes the decision to sample, it can then
Starting point is 00:26:52 propagate that decision downstream. And that's powerful by itself, but it's still probabilistic. You're basically saying, I want to keep 20% of the traces and that's it. With tail-based sampling, it is a bit more difficult to implement because it does require you to have an external component that all your spans for a particular trace go through. So think about you've got three services in a trace, each of them will be producing their spans. You need to feed all the spans through one single replica of something, of a collector or some other way of like,
Starting point is 00:27:29 multiple vendors will provide their own tail-based sampling as well. But you need to send all the spans to one single point, all the spans for a trace. So when one of those samplers, which could be an open telemetry collector, gets a trace, gets the first span for the trace, it starts a counter,
Starting point is 00:27:47 basically. And then when that starts to keep all the spans in memory for that particular trace, and when the counter comes to the end, they can say, the open telemetry collector or whatever sampler, they can say, okay, I'm looking at the whole spans, all the spans in this trace, and I can
Starting point is 00:28:04 say, well, if there was any error, for example, in this whole trace, I keep the trace. If there's no errors, well, then I look at the duration of the whole trace. If it goes over a particular threshold, then I'll keep it. If not, then I'll discard it.
Starting point is 00:28:19 And then, you know, you can as well do a bit of probabilistic sampling there. But the idea here is that you can look at the whole trace because this spans from multiple services at the same time and then decide if you want to keep them or not. So we think that is very powerful because you end up basically only storing the data that is useful and then discarding the rest. Would you also keep, I imagine, you'd probably also keep ones that are running
Starting point is 00:28:49 nominally so that you can compare bad to the standard, right? So it's going to be a mix of types, not just keeping bad, right? You'd want to keep some of the standards so you understand what it should look like, right? Yeah, so you want to keep a percentage of the good ones as well, just to understand what good looks like. Coming back to something that we also discussed in preparation of this, the people that are building in OpenTelemetry or any type of observability data are in the end going to be the developers. The developers
Starting point is 00:29:26 that have their own, that they want some information from their own code or developers that are building frameworks that is then used, they're the ones that know their code best and they know what they need. Are there, is there good enough best practices out there to either tell developers what is good to instrument and also what is not good to instrument or is there any tooling out there that actually looks at traces and says, hey, you know what, this is actually useless information or it's too much or it's dangerous information.
Starting point is 00:29:59 It might be secure. It's like confidential information. Because I always worry that what we've seen in our world 15 years ago when our product or 10 years ago when our product was much different we had the option to say instrument everything
Starting point is 00:30:15 so we could specify method matches or rules and you could do what we call a shotgun instrumentation shotgun meant everything in a certain package or. You could do what we call the shotgun instrumentation. Shotgun meant everything in a certain package, or maybe somebody could do a star dot star and then everything got instrumented. And basically that was extremely powerful and great for some in a certain environment, but in production obviously would kill everything because it
Starting point is 00:30:37 just a lot of overhead, costs, you name it. Are there good enough best practices out there, A, for developers to find, or B, what can developers do to validate if they're doing a good enough job? Is there any validation of the instrumentation? Yeah, that's a difficult question actually, because I think that is, if you look at some of the open telemetry instrumentation libraries right, you take the open telemetry Java agent and by default it will auto instrument every single library that it knows how to auto instrument right. That could be a lot of information, that could be certainly something that you want to use
Starting point is 00:31:25 to test it and to roll it out if you're perhaps in a small volume or small traffic. But when you think about rolling it out at an organizational level, then you've got a decision to make, which is which ones of these libraries you want to instrument or which ones of these settings you want to apply. Now, there is good news here as well. The tooling is there. OpenTelemetry does provide the tooling for you to enable, disable things. In general, what I've seen is not if there is anything that could potentially be
Starting point is 00:32:05 sensitive information is normally disabled by default, like storing headers, for example, or things like that are not normally stored by default. There's a lot of things that you could maybe decide to disable. So the way that we do it at Skyscanner is we've got our internal, basically like an open telemetry distro that for those that don't know what an OpenTelemetry distro is, you will see multiple vendors will have their OpenTelemetry distro.
Starting point is 00:32:31 It's not a different implementation of OpenTelemetry, it's just a sort of package configuration that, you know, what components are going to integrate better with that particular vendor. The way that we do it at SkyScanner is we have our own default. And when we roll out OpenTelemetry, we roll it out with the minimal required. And by minimal required, we mean we want the HTTP clients to be instrumented, we want the servers to be instrumented, we want some internal, like
Starting point is 00:33:00 middleware, for example, internal libraries to be instrumented but in general we don't instrument all and then you know because there is a lot of a lot of things that we want to allow by default and then allow service owners to to add their own to be able to configure their own and the way to look at it i guess is uh is how basically is what's useful to you to debug your service, that's going to be very much dependent on each service owner. There's also good news on the metric side. I was talking mostly about traces here, but on the metric side, one of my favorite things, actually,
Starting point is 00:33:37 of OpenTelemetry metrics is the concept of metric views, which allow you to, let's say that you're a library author, and you say, I want to instrument this client, this HTTP client, and I want to put every single URL in it, which is not a semantic convention anymore, but let's say that someone says, I want to put every single URL in a metric as a tag.
Starting point is 00:34:00 Now, that could be a really high-categorized tag, if you've got the URL in it. Now, with metric views, what you can do is you can just go and remove that and basically re-aggregate the metric at the service level without having to change how the metric is instrumented. So that leaves the developer of the library with the option to say these are all the things that you should care about, but then you as a service owner you can just say, well, actually, I want you to re-aggregate that into less dimensions
Starting point is 00:34:28 and then push those data points out. So this is the work that, if you've got a big organization, it's probably good to invest in some, I like to call it telemetry enablement. It's basically, you need a set of people that know what they're doing and that know how things are instrumented and then can provide these defaults and say, this is a default metric view for this particular library. And then that's something that you can roll out internally across multiple teams.
Starting point is 00:35:02 Because not everyone has the time to go and look at this. And then it's, yeah. I'm of the approach that it's good to have that minimal instrumentation, minimal telemetry, and then allow service owners to enable what they need then later. Do you see any testing that is necessary for the telemetry data? Meaning, are there, like we do functional testing and unit testing,
Starting point is 00:35:31 do we need to do telemetry testing so that we actually make sure we get the right telemetry before we push it into production? Have you seen this? Yeah, I've seen, yeah. Even when like, I think that's especially when you've got the,
Starting point is 00:35:44 a good practice is to extrapolate, right? So you can instrument your service with a particular library, you run it in your test environments, see what telemetry they produce, you run it then if you've got a pre-prod environment or like canary deployment, something like that, you can run it there and then say, well, if this is producing this amount of telemetry, is that useful for me to debug? Is it useful data? And then try to basically get the value of how much is this going to cost me? Compared to, is it really useful data that is produced? It's kind of like observing the observer.
Starting point is 00:36:24 There is as well things that you can do when people are instrumenting their services and adding spans or manual instrumentation. I think probably more related to spans is the ability to test that in your unit tests as well. Something that we do at Skyscanner, you can have an in-memory exporter. So then you can use that in testing and say,
Starting point is 00:36:51 what spans were generated during this unit test? Is that what you're expecting? Or is context being propagated correctly? And you can start to look at it that way as well. Especially because telemetry data is now also meant to be used by not only the developers for troubleshooting or for ops teams,
Starting point is 00:37:13 but also by your DevOps tool itself. We mentioned canary deployments, right? We have automated canary analysis tools where your CADA, your HBA, your auto-scaling, they rely on accurate telemetry data. And if your telemetry data is not good, then these tools may make wrong decisions. I think that's a big point. So kind of the quality of observability,
Starting point is 00:37:40 how can you need to validate that the quality is right and that you have the right data and the right quality. The metrics view was an interesting concept and I would like to ask one more question because you said you can define a view and then the aggregation
Starting point is 00:37:58 happens. Does the aggregation happen in the app itself then? That means it's on the client side where it's collected or does it happen in the app itself then? That means it's on the client side where it's collected or does it happen in the collector? Where does it happen? It would happen on the client side, so on the individual server's replica
Starting point is 00:38:13 where that is being generated. So it's basically the way that the API works is quite interesting. For me, that was a new concept as well, which is the separation between measurements and their aggregation. Because when you think about, I don't know, the usual Prometheus clients, you basically get a counter, you add it, and then just get aggregated there in memory, and then you export
Starting point is 00:38:38 it. Well, it gets scraped in the case of Prometheus. With the New Relic API, what you're doing is you're adding a measurement, but the way that these measurements are aggregated is configured as startup, an application startup. So that's what the view does, really. It just informs the SDK, the open telemetry SDK, to say, well, this is how you will aggregate these measurements.
Starting point is 00:39:03 And you could even change this you know you could even change the the type you can basically if you've got a histogram like that whatever library author said i i'm going to instrument this as a histogram but you may not need a histogram and you just need a counter so you can even change the the aggregation type right so you can say i just need a counter for this. Normally that's a bit of a niche use case, probably not that common, but changing the type of aggregation is quite powerful. I've got two or three more quick questions for you. Also coming back to what you said
Starting point is 00:39:42 in the beginning, we also discussed this in our prep call. From an adoption perspective, I see, especially in enterprises that have had software for a long, long time and still running, a lot of, let's say, their quote-unquote legacy for the lack of a better term, software, that may never, either they don't have the source code. They didn't write it, but they just run it. And there's no way for them to ask their, their supplier to put in open telemetry because what is the benefit for them? Is it worth the cost?
Starting point is 00:40:17 Well, maybe also some, I think I brought up this use case that I ran into with a client where they said, Hey, our software is using this library, but we cannot use the latest version because we're running on an older version of another library and therefore we have a certain dependency so we cannot upgrade to something that
Starting point is 00:40:33 is instrumented with OpenTelemetry already. So we have a gap. And so this is also why then we still need auto instrumentation. And fortunately, there are some auto-instrumentation agents available in non-open telemetry as well, which may fill this gap. But what do you think needs to happen in order to completely fill this gap? Or what can we do to encourage people to also maybe touch some old code
Starting point is 00:41:01 and actually instrument those libraries? Because we want to make sure that enterprises that have these legacy systems running around, that they can also benefit from OpenTelemetry, from distributed tracing end-to-end. Yeah, so I think, well, the first thing is communicating value, right? Which is every single one of these discussions should start with, like, what is the value in doing this um and then now that's that's why you know there's so much good content out you know um out there basically to say this is the value of open standards this is the value of um correlation of signals of like you know using open telemetry semantic conventions and so on. So I guess if you've got an old framework
Starting point is 00:41:46 that needs to be instrumented, well, try to basically start those discussions with different vendors or different library maintainers to make them instrumented with OpenTelemetry. Now, there are things, it's not always possible, but in some cases there are ways that you can still use OpenTelemetry. Now, there are things, it's not always possible, but in some cases there are ways that you can still use OpenTelemetry semantic conventions with, for example, things like logs, right? We just said you can start to use some semantic conventions around the
Starting point is 00:42:18 logs that you produce so they can get added to or correlated to traces. And there is also the open telemetry collectors, which allow you to receive some of that data that may be produced in, I don't know, some of the older ways of exporting metrics, for example, StatsD or, well, not old, some other ways like Prometheus, or other basically clients. Some of that, other ways like Prometheus, or they're basically clients. Some of that, there's like 80 different receivers,
Starting point is 00:42:50 like OpenTelemetry collector receivers, which can either scrape or receive data in a format. And then you can do a lot of processing within these collectors to put that, to add those semantic conventions, those attributes that will allow you to then not just, you know, produce it in a standard format, but also be able to put it in context
Starting point is 00:43:15 with other open telemetry data. From an operational perspective, what is more challenging? What do you see? Is it to get open telemetry into the client app to actually generate the traces or to operate and run all of the collectors and the infrastructure that stores the data? I was going to go to a third thing, which is the hardest thing. But in terms of, I think the open telemetry collectors are easy to run.
Starting point is 00:43:45 They're just super efficient. We run them, as I said, you know, we run a set of like collectors as a gateway that accept those 1.8 million spans per second and metrics are now ramping up as well. And they run with like those, I don't know, in total across all of our clusters, I think it's less than 125 cores. So it's quite small CPU utilization for the amount of data that they chunk through. And even generally metrics and spans are not just basically passing data through, they're actually doing things with the data. But I was going to say the most difficult part, perhaps, is the human part of things. As usual, technology is easy, it's humans that are hard.
Starting point is 00:44:33 It's how to change those practices of debugging. We've got, for example, at SkyScanner, we made it super easy for people to go and use tracing data. Most teams use it. But there will be, you know, we've got a lot of different teams and there will be some teams that even though they've got the tracing data, they've got a RAM book that says you go to this dashboard and you go to logs and you know, they're not used to it because that's the way that they've been operating their service for years. So is that sort of like changing those patterns of
Starting point is 00:45:07 using observability data is probably the most challenging part, I think. Yeah. Definitely see that one all the time, where we're engaging with customers and they get
Starting point is 00:45:24 these traces, but yet they say, well, we really haven't looked at it much because every time there's an issue, we just jump back to logs, what we know what to do. And there's that idea, okay, well, if you practice the new style for a little bit, there might be a little bit of a pace. It's that transition. It's the culture that's got to change in the transition and try this. And then once you get to that and you get comfortable with that, you're going to find it so much more efficient but how do you get somebody especially
Starting point is 00:45:48 in the heat of the moment when something's blowing up like i know i can go here it might take me three times as long but i know what i'm doing as opposed to trying this new thing and that yeah that's extremely challenging so i think the practice there is finding a time for people to play or to experiment and poke around when there's not a fire. It's to find dedicated time to look into those situations. I actually have a quote. I think there's a quote in my book about that. It says like, the firefighter doesn't wait until the house is on fire to learn how to use a fire engine. So there is some training needed as well. And I think we've started to do that,
Starting point is 00:46:30 so try to apply that concept, a sky scanner as well, having some sort of game days, or just basically teams looking at their telemetry. Something that I'm really keen on starting to do is having multiple teams looking at their telemetry. Something that I'm really keen on starting to do is having multiple teams looking at their telemetry together. Because then that has, I've seen that happening, not with the whole team, but with one engineer from one particular area and another engineer
Starting point is 00:46:59 from another area. Now they're both part of the same transaction, but the services are part of the same transaction, but they're different teams. And then when they start to talk about telemetry, they're all like, oh, actually, I didn't know my endpoint was being called by your service, or that this endpoint was being called by your service and that was part of this particular user journey. And it's really interesting the conversations that can come out of that. And I think out of that actually comes another cool thing. If you start to learn about your
Starting point is 00:47:29 quote unquote users or customers from your API and you see when something breaks, why it breaks, you can also start thinking about kind of system boundaries and then define good SLAs and SLOs and actually start measuring the right things and then get right alerting in and all that, right? I mean, that's a big topic. Daniel, I know we are getting close to the end of the hour here, but if you look ahead a year from now, and if you would, if, you know,
Starting point is 00:47:59 let's assume you will write a second edition of your book, not because you need to keep it up to date, but you're writing new stuff. What do you think would be in your book? What are the things that are still struggling in your current role or in your current situation? What are the things that you have not solved yet that you need to solve this year
Starting point is 00:48:21 that you want to write about how you solved it? I think it would probably be, and this is an idea where OpenTelemetry has put in a lot of effort now, is in the client side part of things, the front end, browser, mobile instrumentation. And then how do we get that, basically, how do we get what we have in the back end in terms of correlation, context context and all that to mobile and to browser in a way that is integrated with open standards and so on.
Starting point is 00:48:54 So how do we basically move that context as well to the back end, so propagate all that context and start to basically get an idea of what our users are really experiencing. For example, we've got cold web vitals as something that we all care about because Google says so. It is a really genuine, good way of asserting user experience, those three metrics. But there's so much more that could be done to basically assert what users really care about, and then correlate that to backend performance as well. So I think that's an area where I'm really looking forward to next
Starting point is 00:49:36 year, next couple of years, it was coming. And then get the whole view of contextualized data from the customer, from the user, to the database. Sounds like describing a product that we work with on a daily basis. No, but that's the standard, right? I mean, that's the nice thing. Yeah, the standard. it, I think. And also, when we can get, this is another area where it is related to this as well. You mentioned that SLOs, try to get SLOs for basically that product-driven. You've got your SLOs that product people, product owners, product managers care about. And then you start basically looking at product health in a way that is driven by SLOs
Starting point is 00:50:28 and then get that sort of view of there is an SLO that's about to break and then you get into the metrics and then vendors start to basically give that. We were talking about workflows or runbooks, basically get those without having to write a runbook.
Starting point is 00:50:52 Because I think writing detailed runbooks is a losing game. You're never going to write the perfect runbook. But if you can have a data that is correlated that a vendor can use to drive you through the SLO to metrics, to traces, to logs,
Starting point is 00:51:08 and all that in context. There's still a lot of work to do there, I think. Daniel, did we miss anything? Did we miss any topic that you wanted to make sure all listeners are aware of as they embark on their journey towards adopting OpenTelemetry. Besides obviously looking at your book
Starting point is 00:51:30 because that's where all the wisdom is. Yeah. I don't think so. I think perhaps the cost distribution, but I think we talked about that. Yeah. I know you presented at SRECon and at QCon.
Starting point is 00:51:52 I didn't actually. I didn't present at SRECon. I presented at, well, recently at QCon. I presented at Ollifest. Oh, Ollifest. Sorry, that was my mistake. Any other conference presentations coming up this year? Not yet.
Starting point is 00:52:09 I've been invited to a couple, but I also have a lot of work to do at Skyscanner to adopt open standards. Yeah. Awesome. Brian, any final words from you? Any new dream that came up in your mind? Well, the only dream I have is still one I thought I saw in your notes and we didn't
Starting point is 00:52:31 really tackle is I remember when OpenTelemetry was first coming out, there was this promise of all these vendors were going to be baking OpenTelemetry code into their runtimes and everything. And if it's happening, I haven't seen it yet, but I haven't seen everything obviously, but not seen it in the common stuff yet. And hoping that still comes because I think that'll just make it so much easier for everyone, you know, and it'll be based on the vendor's best practices for what's going to be most useful. The other thing, I like what you said, Daniel, earlier, I remember reading an article about this way back as you kept on talking about collecting all this data
Starting point is 00:53:06 and collecting all this data, and it could be overload from a data point of view. And I think one of the most important components of observability is turning the data into information, which there's some great articles out there. I can't remember the one I read way back, but it was just going through the difference between data, which is just data points and information,
Starting point is 00:53:28 which is turning into something useful. And OpenTelemetry, obviously, is going to help you capture all the data. And it's going to be up to you to turn that into useful information and find ways that you can leverage that within your organization to make those improvements. So that's the other key component to it. My hope is that we can get
Starting point is 00:53:49 vendors, like the effort that vendors used to basically get on running their own instrumentation agents and their own APIs and SDKs. When vendors move towards open standards for instrumentation and the API and SDK. Hopefully all that extra
Starting point is 00:54:07 engineering time is put onto using that data, that standardized data, and making it valuable. Awesome. That's all I had though, Andy. Hopefully I'll have another dream for the next episode. Which I believe is actually also another session on OpenTelemetry because it's going to be practical implementation tips from one of our users out there and how they implemented OpenTelemetry. So OpenTelemetry is
Starting point is 00:54:36 a hot topic and we are trying to do whatever we can to educate the world. A final statement I want to make, and we discussed this also in the prep call. I think we as vendors that have been in the space for so long, we're really looking forward to Open Telemetry because I think it helps us to really focus our engineering efforts on where we can really make a difference because the data collection problem is something that will be solved with OpenTelemetry,
Starting point is 00:55:06 like how and what data, so we can really figure out and solve the problem that Brian, you just mentioned, how can we then convert this data into answers and refocus and then also differentiate there, right? And that's going to be interesting and yeah, I think it's still a way to go, right? Because even though Open open telemetry it kind of looks like skyrocket adoption, it will still take a while until we have it in a state
Starting point is 00:55:31 where we can just purely rely on open telemetry. But yeah, it's going to be good. Yep. Good future. All right, awesome. Daniel, thank you so much
Starting point is 00:55:42 for joining us today. I was a bit quiet today, but it was, I think, a lot for me to learn on today's episode for sure. Plus, I was still stuck. I couldn't stop thinking about you dancing around on the moon doing the salsa, the dancing on the moon. But to any of our listeners, if you have any questions, comments, topics, feel free to send us an email at pure underscore DT. And we'd love to hear from you. And thank you, everybody.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.