PurePerformance - Pitfalls to avoid when going all-in on OpenTelemetry with Hans Kristian Flaatten

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my co-host Andy Grabner who today is not mocking me. I know on the last podcast recording you probably heard me mentioning him mocking me and today he's just smiling quietly, very pleased with himself that he's not mocking me. So Andy, how are you doing today i'm very good i just uh wanted to uh make sure that people know that you're not a fortune teller because i think you assumed i am again i did think i was getting ready for it yeah so i i'm no longer

Starting point is 00:01:00 a fortune teller the gig is up or the jig is up what's the uh the saying whatever it's probably an american thing and you know what i could also have uh i think not even a fortune teller could have predicted that today we have two norwegians oh yeah that's funny i mentioned it because i'm so minorly norwegian but yes yes i do have a little bit of background, but our guest is full on, I assume. I don't know. Maybe he'll tell us he just moved there two weeks ago. Who knows?

Starting point is 00:01:31 But then he maybe also changed his name to something that sounds very Norwegian. I wonder what would his name be? I don't know. Hans Christian, not Andersen, but maybe Flaten? I don't know. Yeah. How do I pronounce it correctly? Hi, Hans Christian. How are you? Hans Christian. Hans Christian, not Andersen, but maybe Flaten? I don't know. Yeah. How do I pronounce it correctly? Hi, Hans Christian, how are you?

Starting point is 00:01:48 Hans Christian, yes. I think it's very spot on, Andy. It's also easy for, I guess, easier for me as a German speaker, like an Austrian who speaks German, Hans Christian. And then Flaten, how do you pronounce the double T? Yeah, so the double A is a Norwegian syllable, Å, so it's flotten. Oh, flotten.

Starting point is 00:02:11 Ah, okay, yes. Yes, so it's a little bit old. Actually, it comes more from Danish language. And then, so this guy that I believe you mentioned, Hans Christian Andersen, he was sort of like, he went around and collected fairy tales in Norway and wrote sort of like the book on fairy tales from Norway. That's, yeah.

Starting point is 00:02:33 Well, so this is the first thing people learn when they listen to our podcast today. First of all, how to correctly pronounce your name. It's not flattened, like I said. And then Hans Christian Andersen, if in case you have escaped the stories, then you should check him out. Maybe we'll add a link to it in the description too.

Starting point is 00:02:51 But Hans-Christian, to get into a more serious topic, the reason why we talk today is because, first of all, you reached out on Cloud Native Bergen, which is a conference you run in Norway. Maybe you want to quickly explain since when you do it and why you love doing it. Yeah, I'll do that. So Bergen is sort of the westernmost city of Norway.

Starting point is 00:03:15 It's an old Hanseatic town. So we had close relations with the German trade routes. It went through and for a long time, it was actually sort of the capital of Norway. So yeah, we're very proud of the heritage we have over there. And we have a great cloud-native community as well. There's a lot of financial fintech companies in Bergen.

Starting point is 00:03:37 So there's a strong technical community there. And we are organizing for the first time ever cloud-native Bergen. So by the time the podcast is out, the CFP is closed and the agenda has been published, but you can still get your tickets because

Starting point is 00:03:56 the conference is not until October 30th. So it will be sort of like a pre-Halloween. Maybe there will be some horror stories from production. Maybe. Actually, it's a nice theme though. Hopefully you

Starting point is 00:04:11 have actually, even though we don't some people like horror stories, but typically in an IT sense we don't wish them on ourselves, only to our enemies or competition maybe. But still, I think stories about what is not going well is actually something that I would actually like to dive into today.

Starting point is 00:04:31 Not meaning that things fail, but what we can learn of things that we should have done differently. So thank you so much for this, for the Cloud Native Bergen folks. If you want to attend that conference, the links and the details are in the description another thing that you know once we connected and talked I then stumbled across a blog from you and it's called open telemetry from zero to 100 the story of how we adopted open telemetry at NAV the Norway's largest government agency and I read through their blog, and OpenTelemetry is obviously, especially Brian, where you and I are, right?

Starting point is 00:05:08 We are constantly reminded that this is a big topic. We see a lot of people looking into OpenTelemetry, adopting it. And there's obviously, we hear a lot of great stories. But today, before hearing the great story and how you adopted it at the Norwegian agency, I want to first hear the things that you typically don't hear when you listen to a presentation at a conference where everybody's just saying, we achieved X, Y, Z and where everything is perfect.

Starting point is 00:05:36 So, Hans Christian, if I could ask you, maybe let's start off with some of the challenges you have with adopting some of the things people should know that you would have liked to known before starting on the journey yeah so right off the bat these things take way longer than you anticipate uh so we were already sort of we were preparing mentally for this and we have been through a number of sort of transformations and changes over the past and they are still ongoing, many of them, as in most organizations. And this is no exception. And it's so intricate. There's a lot of moving parts. The standard isn't really standardized for all the different areas. The documentation is certainly not complete. So we have to dig through source code and sort of figuring out,

Starting point is 00:06:28 oh, this markdown is not sort of like, it's slightly different from what's on the website, which one is actually the one that we are supposed to follow for what the version we are. So it's a huge area. And as you all know,

Starting point is 00:06:43 sort of observability and then adding on a new standard and standard to rule them all makes it just infinitely more complex. So I wish that more people would have prepared me for the complexity there before digging in. So that would definitely be one of the first areas there. So, and then we have sort of different sort of smaller, more concise areas, sort of like area, a lot of noisy data when we started. Because we went through, just to give a little bit of context,

Starting point is 00:07:35 NAV being the Norwegian Labor and Welfare Administration, and we have been doing Kubernetes for close to 10 years. So by now we have close to 2,000 individual applications or microservices or whatever you want to call them in our clusters. So a fairly large amount. We have roughly 1,000 developers across 100 plus teams. So it's a fairly sizable operation here.

Starting point is 00:08:01 And sort of having developers do anything is sort of a challenge in itself. Nevertheless, sort of like, oh, throw out the sort of the instrument with all of these SDKs and libraries and so forth. So we knew that we had to provide sort of an easier on-ramp at least. We had to sort of cater to both worlds, those that, oh, we want to be completely in control and sort of like know every single binary or SDK or library or package we want because we have those teams as well. But the majority, they are sort of like, oh, I've already sort of backlogged for several years

Starting point is 00:08:38 of development. Please don't sort of add more to the backlog. make this just work out of the box. So having sort of a way to automatically instrument and actually sort of, I was a strong sort of like, I wouldn't say against Java, but sort of I've come to like Java more and more because the JVM at least has some really, really solid hooks. And recent versions of Java is actually becoming quite nice. But the way that you can actually hook into the JVM and get in there without completely breaking the application actually works. So we were able to leverage the auto instrument part and the majority of our applications are Java. So they just mainly worked out of the box there to sort of get them

Starting point is 00:09:35 instrumented with OpenTelemetry SDKs. But that sort of was only sort of the beginning because then suddenly sort of you had influx of so much data. So we met sort of every step of the way, sort of like scaling issues when sort of setting up this platform sort of on our own infrastructure. That's sort of like, oh, the ingester need to be scaled. So the backend needs to be scaled. So the backend needs to be scaled. The query engine needs to be scaled and so forth.

Starting point is 00:10:07 Sort of the list never ended. So there was constantly a new sort of bottleneck. And then sort of once you had all of the data, sort of making sense of it all, sort of how do we actually explain to the teams that this isn't sort of like, well, there is certainly advantages and some really, really

Starting point is 00:10:25 strong motivations on adopting open telemetry, adopting a standard. It's certainly sort of like, it's not a silver bullet. It never is. The tool is never a silver bullet. It just doesn't magically just solve all your problems. It makes some problems easier to solve, but there's still work. There's still sort of, you need to learn to use the tools and sort of how to use the data and get them making sense of the data correctly. So that's again where we prepared ourselves mentally, but we could have been much, much better prepared not only how the teams were supposed to instrument and get their applications into the OpenTelemetry ecosystem, but then on the other

Starting point is 00:11:16 end, the day two operations, how are you actually going to use this data? And to a certain extent, I don't think the OpenTelemetry community don't know And to a certain extent, I don't think the OpenTelemetry community don't know that to a certain extent using this. So there were no prior sort of like, oh, this is how you, this is the questions that you are solving. And sort of these are, and at least we didn't find them. So then they're very, very well hidden. So that's telling. Can I quickly inject here? Because I think that's an interesting thing that you say.

Starting point is 00:11:55 Observability is more than just gathering the data, right? Open telemetry, the SDKs are solving the, how do we get the data from the apps, from the frameworks? Then we have the open telemetry collector who can collect it, transform it and then send it to a backend. But this is only half of the story. And already here, it's challenging as you said, right? It's challenging to instrument the right things. It's challenging to scale the data collection, the collectors to make sure

Starting point is 00:12:24 you're not losing data. It's challenging. And I look in your blog post, you have some examples of when you then came up with a lot of rules of reducing the noise to basically not capture things that are not essential.

Starting point is 00:12:36 And what is interesting now, if there is no one, and maybe I'm wrong because I'm not as deep maybe into the whole community as you are because you're working with this for so long. But if everyone that adopts open telemetry as the pure thing and then doing everything themselves needs to learn these lessons,

Starting point is 00:12:56 need to configure the same exception rule, the same exclusion rules, needs to figure out a way how to best scale your infrastructure, then it feels like a lot of duplicated effort so then my question to you is why did you or why did the norwegian government even decide to go that route and not just say this is not our field of expertise we don't want to build our own observability platform why did they go this route and not go with a vendor that has maybe five, 10, 15, 20 years of experience and just giving you an out-of-the-box solution? Yeah, very good question. And the primary one was that we really, really believe in open standards.

Starting point is 00:13:42 So the goal here wasn't that much of building a platform, but that was just an artifact of proving that the standard works, more or less. I'm not married to any of my tools. I'm married to the protocol and the format of my telemetry data. Because what we have been through over and over again was that we needed to ask our developers to not only change the tools, and that's hard enough, and sort of where do you view and what are you logging into,

Starting point is 00:14:15 but also instrumenting their code. And that's just a so time-consuming task. So that was the underlying goal here was to, okay, say that hopefully if our bet is correct here, we don't need to instrument the code again. We can swap out whatever tracing backend and log storage backend and viewer for whatever makes most sense

Starting point is 00:14:42 and gives us the best visualization and the best sort of understanding of the data, not sort of like locking down the data itself or the format. But we felt that in order to understand the underlying technology and sort of the nitty-gritty details, we needed to sort of get our hands dirty to a certain extent, but this is not the final chapter. This is not the end of the road. Hopefully, it's the end of the

Starting point is 00:15:15 road when it comes to instrumenting, and then we can use our time better when it comes to sort of making sense and where would we like to have this data. And then the second part there also is that in government, things like procurement is really, really slow. And since we really didn't know what we needed, it was merely impossible to sort of go out and sort of make a procurement and make a public tender to sort of like, oh, we don't even know what we need and what we should ask for. So again, it's proving sort of that this has its worth.

Starting point is 00:15:54 And then once we sort of have, so we are still sort of treating it as sort of a proof of concept, even though it's running in production and we are making it as production ready as we are able to do, at some point we will have the time and ability to take a step back and say, where do we go next? Is Open Telemetry working as intended? And is there better options when it comes to the tooling, to the storage, to the analysis, and what extra can we do with this data here? Hopefully by then, the ecosystem has matured when it comes to integrations with OpenTelemetry, and there will be different options for us to choose from. Maybe we have, because due to our size and sort of maybe there's not one option fits all and maybe certain domains, certain teams, technologies,

Starting point is 00:16:53 or what have you sort of will favor sort of these tools here and some other will favor there. And there will be some overlap because we can send the data to multiple sources because we can send the data to multiple sources. Because we have the collector and we can just say that, oh now we want to send it here. And maybe the teams can in larger degree choose their own journey in that sense. Because that's the one thing that even though we are a large organization, we actually have very disjoint domains. We have these, instead of sort of a bank would have a very, very sort of like one customer,

Starting point is 00:17:34 and then there would be some core services that are used by all of them. And there would be additional services, but it's very intertwined. The domain is still very financial and there are subdomains, of course, and nuances. But in all of the services we provide, they're actually quite independent. Your paternity leave has nothing to do with your disability payments. There might be some calculations here and there, but sort of the act of delivering the whatever user interface the user would be interfacing and the caseworkers workers will be completely different in most cases. Allows us to treat them a little bit different. I'm taking a lot of notes as always. And one of the things that I really thought

Starting point is 00:18:47 I think should be a quote out of here is the goal that you try to achieve with going with open telemetry and kind of proving it out is that you don't want your developers to constantly with every cycle uh of your kind of like tooling also have to change the way they do coding they do instrumentation and i think this is why betting on the standard makes a lot of sense. And I really like that. So don't force any tool decisions that you do in the backend onto the way developers are actually instrumenting their applications. And I think that's why going with OpenTelemetry makes a lot of sense. And I also agree with you.

Starting point is 00:19:21 I mean, OpenTelemetry has come a long way and so has adoption also of with you. I mean, OpenTelemetry has come a long way and so has adoption also of commercial vendors. You know, whether it's us, Datadog, New Relics, Blank, we are all in the observability space. And I think we all take OpenTelemetry very seriously. We also see it as a big benefit because it means that we, while we still obviously have our agents,

Starting point is 00:19:42 OpenTelemetry is a great source of information for us. And like what you do, 100%, you try to go to 100% OpenTelemetry, that's what we see more and more people asking for. On the Java side, just jumping quickly back because you said you kind of fell in love with Java. I just recorded a video this week on GraalVM. I'm not sure a video this week on GraalVM. I'm not sure, have you looked into GraalVM at all? I believe we have teams looking at it, experimenting it.

Starting point is 00:20:13 Maybe we even have it running in production. I'm not entirely sure because there's just so many applications. But that's really interesting. But of course, you're trading your startup and memory and efficiency during runtime into a longer build cycle. And I know that the same teams are also trying to get their builds as short, as quick as possible. There are some teams that are really, really to the extreme sort of focusing on sort of having when you commit to when you get sort of a thumbs up or a thumbs down with regards to that

Starting point is 00:20:52 commit to sort of not add on additional seconds there. So the only thing I bring it up here is because I also, I learned this this week when I listened to the video or I recorded the video with one of our colleagues. Java Graal or Graal compiles Java code into native images, which means the whole JIT compilation, the just-in-time compilation, where you can actually use an agent-based approach. Modifying bytecode does no longer work. And this is what traditionally we use for auto- auto instrumentation at runtime. And so the new approaches need to be found. And I know that in the OpenTelemetry community, there's also initiatives going on to also instrument

Starting point is 00:21:35 Corral native VMs. We also, from our engineering team, they figured out a really elegant solution of actually instrumenting the compiler. And then as the compiler creates the native images, we're instrumenting the compiler, so the compiler produces a native image that then has the instrumented code in there. So a really

Starting point is 00:21:55 interesting approach. It's interesting you mention that. Only thinking, like we've had a bunch of conversations in the past about when people are choosing which platform to use, if they're going to use Kubernetes, if they're going to use serverless, right? Thinking about more than just what they want to deploy to, but if they're going to be are developing these new, I don't know, these new variations of code and compilation and all are not thinking about how anything else is going to be done on it.

Starting point is 00:22:33 Right? And we almost need to shift over to them to say, please, yeah, find cool new ways that are going to be more efficient, but we also need to observe it. And you have to have something built in for observability so that suddenly anybody who uses that is not blind. It's just an interesting conundrum. It's the first time I'm hearing that and thinking of it. Because it's usually on the other side, right? Hans Christian, people will pick something because they think it's what's going to be best for them. And then they come through, okay, now we need to observe it, now we need to do security

Starting point is 00:23:03 on it. And it's really, really difficult because what they chose might not be suitable for that, whereas that should have been part of the consideration early. It feels like that consideration has to go into these new things, too. Interesting, very interesting. Yeah, absolutely. And we

Starting point is 00:23:19 are trying to sort of lift or embed the focus of observability into the teams because they still have a lot of teams that sort of treat it sort of way they have no sort of thought or sort of it's just an afterthought about the whole observability thing so still trying to create that culture into all of our teams. To a large degree, we have been able to do it with security. We have a security champions program within our organization that's been really, really successful once they sort of found out that, oh, it needed to

Starting point is 00:23:58 be sort of like an opt-in and sort of more FOMO afterwards, sort of like, oh, you're missing out and there are some cool events, et cetera, instead of sort of like a mandatory sort of more FOMO afterwards, sort of like, oh, you're missing out and there are some cool events, et cetera, instead of sort of like a mandatory sort of top-down, sort of, oh, you need to have a security role and something on your team. So trying to do the same when it comes to observability, but it's a long road there. Yeah.

Starting point is 00:24:21 One additional question on, so you said earlier you've been 10 years on Kubernetes. Are you the platform engineering team? Is everything Kubernetes or do you also have any other systems that you connect to that are kind of running in the

Starting point is 00:24:38 quote-unquote traditional more legacy world? We have lots of legacy systems. It's just that it's not under my team. So the platform team that I'm a part of, the Nice platform, it's 100% Kubernetes. So we do

Starting point is 00:24:53 have a fairly large on-premise environment. So we have almost 50-50 split between our on-premise clusters and our Google Cloud ones. But it's purely Kubernetes at that point. And that is sort of the when

Starting point is 00:25:09 services within our organization are modernized, they are placed into Kubernetes. We have all the way back to mainframes still running as per organization and then everything in between middleware servers,

Starting point is 00:25:26 application services, running on bare metal servers or virtualized servers. Is there ever a need from an end-to-end kind of responsibility perspective to get consistent observability from your, let's say, new stack from Kubernetes into the mainframe? Is this ever a need or is it just completely different silos in the organization and they don't touch base? Well, to a certain degree, the large majority

Starting point is 00:25:52 is sort of these very sort of isolated pockets. They have their own services that they are responsible for and then not integrating too much with the legacy services. And these larger things that are not modernized are also sort of a little bit in the same boat there. So of course, there are never sort of like rules without exception. There are certainly some services that still call more of their legacy or other external services, definitely.

Starting point is 00:26:30 And that was, but that's not the second order sort of thought, but from our sort of initial evaluation of telemetry, they even have sort of a mainframe working group. So even on the mainframes, there could be the possibility of us instrumenting it or getting OpenTelemetry data from it. So it felt like a safe bet because the OpenTelemetry,

Starting point is 00:27:00 while they are good Kubernetes and container support for those applications, it has no ties into the Kubernetes community. It's sort of, it doesn't really concern itself with where you're running it. It's more, it's on a higher order or depending on where you are, of course, but it's more application focused

Starting point is 00:27:21 or it's network focused and sort of like, yeah, it works great in communities, but it works just as great outside communities as well. I remember we had a podcast around mainframe and open telemetry with one of our colleagues. So it is, it is pretty cool to see what what's been happening in that space and what open telemetry and also other open source initiatives have made happen, right?

Starting point is 00:27:47 I mean, have triggered, as you said, the great thing about being so open and also defining standards is that once you have a standard and people can agree on it, then you all of a sudden potentially have a really cool ecosystem that evolves and develops.

Starting point is 00:28:05 Hans-Christian, Evan, one more question. You mentioned the challenge of too much data, or you think you call it, how do you deal with the noise of the data? And this is the same problem that we've had since the dawn of observability, right? So it's either not enough data or it is too much data. The question is, what is the right data and my question would be how do you educate or do you have an answer on educating engineers on what is the right amount of data like do you do people have to go through a training program or do they you've pressed practices do you have any checks in your pipeline to make sure that nobody is logging

Starting point is 00:28:43 sensitive data nobody's logging too much data. Any insights on this would be really interesting. Yeah, absolutely. And that's that, but it's sort of where I wished we put more effort in earlier, because this is the hard part. Of course, the fun and technical stuff, it's not really the hard part. It's sort of getting people to use it correctly. So what we do is that we ask back, what is the question? What is it you want to answer? And then focusing on

Starting point is 00:29:22 what are the important user journeys that you want to be able to make sure actually works. And surprisingly, while they can, the teams often can say that these are the characteristics of the system, they are not that certain of what's the critical user journey of this one application here. So they actually then have to go back to their own stakeholders to sort of, oh, but we have all of these features here, but what's actually the most, where should we sort of start to make sure that this is actually working?

Starting point is 00:30:00 And it's working not only correctly, but it's performing to a level that we feel is satisfactory to our users. So it's sort of a long journey here to sort of, as you mentioned, sort of the realm of observability. It's the technical part and the data parts, only it's just a small part of it. And it's so much more about sort of the mindset and sort of what do you want to get out of it. So that is where we are now increasing our effort

Starting point is 00:30:37 to sort of have these training programs to enrich our documentations and sort of to ask these questions here and sort of get the developers more into a state of mind of thinking about this more rationally or specific to what the user might be doing at a given moment and what can go wrong and how should we actually sort of react to that and

Starting point is 00:31:06 where do we need to know if something goes wrong. And more than just sort of, oh, just instrument, just get as much data as possible. And because then sort of we end in just the same position that we are today, where we have a lot of data, but we really don't know, or the developers don't know what should they expect. What is the sort of, oh, here are lots of noise. Should we react on it? We don't know. Hey, Brian, this feels like a lot of the conversation that we have with the people that we interact

Starting point is 00:31:46 with. I mean, on our end, we are capturing a lot of data by default without people having to think about what to capture. But in the end, it comes back to this educational piece. What is it really that is important? And then what data do I then look at? And for me, it's fascinating. It's also interesting, too, because a lot of people

Starting point is 00:32:09 have ideas of what data is important, right? And they might be missing the big picture. This is a bit of a stretch of an example, but you might have people coming in saying, well, we need to know what the cpu utilization of each thing is at every single time we need to know how many times these methods are being fired off we need to know x y and z right and then when you turn around and ask well why so well that's what's going to help us fix right um then it's like fix what what? And they say, well, if something goes wrong, well,

Starting point is 00:32:46 how do you know something goes wrong? And they stop and think, right? Because if you think about it from like the SRE point of view, is it starts with what is it that you're delivering? And are you delivering it okay? Sometimes, especially if you're in a dynamic situation like Kubernetes, where it can auto-scale and add more pods to handle the load, it doesn't necessarily matter what the CPU of each pod is, so long as it's scaling properly and it's delivering error-free and in a timely response, right? Those are maybe nice-to-know things, but the more important thing is, first of all, are you delivering against your SLOs? And then when you have that front end,

Starting point is 00:33:33 the user-facing piece defined, that I believe helps you fill in all those back ends. Because as you find out from game days or chaos experiments, what are the things then that help you troubleshoot those things and that can help you backfill into what are the important things we need to know on the backside. But a lot of people start very granular because, again, most developers are working in that little space of theirs.

Starting point is 00:33:57 So unless you have somebody outside of that looking bigger picture to say, this is our goal, what do we need to do to be able to achieve our goal? And we can define what we need, what information or data we need to monitor that properly. I think that's one of the key pieces to that. Yeah, I don't know. That's just my thoughts on that. I don't...

Starting point is 00:34:14 Yeah, and in many cases our stakeholders doesn't really know, or at certain areas, certain points, not have not matured enough because it's uh is it the the the thought is still like but why do i need to concern myself about sort of individual pieces shouldn't everything work why isn't this this sounds so easy you're making this system here and it should work it should be binary and to a certain degree

Starting point is 00:34:45 it's sort of like a little bit about sort of our legacy mindset and it's also sort of working in public sector where we have all of these rules and regulation and it's fairly binary if we we need to abide by all of them the very few things that we sort of need to concern ourselves that are optionals you know uh all of these sort of criterias needs to be checked and then you get your the service or sort of it's not optional to just say that oh you we were able to check four out of five that's good enough and maybe sort of while private sector certainly has sort of has rules and regulations that they need to abide by, there are still a lot more flexibility when it comes to decisions from the company. You need to go high enough up to have decision-making authority,

Starting point is 00:35:38 but saying that, okay, at this point, we feel that it's good enough. We are certain enough, and we are willing to take the risk. That risk mindset isn't really translated very well into the public government because it's, again, there are no one there to say that we can actually go outside or disregard some rules or regulations because that's all we need to care about is these things. There are no sort of like, oh, but our company is sort of, we have a mission and sort of like, you know, sort of like it's created to serve some purpose.

Starting point is 00:36:16 We are not sort of, we are sort of created to cater to whatever the politicians and sort of the government has decided that this is this is how it should be yeah when you talk about filtering out the noise i'm curious right you know and i'm going to use these words from their more their definition point of view the difference between data and information right information is processed data that's made sense of um so is it a matter of we still want to collect? Because it's new to us, right?

Starting point is 00:36:48 To Andy and I, it's like our tool collects what it collects, right? And that's it. When you're doing open telemetry, you're sitting down with a blank canvas and you can collect whatever data you want, right? And you'd be figuring out what overhead you're adding and doing all that kind of stuff. But I'm thinking, I don't know why I had a Star Trek, I'm not even a big Trekkie or nothing, but if you're on the bridge of the Enterprise, you have certain

Starting point is 00:37:11 alarms that are going off that certain things are going wrong. And it could be Scotty, Mr. Scotty in the engine room is having an issue. It's saying the engine is having a problem. Well, on the bridge there, you would want that to be information. You would want that to be process data. You would want that to be process data. You'd want that to be something that somebody looked at. What are the key data points to alert us to when something is wrong in the engine room? And what key area of the engine is it? Maybe it's not going to be the granular bit because you're not going to be looking at every single piece of data at that point. But then when it goes to say, Mr. Scotty in the engine room, he might need all

Starting point is 00:37:48 those other data points to figure out exactly where that is. So, but it wouldn't be necessarily in an information format. It would be, you know, if you think about sensors on every little single thing, okay, we know it's coming from this area. And now that I'm looking at the data itself, I can see this sensor is telling me it's this little piece. So when you're talking about filtering out the noise of the data, is it you still want to collect a bunch of the data, but you're only going to info process certain pieces? Or are you also looking to say, what data do we just not collect anymore?

Starting point is 00:38:20 Or is there a balance between that? It's just a new concept. I mean, as it's hitting me as you're talking, it's just having a hard time. Like, how do you even tackle that? Like, it's hurting my brain thinking, how do you tackle that problem, you know? Absolutely.

Starting point is 00:38:32 So it's mostly about sort of dropping sort of signals that we don't care about. They are just noise in its purest sense. Because very little of what's coming out of OpenTelemetry is refined or anything. It's very raw. So we need to sort of collect it. And then the different signals are,

Starting point is 00:39:01 if done right, you can correlate them. And they are linked together. And then the tool where you are storing it and viewing it will provide the bigger picture. And that's what we are still learning about, what's actually collected here. Because to a large degree, we are relying on this auto-instrumentation and that's just a black box. We don't know what it's collecting and we need to figure out what data do we have here? What data should we expect to be there and how can we sort of make it into more valuable sort of signals

Starting point is 00:39:37 and alerts that's actually actionable. But, Christian, this now brings me back and I want to challenge one of the things you said earlier. Earlier you said you're looking into open telemetry because you want to make sure that the developers don't have to change the tools every couple of years. But if you look into auto

Starting point is 00:39:58 instrumentation and you're heavily relying on auto instrumentation, currently you don't know what is coming out. You also don't know what is coming out. You also don't know what is coming out in a month or in a year because somebody else is changing the auto instrumentation for OpenTelemetry. Isn't that then giving developers a similar challenge? Because all of a sudden with an update to the instrumentation, things may go away. A lot of things come into the picture they don't even know. Isn't that a big challenge? Yeah, absolutely. And again, we are placing a bet that OpenTelemetry will follow in the

Starting point is 00:40:33 footprints of Kubernetes where things are graduating and becoming stable. And that's saying that in and that will apply similarly to the auto instrumentations that okay once this signal here and this sort of integration or instrumentation once it's sort of we have sort of it's good enough it will graduate or become deprecated and so we can rely more and more that this will continue to provide us the data that we are expecting and not sort of certainly. But yes, there are moving pieces here. And to a certain degree, we need to, or we are willing to sort of live with that as well. And because we need to find this delicate balance between sort of like locking everything and saying that this will never and can never change and sort

Starting point is 00:41:25 of changing everything all the time. And sort of the hope here is that it would provide a certain balance there. But of course, things will change. That's sort of inevitable. But it's just a matter of how much pain does it involve and it needs to change. But yeah, good point. Then one more thing.

Starting point is 00:41:55 How many people would you say are working in your team to make sure that observability is actually available in your platform for your developers? How many people take care of observability? In total, we are close to 20 people working on the platform. And then two of us are primarily concerned on the observability side. Cool. Awesome. And do you have a measure of success of adoption? Do you know, like, hey, something happened, either, I don't know, either OpenTelemetry messed up with the latest

Starting point is 00:42:34 deployment or something happened and developers all of a sudden don't use it anymore. Do you measure somehow and look into adoption rates and then also kind of act on that data? We are following closely adoption rates and then sort of working with teams and applications that are experiencing problems, often sort of suggesting that they start adopting OpenTelemetry and sort of the tools that we are providing, when we see that this is a prime example where if we had better insights, we could at least have understood the issue faster and maybe even alerted once you had been aware that this issue here could arise.

Starting point is 00:43:26 So we have a lot of, since we are such a large organization, our first responders have a hard time figuring out when someone calls about some part of the organization not working correctly, figuring out what team is actually responsible, where would the root cause actually, or where would it be likely. So most of that is done via intuition today. So it's a huge respect to that team that knows a lot of what's the all of the old systems and then keeping up with all of the new ones as well but we want to give them tools that sort of better um say that once the user in this area here

Starting point is 00:44:13 have reported an error you can go and you can trace it back and see which other maybe seemingly unrelated area is causing here or or correlating a bunch of errors and seeing that, oh, but this is due to a network issue between these two sites here, for instance. Brian, for me, it's fascinating to hear all this because it's just things we deal with and have dealt with over the last 10, 15 years since we've been in observability. And the whole ownership discussion is just also fascinating. Or fault domain isolation, as we call it. First of all, where's the fault domain?

Starting point is 00:44:50 If you look at a large complex system, pinpointing the fault domain and then trying to figure out what additional data do you have. I also really like, Brian, your Star Trek analogy. I think that's because the fault domain will be on the bridge. You may have five you know, like five lights and they should always be green but if one goes red, you know, the fault domain and then, you know, who is responsible

Starting point is 00:45:12 for that fault domain? You go to Scotty or you go to somebody else. Sorry for the Star Trek fans. I only went back to the original. I know there's a lot of people passionate about Star Trek and all the iterations of it, so I'll just keep it there. I think the other really interesting thing, too, is hearing this from the point of view of government.

Starting point is 00:45:35 Because when you think about, as we said before, there's two sides to observability. There's data collection, and then there's doing something with that data. And open telemetry has gotten really good with the data collection side. And I think from, as you've illustrated and others, Hans, Christian, doing something with the data is somewhat the more difficult part. People like Andy and I are going to be biased to saying, use a vendor, right? Because we've done it all. But in a government situation,

Starting point is 00:46:06 a couple of other factors play in where, number one, they don't have to be as budget conscious, right? So to set up a team, not that governments want to waste money or anything, but they're not beholden to shareholders, right? Private sector or companies, they're looking at every penny and nickel spent.

Starting point is 00:46:24 You might have some people who are you know looking at that in the government side but it's not as drastic as that you're not you know you are beholden to the voters but something like this isn't going to be so noticeable so you can create a team that's going to create the back end for the processing for it without as much of an impact because it does always boggle my mind when we have companies who, you know, the DIY approach. I understand the desire. I understand the creativity. I just understand the curiosity, but it's like you're paying for all these employees as a company to work on this thing that has no direct tie to your revenue, no direct benefit to your

Starting point is 00:47:03 shareholders or anything else like that when it exists. So I think governments have a lot more leeway to play that game. But I can also see the importance of not necessarily being tied to a vendor as well, because on the government side, you got to think about, OK, we're friendly. This vendor is from country A. We're friendly with them right now. And now tomorrow we're no're friendly with them right now. And now tomorrow we're no longer friendly with them. And half of our systems are in their software. So it does give a compelling reason,

Starting point is 00:47:34 especially at least in the government sector, to do things this way. But fortunately for us, it doesn't give a compelling reason for a private sector. Yeah, and in most, I completely agree with you, Brian, for sort of, as you said, in private sector, especially sort of the return on shareholder value, that this is undifferentiated, heavy lifting. But these concerns that you are touching on, sort of being vendor agnostic to a certain degree.

Starting point is 00:48:08 And then, as I also mentioned, sort of the procurement process here is also ridiculously long, sort of securing budgets, et cetera, et cetera. So this was really the only way that we could sort of get this off the ground and then prove to the people that we could get this off the ground and then prove to the people that we need to prove it to

Starting point is 00:48:27 in order to get the proper budget for saying that now we have good faith in this, we have learned this, we have fixed these issues and this is helping our developers and other areas in our

Starting point is 00:48:43 organization and our end users, of course, getting a better product, better services. Now it's the time for us to sort of tackle all the other problems and sort of see what of this stack here can we now outsource and purchase as a service to a large degree. Hey, Hans-Christian, it's amazing how time flies because at least on my clock, it is almost at the top of the hour

Starting point is 00:49:10 and we typically record just about an hour. By the way, you are one hour ahead of me, right? In Norway, the Central European time? It's six o'clock here. Oh, it's six o'clock too, so we're on the same time zone. I do have one final question though for you because I know I had a recent discussion with one of our customers on this

Starting point is 00:49:32 and he was talking about vendor lock-in versus community lock-in. So going all in into a community, which is great, right? Communities, especially if they're diverse, but is there a fear, is there a potential that you're locking yourself in into a community that you also cannot control? And especially if that community might be controlled by some entities

Starting point is 00:49:56 that put a lot of people into development in the future, that you might again be kind of dependent on them? Yeah, absolutely. I don't think we have any models or evaluation that sort of takes that into account, but it's certainly sort of an aspect there. I've been part of the CNCF community, being a CNCF ambassador this year,

Starting point is 00:50:18 but being a member for a long, long time. And sort of a lot of the faith in OpenTelemetry is that it's well-structured. It's one of these, the Apache foundation would be another, and sort of it's, there are certain checks and balances in place, even on the community side to make sure that this isn't hijacked or suddenly sort of just reversing course or pulling the rug under you. So we do have a certain confidence there, but it's definitely something that we need to balance and also take into account that, yeah, we have a dependency.

Starting point is 00:50:57 Absolutely. And to a certain degree, it's uncertain as well, but we at least from the specific to sort of open telemetry and Kubernetes and et cetera, we feel that it's diverse enough that it's harder to sort of just pull the rug and suddenly change the license and sort of something else would be, it would be completely different if it was just one vendor sort of controlling the whole community and sort of the whole project and just deciding from one day to another that we are going to change the license. So we are very, very risk averse when it comes to sort of these, when they are a single vendor that can sort of just, even if it's open source, regardless if it's closed or proprietary or whatnot, if they can just

Starting point is 00:51:56 from one day to another decide that now we are going in this direction here, we will take that into account, at least my team building the application platform and the observability platform. Yeah, I was hoping for that answer because I wanted to make and give the people that are listening the confidence that OpenTelemetry is a mature enough, diverse, and well-established project. But still, these questions come up, and I'd like to hear this also from other fellow C&GF investors.

Starting point is 00:52:29 Brian, any final thoughts from you? No, that other one was my last thought. It's been very educational, as always. I guess we'll... Yeah. Did we cover everything? No, I was just asking Hans Christian, too. Did we cover everything? No, I was just asking Hans-Christian too. Did we miss anything?

Starting point is 00:52:46 Did we miss anything important for people, especially that want to follow your path? I'm mostly on LinkedIn these days, sort of dropping out the whole Twitter X when it just became so toxic. So please connect with me there. Really, really enjoyed being invited here. And there are so many other facets of the platform

Starting point is 00:53:13 that we could have spoken about, talked about. So yeah, maybe I will come here for another episode. Exactly. And I'm looking forward to being in Bergen because while we will probably see a lot of containers in the Kubernetes clusters, I do hope that we also see some containers on real ships because you mentioned that Bergen is a big Hansa city.

Starting point is 00:53:35 So there will be a lot of big ships. Yeah, let's hope so. Great. All right, well, thank you again, Hans Christian. Sorry, I just totally blew your name I called you Han wow it's like Han Solo Han Christian

Starting point is 00:53:49 Han Solo I get it no worries so we went from Star Trek to Star Wars now so full circle now we just gotta somehow get in Blade Runner and the Dune Dune stuff so to be fully

Starting point is 00:54:01 fully sci-fi nerdy anyway thank you very much for being on today it's been amazing to have you uh great stories it's it's probably one of the uh more i hate saying more unique because that's not a real phrase because you can't say more unique but i'll say it anyway it's one of the more unique discussions we've had around open telemetry especially around like the the government push behind it and and how that's all working so um and really thanks for the sharing some of of the pitfalls you encountered along the way. That's always key.

Starting point is 00:54:31 We always hear the sunshine story, but we don't hear the troubles along the way. And I think if more people hear the troubles along the way, they'll be able to look out for them to learn from others and not have to repeat the same lessons. Everyone else has, has learned the hard way. Anyway, thank you very much.

Starting point is 00:54:49 Thank you. Our listeners. Thank you, Andy, as always. And we'll see you all next time. Bye. Bye.

Starting point is 00:54:56 Thank you.

PurePerformance - Pitfalls to avoid when going all-in on OpenTelemetry with Hans Kristian Flaatten

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.