PurePerformance - Pitfalls to avoid when going all-in on OpenTelemetry with Hans Kristian Flaatten
Episode Date: September 2, 2024Hans Kristian is a Platform Engineer for NAV's Kubernetes Platform Nais hosting Norway's wellfare services. With 10 years on Kubernetes, 2000 apps and 1000 developers across more than 100 teams there ...was a need to make OpenTelemetry adoption as easy as possible.Tune in as we hear from Hans Kristian who is also a CNCF Ambassador and hosts Cloud Native Day Bergen why OpenTelemetry is chosen by the public sector, why it took much longer to adopt, which challenges they had to scale the observability backend and how they are tackling the "noisy data problem"Links we discussed in the episodeFollow Hans Kristian on LinkedIn: https://www.linkedin.com/in/hansflaatten/From 0 to 100 OTel Blog: https://nais.io/blog/posts/otel-from-0-to-100/?foo=barCloud Native Day Bergen: https://2024.cloudnativebergen.dev/Public Money, Public Code. How we open source everything we do! (https://m.youtube.com/watch?v=4v05Huy2mlw&pp=ygUkT3BlbiBzb3VyY2Ugb3BlbiBnb3Zlcm5tZW50IGZsYWF0dGVu)State of Platform Engineering in Norway (https://m.youtube.com/watch?v=3WFZhETlS9s&pp=ygUYc3RhdGUgb2YgcGxhdGZvcm0gbm9yd2F5)
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my co-host Andy Grabner who today is not mocking me.
I know on the last podcast recording you probably heard me mentioning him mocking me
and today he's just smiling quietly, very pleased with himself that he's not mocking me.
So Andy, how are you doing today i'm very
good i just uh wanted to uh make sure that people know that you're not a fortune teller
because i think you assumed i am again i did think i was getting ready for it yeah so i i'm no longer
a fortune teller the gig is up or the jig is up what's the uh the saying whatever it's
probably an american thing and you know what i could also have uh i think not even a fortune
teller could have predicted that today we have two norwegians oh yeah that's funny i mentioned
it because i'm so minorly norwegian but yes yes i do have a little bit of background, but our guest is full on, I
assume.
I don't know.
Maybe he'll tell us he just moved there two weeks ago.
Who knows?
But then he maybe also changed his name to something that sounds very Norwegian.
I wonder what would his name be?
I don't know.
Hans Christian, not Andersen, but maybe Flaten?
I don't know. Yeah. How do I pronounce it correctly? Hi, Hans Christian. How are you? Hans Christian. Hans Christian, not Andersen, but maybe Flaten? I don't know.
Yeah.
How do I pronounce it correctly?
Hi, Hans Christian, how are you?
Hans Christian, yes.
I think it's very spot on, Andy.
It's also easy for, I guess, easier for me as a German speaker,
like an Austrian who speaks German, Hans Christian.
And then Flaten, how do you pronounce the double T?
Yeah, so the double A is a Norwegian syllable,
Ã…, so it's flotten.
Oh, flotten.
Ah, okay, yes.
Yes, so it's a little bit old.
Actually, it comes more from Danish language.
And then, so this guy that I believe you mentioned,
Hans Christian Andersen, he was sort of like,
he went around and collected fairy tales in Norway
and wrote sort of like the book on fairy tales from Norway.
That's, yeah.
Well, so this is the first thing people learn
when they listen to our podcast today.
First of all, how to correctly pronounce your name.
It's not flattened, like I said.
And then Hans Christian Andersen,
if in case you have escaped the stories,
then you should check him out.
Maybe we'll add a link to it in the description too.
But Hans-Christian, to get into a more serious topic,
the reason why we talk today is because,
first of all, you reached out on Cloud Native Bergen,
which is a conference you run in Norway.
Maybe you want to quickly explain since when you do it
and why you love doing it.
Yeah, I'll do that.
So Bergen is sort of the westernmost city of Norway.
It's an old Hanseatic town.
So we had close relations with the German trade routes.
It went through and for a long time,
it was actually sort of the capital of Norway.
So yeah, we're very proud of the heritage
we have over there.
And we have a great cloud-native community as well.
There's a lot of financial fintech companies in Bergen.
So there's a strong technical community there.
And we are organizing for the first time ever
cloud-native Bergen. So by
the time the podcast is out,
the CFP
is closed and
the agenda has been published, but you can
still get your tickets because
the conference is not until
October 30th.
So it will be sort of like a pre-Halloween.
Maybe there will be some
horror stories from production.
Maybe.
Actually, it's a nice theme
though. Hopefully you
have actually, even though we don't
some people like horror stories,
but typically in an IT sense
we don't wish them
on ourselves, only to our enemies
or competition maybe.
But still, I think stories about what is not going well is actually something that I would
actually like to dive into today.
Not meaning that things fail, but what we can learn of things that we should have done
differently.
So thank you so much for this, for the Cloud Native Bergen folks.
If you want to attend that conference, the links and the details are in the description
another thing that you know once we connected and talked I then stumbled across a blog from you and
it's called open telemetry from zero to 100 the story of how we adopted open telemetry at NAV
the Norway's largest government agency and I read through their blog, and OpenTelemetry is obviously,
especially Brian, where you and I are, right?
We are constantly reminded that this is a big topic.
We see a lot of people looking into OpenTelemetry, adopting it.
And there's obviously, we hear a lot of great stories.
But today, before hearing the great story and how you adopted it
at the Norwegian agency,
I want to first hear the things that you typically don't hear when you listen to a presentation
at a conference where everybody's just saying, we achieved X, Y, Z and where everything is
perfect.
So, Hans Christian, if I could ask you, maybe let's start off with some of the challenges
you have with adopting some of the things people should know that you would have liked to known before starting on the journey yeah so right off the bat these
things take way longer than you anticipate uh so we were already sort of we were preparing mentally
for this and we have been through a number of sort of transformations and changes over the past and they are still ongoing, many of them,
as in most organizations. And this is no exception. And it's so intricate. There's
a lot of moving parts. The standard isn't really standardized for all the different areas. The
documentation is certainly not complete. So we have to dig through source code
and sort of figuring out,
oh, this markdown is not sort of like,
it's slightly different
from what's on the website,
which one is actually the one
that we are supposed to follow
for what the version we are.
So it's a huge area.
And as you all know,
sort of observability
and then adding on a new standard
and standard to rule them all makes it just infinitely more complex.
So I wish that more people would have prepared me for the complexity there
before digging in.
So that would definitely be one of the first areas there.
So, and then we have sort of different sort of smaller, more concise areas, sort of like area, a lot of noisy data when we started.
Because we went through, just to give a little bit of context,
NAV being the Norwegian Labor and Welfare Administration,
and we have been doing Kubernetes for close to 10 years.
So by now we have close to 2,000 individual applications
or microservices or whatever you want to call them
in our clusters.
So a fairly large amount.
We have roughly 1,000 developers across 100 plus teams.
So it's a fairly sizable operation here.
And sort of having developers do anything is sort of a challenge in itself.
Nevertheless, sort of like, oh, throw out the sort of the instrument
with all of these SDKs and libraries and so forth. So we knew that we
had to provide sort of an easier on-ramp at least. We had to
sort of cater to both worlds, those that, oh, we want to be completely
in control and sort of like know
every single binary or SDK or library or package we want because we have those teams as well. But
the majority, they are sort of like, oh, I've already sort of backlogged for several years
of development. Please don't sort of add more to the backlog. make this just work out of the box. So having sort of a way to
automatically instrument and actually sort of, I was a strong sort of like, I wouldn't say against
Java, but sort of I've come to like Java more and more because the JVM at least has some really, really solid hooks.
And recent versions of Java is actually becoming quite nice.
But the way that you can actually hook into the JVM and get in there
without completely breaking the application actually works.
So we were able to leverage the auto instrument part and the majority of our
applications are Java. So they just mainly worked out of the box there to sort of get them
instrumented with OpenTelemetry SDKs. But that sort of was only sort of the beginning because then suddenly sort of you had influx of so much data.
So we met sort of every step of the way,
sort of like scaling issues
when sort of setting up this platform
sort of on our own infrastructure.
That's sort of like, oh, the ingester need to be scaled.
So the backend needs to be scaled. So the backend needs to be scaled.
The query engine needs to be scaled and so forth.
Sort of the list never ended.
So there was constantly a new sort of bottleneck.
And then sort of once you had all of the data,
sort of making sense of it all,
sort of how do we actually explain to the teams
that this isn't sort of like,
well, there is certainly advantages
and some really, really
strong motivations on adopting open telemetry, adopting a standard. It's certainly sort of like,
it's not a silver bullet. It never is. The tool is never a silver bullet. It just doesn't magically
just solve all your problems. It makes some problems easier to solve, but there's still
work. There's still sort of, you need to learn to use the tools and sort of how to use the data and
get them making sense of the data correctly. So that's again where we prepared ourselves
mentally, but we could have been much, much better prepared not only how the teams
were supposed to instrument and get their applications
into the OpenTelemetry ecosystem, but then on the other
end, the day two operations, how are you actually going to use
this data? And to a certain extent, I don't think the OpenTelemetry
community don't know And to a certain extent, I don't think the OpenTelemetry community don't know that
to a certain extent using this. So there were no prior sort of like, oh, this is how you,
this is the questions that you are solving. And sort of these are, and at least we didn't find
them. So then they're very, very well hidden. So that's telling.
Can I quickly inject here?
Because I think that's an interesting thing that you say.
Observability is more than just gathering the data, right?
Open telemetry, the SDKs are solving the,
how do we get the data from the apps, from the frameworks?
Then we have the open telemetry collector who can collect it, transform it
and then send it to a backend. But this is only half of the
story. And already here, it's challenging
as you said, right? It's challenging to instrument the right things. It's challenging
to scale the data collection, the collectors to make sure
you're not losing data.
It's challenging.
And I look in your blog post,
you have some examples of
when you then came up with a lot of rules
of reducing the noise
to basically not capture things
that are not essential.
And what is interesting now,
if there is no one,
and maybe I'm wrong
because I'm not as deep
maybe into the whole community as you are
because you're working with this for so long.
But if everyone that adopts open telemetry as the pure thing
and then doing everything themselves needs to learn these lessons,
need to configure the same exception rule, the same exclusion rules,
needs to figure out a way how to best scale your infrastructure,
then it feels like a lot of
duplicated effort so then my question to you is why did you or why did the norwegian government
even decide to go that route and not just say this is not our field of expertise we don't want to
build our own observability platform why did they go this route and not go with a vendor that has maybe
five, 10, 15, 20 years of experience and just giving you an out-of-the-box solution?
Yeah, very good question. And the primary one was that we really, really believe in open standards.
So the goal here wasn't that much of building a platform,
but that was just an artifact of proving that the standard works, more or less.
I'm not married to any of my tools.
I'm married to the protocol and the format of my telemetry data.
Because what we have been through over and over again
was that we needed to ask our developers
to not only change the tools, and that's hard enough,
and sort of where do you view and what are you logging into,
but also instrumenting their code.
And that's just a so time-consuming task.
So that was the underlying goal here was to,
okay, say that hopefully if our bet is correct here,
we don't need to instrument the code again.
We can swap out whatever tracing backend
and log storage backend and viewer
for whatever makes most sense
and gives us the best visualization
and the best sort of understanding of the data, not sort of
like locking down the data itself or the format.
But we felt that in order to understand
the underlying technology and sort of the
nitty-gritty details, we needed to sort of get our hands dirty
to a certain extent, but
this is not the final chapter. This is not the end of the road. Hopefully, it's the end of the
road when it comes to instrumenting, and then we can use our time better when it comes to sort of
making sense and where would we like to have this data. And then the second part there also is that in government,
things like procurement is really, really slow.
And since we really didn't know what we needed,
it was merely impossible to sort of go out and sort of make a procurement
and make a public tender to sort of like,
oh, we don't even know what we need
and what we should ask for. So again, it's proving sort of that this has its worth.
And then once we sort of have, so we are still sort of treating it as sort of a proof of concept,
even though it's running in production and we are making it as production ready as we are able to do, at some
point we will have the time and ability to take a step back and say, where do we go next? Is Open
Telemetry working as intended? And is there better options when it comes to the tooling, to the
storage, to the analysis, and what extra can we do with this data here?
Hopefully by then, the ecosystem has matured when it comes to integrations with OpenTelemetry,
and there will be different options for us to choose from. Maybe we have, because due to our size and sort of maybe there's not one option fits all
and maybe certain domains, certain teams, technologies,
or what have you sort of will favor sort of these tools here
and some other will favor there.
And there will be some overlap because we can send the data
to multiple sources because we can send the data to multiple sources.
Because we have the collector and we can just say that, oh now we want to send it here.
And maybe the teams can in larger degree choose their own journey in that sense. Because that's
the one thing that even though we are a large organization, we actually have very disjoint domains.
We have these, instead of sort of a bank would have a very, very sort of like one customer,
and then there would be some core services that are used by all of them. And there would be
additional services, but it's very intertwined. The domain is still very financial and there are subdomains, of
course, and nuances. But in all of the services we provide, they're actually quite independent.
Your paternity leave has nothing to do with your disability payments. There might be some calculations here and there, but sort of
the act of delivering the whatever user interface the user would be interfacing and the caseworkers workers will be completely different in most cases.
Allows us to treat them a little bit different.
I'm taking a lot of notes as always.
And one of the things that I really thought
I think should be a quote out of here is the goal that you try to achieve with going with open telemetry and kind of proving it out is that you don't want your developers to
constantly with every cycle uh of your kind of like tooling also have to change the way they do
coding they do instrumentation and i think this is why betting on the standard makes a lot of sense.
And I really like that.
So don't force any tool decisions that you do in the backend
onto the way developers are actually instrumenting their applications.
And I think that's why going with OpenTelemetry makes a lot of sense.
And I also agree with you.
I mean, OpenTelemetry has come a long way
and so has adoption also of with you. I mean, OpenTelemetry has come a long way and so has adoption also of commercial vendors.
You know, whether it's us, Datadog, New Relics, Blank,
we are all in the observability space.
And I think we all take OpenTelemetry very seriously.
We also see it as a big benefit
because it means that we,
while we still obviously have our agents,
OpenTelemetry is a great source of information for us. And like what you do, 100%,
you try to go to 100% OpenTelemetry, that's
what we see more and more people asking for.
On the Java side, just jumping quickly back because you said
you kind of fell in love with Java. I just recorded a video this week
on GraalVM. I'm not sure a video this week on GraalVM.
I'm not sure, have you looked into GraalVM at all?
I believe we have teams looking at it, experimenting it.
Maybe we even have it running in production.
I'm not entirely sure because there's just so many applications.
But that's really interesting.
But of course, you're trading your startup and memory and efficiency during runtime into a longer build cycle.
And I know that the same teams are also trying to get their builds
as short, as quick as possible.
There are some teams that are really, really to the extreme sort of focusing on sort of having when you commit to
when you get sort of a thumbs up or a thumbs down with regards to that
commit to sort of not add on additional seconds there.
So the only thing I bring it up here is because I also,
I learned this this week when I listened to the video or I recorded the video with one of our colleagues.
Java Graal or Graal compiles Java code into native images, which means the whole JIT compilation, the just-in-time compilation, where you can actually use an agent-based approach.
Modifying bytecode does no longer work.
And this is what traditionally we use for auto- auto instrumentation at runtime. And so the new approaches
need to be found. And I know that in the OpenTelemetry
community, there's also initiatives going on to also instrument
Corral native VMs. We also, from our engineering team,
they figured out a really elegant solution of actually instrumenting the
compiler. And then as the compiler
creates the native
images, we're instrumenting the
compiler, so the compiler produces
a native image that then has
the instrumented code in there. So a really
interesting approach.
It's interesting you mention that.
Only thinking, like we've had a bunch of
conversations in the past about when people are
choosing which platform to use, if they're going to use Kubernetes, if they're going to use serverless, right?
Thinking about more than just what they want to deploy to, but if they're going to be are developing these new, I don't know,
these new variations of code and compilation and all are not thinking about how anything
else is going to be done on it.
Right?
And we almost need to shift over to them to say, please, yeah, find cool new ways that
are going to be more efficient, but we also need to observe it.
And you have to have something built in for observability so that suddenly anybody who uses that is not
blind. It's just an interesting conundrum. It's the first time
I'm hearing that and thinking of it. Because it's usually on the other side, right? Hans Christian,
people will pick something because they think it's what's going to be
best for them. And then they come through, okay, now we need to observe it, now we need to do security
on it. And it's really, really difficult
because what they chose might not be
suitable for that,
whereas that should have been part of the consideration
early. It feels like that consideration has
to go into these new things, too.
Interesting, very interesting.
Yeah, absolutely. And we
are trying to sort of
lift or embed
the focus of observability into the teams because they
still have a lot of teams that sort of treat it sort of way they have no sort of
thought or sort of it's just an afterthought about the whole observability thing so still
trying to create that culture into all of our teams. To a large degree,
we have been able to do it with security. We have a security champions program within our
organization that's been really, really successful once they sort of found out that, oh, it needed to
be sort of like an opt-in and sort of more FOMO afterwards, sort of like, oh, you're missing out
and there are some cool events, et cetera, instead of sort of like a mandatory sort of more FOMO afterwards, sort of like, oh, you're missing out and there are some cool events, et cetera,
instead of sort of like a mandatory sort of top-down,
sort of, oh, you need to have a security role
and something on your team.
So trying to do the same when it comes to observability,
but it's a long road there.
Yeah.
One additional question on,
so you said earlier you've been 10 years
on Kubernetes. Are you
the platform
engineering team? Is everything
Kubernetes or do you also have
any other systems that you connect to that
are kind of running in the
quote-unquote traditional more legacy
world?
We have lots of legacy systems.
It's just that it's not
under my team. So the
platform team that I'm a part of,
the Nice platform, it's
100% Kubernetes. So we do
have a fairly large on-premise
environment. So we have almost
50-50 split between our on-premise
clusters and our Google Cloud
ones. But it's
purely Kubernetes
at that point. And that is sort of the
when
services within our organization
are modernized, they are placed
into Kubernetes.
We have all the way back to
mainframes still running
as per organization and then everything
in between middleware
servers,
application services,
running on bare metal servers or virtualized servers.
Is there ever a need from an end-to-end kind of responsibility perspective
to get consistent observability from your, let's say,
new stack from Kubernetes into the mainframe?
Is this ever a need or is it just completely different silos
in the organization and they don't touch base?
Well, to a certain degree, the large majority
is sort of these very sort of isolated pockets.
They have their own services that they are responsible for
and then not integrating too much with the legacy services.
And these larger things that are not modernized
are also sort of a little bit in the same boat there.
So of course, there are never sort of like rules without exception.
There are certainly some services that still call
more of their legacy or other external services, definitely.
And that was, but that's not the second order sort of thought,
but from our sort of initial evaluation of telemetry,
they even have sort of a mainframe working group.
So even on the mainframes,
there could be the possibility of us instrumenting it
or getting OpenTelemetry data from it.
So it felt like a safe bet
because the OpenTelemetry,
while they are good Kubernetes and container support
for those applications,
it has no ties into the Kubernetes community.
It's sort of, it doesn't really concern itself
with where you're running it.
It's more, it's on a higher order
or depending on where you are, of course,
but it's more application focused
or it's network focused and sort of like,
yeah, it works great in communities,
but it works just as great outside communities as well.
I remember we had a podcast around mainframe and open telemetry with one of
our colleagues. So it is,
it is pretty cool to see what what's been happening in that space and what
open telemetry and also other open source initiatives
have made happen, right?
I mean, have triggered, as you said,
the great thing about being so open
and also defining standards
is that once you have a standard
and people can agree on it,
then you all of a sudden
potentially have a really cool ecosystem
that evolves and develops.
Hans-Christian, Evan, one more question.
You mentioned the challenge of too much data, or you think you call it,
how do you deal with the noise of the data?
And this is the same problem that we've had since the dawn of observability, right?
So it's either not enough data or it is too much data.
The question is, what is the right data and my question would be how do you educate or do you have an answer on educating engineers on what
is the right amount of data like do you do people have to go through a training program or do they
you've pressed practices do you have any checks in your pipeline to make sure that nobody is logging
sensitive data nobody's logging too
much data. Any insights on this would be really interesting. Yeah, absolutely. And that's that,
but it's sort of where I wished we put more effort in earlier, because this is the hard part. Of
course, the fun and technical stuff, it's not really the hard part. It's sort of getting people to use it correctly.
So what we do is that
we ask back, what is the question?
What is it you want
to answer? And then focusing on
what are the important user journeys that you want to
be able to make sure actually works. And surprisingly, while they can,
the teams often can say that these are the characteristics of the system,
they are not that certain of what's the critical user journey of this one application here.
So they actually then have to go back to their own stakeholders
to sort of, oh, but we have all of these features here,
but what's actually the most, where should we sort of start
to make sure that this is actually working?
And it's working not only correctly,
but it's performing to a level that we feel
is satisfactory to our users.
So it's sort of a long journey here to sort of, as you mentioned, sort of the realm of
observability.
It's the technical part and the data parts, only it's just a small part of it. And it's so much more about sort of the mindset
and sort of what do you want to get out of it.
So that is where we are now increasing our effort
to sort of have these training programs
to enrich our documentations
and sort of to ask these questions here and sort of
get the developers more into a state of mind
of thinking about this more rationally or specific
to what the user might be
doing at a given moment and what can go wrong and how should we
actually sort of react to that and
where do we need to know if something goes wrong. And more than just sort of, oh, just instrument,
just get as much data as possible. And because then sort of we end in just the same position that we are today,
where we have a lot of data, but we really don't know,
or the developers don't know what should they expect.
What is the sort of, oh, here are lots of noise.
Should we react on it? We don't know.
Hey, Brian, this feels like a lot of the conversation
that we have with the people that we interact
with.
I mean, on our end, we are capturing a lot of data by default without people having to
think about what to capture.
But in the end, it comes back to this educational piece.
What is it really that is important?
And then what data do I then look at?
And for me, it's fascinating.
It's also interesting, too, because a lot of people
have ideas of what data is important, right?
And they might be missing the big picture.
This is a bit of a stretch of an example,
but you might have people coming in saying,
well, we need to know what the cpu utilization of each
thing is at every single time we need to know how many times these methods are being fired off we
need to know x y and z right and then when you turn around and ask well why so well that's what's
going to help us fix right um then it's like fix what what? And they say, well, if something goes wrong, well,
how do you know something goes wrong? And they stop and think, right? Because if you think about
it from like the SRE point of view, is it starts with what is it that you're delivering? And are
you delivering it okay? Sometimes, especially if you're in a dynamic situation like Kubernetes,
where it can auto-scale and add
more pods to handle the load, it doesn't necessarily matter what the CPU of each pod is, so long as
it's scaling properly and it's delivering error-free and in a timely response, right?
Those are maybe nice-to-know things, but the more important thing is, first of all, are you delivering against your SLOs?
And then when you have that front end,
the user-facing piece defined, that I believe
helps you fill in all those back ends. Because as you find out from game days
or chaos experiments,
what are the things then that help you troubleshoot those things
and that can help you backfill into what are the important things
we need to know on the backside.
But a lot of people start very granular because, again,
most developers are working in that little space of theirs.
So unless you have somebody outside of that looking bigger picture
to say, this is our goal, what do we need to do to be able to achieve our goal?
And we can define what we need,
what information or data we need
to monitor that properly.
I think that's one of the key pieces to that.
Yeah, I don't know. That's just my thoughts on that.
I don't...
Yeah, and in many cases
our stakeholders
doesn't really know, or
at certain
areas, certain points, not have not matured enough
because it's uh is it the the the thought is still like but why do i need to concern myself
about sort of individual pieces shouldn't everything work why isn't this this sounds so
easy you're making this system here and it should work it should be binary and to a certain degree
it's sort of like a little bit about sort of our legacy mindset and it's also sort of working in
public sector where we have all of these rules and regulation and it's fairly binary if we we
need to abide by all of them the very few things that we sort of need to concern ourselves that are optionals you know uh all of
these sort of criterias needs to be checked and then you get your the service or sort of it's not
optional to just say that oh you we were able to check four out of five that's good enough and
maybe sort of while private sector certainly has sort of has rules and regulations that they need to abide by,
there are still a lot more flexibility when it comes to decisions from the company.
You need to go high enough up to have decision-making authority,
but saying that, okay, at this point, we feel that it's good enough.
We are certain enough, and we are willing to take the risk.
That risk mindset isn't really translated very well into the public government
because it's, again, there are no one there to say that we can actually go outside
or disregard some rules or regulations because that's all we need to care about is these things.
There are no sort of like, oh, but our company is sort of,
we have a mission and sort of like, you know,
sort of like it's created to serve some purpose.
We are not sort of, we are sort of created to cater
to whatever the politicians and sort of the government
has decided that this is
this is how it should be yeah when you talk about filtering out the noise i'm curious right you know
and i'm going to use these words from their more their definition point of view the difference
between data and information right information is processed data that's made sense of um
so is it a matter of we still want to collect?
Because it's new to us, right?
To Andy and I, it's like our tool collects what it collects, right?
And that's it.
When you're doing open telemetry, you're sitting down with a blank canvas
and you can collect whatever data you want, right?
And you'd be figuring out what overhead you're adding
and doing all that kind of stuff.
But I'm thinking, I don't know why I had a Star Trek, I'm not even a big Trekkie or
nothing, but if you're on the bridge of the Enterprise, you have certain
alarms that are going off that certain things are going wrong. And it could be Scotty, Mr.
Scotty in the engine room is having an issue. It's saying the engine is having a problem.
Well, on the bridge there, you would want that to be information. You would
want that to be process data. You would want that to be process data.
You'd want that to be something that somebody looked at.
What are the key data points to alert us to when something is wrong in the engine room?
And what key area of the engine is it?
Maybe it's not going to be the granular bit because you're not going to be looking at every single piece of data at that point. But then when it goes to say, Mr. Scotty in the engine room, he might need all
those other data points to figure out exactly where that is. So, but it wouldn't be necessarily
in an information format. It would be, you know, if you think about sensors on every little single
thing, okay, we know it's coming from this area. And now that I'm looking at the data itself,
I can see this sensor is telling me it's this little piece.
So when you're talking about filtering out the noise of the data,
is it you still want to collect a bunch of the data, but you're only going to info process certain pieces?
Or are you also looking to say,
what data do we just not collect anymore?
Or is there a balance between that?
It's just a new concept.
I mean, as it's hitting me as you're talking,
it's just having a hard time.
Like, how do you even tackle that?
Like, it's hurting my brain thinking,
how do you tackle that problem, you know?
Absolutely.
So it's mostly about sort of dropping
sort of signals that we don't care about.
They are just noise in its purest sense.
Because very little of what's coming out of OpenTelemetry
is refined or anything.
It's very raw.
So we need to sort of collect it.
And then the different signals are,
if done right, you can correlate them.
And they are linked together.
And then the tool where you are storing it and viewing it will provide the bigger picture.
And that's what we are still learning about, what's actually collected here.
Because to a large degree, we are relying on this auto-instrumentation and that's just
a black box. We don't know
what it's collecting and we need to figure out what data do we have here? What data should
we expect to be there and how can we sort of make it into more valuable sort of signals
and alerts that's actually actionable.
But, Christian, this now brings me back and I want to challenge one of the things you said
earlier. Earlier you said
you're looking into open telemetry
because you want to make
sure that the developers
don't have to change the tools every couple of
years. But if you look into auto
instrumentation and you're heavily relying
on auto instrumentation,
currently you don't know what is coming
out. You also don't know what is coming out. You also don't know
what is coming out in a month or in a year because somebody else is changing the auto
instrumentation for OpenTelemetry. Isn't that then giving developers a similar challenge?
Because all of a sudden with an update to the instrumentation, things may go away. A
lot of things come into the picture they don't even know. Isn't that a big challenge? Yeah, absolutely. And again, we are placing a bet that OpenTelemetry will follow in the
footprints of Kubernetes where things are graduating and becoming stable. And that's
saying that in and that will apply similarly to the auto instrumentations that okay once this signal here and this sort of
integration or instrumentation once it's sort of we have sort of it's good enough it will
graduate or become deprecated and so we can rely more and more that this will continue to provide
us the data that we are expecting and not sort of certainly. But yes, there are moving pieces here.
And to a certain degree, we need to, or we are willing to sort of live with that as well.
And because we need to find this delicate balance between sort of like locking everything
and saying that this will never and can never change and sort
of changing everything all the time.
And sort of the hope here is that it would provide a certain balance there.
But of course, things will change.
That's sort of inevitable.
But it's just a matter of how much pain does it involve
and it needs to change.
But yeah, good point.
Then one more thing.
How many people would you say are working in your team
to make sure that observability is actually available in your platform for your developers?
How many people take care of observability?
In total, we are close to 20 people working on the platform.
And then two of us are primarily concerned on the observability side.
Cool. Awesome.
And do you have a measure of success of adoption? Do you know, like,
hey, something happened, either, I don't know, either OpenTelemetry messed up with the latest
deployment or something happened and developers all of a sudden don't use it anymore. Do you
measure somehow and look into adoption rates and then also kind of act on that data?
We are following closely adoption rates and then sort of working with teams and applications that
are experiencing problems, often sort of suggesting that they start adopting OpenTelemetry
and sort of the tools that we are providing, when we see that this
is a prime example where if we had better insights, we could at least have understood
the issue faster and maybe even alerted once you had been aware that this issue here could
arise.
So we have a lot of, since we are such a large organization,
our first responders have a hard time figuring out
when someone calls about some part of the organization
not working correctly, figuring out what team is actually responsible, where would
the root cause actually, or where would it be likely. So most of that is done via intuition
today. So it's a huge respect to that team that knows a lot of what's the all of the old systems
and then keeping up with all of the new ones as
well but we want to give them tools that sort of better um say that once the user in this area here
have reported an error you can go and you can trace it back and see which other maybe seemingly
unrelated area is causing here or or correlating a bunch of errors and seeing that, oh, but
this is due to a network issue between these two sites here, for instance.
Brian, for me, it's fascinating to hear all this because it's just things we deal with
and have dealt with over the last 10, 15 years since we've been in observability.
And the whole ownership discussion is just also fascinating.
Or fault domain isolation, as we call it.
First of all, where's the fault domain?
If you look at a large complex system, pinpointing the fault domain and then trying to figure
out what additional data do you have.
I also really like, Brian, your Star Trek analogy.
I think that's because the fault domain will be on the bridge.
You may have five you know, like five
lights and they should always be green
but if one goes red, you know, the fault domain
and then, you know, who is responsible
for that fault domain? You go to Scotty
or you go to somebody else.
Sorry for the Star Trek fans.
I only went back to the original.
I know
there's a lot of people passionate about Star Trek
and all the iterations of it, so I'll just keep it there.
I think the other really interesting thing, too, is hearing this from the point of view of government.
Because when you think about, as we said before, there's two sides to observability.
There's data collection, and then there's doing something with that data.
And open telemetry has gotten really good with the data collection side.
And I think from, as you've illustrated and others, Hans,
Christian, doing something with the data is
somewhat the more difficult part. People like Andy and I are going to be biased to
saying, use a vendor, right? Because we've done it all.
But in a government situation,
a couple of other factors play in
where, number one,
they don't have to be as budget conscious, right?
So to set up a team,
not that governments want to waste money or anything,
but they're not beholden to shareholders, right?
Private sector or companies,
they're looking at every penny and nickel spent.
You might have some people who are
you know looking at that in the government side but it's not as drastic as that you're not
you know you are beholden to the voters but something like this isn't going to be so
noticeable so you can create a team that's going to create the back end for the processing for it
without as much of an impact because it does always boggle my mind when we have companies who,
you know, the DIY approach. I understand the desire. I understand the creativity. I just
understand the curiosity, but it's like you're paying for all these employees as a company
to work on this thing that has no direct tie to your revenue, no direct benefit to your
shareholders or anything else like that when it exists.
So I think governments have a lot more leeway to play that game.
But I can also see the importance of not necessarily being tied to a vendor as well, because on the government side, you got to think about, OK, we're friendly.
This vendor is from country A. We're friendly with them right now.
And now tomorrow we're no're friendly with them right now.
And now tomorrow we're no longer friendly with them.
And half of our systems are in their software.
So it does give a compelling reason,
especially at least in the government sector,
to do things this way. But fortunately for us,
it doesn't give a compelling reason for a private sector.
Yeah, and in most,
I completely agree with you, Brian, for sort of, as you said, in private
sector, especially sort of the return on shareholder value, that this is undifferentiated,
heavy lifting.
But these concerns that you are touching on, sort of being vendor agnostic to a certain degree.
And then, as I also mentioned,
sort of the procurement process here
is also ridiculously long,
sort of securing budgets, et cetera, et cetera.
So this was really the only way
that we could sort of get this off the ground
and then prove to the people that we could get this off the ground and then prove to
the people that we need to prove it to
in order to get the proper budget
for saying that now
we have
good faith in this, we have learned
this, we have fixed
these issues and
this is helping our developers
and other areas in our
organization and our end users,
of course, getting a better product, better services.
Now it's the time for us to sort of tackle all the other problems
and sort of see what of this stack here can we now outsource
and purchase as a service to a large degree.
Hey, Hans-Christian, it's amazing how time flies
because at least on my clock,
it is almost at the top of the hour
and we typically record just about an hour.
By the way, you are one hour ahead of me, right?
In Norway, the Central European time?
It's six o'clock here.
Oh, it's six o'clock too, so we're on the same time zone.
I do have one final question though for you
because I know I had a recent discussion
with one of our customers on this
and he was talking about vendor lock-in
versus community lock-in.
So going all in into a community,
which is great, right?
Communities, especially if they're diverse,
but is there a fear, is there a potential that you're locking
yourself in into a community that you also cannot
control? And especially if that community might be controlled by some entities
that put a lot of people into development in the future, that you might
again be kind of dependent on them?
Yeah, absolutely.
I don't think we have any models or evaluation
that sort of takes that into account,
but it's certainly sort of an aspect there.
I've been part of the CNCF community,
being a CNCF ambassador this year,
but being a member for a long, long time.
And sort of a lot of the faith in OpenTelemetry
is that it's well-structured.
It's one of these, the Apache foundation would be another, and sort of it's, there are certain
checks and balances in place, even on the community side to make sure that this isn't
hijacked or suddenly sort of just reversing course or pulling the rug under you. So we do have a certain confidence there,
but it's definitely something that we need to balance
and also take into account that, yeah, we have a dependency.
Absolutely.
And to a certain degree, it's uncertain as well, but we at least from the specific to sort of open
telemetry and Kubernetes and et cetera, we feel that it's diverse enough that it's harder
to sort of just pull the rug and suddenly change the license and sort of something else would be,
it would be completely different if it was just one vendor sort of controlling the whole community and sort of the whole project and just deciding from one day to another that
we are going to change the license.
So we are very, very risk averse when it comes to sort of these, when they are a single vendor that can sort of
just, even if it's open source, regardless if it's closed or proprietary or whatnot, if they can just
from one day to another decide that now we are going in this direction here, we will take that
into account, at least my team building the application
platform and the observability platform.
Yeah, I was hoping for that answer because I wanted to make and give the people that
are listening the confidence that OpenTelemetry is a mature enough, diverse, and well-established
project.
But still, these questions come up, and I'd like to hear this also from other fellow C&GF
investors.
Brian, any final thoughts from you?
No, that other one was my last thought.
It's been very educational, as always.
I guess we'll...
Yeah.
Did we cover everything?
No, I was just asking Hans Christian, too. Did we cover everything? No, I was just asking Hans-Christian too.
Did we miss anything?
Did we miss anything important for people,
especially that want to follow your path?
I'm mostly on LinkedIn these days,
sort of dropping out the whole Twitter X
when it just became so toxic.
So please connect with me there.
Really, really enjoyed being invited here.
And there are so many other facets of the platform
that we could have spoken about, talked about.
So yeah, maybe I will come here for another episode.
Exactly.
And I'm looking forward to being in Bergen
because while we will probably see a lot of containers
in the Kubernetes clusters,
I do hope that we also see some containers on real ships
because you mentioned that Bergen is a big Hansa city.
So there will be a lot of big ships.
Yeah, let's hope so.
Great.
All right, well, thank you again, Hans Christian.
Sorry, I just totally blew your name
I called you Han
wow it's like Han Solo
Han Christian
Han Solo
I get it
no worries
so we went from Star Trek to Star Wars now
so full circle
now we just gotta somehow get in Blade Runner and the Dune
Dune stuff so
to be fully
fully sci-fi nerdy
anyway thank you very much for being on today
it's been amazing to have you uh great stories it's it's probably one of the uh
more i hate saying more unique because that's not a real phrase because you can't say more unique
but i'll say it anyway it's one of the more unique discussions we've had around open telemetry
especially around like the the government push behind it and and how that's all working so
um and really thanks for the sharing some of of the pitfalls you encountered along the way.
That's always key.
We always hear the sunshine story, but we don't hear the troubles along the way.
And I think if more people hear the troubles along the way,
they'll be able to look out for them to learn from others
and not have to repeat the same lessons.
Everyone else has,
has learned the hard way.
Anyway,
thank you very much.
Thank you.
Our listeners.
Thank you,
Andy,
as always.
And we'll see you all next time.
Bye.
Bye.
Thank you.