PurePerformance - Educating the next generation of Observability Heroes with Rainer Schuppe
Episode Date: May 20, 2024Making observability available to everyone! This noble goal needs superhero powers in an IT world where there is so much chatter and confusion about what observability is, how to sell the value add be...sides a glorified troubleshooting tool and how OpenTelemetry will disrupt the landscape.In our latest episode we have Rainer Schuppe, Observability Veteran (more than 20+ years in the space), who has worked for the majority of the observability vendors. He is sharing his observability expertise through workshops in his home town of Mallorca. Teaching organizations from basic to strategic observability implementations.Tune in and learn about the typical adoption and maturity path of observability within enterprises: from fixing a problem at hand, to justifying the cost to keep it until enabling companies to become information driven digital organizations! Also check out his OpenTelemetry journey in his blog post seriesHere are the links we discussed today:Observability Heroes Website: https://observability-heroes.com/Observability Heroes Community: https://observability.mn.co/Cloud Native Mallorca Meetup: https://www.meetup.com/cloud-native-mallorca/OpenTelemetry: https://opentelemetry.io/Rainer on LinkedIn: https://www.linkedin.com/in/rainerschuppe/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me my mocking co-host Andy Grabner.
How are you doing Andy?
Good, just talking about mocking.
Mocking services, I don't know why this just comes to mind, but for whatever reason, mocking,
mocking birds, mocking services, mocking you. i have no idea why this came up you just had a brainstorm of mock yeah yeah well anyway i'm good
thanks for asking oh good i did ask you yeah no okay i thought you were saying that sarcastically
i'm like what are you talking about so anyway i was um you know one of the things we have a lot
of rabbits around here um i like to call them bunnies because it's cuter and they're
cute little animals and every day i find myself looking out the window at them um see them hopping
around and trying to figure out what are they getting that what are they trying to do and i'm
i'm always making mental notes of what they're doing and I'm like, this is really an inefficient way
to track their
behavior and see what's going on
and try to understand if there's any
attack vectors of the neighborhood cat coming
in and
it's reminding me of something I just can't put my
finger on. I was hoping, Andy, you
might be able to help me remember what
this reminds me of.
This reminds you of of of
observing observing observability that's right maybe with me you know what would be cool if you
would have an observability hero that could help us and actually shed some light then maybe also
tell us how we can apply the stuff we've been talking about for the last 16 years or eight years on the podcast to not only microservices, but maybe in a broader sense.
It would be amazing if we could do that.
It would be amazing. What if we just clap our hands? And, whoa, look at that,
we have a guest here on the podcast, Rainer Schuppe. Servus, Rainer.
Servus, Andi. And and hello brian and thanks for having
me here and and double thanks to calling me an observability hero which well i'm definitely a
veteran in that space doing that for 21 years officially for lots of apM companies and inofficially for I would think 25 years but that's
stuff that you don't want to know about that's like Stone Age manual
observability with logs and Perl scripts to extract some kind of metric from that
that's who yeah you could put that a haunted house or something like that
yeah I had to call the
Observability Hero because, guess what,
you put it on your website.
Observability Heroes.
He's very humble. I guess he's very
humble, yes. That's where
Observability Heroes are made. I'm not saying I'm
one, but I can help you to become
one.
That's also the name of the community
that I just founded. And yeah, I, that's also the name of the community that I just
founded and yeah, I hope it
can be some kind of the
Justice League or the Avengers for
you know, observability.
That would be the
goal. But it's a long
way. As you know, Andy,
I mean, we know each other for
16 years
or so. Yeah, it's been 16 years. You joined Dynatrace
while I had my stint there. So after working for Wiley Technology, with the first APM solution that
uses biker instrumentation out there, I joined Dynatrace afterwards and then hopped on to other
generations with AppDynamics and Instana at the latest, but
what I really like is the
OpenTelemetry now, which
is kind of
the
common denominator for
observability that we can all
benefit from.
I hope it will become a street name
out there and will make it easier
to deploy observability solutions
in organizations because everybody needs it.
I mean, it's, yeah.
If you don't have it,
you surely run into troubles at one point in time.
So coming back to the superhero kind of theme,
is that basic?
Does this mean the open telemetry trace
is the secret weapon
or the secret craft uh of the superhero or what is the what is what makes a superhero what makes
an observability hero a superhero what is it that you need to know what are you teaching people in
your uh you know in your classes uh i know it and it's been it's been great to bump into you last week just by
exit. Maybe a quick story
for those of you, right? I've been
fortunate enough to travel to Barcelona
last week giving a talk at a
cloud-native meetup about platform engineering
posted on LinkedIn. All of a sudden
Rainer sends me a LinkedIn message.
Hey, I'm just around the corner.
I'm on Mallorca. And I said, I thought,
hmm, Mallorca is not just the next town over.
But I guess geographically-wise, it's not that far away.
No, it's a 30-minute flight.
It's an island, so I have to hop over.
It's either a couple of hours on the ferry or just jump on a plane, get over to Barcelona.
And guess what? Next week
we have a CNCF meetup
on Mallorca in a
very nice location in Palma.
So whoever is on the island
and likes to know more about it, it's not
specifically observability, it's CNCF.
But I know that
Pera is there, one of the Dynatrace
engineers. And it's actually
again like the one in
Barcelona,
sponsored and initiated by
some Dynatrace folks.
So, yeah.
What does it take to...
Sorry, answer your question.
I love Spain, so I
wanted to talk about Spain, but we're not here to talk
about Spain.
But you can come here anytime.
My wife and I, we are having a co-working space,
so we have a very good internet connection,
a really good coffee, and a very
relaxed atmosphere around.
So as soon as you step out of the office,
you're like in a 4,000 people
city village
in the south of Mallorca,
and everything goes away
and then everything comes back.
And for me, for example,
the idea about observability heroes.
Because when I go back in,
my father was a mechanic
and he repaired things.
And that's what people applauded him for
because he had the endurance
to look for stuff that others didn't.
So he was like,
this was his passion,
like finding the root cause of a problem and then fixing it.
And I have inherited that.
And because I never give up when somebody presents me with a problem,
it's my obligation to solve it.
Unless it's really that bad that I said,
sorry, I'm the absolute wrong person for that.
But as long as I see a chance that I could solve it,
I'll give my best. And
being 30 years in the IT industry, I have seen everything I started at the
at the end of a hotline, answering end customer calls, then did some
operations for PDP 10 mainframes from deck at that time which were computer first running on and then went into
consulting architecture and ended up in pre-sales for apm companies and all the time people called
me when there was something wrong because i had an understanding of all these systems i
and what what it takes to become an observability hero is basically you need to have an understanding of the systems,
how they basically work, and then get your tool belt right.
And that's different from organization to organization.
You need different tools for, let's say, the old ones with monolithic or three tier applications.
And then for microservice serverless, you need a different tool set, generically the same, blocks, traces, and metrics.
But in different, how would you call it, Brian?
Different flavors.
Yeah.
Varieties, flavors, yeah.
And what I teach people in my coachings
and the workshops is
how to get where they want to be.
And the first step is
finding out where they want to be.
Are they simply troubleshooters?
Are they in the organization
just to make sure that the operation
doesn't totally break down? Or are they they in the organization just to make sure that the operation doesn't totally break down or are they already in the forefront and establishing kind of a platform engineering
type of thing where they need to get more people on board or do they need to do some troubleshooting
first to get the funding for some wider observability. Because observability is basically
what I've seen in the last 20 years
is always an overlay function.
It's a cost center.
It's not something that you pay up front for
because you see lots of value.
It's something that you need to pay for
because you just need it to make systems run.
And if you need a,
former times we called it a crit set, a critical situation to get the funding for your
observability tools. That's, that's something that's already
good. But it's better if you get the funding upfront, because you
can convince people that they need observability to make use
or not to overpay in Kubernetesubernetes not overpay in the cloud um
in the cloud services because all of these microservices are nice to deploy and hey here
deploy here deploy there and then at monoliths you already started to throw hardware at performance
problems oh we run out of memory okay well put in, put in more memory. Now, in Kubernetes or in cloud-native deployed applications,
it's kind of the same.
You just don't notice it
because Kubernetes takes care of that for you,
and it can cost you an arm and a leg.
So if you do it right,
and that's also what I teach people,
is you can justify the cost for observability
with the cost reduction you have in the...
How should I say that?
Operational costs?
Yeah, in preventing over-provisioning,
which is something that lots of people do
because, well, let's say GCP does that for you,
or AWS or Microsoft Azure,
they do that for you
as a kind of an
operational security, which is nice,
but, well,
sometimes you pay through your nose
to get that type of security.
And
that's where observability can
help as well, but that's usually step two.
First step is, we do have a massive
problem somewhere and we need fast help
and then, well, where do we get the
data from and then people look around for and you've probably seen this andy and brian as well
andy in the pre-sale space these companies that want to run a proof of concept just to get your
tool to solve a specific problem and then you and then you start selling.
Andy, I was going to say, it sounds like the secret weapon of the superhero is similar to Batman.
It's the utility belt.
You mentioned having the tool set, the belt, and also the passion.
Now, Batman's a little dark there, but he still has a passion for justice right and you mentioned passion which i really love because i think a lot of people um when they're treating performance and observability
as a requirement um they don't always do it so well it's more if we need to get something in
there we have it we have the check but when you you have A, the right tool belt, and B, the passion to make
sure you're doing observability right, the passion to find ways to save money in the cloud or to
increase performance and all that, that's when you can really take off and become that superhero
because you're putting everything you have into improving the entire ecosystem, not only for the
end users, but for the organization.
And then that's when you end up looking like that superhero.
And you feel like one too, right?
If you don't have that in any job you have, if you don't have that passion for it, you're
just going through the day.
So I think we found the secrets.
I think we found the secrets of being that superhero there.
I want to recap something quickly.
I'm not sure if you noticed, but I always take a lot of notes
when we have these podcasts, and then I use them, A,
to reflect afterwards, but also to write the summary.
But I wanted to go back to what you said earlier.
You said traditionally the need for observability comes in
because people are having a critical situation, a problem.
Then get me any tool that I need to solve the problem.
Cool, problem solved.
Then the next step typically is, okay, how can we justify the costs?
If you want to keep the observability tool for longer, this is typically where we address the cost-saving potential, right? You know, optimizing your systems, combining it with performance testing,
fixing hot spots in your application, right sizing and everything like that.
For me, the last step, and I think you mentioned this in the beginning,
is then, however, how can we change the mindsets of organizations
to think about the potential of observability from the start
so that they actually can become data-driven companies. Because if I'm a new startup,
I guess, and I have a product idea, I typically have assumptions. I have an assumption about how
many people will use my product, how much money will I make? I probably also do experiments with my features that I put out there.
And for monitoring and evaluating my assumptions and run my experiments, I need some type of
data, observability data.
Because we're all digital organizations, ideally, we get this observability data straight from
the digital services.
So are these the three steps or are there also things in the middle but what's
in the middle between i'm using observability for cost optimization and i'm using observability for
really making strategic data-driven decisions is there any other reasons why people then would kind
of use observability for something in between uh probably many reasons but I would say they are niche, and
they all depend
on the data.
And thanks for
reminding me.
I mean, the
biggest problem
in troubleshooting
is not having
data.
And that
means if you
put in
observability
when the
problem
happened, you
have a loss.
You have to
hope that the
problem happens
again.
So having the
data collected
from the beginning
gives you not only performance indicators
that you won't see in development or testing.
I once gave a talk at a conference
about the different types of performance problems,
and they only reveal themselves usually in production.
And this is what you cannot test
but you need the data.
But you only know what data you need
when you actually have the issues.
So I think that's also one of the things
that people have to realize
that it's a constant learning process.
And that's not, let's say,
it's a constant learning process. And that's not, let's say, a step.
It's a learning step.
You have to constantly apply your know-how, your expertise
to collect the right data
and scrape the data that you don't need
and find a good...
The American way is probably saying
solve the Goldilocks solve
the Goldilocks problem
with just the right not too
much not too less and this is
for me in between step two
and three but not in the thing
that you can kind of learn
and certify it's
it has to come from the inside which is really hard
to measure
yeah
and coming back to the to your
question other data security the observability gives you data about volatile data about logins
about um possible attacks that that are happening because all of a sudden you get these a ton of
404 errors here on different endpoints and you see that could be an attack in there so also
securities i mean in in our times that's a big a big big thing to to think about
and a couple more that i just don't can't come up with from the top of my head at the moment
yeah no worries um in your blog if if I look at, and folks,
all the links, as always, you'll find them
in the description of the blog post,
but observability-heroes.com
definitely gets you to the website.
And then looking at the blog,
I saw that, first of all, you did a good job
with explaining what is observability
from your perspective.
But then the next thing, and I know last week when we were in barcelona and you said hey let's give
like another hour or two after the meetup and let's sit down uh with the beer with the cerveza
and we tried i at least i tried to do my best in ordering everything in spanish unfortunately
we ended up not uh not dehydrated but we we had enough, plenty of beers.
But I remember you said, hey, this OpenTelemetry thing,
and you mentioned this earlier, this is really something interesting.
And you started your own OpenTelemetry journey.
And you also started now to write a blog about your journey, right?
You started, I'm looking at at here my hotel journey week one
um open telemetry as you said you know is it is hopefully going to be an amazing tool
for all of us to really enable developers but organizations to get the data that they need because they are under control on what type of data gets generated
i think there's also still a little bit of a misconception of what open telemetry really is this is a very educational piece come in right open telemetry is not your like your magic
it's not a magic wand and you sprinkle over some stars and then everything automatically
works and you don't need anything else open telemetry only solves the problem of what type of data do we capture, what type of insights do we want, but you still
need to get this data to some endpoint, you need to store it somewhere, you need to analyze it somewhere.
But open telemetry,
many people are walking through that same journey that you are currently going through.
You're learning this new technology. What are your lessons learned
so far and what would you like people to take away
from your learnings?
Obviously, besides reading your blog series,
what else is there?
The open telemetry is a good start.
And as you said, it's all about data gathering.
And that's my biggest learning.
You need to have a good backend to analyze the data.
So getting the data is one thing. That's a big thing.
If we have one common platform that can gather the data in a specific format, and that's why context and like in the words,
the right nomenclature is, is essential. So you can actually
find the stuff in other systems.
But that takes away a big, big, big chunk of work in actually getting the data.
But then you have to store it, you have to analyze it, and this is where it gets tricky.
And after working for four big APM companies where we provided that,
and now I have to actually,
hats off to all the developers who stored that stuff,
who made it easy to analyze,
easy to go through and to query all of that stuff
because this is what's bothering me most in OpenTelemetry.
And it's not what OpenTelemetry is all about.
That's what I realized.
It's not about analyzing the data.
It's about gathering and then use whatever tool you need to get your job done.
If you're an operator, you go with Grafana dashboards and you put the metrics in there and get your alerts.
If you're a developer, you want your fine-grained service traces and all of that to dig into how your stuff actually works and maybe make it perform better.
When you are testing QA,
you want to see where are the things
that need to be optimized
or just give it a go to release in production or not.
But that means you also need different visualization.
And this is what OpenTelemedry doesn't help you with.
This is where Dynatrace,
Instana, Honeycomb,
all of the
others out there help you
to do that work. And it's
seriously, and I looked
over, and it's open on the internet
so this is no trade secret,
the installation of the Instana
backend. When you read the
documentation, how to install it,
you realize that this is a thing consisting of 30 different services
with three databases.
So there's Cassandra, there's Elasticsearch, there's ClickHouse,
there's also, there's no Postgres anymore.
Cockroach was also, well, some SQL database,
but that's only for some auditable stuff.
But all these services that have to work together,
this is so much know-how, so much expertise flowing into it
that this is the really hard part then for you
to implement that in your organization.
And if you're doing your first steps, that's fine.
Take Jaeger, take Zipkin,
take all the others,
take a Grafana or Elastic and get your feet wet.
And if you're fine with
administering that stuff,
if you're fine with administering
an Elasticsearch cluster,
a Clickhouse cluster,
maybe on Kubernetes,
then okay.
But if you're not,
you can still have the OpenTelemetry data.
You don't have to touch that, but you can switch your backend.
And that's what I really love about OpenTelemetry,
is that you don't have to touch that stuff,
because it is all the same format.
It's a format that most of the commercial vendors out there support at the moment.
Some more, some less.
And this is where I would like to see OpenTelemetry go, really.
You have to foundation write.
And then, with trial and error,
you find the tool that you want.
Some like the, I mean,
and take a look at the Datadog.
It's a lot of stuff that you could do with Datadog,
but you have to have the passion for,
you know, playing with it
and running the queries
and putting the dashboards together and all of that. If that's your passion, if that's what you like to have the passion for playing with it and running the queries and putting the dashboards together
and all of that. If that's your passion, if that's
what you like to do, go with it.
If not, go with Dynatrace,
go with Instana, because they do
a lot of things automatically that
you don't have to do.
And
yeah, that's basically the biggest learning.
And the other learning that I had,
and I'm currently, currently still have some AWS instances
running with FluentBit,
and that's not OpenTelemetry,
but I'm trying to get syslog data
from two servers that I have here
in our co-working space
to report into a collector,
and I'm miserably failing.
I don't know why,
but it is some fickly work that's going on in OpenTelemedge.
It's a good start, but if you go with one of the vendors,
you can have a positive experience within half an hour.
I always liked, when I was in Edstana,
that we could set you up in a Kubernetes cluster in five minutes.
The agent was discovering everything, then boom, then they would.
This was just jaw-dropping like
six years ago.
This is not what you're going to have with OpenTelemetry.
Definitely not.
You're going to have a lot of work.
You pretty much understand what this stuff is all about.
But it's a lot of
manual work and
if you want to sell this to your
bosses as a free solution,
it may be
free of license costs,
but it's not free of work.
And it's going to be a lot of work
that you put into it.
So there's a lot of automation already going on.
But still, you have to get familiar
with that. And then,
of course, you have to store the data somewhere.
And this is where also some costs are coming towards you.
I think one of the important differentiations you're making there,
at least if I look at the English definitions of the words,
is the difference between data and information.
Data is your supermarket full of food.
And as you're talking about you want to
maybe go in with your uh your shopping list and buy the food that you need need which is selecting
which data you want to collect but you still don't have the recipe to cook a meal right and and you
have to turn that data into information which is useful and that's where those back ends come in
and yeah i agree it'd be awesome if something like uh open telemetry can start
expanding into that section um but that's where that back end comes right and i even forget you
know not that i forgive but like i understand like open telemetry has not been around for very long
right most of us vendors in terms of figuring out how to make it easy, what data is relevant to collect and all that.
We've all been doing this for years and years and years.
I think the strides that open telemetry has made
in such a short amount of time
is definitely something to marvel at.
But again, it's only been a few years
and once you have a community of people,
everyone putting their input in,
now you have to battle, not battle, but everyone has to agree upon what to do and it gets sticky.
So it will be really, really interesting to see what happens on that side over the coming years, to see it get refined and more complete.
And I believe they're even doing a lot of work on ease of deployments, right? There's some of these automatic,
a lot of the projects have these automatic instrumenters built in.
So yeah, but it really comes down to that,
that information versus data.
And what are you going to do with that data when you have it?
That's the key.
That's actually the,
I have two blog posts that I'm working on right now.
One is called data versus information,
or data and information. One is called Data vs. Information, or Data and Information.
It's exactly that point.
And the other one is about how easy it is to monitor and how hard it is to ignore the root cause.
I mean, if we all go back to three-tier architecture,
it would be a lot easier, right?
Yeah, but even then,
even in a microservice architecture,
it's easy to spot the problem.
It's easy to, because there are only two symptoms.
It's either too slow, or it doesn't work at all, throws an error at you.
And, well, bonuses, it's slow and throws an error, which everybody is annoyed of.
But those are the things that you alert on, that you monitor, and that's kind of easy.
You have some baselines, you have a threshold, some experience from the past, and you apply it to a service.
This is all done automatically, and then something
turns red.
Then now the hard part comes.
Taking the data that you need to get
information about where's
the root color of the thing, who is supposed to
resolve that problem.
And once you nail
that down, at one of the times we called it
the blame game, and we called it the blame game.
And we actually had a blame game, which was like a spinner.
And then you just flipped or spun the arrow.
And it turned out to be database operators, developers, you know.
And that was as effective as a blame game is.
Because you need data and then you need information out of that.
So monitoring, finding out that you have a problem is easy, but then comes the
need for the information to resolve the issue.
This is the hard part.
And this is where the backends are needed
and the visualizations, according to
your job. And
with the whole DevOps SRE
thing, it's
a different job description that you have there.
You need different data
and especially a different communication culture.
And that's what, during the meetup last week,
Almenuda said,
I was able to follow the Spanish a little bit
because she was very fast.
She said, DevOps es cultura,
which means DevOps is the the culture it's not
a question of the tools that you have it's a question of how you're dealing with that and then
you pick the right tools coming back to the tool belt if you don't know what tools you need you
throw a batarang at uh i don't know to open a fridge it's just it, it's, you know, a fool with a
tool is still a fool. So it's your way of doing things the way
your organization does things is demanding the tools for that.
That's why Open Telemetry opens a lot of different LA's up, but
you have to choose and you have to choose to what is the best
for your for your job and the strategy that you're following.
Rainer, I'm just looking at your website again because you are offering different types of boot camps that you have.
I remember you told me about this.
You made it very appealing to fly to mallorca and then spend
some time with you and i also really like the different models that you have where you basically
also get a coaching session when you have like a vocation camp so folks if you are listening to
this really check out observabilityheroes.com check out the boot camps i know a lot of organizations
are doing these vocation camps where their teams kind of get together in a remote location.
You in Mallorca, you're providing not only a beautiful island,
but also the infrastructure that people need.
You also provide them with your coaching on observability.
My question to you is, if you do these workshops on observability and people come to you are there
any kind of big revelations where you say man why are we going down to these basics why don't
i mean like are there certain very basic things that people have never that haven't thought about
at all when they come to you around observability or do people already come to you because they
already know they need observability or do some people come to you because they already know they need observability?
Or do some people come to you and then are completely like
blown away by,
because they never thought
about these things?
It's all of the above.
So there are people
who want to know
about observability in general.
They know that they need something,
but they have no idea
where to start.
And for those, of course,
we go through the basics.
But even the ones that are already, let's say,
senior or advanced observability practitioners,
they sometimes need to widen their scope
because they work with a solution or with their setup for quite some time
and they want to expand it.
And every time I switched my companies,
it was mainly for the reason that there was a new technology coming.
So be it microservices suddenly implemented
and was a pain to deploy other solutions
because there were so many installations to be done,
so automatic would be would be nice it all it also
boils down to what are we doing for example we need tracing tracing need
needs instrumentation instrumentation is added code to the applications and it
always sounds oh yeah we just instrumented but do you forget that you
add code and if you add code to the wrong part of your code
then you can add a massive overhead and that can be very quick so during also dynatrace times
we killed systems with one wrongly instrumented method but that was before we also had all these
automatic ones so we had to search for stuff that's all way better
but when you do custom instrumentation you should know how this is applied so you can make a good
decision what to instrument and what not have what to instrument for production and what you do in
development and maybe find a feature flag for your observability to have this turned on only in development and occasionally
if need be turn it on the fly in production but this is what also long-time practitioners
sometimes sometimes don't realize what the overhead what overhead actually means because it's
yeah it's multifaceted that's using additional c CPU and adding to the response time is one thing,
but having shared resource contention, deadlocks, and other things
is also something that you could easily do when you apply it the wrong way.
So these are the things that we talk about,
and I want to be sure that people understand that,
because then they can make the right decisions going further.
So if I talk about the basics,
I make sure that people have a need
or, well, should know about them
and just checking where they are,
which is something that I learned from my hotline days
when people came to me with problems.
I first established a baseline
what is their level is this the grandma that wants to send an email to their um
the grandchild over in uh in europe which i had actually or is this someone very
versed in technology who just needs to get desperately online
in a very remote country somewhere
and his modem or her modem didn't work?
Different levels.
I ask different questions.
And that's what I check first.
Where are we?
And then we move on from there.
And as you said, the vocation camp,
that's one week in our co-working space
with very good connections
so you can do your regular work and you can focus on observability if you can't do this in your
office at home because people come in and you know the regular you words in a day's work is a lot of
people coming in and asking how are you doing so you can focus on your project and you can ask me.
That's part of the offering is there are
some coaching hours in there or mentoring
hours. You can ask me about specific
problems. We develop a strategy together
or I'll
give them something
to do. By the way, I'm
also financing the
AWS
instances needed for that.
So if you need a big server to test something out,
not a problem, that's included.
And then we check where everybody is
so that at the end of the week,
they know either the basic concepts of observability
or they have a good idea about the strategy
that they're going to implement
when they're back in their office.
Hey Brian,
what do you do next week? Should we sign up for
a location camp? Oh yeah,
absolutely, please.
Are you based on the East Coast,
Brian? No, I'm in Denver.
I used to be in New Jersey, but I'm now
in Denver, Colorado, and smack in the middle.
Well, not smack in the middle, but you know. No ocean near me. I used to have in New Jersey, but I'm now in Denver, Colorado, and smack in the middle. Well, not smack in the middle, but you know, no ocean near me.
I used to have an ocean near me.
I've been there.
Back at Wiley Times, because my boss was residing there, actually came from Boulder, but he showed us around Denver.
Nice town, but very far from the sea.
There is a direct flight from New York to Palma the mallorca so but not from denver unfortunately yeah i think the closest to the islands was um tarifa and nera
but uh never to the islands themselves yeah folks listeners if you think this is a commercial for
mallorca well it sounds like it and it's a really beautiful place. We will add some links to the description.
Rainer, what you said earlier was really good
because, you know, we have great tools.
What's the saying?
With great power comes great responsibility.
With the power of instrumenting code comes great responsibility,
and we as an industry, we have, like in your case, 20 plus years of experience
of what it takes to bring down an application with bad instrumentation.
In the Appmon days, we called it the shotgun instrumentation.
Or even if you are instrumenting a method that is called a million times,
then of course, the relative overhead becomes really big.
And then you can really bring down systems.
And I also feel that it needs more education.
Fortunately, the tools get better and I'm pretty sure there will be more built-in features
in OpenTelemetry, more tools for developers that actually make them aware of the potential
impact.
But folks, if you're listening, think about this is code, right?
It only collects a little bit but it
gets executed every time this method gets executed and this data is then sent off
to your open telemetry endpoint or your open telemetry collector there was the additional
overhead the dimension the overhead is not just in the running thread it's on the network or first in
the memory because the data needs to be buffered then it needs to be sent over to the next
component this might be a collector or directly the back end but it means network overhead um
so yeah think about there's a lot of things and this there's reasons why organizations like you
mentioned it right the new relics the data dogs the instanas the dinosaurs of the world we've
been doing this for many years and we've we had all these lessons learned and this is why where i think we're doing a good job and you should still look at what we've done maybe we
should you know uncover and like repost some of our old blog posts from back then because i'm
pretty sure there's many blog posts we wrote about you know proper instrumentation best practices
worst practices on instrumentation that people should look into it again.
Yeah, absolutely.
And instrumentation is one thing
that can kill an application.
If you kill it once,
then your management is going to be very weary
of letting you do this again.
But as you also said,
with automation,
there is lots of things
that you can just get out of the box and
you don't mess with it which usually safe um but going back to a blog post that's exactly what i
did after our meeting and there are some blog posts from from me from 2010 on the code centric
blog with the company that i worked for at that time um about um no time for monitoring
so if you don't invest time for monitoring um then you probably spend more time some somewhere else
so but also automation helps a lot in saving their time getting it up and running in in no time
and then another one was actually not from me,
but from Fabian, who compared
a very good
overview over sampling
and tracing
or profiling and
tracing and the different overhead
that is incurred there.
And with some practical examples,
but it's 14 years old, so
it's probably on an old Java version that
probably has that already optimized out.
But I read today an article about memory consumption in OpenTelemetry in Java, where they have
a new module that allows you to reuse memory.
And it's not an immutable anymore but it's a mutual object
a mutable object and that basically reduces your memory impact of the agent or of the instrumentation
to zero which is massive when you have lots of they have like kafka um with kafka or pulse
pulsar apache puls, with lots of topics.
And each topic created one kind of metric,
and that created a ton of objects in the memory,
which then blew up kind of the JVMs when you monitored them. And now with this new, which is officially accepted, I think,
by OpenTelemetry now in the Java SDK,
with this new approach, it's basically zero and this is gone.
But this is something that you wouldn't realize until you have that problem,
until you see what is happening here.
Why is my JVM blowing up?
I didn't do that before.
Well, it's just a multiplication of things with the topics
and the metrics that you get out of it.
This is just one of the things that make open telemetry
kind of that complex.
On the other hand,
if you don't deal with that,
if you don't have
queueing mechanisms,
messaging systems
with lots of topics,
then you don't need to bother.
This is what we can talk about.
My approach was always
we start with your problem at hand,
we solve that, and take the next step we start with your problem at hand we solve that and
take the next step we're not boiling the ocean up front if we don't have to and get the results
quickly in in that week and make you realize what you actually need and get you successful with that
and then take the next step and even then if it's well with that approach you can get the
most complex situation like handball and yeah easy to deal with hey andy i think i have a title for
this episode observing of observing observable i can't even say it observing observability
uh right this is all this you know one. One quick anecdote I wanted to mention about the instrumentation.
This happened, I think, 2012.
I'm not going to mention the vendor or the customer.
But sometimes when observability takes down your system, even the most best designed ones,
it could be revealing a core problem.
So we went to a potential customer.
They were complaining that login was taking like six to eight seconds.
It was a really long login.
It was a commerce kind of site.
So we put Dynatrace on the system, and login went up to 15 seconds.
And we're like, what the heck is going on?
And this was with auto-instrumentation already done.
This was beyond that.
Well, it turned out that they were running upwards of 14,000 database queries upon login.
So like any observability tool, you're going to be instrumenting do execute or whatever
in the database to just track that bit, which is a standard, normal thing.
I just remember it was really funny because they had a guy from the software vendor there. He was
like, oh, I don't know. But we're like, yeah, this is the problem. You're running 14,000 queries.
Of course we killed it. No matter what we do, that's just like, talk about instrumenting the
wrong thing. That was the right thing, but because of their setup,
it took it down. So we then tried to play that game of, well, not the game, but like, look, we didn't take your system down. Well, we did take your system down, but it's revealing what
your problem is by doing that. But yeah, it can be really, really tricky in those situations.
And this just reminded me of another anecdote, just a quick one, where a vendor actually blocked our agent from working.
So the customer that was actually a sole performer customer.
So guess who was in there for monitoring?
And they told us it doesn't work.
Let's put an agent in.
Let's put at that time Dynatrace in.
We did.
And we found something.
And they went back to the vendor saying,
yeah, this doesn't look very good.
Can you solve that?
And the vendor came back,
no reverse engineering in our code.
It's forbidden.
And they were like, no, no, no, no.
You don't.
What?
The next time we tried, they gave us a patch
and then we tried to put the agent in it.
We got no results at all.
And we found out that they actually stripped
the minus agent path variable from the JVM options.
So their code was rewritten
to prevent any monitoring from Java.
It was like, how desperate as a vendor must you be to do that?
That's amazing.
And it also reminds me, and this comes back to the use and use of experience
and how commercial vendors have optimized the systems, the database example,
these problems that we then saw by capturing every single execution
of these 14,000 database queries
led to what we call database aggregation back then,
where instead of capturing 14,000 instances
of the same query maybe just once.
And these are things that I hope
will also make it into every
other instrumentation for open
telemetry or make it to the
blogs about best practices
because this is things we need to avoid because
we've ran into these problems before
and we've solved them before.
Yep, we did. And this is something
that needs to go into open telemetry
and the data gathering, absolutely.
Rainer,
I fear we're closing
at the top of the hour here from a recording
perspective.
I hope we did. I think
we did a good job in promoting
Mallorca and your
co-working space, your
boot camps.
I also offer that online if
you're not getting
the vocation approved
by anyone.
No, the thing is
my goal is to make observability available
for everyone to take it to the next level.
So we cover the ground stuff
and then we care about the really
as a former
colleague of mine called the big hairy audacious
problems. So
get rid
of the easy to solve issues and then
focus on the ones that are really tough to solve
because it's much more fun.
But having observability in all of the
places. So I
have my experience. I can share them.
Of course, yeah, I can
be hired. I can be bought, so to say.
But it's also good to talk with you guys about how,
because I was always seeing it through the same pair of glasses,
as the Germans say, always from a vendor perspective.
And now I'm getting this from the technology perspective,
and this opens up a whole different perspective on the technology perspective. And this opens up a whole different perspective
on the whole thing.
And unfortunately, after 20 years,
I reached the same thing over and over again.
You need data, you need stability,
and you need to write data,
and you need to turn it into information
that solves your problems,
whatever those may be in your role.
It's still not in the mainstream.
It's slowly getting there, but
it's still not really there.
This is my mission, to get
fundings for observability from the ground
up. Not as an overlay,
not as a, oh god, this doesn't
work, we have to throw money at it to make it go
away, or make the problem go away.
So, yeah. And, yeah, thanks
for inviting me here.
And, as
stated, there is a
CNCF meetup in Palma next
week. And also, there
is the Web Engineering Unconference
in September, also happening
in Palma. This is always a nice
event. It evolved always a nice event.
It evolved from the
PHP
unconference
to now
everything
around web
engineering.
It's a
pretty cool
event taking
place in one
of the hotels
here.
So the
whole island
is really
good for IT
people.
We have a
great connection.
Enough of the
commercial here.
Thanks again.
That's good.
And talking about
events, maybe one
last thing because
this is a global
thing.
In the first week
of June, I think
it's actually June
6th.
I think this one,
oh, you're talking
about an event.
I was going to
say I think this
one airs next
week.
No, what I'm
saying is that
we have,
Kubernetes is turning 10 years.
So we have Kubernetes birthday party globally.
The main party will happen on the West Coast.
But every CNCF, every cloud native local community
is encouraged to run their own birthday bashes.
I know in Barcelona, our friends from Lidl, from Schwarz.it,
they are doing the birthday bash in Barcelona
we are hosting a party in Vienna in the Dynatrace office and I'm sure many places around the world
so Kubernetes is 10 years old it's no longer a baby it's already a teenager it's amazing
oh my god it's going to be a rampage teenager yeah that's the question
puberty is coming.
Oh my God.
Acne.
I think at the parties,
you're going to have to have people walking around
trading hors d'oeuvres and seeing if you can get
the hors d'oeuvre to the right end destination point
by passing it from person to person.
You mean you have to order your hors d'oeuvre
with a YAML file?
Yeah.
Okay, great.
Well, I hope people, if you ever wanted to get to Mallorca,
I hope they get there really soon,
because I think as a result of this episode,
Mallorca is probably going to be the hot spring break type of destination
for observability people.
So before it gets over you know overrun by
technologists if you want to see it nice get there soon um of course uh your house is always open to
anybody who shows up if you just you know give us your address on on the episode people can just
freely walk in i'm kidding um when you take a look at the at the picture of the beginning, you see our
roll-up in the background and that
is our co-working space.
And if people Google that, they'll
find my home address.
We have locks and, you know...
And big dogs.
Actually, just a cat, but
yeah. He's fierce.
Oh, he's fierce oh he's fierce
cats can be scarier
than dogs
because you never know
what they're going to do
or what they're thinking
alright
Andy
any
last words
wrap up
alright
all good
I'm really happy that
after so many years
we met each other again
and then
that this worked out
that well
so
all the best with
your with your ops with making everybody an observability hero and making observability
available to everyone i think that's a really nice statement and if you can say observability
three times in a row without twisting your tongue you get a beer from me observability
observability observability next one is on me.
He's been practicing.
Alright, well thank you for being on the show today.
We really, really, really, really appreciate it.
This was an awesome show.
I just want to say thanks to all of our listeners again.
And as we were discussing before we started,
I think we're like somewhere over eight years on the show.
So thanks for everyone who's made that all possible.
And thanks to wonderful guests like you.
I'm going to try it.
Reiner?
Reiner?
Reiner is perfect.
Reiner, there you go.
All right.
I've been having trouble with that name, people who are listening.
It doesn't seem like you should, but anyway, it's just me.
I'm a stupid American thank you all for listening
thanks everyone for being part of this and we
will see you next episode
bye bye