PurePerformance - Observability that is Battle tested by Millions with Marco Sussitz and Wolfgang Ziegler
Episode Date: August 12, 2024When your code runs on more than 6 million systems - many of them business critical - then this is really exciting news for Marco and Wolfgang, Dynatrace OneAgent Java Team members. Their code powers ...auto-instrumentation and collection of all observability signals of Java based applications running on every possible stack: container in k8s, serverless, VM, on your workstation or even the mainframe.Tune is as we sat down with Marco and Wolfgang to learn what it means to continuously innovate on agent-based instrumentation with 160+ other engineers across the globe that also focus on OneAgent. They share insights on how they develop their observability code, how they continuously test across all supported environments, what the processes at Dynatrace look like to avoid situations like the recent CrowdStrike outage and how they integrate and collaborate with other communities and tools such as OpenTelemetry!Things we discussed during the episodeDynatrace OneAgent: https://www.dynatrace.com/platform/oneagent/Dynatrace for Java: https://www.dynatrace.com/technologies/java-monitoring/OpenTelemetry and Dynatrace: https://docs.dynatrace.com/docs/extend-dynatrace/opentelemetryJobs at Dynatrace: https://careers.dynatrace.com/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me my fantastic co-host Andy Gradner.
How are you doing today Andy?
I'm very good and I'm so happy that you're back because I had to record the last episodes without you.
Yeah.
You don't even know about it because you just came back from vacation vacation and don't even know what I did while you were gone. Well, I saw there was some back and forth on the thread. So I was like, oh, I should be getting some links from Andy pretty soon. But I have not
heard it yet. I don't know what the topic is. It'll be as much a surprise to me as it is to
the listener. But I guess I'll get to hear it first before then. But before we jump into the
topic, I hope you had an enjoyable vacation even though
you came back with something that you probably didn't want to come back with uh i thought we
weren't going to talk about my um you know offshore leave uh now i'm kidding i was trying to make a
sailor joke what sailors come back from leave with um if anyone gets that reference uh yeah yeah no i
went to uh new york
with my daughter met some friends there but then when i was taking some photographs i slipped and
broke my collarbone so in a good amount of pain and uh but i'm here for you for the podcast and
for all of our listeners because i could never ever let let this let our community down so
and i'm pretty sure by now our guests wonder
when do they finally stop talking and let us talk.
And I think now is actually the moment.
With us today, two guests, Wolfgang and Marco.
And I would actually start in this case with Marco
because your first name is higher in the alphabet.
So Marco, maybe you can just quickly introduce yourself,
who you are, what you do. We all work at Dynatrace, so there is a special episode, but I would
like to know what you do at Dynatrace and also what brought you to Dynatrace.
Yeah, of course. So I work as a Java agent developer in our agent team um for about a year now uh before that i was just a
normal java developer so i did um we worked at a video encoding company and i did a little bit of
c++ and some java and then i switched to dynatrace and i think one of the reasons or one of the
things that made dynatrace very interesting to me was the agent
team of course. I like the dynamic nature of Java and I like to be able to do certain things with
the language to do the one-time code manipulation that we do. Those are things that interest me a
lot and that was especially why I wanted to join Dynatrace. Very cool. So I like the way you phrased
it. You said you know on, I'm on the Java development,
the one-agent development team for Java,
but you used to be a regular Java developer.
So I thought it was an interesting way to phrase it.
But yeah, really cool.
Then to our next guest, Wolfgang.
How about you?
Yeah. Hi, everyone. My name is Wolfgang.
And Brian, really sorry to hear about your accident.
I went through the same thing a couple of years ago
with a snowboarding accident,
so I can really relate to the pain of breaking a collarbone.
Yeah, thanks.
Yeah.
Yeah, so my name is Wolfgang.
I'm a team captain on one of the Dynatrace One Agent teams,
actually on one of the Dynatrace Java One Agent teams.
So we have more than one Java 1 agent team now,
but I also did and still do regular development work there.
And where I'm coming from,
it's almost embarrassing when you talk about
like your career or experience in the industry
in terms of decades,
but yeah, that's what I've reached by now.
So I'm looking back on actually exactly two decades of being in the software industry.
So I graduated here in Linz in 2004.
And on my regular office, I almost have a direct line of sight to my former university.
And yeah, back then things were different.
So we were still, there were no apps like, I don't know,
iPhones or something like that. And not everything was on the web. So we were really, we were still writing Windows applications.
And I was starting as a.NET developer doing Windows forms
for an Austrian e-government software.
Did this for a couple of years and then I landed a company which was back then still quite famous for its name, Borland.
I think Andy also spent some time there and was working.
So it was not the Borland some of us knew back then, so no compilers, no IDEs,
but mainly testing software. So we had a
functional testing product. We had a load testing product,
test management software. And I was working on the load testing
product on Silk Performer, which I actually was listening to some of the back
episodes here. And that Ernst, of course, our chief architect at Dynatrace, was working on it
and mentioned it.
But also, what was her name?
Modena, I think.
She said she was using the product.
So this really brought back very warm memories of the product I've been working on for quite
a while.
Almost 10 years, actually.
But then things went into a downturn spiral there at Borland.
It's been acquired by a company called Microfocus in the meantime.
And it was time for a change.
And since I've known Dynatrace already,
so we had some plugins and integrations with them.
It was always a company.
I knew it was interesting.
It was roughly the same domain like we were with the load testing.
And yeah, that's when I switched jobs to Dynatrace
and initially started there as a.NET developer working on the agent.
And after a year or so, I figured out it was time to take on more responsibility and took over as a team captain, the SDK team.
And then we figured out, well, SDKs aren't what's really looked for in the industry anymore.
Open telemetry started developing.
So we moved in that direction.
So we were jumping quite around like technology
wise. And yeah, that's where I spent a lot of time then in like SDK and OpenTelemetry work. And then
I did something else for a short time. And yeah, now I'm in the Java team and yeah, being team
captain of that team. Yeah. I think both Brian and I were very happy
to hear that we have similar backgrounds especially with the load testing background.
When were you at Boland? Which time do you remember? Which years? It was let me think 2008
until 2017. Yeah so almost 10 years as I said. Yeah. Both of you, thank you so much for the introduction.
So the episode today, we typically talk about what people do with observability. We had a lot
of different talks about obviously performance engineering, how observability data helps. Like
the episode that you mentioned with Almodena, that was interesting. We also talked about things like how people bake observability into their platform engineering.
Hardly ever do we talk about what actually has to be done to get the observability data.
And that's why I think it's great to get a little bit glimpse into what has actually happened within Dynatrace as one of the vendors out there and how we are building agents and
also ensuring that our agents not only deliver the data that
our users need, but also do it in a secure and a reliable
way. And the reason why I say secure and reliable,
just at the time of the recording, just a couple of days ago, the world was
struck by aStrike,
by that big incident.
And some could answer,
or some would ask the question,
how does this work in Dynatrace?
Because we are building an agent
that gets installed on thousands of machines out there.
And we have a lot of privileges on some of these machines
because we're capturing a lot of data.
How does software engineering work within Dynatrace so that we ensure that our agent,
that your agent, that the two of you are producing, is not causing the next
CrowdStrike incident? And I don't know who
wants to take it first, but I would just be interested in hearing some thoughts because
I want to just learn. What do we do for software quality within agent?
Mind if I take this, Marco?
Of course, Kev, I think you know more than I do.
Well, both know a lot about it, I guess, because it's, I mean, key is testing, right?
So this is really, I want to say more than 50% of the daily business of a developer here at Dynatrace.
So as you said, something like CrowdStrike, of course,
is the absolute worst case scenario that could happen.
But on a smaller scale, if we did something really wrong,
we could cause similar issues if you made a big mistake. So currently, for example, the Java agent runs about 6 million installations.
So if you have a crucial bug that would prevent JVM from booting,
you have a similar scenario than CrowdStrike.
I mean, you don't take down the whole machine,
but you take down a JVM, which is, in the end,
for a customer who's running their
business transactions or business load it's the same result. So to prevent that from happening as I said testing is key and this is what we really take so seriously and every day we have
like hundreds, hundred thousand or so instances of test cases
that are running.
There's instances here
because it's not like individual tests,
but if you run tests in different permutations
like operating systems,
operating system flavors,
even like 32-bit and 64-bit
and Linux and Windows
and esoteric platforms like Solaris and AIX.
And then you want to run different versions
of the product you're supporting.
And you can easily see how this multiplies
into a huge matrix of test permutations.
And that's what you're doing on a daily basis.
So, of course, this cannot run
with each and every product build, but it runs on its own cadence
at least daily. And then we take our time to look at these test results. So it's a fixed
schedule, it's a fixed part of our weekly work where we sit together and analyze eventually the tests that may have failed together and analyze it for results.
But bottom line is, if really something terrible would happen, we would see this immediately.
Yeah, I remember some time ago we were looking at our weekly meeting and for one of the tests that i've written it
failed on a really obscure i think some ibm machine or something and i was really glad that
some of the somebody else um immediately knew what was up because i think figuring that out
would have taken me a couple of days at least exactly that's really helpful because of course
every developer does their due diligence
and runs the tests in the local environment and maybe sometimes even in a virtual machine
you have on your computer or like WSL or something like that.
But no one can run like a Solaris test in there because you just don't have the hardware
right or the operating system.
So you are dependent on the CI and this is where we really
have the broad coverage that you need. Marco, maybe one additional question for you on this,
because you said you just recently joined Dimetries, quote unquote recently, like you've
been here a year. And you said that when you had this test scenario that failed, you had other people that could jump into because they have been probably with the product for longer.
I remember when I started, it was like 16 and a half years ago.
And we also already did, maybe from hearing from you, how many out of your colleagues do you have in the team
that have been there for so many years
that you can then actually also learn from
or ask for advice because it's impossible for somebody
that just starts and has only a year under his belt
to know all this?
I think we are quite an old team,
if I can correct me'm wrong, Wolfgang.
But... There's one reason in experience.
Experience, not all.
So we have a lot of very experienced people that help a lot.
I remember when I started, I came from a smaller startup of 115 people.
And there the scales are completely different.
And also the amount of testing we do, of course, which was something I've never
seen before in that scale, back then I think we had a few thousand
end to end tests, but now it's a lot more.
Now I need to ask you one question because I used to be a developer myself
in the early days and testing was not the thing that got me excited
in the morning when I got up and started working, just to be honest with you, right? I mean, I was
obviously often like you working for a testing company. So it was obviously in our DNA, but still
it was also not something that I learned in my education. Test-driven development back then was,
I think, something that we didn't
really do. And this was like 25, 30 years ago when I got educated. Marco, as you said,
how is it now for an engineer? Do you get, with all this testing that happens, how can you still
focus and get excited about creating new things? Or what's the ratio? What does a day look like?
It really depends on what I'm working on.
So usually we work on one feature at a time.
And then, of course, a huge part of that feature is spent with testing.
But for me, usually the workflow is to create some very simple test case to get
the first workflow running, to how i'm doing um or what
needs to be changed and then do to cover the rest of the testing at the end so i think it's usually
for me it's a bike at the beginning at and at the end of of the development and between that i
don't do that much i think what also makes it
easy for an agent developer
or a natural to write a lot of tests
is it's most often the only way to actually see
if your code really is working and doing what it's doing.
Back then, as I said, it was easy. You had a UI application and you saw the results
and you had something to interact with.
When you're writing agent code,
you're basically placing your code in a customer's application
and you need some kind of application, right,
that you can instrument and run the agent code in.
And then you need to verify in some way
that your code is doing what it's supposed to
do like extracting data and tracing data so you need to write a test in the first place to verify
that so it's uh yeah that burden or that doesn't there's no additional burden to writing tests in
in that sense and what also helps is we have a really good framework that makes it easy
to run things in containers or things like that.
So you have no hassle for additional setup or so is needed.
So you really need to lower the barrier for everyone to write good tests.
I think that's the next thing i noticed with joining
dynatrace with a bigger company um how developed the testing infrastructure so back then
one team is doing everything but nowadays we have in dynatrace we have those really developed
processes this really developed frameworks that already do a lot for you making these and and i think brian
right we both are really happy with with the output and the level of quality you guys are
producing because so many times and brian you even more so in your line of work on this on the sales
side right when you do a poc a proof of concept and have the magic one agent that we install on very critical systems.
And obviously people are, if they don't know the magic of the one agent and the level of
quality it has, you might be frightened to install it on a production environment.
Oh, yeah.
I mean, that's, I think everyone's first question is, well, not as much these days, right?
I think these days people are a lot more used to
observability through agents. But going back even not too long ago, it was always, well, is this going to break my system? Is this going to break my system?
And when you stop and think about it, it's a miracle that it doesn't. But it's because of that testing. And I've been here since 2011 with Dynatrace and I can barely think of any time that the agent had some core functionality or core bug in it that would really do that. makes it easier for us because we know,
and it's one of the things we always say too,
just so I don't know if you guys ever hear this,
it's like 99.99% of the time,
if there's a problem when we deploy in someone's environment,
it's usually because of some wonky thing in their environment. It's not because of our agent.
They're doing something quote-unquote illegal.
And that's just really comforting.
Yeah.
Yeah, I think what's also,
I mean, also thanks,
welcome for the number
that you mentioned earlier,
6 million installations
of our Java agent alone.
That's also a cool statement for,
I guess, even a quote unquote
young developer like Marco,
you are knowing that your new features
get potentially used by 6 million
different instances of the one agent that is observing business critical Java-based
applications. That's a really cool thing. I actually wanted to follow up on this because
if you think about it and think about the responsibility that comes with that, so
it's easy to say, yeah, maybe we have this little bug that only affects maybe 0.1% of our customers. But what does this mean? Like 6 million installations is a lot of support cases, right? So you can't even have like an uncertainty part in your code. You say, yeah, maybe this affects someone,
but not all of them.
You should always be trying to make sure
that no one is affected, right?
So if you develop software for a smaller audience,
maybe you can live with such a percentage.
In our case, it would mean we get flooded
by support cases.
Yeah, and by now,
we will kind of
see our product and our
company grow and also the customers
that we have and where they install it.
So we really run on business-critical apps.
If they don't work anymore because
of our mistake, then
something at the magnitude of CrowdStrike
could almost happen, right? So if you think
about airlines,
e-commerce sites,
insurance companies, governments,
they all use our software,
your software that you guys are putting out there.
And just phenomenal.
I wanted to touch upon that point
one more time because early on
when Wolfgang was explaining
some of the testing,
I got really concerned for a minute
until I thought it through some more, right?
The idea that you can't test
on like every single OS,
there's all this whole plethora,
there's a whole matrix of OSs or whatever.
And that got me concerned,
which my first concern,
which I still think is somewhat valid,
is the quest for getting software out fast
versus the test,
the compromises we have to make in complete testing.
Right. You know, we're all from probably most of us.
I don't know about you, Marco, but Marco, but from Waterfall, right.
Deployment where it had to be fully tested before anything.
And that was a compliment to you, Marco, because you look younger.
So that leaves like these gaps. Right.
And you said maybe it's running every day, but it's not running with every release because
maybe you have a release every two hours.
However, so that to me thought, well, that's exactly how something like CrowdStrike could
get out.
Right.
Um, but when I thought about it again, the reality is any of the regular testing that's going to go on, the release testing, is going to be hitting those most common pieces, the most common platforms, because it's going to be going on what we're targeting.
So CrowdStrike happened, at least I think, because I have no idea, but that was like every single OS instance.
It was almost like it was never deployed onto any machine, right?
Whereas in this case, you're at least hitting the most major ones.
It's all those edges and all those, the complete matrix that you're not doing.
So the idea of even if you skip that full, if you don't do the full OS test on every release,
you at least have your major ones covered just by your general pipeline testing and all that because that's going to be on that. So at least counter the idea
that it's a huge risk. There's a risk, but it's more for the edge type stuff.
And then to your point, if that edge is 5%, that's still quite a lot
if you take a look at how many people are putting it on. But to create a CrowdStrike level event,
that kind of thing isn't going to happen if you're doing this sort of thing.
Anyway, it was just an observation because I was mentally going through it thinking,
oh my gosh, this is the end of the world and talking myself off of the ledge of,
no, this covers most cases.
So it's an interesting concept and one that you have to consider too of what coverage do we have
if we're not doing the full suite, right?
And how much do we slow down releases for full coverage versus not?
And I think that's a question
we're going to have to tackle in the future
if more of these kinds of things happen.
Re-evaluating speed of release
versus completeness of testing.
Obviously, you can't have 100% completeness of testing,
otherwise we'd be back where we were.
But have we gone too fast?
And I don't mean us Dynatrace,
but the industry.
Maybe I misspoke before a little bit. So this not covering all the operating systems and flavors of
operating systems, that's what happens on a developer's machine before they set up a pull request. So as soon as it's in the
source code repository
and the tests are running, everything
is covered.
As soon as
a new agent is
ready for
being rolled out into production,
every possible test has been
run. So there's even a separate
stage. So we call's even a separate stage.
So we call this the hardening stage. So even though we work in scrum sprints or iterations,
finishing a sprint doesn't mean the agent is rolled out to customers immediately.
So there is a hardening phase
where we still have the chance to identify
something that might have slipped through
through manual testing.
Or also what's crucial is to have longer running tests, right?
Because the tests I mentioned,
you can't really afford to have a 24-hour test
for every possible permutation there. So you have
to select a few certain scenarios where you do long-running tests and see maybe memory is
accumulating or maybe there is like a weird CPU spike if you run for a longer time. So you also
have to look out for these things to happen and the learning stage is the perfect
opportunity to identify those kinds of bugs and uh yeah delay or a rollout or fix something before
you roll out to a customer and this brings me to another question on on this before i want to move
on to another topic but our podcast is called pure. So it has performance in its name.
And I remember in the very early days of distributed tracing,
agent-based instrumentation of apps,
the number one concern was,
how much overhead do you guys add to my application?
I don't want to install this because then my application gets slower
and then it impacts my business and things like that.
So I guess the level of testing you were just saying with the long-running tests, not only do
we detect if long-running, we have memory leaks or things like this, but we also do intensive
performance testing of our components as well and just seeing what is the impact that a code change has on the OS.
Maybe Marco, you, especially from a developer perspective, I would be interested in how
you deal with this.
Like, are you aware of this as an engineer when you're building a new feature?
Yeah, I think it was actually something I just worked on recently.
So the past week or half, week and a half I spent with performance testing for
one of the features that I did just to make sure that everything is in order. And of course there's
a performance impact to having the agent. I think everyone expects that but we try to keep that
minimum. Yeah and I think you're doing a pretty good job in this because it's phenomenal to see what type of data we collect and how little of an impact it has.
It was just before your time, before both of you joined, we had, as an industry, we had to do a lot of work to prove that agent-based instrumentation is not adding the level of overhead that would actually
negatively impact the app. Or let's say that way the level of impact it has
because as you correctly say Marco, every instrumentation whether it's agent-based,
whether you are doing manual instrumentation by writing logs, whether
using OpenTelemetry, whatever it is, you're adding code to the existing
business code and that just by default generates overhead.
But the question is, what's the cost benefit of this?
And you need to capture enough data without having the overhead,
but you need enough data to then be able to see what you need to see.
And on that point, Andy, the consistency of the negligible overhead, not only by Dynatrace, but I think by all
vendors out there, to keep that overhead low has been critical.
And as you mentioned, we always used to hear that question.
We rarely hear that question anymore.
How much overhead is it going to ask?
That's not a thought on people's mind because as observability has become
more widespread, it's just been taken for granted. Yeah, it's just going to be a teeny bit for what
we get. But that all comes from that hard work, right? It's because you all didn't fail in that,
that it goes unnoticed. Same thing if you go back to, I always say back when we had,
during COVID, well, the vaccine works and people stop getting sick.
People are going to say, well, was it really the vaccine or, you know, but because it would,
there was no, people just didn't get sick, right?
But if it, if people continued to get sick and die, then people would say, oh, it didn't
work, right?
So the lack of having incidents from overhead, while less visible, has made it a non-question for
the most part within the industry.
So that's a huge, huge task and a huge, huge accomplishment from you all and from all the
other agent developers too.
And there is no glory in prevention, I think they said.
That's what I was getting at.
Yeah, exactly.
Thanks. And maybe a last sentence to this.
What was interesting,
because observability was always there,
but back then people were maybe just writing log files,
but nobody looked into the overhead of creating a log
because this was just part of software engineering.
And then the observability vendors came in,
or back in the days,
we called ourselves APM.
And then you all of a sudden add something to your app.
And then if things change, then you say, of course, it's the APM product or it's the agent.
But there was already overhead anyway, but nobody talked about it.
Nobody looked into it by just writing logs.
I mean, that's an interesting aspect as well that nobody thought about.
I think you could say that observability is so omnipresent now that back in the day it
was the question, have it or not?
And now it's which vendor or which product to choose because having it is a given.
Marco, last one on last follow up for for you because you mentioned in the last week
and a half you had the test to do some load and performance testing on your code. Can you just
quickly explain what that looks like for an engineer? What do you do? Do you use what testing
tools to use? What do you look at? What metrics? Yeah, of course. So the test setup is very similar to our regular end-to-end tests.
So we have our whole framework and everything.
And we already have quite a good setup to do the common performances that we want to do.
So for me, that was an HTTP route and there's already a pre-written
class that you just can use and that's the
setup for you.
And then we time how many requests do I get out?
How long do the requests take?
And some other metrics if you want to.
And those are then run with different permutations.
So we have with the observability enabled, disabled on different JVMs and we can see how enabling
those features that you might want would affect the system.
Okay, so you basically have a baseline without observability and then you have different
levels of turning observability on it and you see how much does this change the throughput
or the performance of the app.
Exactly.
Besides throughput, what else do you look for?
For me, it was actually not only throughput. I'm not sure if you know,
if we capture some other information as well?
We often look for the usual suspects like memory and CPU, but it very often is really dependent on the actual feature or
instrumentation you're writing, what you are looking at.
So, but yeah, very often the metric is some kind of transaction rate or something like that.
And Marco, I know you're not sure if you're allowed to say this, but because you just work on a feature that might not yet be released.
But if you can talk about it, this is a feature.
What feature were you working on?
Because for me, maybe from the outside world, they say, well, they had a Java agent for so many years.
What new features do we need?
Yeah, let's see what I can say.
So we try to support different libraries to have already a very rich observability for those libraries that you might want to add. And for me, that was something that would enable you to add logs to your pure path and to see, okay, this logs belong to that.
Okay, so that's the uh the locks in context
of distributed traces for certain frameworks yeah that's awesome very cool yeah that's um
actually one of the the exciting things that happened over the last couple of years when we
when we really uh brought up the ability to connect an individual log that was otherwise
in isolation with the actual trace that generated that log.
That was huge.
Very useful, of course.
Yeah, of course.
When you talk about instrumentation, I want to switch now topics a little bit
because we've been doing agent-based instrumentation since our existence, right?
Dynatrace, dynamic tracing is in our name.
So we started in the beginning with distributed tracing for Java applications
and maybe a little trivia for you because you may,
you're not that long with the company initially that the product,
the internal project name was called JLT, Java Load Tracing.
So if you still see maybe some Jira tickets floating around, the project JLT stood for
Java Load Tracing, so generating traces on Java applications as they're on the load.
And the first load testing tool that we partnered with back then was Silt Performer.
So we've been doing this for a long, long time, but now we are in 2024.
OpenTelemetry it seems has taken the cloud native space, at least by storm.
OpenTelemetry is a great framework for developers to put in their own instrumentation with the
idea that vendors like us, we don't need to reverse engineer certain frameworks because we assume these frameworks
are already instrumented by the developers
of those frameworks
because they know their frameworks best
and what type of telemetry data it should produce.
Can you give us,
and I'm not sure who wants to start,
but a little bit of insights
on how do we deal with open telemetry
and has this changed anything
from our agent-based approach? What do we deal with OpenTelemetry and has this changed anything from our
agent-based approach? What do we do with this data? Is there anything you could talk about?
Yeah, I can start. I can take this again if it's fine, Marco. OpenTelemetry is, so we have several answers to that, several ways to work with OpenTelemetry.
And the simplest being, it's just on the server side, right?
So we have OpenTelemetry, they have their custom or they have their own communication protocol.
It's called OTLP.
So just the way they exchange tracing data with the backend system.
It's an open specification, open source, like everything in open
telemetry. And Dynadrace just offers an endpoint to ingest
open telemetry. So this puts the whole agent
discussion completely out of question because
some agent is running, some open telemetry agent is running, and we
can ingest it. Then you might have a situation where you already
have a Dynatrace agent and, for example, an OpenTelemetry
agent or, in another situation, a Dynatrace agent
and an application that has OpenTelemetry
API calls.
So someone manually instrumented their applications.
And we have basically solutions for all of these scenarios.
So if in a way you could think of the Dynatrace agent as an SDK for OpenTelemetry.
So if I just write a piece of software and add OpenTelemetry. So if I just write a piece of software and
add OpenTelemetry API
calls, they usually
without any agent or SDK
being present, they are dormant.
So they add little to no overhead
because they're just empty implementation
stops. If our agent sees
those, we can
instrument them and suddenly
you light up all these OpenTelemetry API calls,
so additional instrumentation there.
And together with the OpenTelemetry agent, we can run in a side-by-side scenario so we
don't step on each other's toes.
It's another way that we support. So we're fully embracing open
telemetry. It's not like a competition or something. We are even active in the whole
specification and even in the language groups and the implementation groups and try to contribute to
open telemetry. Yeah, I know we have a couple of colleagues
that are, as you said, actively contributing back to Upstream, also some of our instrumentation
technology, and that's great. Marco, anything from your side to add on that topic?
Yeah, I think I have to deflect on that one. I haven't really worked with
OpenTelemetry at Dynatrace yet yes thank you yes yeah but maybe uh
maybe a side question on this because you know when we talk about open telemetry we talk about
open source we talk about um seeing what's out there eventually you know contributing back or
using using open source often i'm just interested also from my background as an engineer. Are you in your
role as an engineer on the Java agent? Are you getting in touch with open source projects? Are
you getting in touch with open source communities in any way? Communities a little bit. So besides
being an agent developer, I also work as a tech evangelist or tech advocate
or whatever you want to call it.
And so we have been in contact with some open source communities, but on the regular job,
not that much.
So of course, we need to look at the source code when we do an instrumentation for a new
framework.
And if this is open source, it makes it a lot easier.
Yeah.
What's the tech advocacy thing?
Can you just out of curiosity?
Yes, of course.
So I spend some of my time being active in the community,
doing talks at meetups or at conferences
and also engage with some open source communities
and work like that.
Well, we should talk because I do the same and I'm very happy to hear what you do. Yeah, I'm quite aware like that. Well, we should talk because I do the same
and I'm very happy to hear what you do.
Yeah, I'm quite aware of that.
Yeah, cool.
I mean, one last thing on OpenTelemetry
and I just had a podcast
where I was a guest on,
recorded when I was at KubeCon in Paris.
And I just want to rephrase this. As you said, Wolfgang,
I think open telemetry is a blessing for our industry because we can assume that we are getting
better in telemetry data because we assume that developers from runtimes, from frameworks,
from projects will instrument their code because they know it best.
So that's great. So we can ingest it and kind of
marry it with the data that we also have.
But I think there's also a misconception
out there that I highlighted in the
recent podcast. Some people believe open
telemetry is all you need. And that's
it. You install the open telemetry
collector and you install one of the
agents. And
while I know this is like just installing the Java agent,
it gives me a lot of data, but without sending the data to
some endpoint, without there the data being
analyzed, sanitized, making sure that
only the right people have access to the data, because with the Java agent
and with OpenTelemetry, you can collect a lot of data and also potentially
confidential data. And you want to make sure that the whole end
to end lifecycle of data, of observability data
is properly managed. So that's ingest, that's transport, secure transport,
that's analytics, and that's also making sure that only the right people
and the right other tools have access to the right amount of data.
And these are just discussions that I think most of you listeners know,
but I also know some people are not aware of this.
So OpenTelemetry is great because it gives us a great source of additional data,
but you still need a backend where this data gets sent.
So whatever tool you choose, choose it on your liking.
But just making sure that this misconception is understood.
Yeah, that's a good point.
So it's an agent, the whole agent part of the
entire trace is just one part of the equation.
It's only the data source, So the data sink, the destination,
and all the analytics that are run there are all artificial intelligence engine that's running in
the background and creating those smart alerts. So that's where also a lot of the power is. So
collecting data is only useful if you can run analytics on it.
And if you're talking about big data, you cannot drill down into single traces or look at the spans.
You need to have a more coarse-grained view on things.
Wolfgang, in the very beginning,
when you introduced yourself,
you mentioned that you are a team captain,
but not only from one agent team,
but from multiple one agent teams,
which means if you look at the one agent,
we just talked about like Marco,
you and the Java area,
there is many different technologies
we support
through our one agent.
And I assume this also means we have,
I don't know, do you have a rough estimate of
how many engineers we have
working on agent
technologies?
It's embarrassing that I don't
know the number now.
Don't quote me on it.
I mean, you're on a podcast.
I'm quoting myself on it.
Say 200.
I don't know.
But I would say,
I would,
sorry,
Marco,
go ahead.
I would have also said
in that ballpark,
I think we had a meeting
yesterday,
a few weeks ago,
and that was
160 under our manager
or something like that.
No.
No.
Think about that investment right we have uh
and i think this is the this is the exciting piece that that we have uh not only java but all sorts of technologies that we support with our one agent that automatically instruments
apps without you having to think about how to instrument it. And we continuously invest to make sure that these agents are up to date,
that they're well tested, they don't have any negative overhead,
and that the quality is so good that you can be confident
that we are not the next crowd strike or whatever other incident happens.
So I think that's just phenomenal.
I assume with the 160 people, I know you both, you are
you live in Austria, so you all work in our
labs here. Do we have other team members
that are in other parts of Europe, other parts of the world?
Yeah, we became
more and more distributed.
So all the negative impact that COVID and the pandemic had, the one benefit was that
we suddenly embraced remote employees and remote working models.
And this has really benefited us, I think, because sometime in the past,
it felt like the pool of engineers in Austria and especially in Linz was completely depleted.
And now we have brilliant people just recently joining in Tel Aviv and we have remote employees
in Berlin, in Switzerland. So just in the Java team, I don't even know where
the people live and the other agent teams are sitting. And so this has really benefited us,
attracting talent from all over the world, almost, I want to say.
Yeah, I'm also one of the beneficiaries of that. So I'm in Klagenfurt and there are only two agent developers here
because the rest of my team,
the biggest part of my team is in Linz.
Yeah, I think that's nice.
I mean, if you think about how we started,
as you said, Linz is obviously the center
of our engineering organization
but just in austria alone with vienna with gas with klagenfurt innsbruck hagenberg and then beyond
austria we have you mentioned tel aviv we have estonia we have poland we have barcelona um
i didn't know about berlin that we also have remote people sitting in Berlin that's also pretty cool
so obviously there's no limit anywhere because
if we have good talent that want to support our
cause of building the technology that makes sure
that our customers can themselves build software and operate software that runs perfectly
then wherever you are you know look at Dynatrace.
Marco, again, for my own benefit, because I used to be a developer and I used to write code,
in your regular day, what types of tools are we using internally? Do we have,
I assume, a lot of homegrown tools, obviously, but how do you develop? What do you use for your IDE?
What do you use? Just give me a couple of things. It's just interesting to talk
about tools as well. Yeah, let me think. I think I've actually a very vanilla
setup. So I have some IntelliGy
terminal. I like to use
ZSR. I think that's maybe a little bit special Emacs for my log files.
So if I need to edit something.
Besides that, not really.
How long did it take you?
You said you started a year ago.
How long did it take you?
I know it's a very specialized team and the agent team is very critical software for us.
But how long did it take you to actually then get started and actually contribute code?
So I think the whole setup took me about a week, week and a half around that.
And after that, I think one of the first issues that I actually worked on was a bug fix already.
So about a month after joining, I was working on that.
Yeah.
And that's, again, it's phenomenal because we need to think about it.
The software that you guys are producing is extremely business critical.
And then having a short onboarding time where you already work on bug fixes
of just a matter of days or weeks
is, I think, phenomenal.
And that's really also kudos to, I guess,
team leads or team captains like Wolfgang
and the rest of the organization
that have provided a good framework
to make this happen.
I do have to admit, I think you never stop learning.
And I think there's still a lot of stuff that I don't know,
which I just noticed a few weeks ago.
So I guess there's always more to learn, especially with such a big organization.
Yeah.
Totally. And I think it also speaks for...
We have a very diligent code review process
that allows especially new joiners
to really have confidence in the code they are contributing to our codebase
because experienced developers have looked over it
and given their stamp of approval and so do the tests. So this gives a lot of
confidence when contributing. It was the same for me when I joined the Java team because I
did other things before. And even with, as I said, an old guy like me with 20 years of working in software, things were still new then for me.
I think compared to Brian and myself, you're also young.
Yeah, thanks for that.
Awesome. Did we miss anything? Is there anything else? Today is really about getting a little bit of a glimpse behind the scenes of a
particular part of our organization, a particular technology that is very critical to us and the
observability world. Anything else we indicator that we've covered it all.
Yeah, I'm drawing a blank right now.
No, it's all good.
We don't need to stretch it.
I think I just want to maybe then finish with something that I said earlier, but I cannot say this enough.
You make the life of many of us other Dynatracers super easy.
Because we can, with confidence, go to somebody that is already either existing Dynatracer customer or a partner or a prospect.
And with confidence, we can give them the one agent.
And the one agent is just picking up everything.
And we have a confidence that we're not crashing any system.
We have the confidence that we get the data that we want.
And we then have the confidence to show them
what they can do with the data
with the rest of what Dynatrace provides.
And I think that's just amazing.
And knowing there's so many people behind the scenes,
like you said, 160 people alone on the one agent side
that make sure that these agent technologies work
in all these environments,
because the world is not just playing Kubernetes.
The world is not just running on Windows.
The world is a big, diverse world
where we have everything from serverless
all the way to the
mainframe and just having the support through one single agent that works flawlessly is a blessing
for us that sell this product. Well, can't follow that up so we'll land it there. Thank you again to the team for all you're doing
and for all the people who you're representing on here. And I'll even give a shout out to
agent developers on all the products, right? Because if one product is giving people a
bad experience, that's going to taint the experience
or the openness to using products like ours for all of it.
So there is definitely a unity between teams
for making sure it's all good for all of our sakes.
So thanks for all that you do.
And hopefully people found this as informative as Andy and I did.
And very happy to have you on.
See, Andy, I'm not even on.
I'm with the pain and the painkillers.
It's not even, they just told me,
give me Tylenol and ibuprofen,
but it's, or ibuprofen is iminophine.
But I'm just not myself today.
I'm just stumbling around.
So I will let that stumble, stumble us to the end.
Thank you all of our listeners.
We'll see you next time.
Bye-bye.
Thank you, guys.
Bye-bye.