PurePerformance - Observability that is Battle tested by Millions with Marco Sussitz and Wolfgang Ziegler

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my fantastic co-host Andy Gradner. How are you doing today Andy? I'm very good and I'm so happy that you're back because I had to record the last episodes without you. Yeah. You don't even know about it because you just came back from vacation vacation and don't even know what I did while you were gone. Well, I saw there was some back and forth on the thread. So I was like, oh, I should be getting some links from Andy pretty soon. But I have not

Starting point is 00:00:52 heard it yet. I don't know what the topic is. It'll be as much a surprise to me as it is to the listener. But I guess I'll get to hear it first before then. But before we jump into the topic, I hope you had an enjoyable vacation even though you came back with something that you probably didn't want to come back with uh i thought we weren't going to talk about my um you know offshore leave uh now i'm kidding i was trying to make a sailor joke what sailors come back from leave with um if anyone gets that reference uh yeah yeah no i went to uh new york with my daughter met some friends there but then when i was taking some photographs i slipped and

Starting point is 00:01:29 broke my collarbone so in a good amount of pain and uh but i'm here for you for the podcast and for all of our listeners because i could never ever let let this let our community down so and i'm pretty sure by now our guests wonder when do they finally stop talking and let us talk. And I think now is actually the moment. With us today, two guests, Wolfgang and Marco. And I would actually start in this case with Marco because your first name is higher in the alphabet.

Starting point is 00:02:02 So Marco, maybe you can just quickly introduce yourself, who you are, what you do. We all work at Dynatrace, so there is a special episode, but I would like to know what you do at Dynatrace and also what brought you to Dynatrace. Yeah, of course. So I work as a Java agent developer in our agent team um for about a year now uh before that i was just a normal java developer so i did um we worked at a video encoding company and i did a little bit of c++ and some java and then i switched to dynatrace and i think one of the reasons or one of the things that made dynatrace very interesting to me was the agent team of course. I like the dynamic nature of Java and I like to be able to do certain things with

Starting point is 00:02:50 the language to do the one-time code manipulation that we do. Those are things that interest me a lot and that was especially why I wanted to join Dynatrace. Very cool. So I like the way you phrased it. You said you know on, I'm on the Java development, the one-agent development team for Java, but you used to be a regular Java developer. So I thought it was an interesting way to phrase it. But yeah, really cool. Then to our next guest, Wolfgang.

Starting point is 00:03:16 How about you? Yeah. Hi, everyone. My name is Wolfgang. And Brian, really sorry to hear about your accident. I went through the same thing a couple of years ago with a snowboarding accident, so I can really relate to the pain of breaking a collarbone. Yeah, thanks. Yeah.

Starting point is 00:03:33 Yeah, so my name is Wolfgang. I'm a team captain on one of the Dynatrace One Agent teams, actually on one of the Dynatrace Java One Agent teams. So we have more than one Java 1 agent team now, but I also did and still do regular development work there. And where I'm coming from, it's almost embarrassing when you talk about like your career or experience in the industry

Starting point is 00:04:02 in terms of decades, but yeah, that's what I've reached by now. So I'm looking back on actually exactly two decades of being in the software industry. So I graduated here in Linz in 2004. And on my regular office, I almost have a direct line of sight to my former university. And yeah, back then things were different. So we were still, there were no apps like, I don't know, iPhones or something like that. And not everything was on the web. So we were really, we were still writing Windows applications.

Starting point is 00:04:39 And I was starting as a.NET developer doing Windows forms for an Austrian e-government software. Did this for a couple of years and then I landed a company which was back then still quite famous for its name, Borland. I think Andy also spent some time there and was working. So it was not the Borland some of us knew back then, so no compilers, no IDEs, but mainly testing software. So we had a functional testing product. We had a load testing product, test management software. And I was working on the load testing

Starting point is 00:05:19 product on Silk Performer, which I actually was listening to some of the back episodes here. And that Ernst, of course, our chief architect at Dynatrace, was working on it and mentioned it. But also, what was her name? Modena, I think. She said she was using the product. So this really brought back very warm memories of the product I've been working on for quite a while.

Starting point is 00:05:44 Almost 10 years, actually. But then things went into a downturn spiral there at Borland. It's been acquired by a company called Microfocus in the meantime. And it was time for a change. And since I've known Dynatrace already, so we had some plugins and integrations with them. It was always a company. I knew it was interesting.

Starting point is 00:06:07 It was roughly the same domain like we were with the load testing. And yeah, that's when I switched jobs to Dynatrace and initially started there as a.NET developer working on the agent. And after a year or so, I figured out it was time to take on more responsibility and took over as a team captain, the SDK team. And then we figured out, well, SDKs aren't what's really looked for in the industry anymore. Open telemetry started developing. So we moved in that direction. So we were jumping quite around like technology

Starting point is 00:06:45 wise. And yeah, that's where I spent a lot of time then in like SDK and OpenTelemetry work. And then I did something else for a short time. And yeah, now I'm in the Java team and yeah, being team captain of that team. Yeah. I think both Brian and I were very happy to hear that we have similar backgrounds especially with the load testing background. When were you at Boland? Which time do you remember? Which years? It was let me think 2008 until 2017. Yeah so almost 10 years as I said. Yeah. Both of you, thank you so much for the introduction. So the episode today, we typically talk about what people do with observability. We had a lot of different talks about obviously performance engineering, how observability data helps. Like

Starting point is 00:07:41 the episode that you mentioned with Almodena, that was interesting. We also talked about things like how people bake observability into their platform engineering. Hardly ever do we talk about what actually has to be done to get the observability data. And that's why I think it's great to get a little bit glimpse into what has actually happened within Dynatrace as one of the vendors out there and how we are building agents and also ensuring that our agents not only deliver the data that our users need, but also do it in a secure and a reliable way. And the reason why I say secure and reliable, just at the time of the recording, just a couple of days ago, the world was struck by aStrike,

Starting point is 00:08:26 by that big incident. And some could answer, or some would ask the question, how does this work in Dynatrace? Because we are building an agent that gets installed on thousands of machines out there. And we have a lot of privileges on some of these machines because we're capturing a lot of data.

Starting point is 00:08:44 How does software engineering work within Dynatrace so that we ensure that our agent, that your agent, that the two of you are producing, is not causing the next CrowdStrike incident? And I don't know who wants to take it first, but I would just be interested in hearing some thoughts because I want to just learn. What do we do for software quality within agent? Mind if I take this, Marco? Of course, Kev, I think you know more than I do. Well, both know a lot about it, I guess, because it's, I mean, key is testing, right?

Starting point is 00:09:18 So this is really, I want to say more than 50% of the daily business of a developer here at Dynatrace. So as you said, something like CrowdStrike, of course, is the absolute worst case scenario that could happen. But on a smaller scale, if we did something really wrong, we could cause similar issues if you made a big mistake. So currently, for example, the Java agent runs about 6 million installations. So if you have a crucial bug that would prevent JVM from booting, you have a similar scenario than CrowdStrike. I mean, you don't take down the whole machine,

Starting point is 00:10:00 but you take down a JVM, which is, in the end, for a customer who's running their business transactions or business load it's the same result. So to prevent that from happening as I said testing is key and this is what we really take so seriously and every day we have like hundreds, hundred thousand or so instances of test cases that are running. There's instances here because it's not like individual tests, but if you run tests in different permutations

Starting point is 00:10:34 like operating systems, operating system flavors, even like 32-bit and 64-bit and Linux and Windows and esoteric platforms like Solaris and AIX. And then you want to run different versions of the product you're supporting. And you can easily see how this multiplies

Starting point is 00:10:55 into a huge matrix of test permutations. And that's what you're doing on a daily basis. So, of course, this cannot run with each and every product build, but it runs on its own cadence at least daily. And then we take our time to look at these test results. So it's a fixed schedule, it's a fixed part of our weekly work where we sit together and analyze eventually the tests that may have failed together and analyze it for results. But bottom line is, if really something terrible would happen, we would see this immediately. Yeah, I remember some time ago we were looking at our weekly meeting and for one of the tests that i've written it

Starting point is 00:11:45 failed on a really obscure i think some ibm machine or something and i was really glad that some of the somebody else um immediately knew what was up because i think figuring that out would have taken me a couple of days at least exactly that's really helpful because of course every developer does their due diligence and runs the tests in the local environment and maybe sometimes even in a virtual machine you have on your computer or like WSL or something like that. But no one can run like a Solaris test in there because you just don't have the hardware right or the operating system.

Starting point is 00:12:21 So you are dependent on the CI and this is where we really have the broad coverage that you need. Marco, maybe one additional question for you on this, because you said you just recently joined Dimetries, quote unquote recently, like you've been here a year. And you said that when you had this test scenario that failed, you had other people that could jump into because they have been probably with the product for longer. I remember when I started, it was like 16 and a half years ago. And we also already did, maybe from hearing from you, how many out of your colleagues do you have in the team that have been there for so many years that you can then actually also learn from

Starting point is 00:13:11 or ask for advice because it's impossible for somebody that just starts and has only a year under his belt to know all this? I think we are quite an old team, if I can correct me'm wrong, Wolfgang. But... There's one reason in experience. Experience, not all. So we have a lot of very experienced people that help a lot.

Starting point is 00:13:36 I remember when I started, I came from a smaller startup of 115 people. And there the scales are completely different. And also the amount of testing we do, of course, which was something I've never seen before in that scale, back then I think we had a few thousand end to end tests, but now it's a lot more. Now I need to ask you one question because I used to be a developer myself in the early days and testing was not the thing that got me excited in the morning when I got up and started working, just to be honest with you, right? I mean, I was

Starting point is 00:14:11 obviously often like you working for a testing company. So it was obviously in our DNA, but still it was also not something that I learned in my education. Test-driven development back then was, I think, something that we didn't really do. And this was like 25, 30 years ago when I got educated. Marco, as you said, how is it now for an engineer? Do you get, with all this testing that happens, how can you still focus and get excited about creating new things? Or what's the ratio? What does a day look like? It really depends on what I'm working on. So usually we work on one feature at a time.

Starting point is 00:14:54 And then, of course, a huge part of that feature is spent with testing. But for me, usually the workflow is to create some very simple test case to get the first workflow running, to how i'm doing um or what needs to be changed and then do to cover the rest of the testing at the end so i think it's usually for me it's a bike at the beginning at and at the end of of the development and between that i don't do that much i think what also makes it easy for an agent developer or a natural to write a lot of tests

Starting point is 00:15:32 is it's most often the only way to actually see if your code really is working and doing what it's doing. Back then, as I said, it was easy. You had a UI application and you saw the results and you had something to interact with. When you're writing agent code, you're basically placing your code in a customer's application and you need some kind of application, right, that you can instrument and run the agent code in.

Starting point is 00:16:00 And then you need to verify in some way that your code is doing what it's supposed to do like extracting data and tracing data so you need to write a test in the first place to verify that so it's uh yeah that burden or that doesn't there's no additional burden to writing tests in in that sense and what also helps is we have a really good framework that makes it easy to run things in containers or things like that. So you have no hassle for additional setup or so is needed. So you really need to lower the barrier for everyone to write good tests.

Starting point is 00:16:43 I think that's the next thing i noticed with joining dynatrace with a bigger company um how developed the testing infrastructure so back then one team is doing everything but nowadays we have in dynatrace we have those really developed processes this really developed frameworks that already do a lot for you making these and and i think brian right we both are really happy with with the output and the level of quality you guys are producing because so many times and brian you even more so in your line of work on this on the sales side right when you do a poc a proof of concept and have the magic one agent that we install on very critical systems. And obviously people are, if they don't know the magic of the one agent and the level of

Starting point is 00:17:32 quality it has, you might be frightened to install it on a production environment. Oh, yeah. I mean, that's, I think everyone's first question is, well, not as much these days, right? I think these days people are a lot more used to observability through agents. But going back even not too long ago, it was always, well, is this going to break my system? Is this going to break my system? And when you stop and think about it, it's a miracle that it doesn't. But it's because of that testing. And I've been here since 2011 with Dynatrace and I can barely think of any time that the agent had some core functionality or core bug in it that would really do that. makes it easier for us because we know, and it's one of the things we always say too, just so I don't know if you guys ever hear this,

Starting point is 00:18:31 it's like 99.99% of the time, if there's a problem when we deploy in someone's environment, it's usually because of some wonky thing in their environment. It's not because of our agent. They're doing something quote-unquote illegal. And that's just really comforting. Yeah. Yeah, I think what's also, I mean, also thanks,

Starting point is 00:18:49 welcome for the number that you mentioned earlier, 6 million installations of our Java agent alone. That's also a cool statement for, I guess, even a quote unquote young developer like Marco, you are knowing that your new features

Starting point is 00:19:04 get potentially used by 6 million different instances of the one agent that is observing business critical Java-based applications. That's a really cool thing. I actually wanted to follow up on this because if you think about it and think about the responsibility that comes with that, so it's easy to say, yeah, maybe we have this little bug that only affects maybe 0.1% of our customers. But what does this mean? Like 6 million installations is a lot of support cases, right? So you can't even have like an uncertainty part in your code. You say, yeah, maybe this affects someone, but not all of them. You should always be trying to make sure that no one is affected, right?

Starting point is 00:19:51 So if you develop software for a smaller audience, maybe you can live with such a percentage. In our case, it would mean we get flooded by support cases. Yeah, and by now, we will kind of see our product and our company grow and also the customers

Starting point is 00:20:10 that we have and where they install it. So we really run on business-critical apps. If they don't work anymore because of our mistake, then something at the magnitude of CrowdStrike could almost happen, right? So if you think about airlines, e-commerce sites,

Starting point is 00:20:27 insurance companies, governments, they all use our software, your software that you guys are putting out there. And just phenomenal. I wanted to touch upon that point one more time because early on when Wolfgang was explaining some of the testing,

Starting point is 00:20:38 I got really concerned for a minute until I thought it through some more, right? The idea that you can't test on like every single OS, there's all this whole plethora, there's a whole matrix of OSs or whatever. And that got me concerned, which my first concern,

Starting point is 00:20:55 which I still think is somewhat valid, is the quest for getting software out fast versus the test, the compromises we have to make in complete testing. Right. You know, we're all from probably most of us. I don't know about you, Marco, but Marco, but from Waterfall, right. Deployment where it had to be fully tested before anything. And that was a compliment to you, Marco, because you look younger.

Starting point is 00:21:22 So that leaves like these gaps. Right. And you said maybe it's running every day, but it's not running with every release because maybe you have a release every two hours. However, so that to me thought, well, that's exactly how something like CrowdStrike could get out. Right. Um, but when I thought about it again, the reality is any of the regular testing that's going to go on, the release testing, is going to be hitting those most common pieces, the most common platforms, because it's going to be going on what we're targeting. So CrowdStrike happened, at least I think, because I have no idea, but that was like every single OS instance.

Starting point is 00:22:04 It was almost like it was never deployed onto any machine, right? Whereas in this case, you're at least hitting the most major ones. It's all those edges and all those, the complete matrix that you're not doing. So the idea of even if you skip that full, if you don't do the full OS test on every release, you at least have your major ones covered just by your general pipeline testing and all that because that's going to be on that. So at least counter the idea that it's a huge risk. There's a risk, but it's more for the edge type stuff. And then to your point, if that edge is 5%, that's still quite a lot if you take a look at how many people are putting it on. But to create a CrowdStrike level event,

Starting point is 00:22:41 that kind of thing isn't going to happen if you're doing this sort of thing. Anyway, it was just an observation because I was mentally going through it thinking, oh my gosh, this is the end of the world and talking myself off of the ledge of, no, this covers most cases. So it's an interesting concept and one that you have to consider too of what coverage do we have if we're not doing the full suite, right? And how much do we slow down releases for full coverage versus not? And I think that's a question

Starting point is 00:23:07 we're going to have to tackle in the future if more of these kinds of things happen. Re-evaluating speed of release versus completeness of testing. Obviously, you can't have 100% completeness of testing, otherwise we'd be back where we were. But have we gone too fast? And I don't mean us Dynatrace,

Starting point is 00:23:24 but the industry. Maybe I misspoke before a little bit. So this not covering all the operating systems and flavors of operating systems, that's what happens on a developer's machine before they set up a pull request. So as soon as it's in the source code repository and the tests are running, everything is covered. As soon as a new agent is

Starting point is 00:23:55 ready for being rolled out into production, every possible test has been run. So there's even a separate stage. So we call's even a separate stage. So we call this the hardening stage. So even though we work in scrum sprints or iterations, finishing a sprint doesn't mean the agent is rolled out to customers immediately. So there is a hardening phase

Starting point is 00:24:25 where we still have the chance to identify something that might have slipped through through manual testing. Or also what's crucial is to have longer running tests, right? Because the tests I mentioned, you can't really afford to have a 24-hour test for every possible permutation there. So you have to select a few certain scenarios where you do long-running tests and see maybe memory is

Starting point is 00:24:53 accumulating or maybe there is like a weird CPU spike if you run for a longer time. So you also have to look out for these things to happen and the learning stage is the perfect opportunity to identify those kinds of bugs and uh yeah delay or a rollout or fix something before you roll out to a customer and this brings me to another question on on this before i want to move on to another topic but our podcast is called pure. So it has performance in its name. And I remember in the very early days of distributed tracing, agent-based instrumentation of apps, the number one concern was,

Starting point is 00:25:34 how much overhead do you guys add to my application? I don't want to install this because then my application gets slower and then it impacts my business and things like that. So I guess the level of testing you were just saying with the long-running tests, not only do we detect if long-running, we have memory leaks or things like this, but we also do intensive performance testing of our components as well and just seeing what is the impact that a code change has on the OS. Maybe Marco, you, especially from a developer perspective, I would be interested in how you deal with this.

Starting point is 00:26:13 Like, are you aware of this as an engineer when you're building a new feature? Yeah, I think it was actually something I just worked on recently. So the past week or half, week and a half I spent with performance testing for one of the features that I did just to make sure that everything is in order. And of course there's a performance impact to having the agent. I think everyone expects that but we try to keep that minimum. Yeah and I think you're doing a pretty good job in this because it's phenomenal to see what type of data we collect and how little of an impact it has. It was just before your time, before both of you joined, we had, as an industry, we had to do a lot of work to prove that agent-based instrumentation is not adding the level of overhead that would actually negatively impact the app. Or let's say that way the level of impact it has

Starting point is 00:27:10 because as you correctly say Marco, every instrumentation whether it's agent-based, whether you are doing manual instrumentation by writing logs, whether using OpenTelemetry, whatever it is, you're adding code to the existing business code and that just by default generates overhead. But the question is, what's the cost benefit of this? And you need to capture enough data without having the overhead, but you need enough data to then be able to see what you need to see. And on that point, Andy, the consistency of the negligible overhead, not only by Dynatrace, but I think by all

Starting point is 00:27:49 vendors out there, to keep that overhead low has been critical. And as you mentioned, we always used to hear that question. We rarely hear that question anymore. How much overhead is it going to ask? That's not a thought on people's mind because as observability has become more widespread, it's just been taken for granted. Yeah, it's just going to be a teeny bit for what we get. But that all comes from that hard work, right? It's because you all didn't fail in that, that it goes unnoticed. Same thing if you go back to, I always say back when we had,

Starting point is 00:28:21 during COVID, well, the vaccine works and people stop getting sick. People are going to say, well, was it really the vaccine or, you know, but because it would, there was no, people just didn't get sick, right? But if it, if people continued to get sick and die, then people would say, oh, it didn't work, right? So the lack of having incidents from overhead, while less visible, has made it a non-question for the most part within the industry. So that's a huge, huge task and a huge, huge accomplishment from you all and from all the

Starting point is 00:28:55 other agent developers too. And there is no glory in prevention, I think they said. That's what I was getting at. Yeah, exactly. Thanks. And maybe a last sentence to this. What was interesting, because observability was always there, but back then people were maybe just writing log files,

Starting point is 00:29:14 but nobody looked into the overhead of creating a log because this was just part of software engineering. And then the observability vendors came in, or back in the days, we called ourselves APM. And then you all of a sudden add something to your app. And then if things change, then you say, of course, it's the APM product or it's the agent. But there was already overhead anyway, but nobody talked about it.

Starting point is 00:29:38 Nobody looked into it by just writing logs. I mean, that's an interesting aspect as well that nobody thought about. I think you could say that observability is so omnipresent now that back in the day it was the question, have it or not? And now it's which vendor or which product to choose because having it is a given. Marco, last one on last follow up for for you because you mentioned in the last week and a half you had the test to do some load and performance testing on your code. Can you just quickly explain what that looks like for an engineer? What do you do? Do you use what testing

Starting point is 00:30:20 tools to use? What do you look at? What metrics? Yeah, of course. So the test setup is very similar to our regular end-to-end tests. So we have our whole framework and everything. And we already have quite a good setup to do the common performances that we want to do. So for me, that was an HTTP route and there's already a pre-written class that you just can use and that's the setup for you. And then we time how many requests do I get out? How long do the requests take?

Starting point is 00:30:54 And some other metrics if you want to. And those are then run with different permutations. So we have with the observability enabled, disabled on different JVMs and we can see how enabling those features that you might want would affect the system. Okay, so you basically have a baseline without observability and then you have different levels of turning observability on it and you see how much does this change the throughput or the performance of the app. Exactly.

Starting point is 00:31:23 Besides throughput, what else do you look for? For me, it was actually not only throughput. I'm not sure if you know, if we capture some other information as well? We often look for the usual suspects like memory and CPU, but it very often is really dependent on the actual feature or instrumentation you're writing, what you are looking at. So, but yeah, very often the metric is some kind of transaction rate or something like that. And Marco, I know you're not sure if you're allowed to say this, but because you just work on a feature that might not yet be released. But if you can talk about it, this is a feature.

Starting point is 00:32:09 What feature were you working on? Because for me, maybe from the outside world, they say, well, they had a Java agent for so many years. What new features do we need? Yeah, let's see what I can say. So we try to support different libraries to have already a very rich observability for those libraries that you might want to add. And for me, that was something that would enable you to add logs to your pure path and to see, okay, this logs belong to that. Okay, so that's the uh the locks in context of distributed traces for certain frameworks yeah that's awesome very cool yeah that's um actually one of the the exciting things that happened over the last couple of years when we

Starting point is 00:32:56 when we really uh brought up the ability to connect an individual log that was otherwise in isolation with the actual trace that generated that log. That was huge. Very useful, of course. Yeah, of course. When you talk about instrumentation, I want to switch now topics a little bit because we've been doing agent-based instrumentation since our existence, right? Dynatrace, dynamic tracing is in our name.

Starting point is 00:33:27 So we started in the beginning with distributed tracing for Java applications and maybe a little trivia for you because you may, you're not that long with the company initially that the product, the internal project name was called JLT, Java Load Tracing. So if you still see maybe some Jira tickets floating around, the project JLT stood for Java Load Tracing, so generating traces on Java applications as they're on the load. And the first load testing tool that we partnered with back then was Silt Performer. So we've been doing this for a long, long time, but now we are in 2024.

Starting point is 00:34:07 OpenTelemetry it seems has taken the cloud native space, at least by storm. OpenTelemetry is a great framework for developers to put in their own instrumentation with the idea that vendors like us, we don't need to reverse engineer certain frameworks because we assume these frameworks are already instrumented by the developers of those frameworks because they know their frameworks best and what type of telemetry data it should produce. Can you give us,

Starting point is 00:34:37 and I'm not sure who wants to start, but a little bit of insights on how do we deal with open telemetry and has this changed anything from our agent-based approach? What do we deal with OpenTelemetry and has this changed anything from our agent-based approach? What do we do with this data? Is there anything you could talk about? Yeah, I can start. I can take this again if it's fine, Marco. OpenTelemetry is, so we have several answers to that, several ways to work with OpenTelemetry. And the simplest being, it's just on the server side, right?

Starting point is 00:35:18 So we have OpenTelemetry, they have their custom or they have their own communication protocol. It's called OTLP. So just the way they exchange tracing data with the backend system. It's an open specification, open source, like everything in open telemetry. And Dynadrace just offers an endpoint to ingest open telemetry. So this puts the whole agent discussion completely out of question because some agent is running, some open telemetry agent is running, and we

Starting point is 00:35:47 can ingest it. Then you might have a situation where you already have a Dynatrace agent and, for example, an OpenTelemetry agent or, in another situation, a Dynatrace agent and an application that has OpenTelemetry API calls. So someone manually instrumented their applications. And we have basically solutions for all of these scenarios. So if in a way you could think of the Dynatrace agent as an SDK for OpenTelemetry.

Starting point is 00:36:22 So if I just write a piece of software and add OpenTelemetry. So if I just write a piece of software and add OpenTelemetry API calls, they usually without any agent or SDK being present, they are dormant. So they add little to no overhead because they're just empty implementation stops. If our agent sees

Starting point is 00:36:39 those, we can instrument them and suddenly you light up all these OpenTelemetry API calls, so additional instrumentation there. And together with the OpenTelemetry agent, we can run in a side-by-side scenario so we don't step on each other's toes. It's another way that we support. So we're fully embracing open telemetry. It's not like a competition or something. We are even active in the whole

Starting point is 00:37:13 specification and even in the language groups and the implementation groups and try to contribute to open telemetry. Yeah, I know we have a couple of colleagues that are, as you said, actively contributing back to Upstream, also some of our instrumentation technology, and that's great. Marco, anything from your side to add on that topic? Yeah, I think I have to deflect on that one. I haven't really worked with OpenTelemetry at Dynatrace yet yes thank you yes yeah but maybe uh maybe a side question on this because you know when we talk about open telemetry we talk about open source we talk about um seeing what's out there eventually you know contributing back or

Starting point is 00:37:59 using using open source often i'm just interested also from my background as an engineer. Are you in your role as an engineer on the Java agent? Are you getting in touch with open source projects? Are you getting in touch with open source communities in any way? Communities a little bit. So besides being an agent developer, I also work as a tech evangelist or tech advocate or whatever you want to call it. And so we have been in contact with some open source communities, but on the regular job, not that much. So of course, we need to look at the source code when we do an instrumentation for a new

Starting point is 00:38:39 framework. And if this is open source, it makes it a lot easier. Yeah. What's the tech advocacy thing? Can you just out of curiosity? Yes, of course. So I spend some of my time being active in the community, doing talks at meetups or at conferences

Starting point is 00:38:56 and also engage with some open source communities and work like that. Well, we should talk because I do the same and I'm very happy to hear what you do. Yeah, I'm quite aware like that. Well, we should talk because I do the same and I'm very happy to hear what you do. Yeah, I'm quite aware of that. Yeah, cool. I mean, one last thing on OpenTelemetry and I just had a podcast

Starting point is 00:39:18 where I was a guest on, recorded when I was at KubeCon in Paris. And I just want to rephrase this. As you said, Wolfgang, I think open telemetry is a blessing for our industry because we can assume that we are getting better in telemetry data because we assume that developers from runtimes, from frameworks, from projects will instrument their code because they know it best. So that's great. So we can ingest it and kind of marry it with the data that we also have.

Starting point is 00:39:50 But I think there's also a misconception out there that I highlighted in the recent podcast. Some people believe open telemetry is all you need. And that's it. You install the open telemetry collector and you install one of the agents. And while I know this is like just installing the Java agent,

Starting point is 00:40:08 it gives me a lot of data, but without sending the data to some endpoint, without there the data being analyzed, sanitized, making sure that only the right people have access to the data, because with the Java agent and with OpenTelemetry, you can collect a lot of data and also potentially confidential data. And you want to make sure that the whole end to end lifecycle of data, of observability data is properly managed. So that's ingest, that's transport, secure transport,

Starting point is 00:40:40 that's analytics, and that's also making sure that only the right people and the right other tools have access to the right amount of data. And these are just discussions that I think most of you listeners know, but I also know some people are not aware of this. So OpenTelemetry is great because it gives us a great source of additional data, but you still need a backend where this data gets sent. So whatever tool you choose, choose it on your liking. But just making sure that this misconception is understood.

Starting point is 00:41:13 Yeah, that's a good point. So it's an agent, the whole agent part of the entire trace is just one part of the equation. It's only the data source, So the data sink, the destination, and all the analytics that are run there are all artificial intelligence engine that's running in the background and creating those smart alerts. So that's where also a lot of the power is. So collecting data is only useful if you can run analytics on it. And if you're talking about big data, you cannot drill down into single traces or look at the spans.

Starting point is 00:41:54 You need to have a more coarse-grained view on things. Wolfgang, in the very beginning, when you introduced yourself, you mentioned that you are a team captain, but not only from one agent team, but from multiple one agent teams, which means if you look at the one agent, we just talked about like Marco,

Starting point is 00:42:20 you and the Java area, there is many different technologies we support through our one agent. And I assume this also means we have, I don't know, do you have a rough estimate of how many engineers we have working on agent

Starting point is 00:42:35 technologies? It's embarrassing that I don't know the number now. Don't quote me on it. I mean, you're on a podcast. I'm quoting myself on it. Say 200. I don't know.

Starting point is 00:42:50 But I would say, I would, sorry, Marco, go ahead. I would have also said in that ballpark, I think we had a meeting

Starting point is 00:42:55 yesterday, a few weeks ago, and that was 160 under our manager or something like that. No. No. Think about that investment right we have uh

Starting point is 00:43:14 and i think this is the this is the exciting piece that that we have uh not only java but all sorts of technologies that we support with our one agent that automatically instruments apps without you having to think about how to instrument it. And we continuously invest to make sure that these agents are up to date, that they're well tested, they don't have any negative overhead, and that the quality is so good that you can be confident that we are not the next crowd strike or whatever other incident happens. So I think that's just phenomenal. I assume with the 160 people, I know you both, you are you live in Austria, so you all work in our

Starting point is 00:43:51 labs here. Do we have other team members that are in other parts of Europe, other parts of the world? Yeah, we became more and more distributed. So all the negative impact that COVID and the pandemic had, the one benefit was that we suddenly embraced remote employees and remote working models. And this has really benefited us, I think, because sometime in the past, it felt like the pool of engineers in Austria and especially in Linz was completely depleted.

Starting point is 00:44:33 And now we have brilliant people just recently joining in Tel Aviv and we have remote employees in Berlin, in Switzerland. So just in the Java team, I don't even know where the people live and the other agent teams are sitting. And so this has really benefited us, attracting talent from all over the world, almost, I want to say. Yeah, I'm also one of the beneficiaries of that. So I'm in Klagenfurt and there are only two agent developers here because the rest of my team, the biggest part of my team is in Linz. Yeah, I think that's nice.

Starting point is 00:45:16 I mean, if you think about how we started, as you said, Linz is obviously the center of our engineering organization but just in austria alone with vienna with gas with klagenfurt innsbruck hagenberg and then beyond austria we have you mentioned tel aviv we have estonia we have poland we have barcelona um i didn't know about berlin that we also have remote people sitting in Berlin that's also pretty cool so obviously there's no limit anywhere because if we have good talent that want to support our

Starting point is 00:45:53 cause of building the technology that makes sure that our customers can themselves build software and operate software that runs perfectly then wherever you are you know look at Dynatrace. Marco, again, for my own benefit, because I used to be a developer and I used to write code, in your regular day, what types of tools are we using internally? Do we have, I assume, a lot of homegrown tools, obviously, but how do you develop? What do you use for your IDE? What do you use? Just give me a couple of things. It's just interesting to talk about tools as well. Yeah, let me think. I think I've actually a very vanilla

Starting point is 00:46:36 setup. So I have some IntelliGy terminal. I like to use ZSR. I think that's maybe a little bit special Emacs for my log files. So if I need to edit something. Besides that, not really. How long did it take you? You said you started a year ago. How long did it take you?

Starting point is 00:46:59 I know it's a very specialized team and the agent team is very critical software for us. But how long did it take you to actually then get started and actually contribute code? So I think the whole setup took me about a week, week and a half around that. And after that, I think one of the first issues that I actually worked on was a bug fix already. So about a month after joining, I was working on that. Yeah. And that's, again, it's phenomenal because we need to think about it. The software that you guys are producing is extremely business critical.

Starting point is 00:47:42 And then having a short onboarding time where you already work on bug fixes of just a matter of days or weeks is, I think, phenomenal. And that's really also kudos to, I guess, team leads or team captains like Wolfgang and the rest of the organization that have provided a good framework to make this happen.

Starting point is 00:48:03 I do have to admit, I think you never stop learning. And I think there's still a lot of stuff that I don't know, which I just noticed a few weeks ago. So I guess there's always more to learn, especially with such a big organization. Yeah. Totally. And I think it also speaks for... We have a very diligent code review process that allows especially new joiners

Starting point is 00:48:32 to really have confidence in the code they are contributing to our codebase because experienced developers have looked over it and given their stamp of approval and so do the tests. So this gives a lot of confidence when contributing. It was the same for me when I joined the Java team because I did other things before. And even with, as I said, an old guy like me with 20 years of working in software, things were still new then for me. I think compared to Brian and myself, you're also young. Yeah, thanks for that. Awesome. Did we miss anything? Is there anything else? Today is really about getting a little bit of a glimpse behind the scenes of a

Starting point is 00:49:28 particular part of our organization, a particular technology that is very critical to us and the observability world. Anything else we indicator that we've covered it all. Yeah, I'm drawing a blank right now. No, it's all good. We don't need to stretch it. I think I just want to maybe then finish with something that I said earlier, but I cannot say this enough. You make the life of many of us other Dynatracers super easy. Because we can, with confidence, go to somebody that is already either existing Dynatracer customer or a partner or a prospect.

Starting point is 00:50:24 And with confidence, we can give them the one agent. And the one agent is just picking up everything. And we have a confidence that we're not crashing any system. We have the confidence that we get the data that we want. And we then have the confidence to show them what they can do with the data with the rest of what Dynatrace provides. And I think that's just amazing.

Starting point is 00:50:41 And knowing there's so many people behind the scenes, like you said, 160 people alone on the one agent side that make sure that these agent technologies work in all these environments, because the world is not just playing Kubernetes. The world is not just running on Windows. The world is a big, diverse world where we have everything from serverless

Starting point is 00:51:04 all the way to the mainframe and just having the support through one single agent that works flawlessly is a blessing for us that sell this product. Well, can't follow that up so we'll land it there. Thank you again to the team for all you're doing and for all the people who you're representing on here. And I'll even give a shout out to agent developers on all the products, right? Because if one product is giving people a bad experience, that's going to taint the experience or the openness to using products like ours for all of it. So there is definitely a unity between teams

Starting point is 00:51:53 for making sure it's all good for all of our sakes. So thanks for all that you do. And hopefully people found this as informative as Andy and I did. And very happy to have you on. See, Andy, I'm not even on. I'm with the pain and the painkillers. It's not even, they just told me, give me Tylenol and ibuprofen,

Starting point is 00:52:14 but it's, or ibuprofen is iminophine. But I'm just not myself today. I'm just stumbling around. So I will let that stumble, stumble us to the end. Thank you all of our listeners. We'll see you next time. Bye-bye. Thank you, guys.

Starting point is 00:52:28 Bye-bye.

Your Ad Here

PurePerformance - Observability that is Battle tested by Millions with Marco Sussitz and Wolfgang Ziegler

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.