PurePerformance - Educating the next generation of Observability Heroes with Rainer Schuppe

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my mocking co-host Andy Grabner. How are you doing Andy? Good, just talking about mocking. Mocking services, I don't know why this just comes to mind, but for whatever reason, mocking, mocking birds, mocking services, mocking you. i have no idea why this came up you just had a brainstorm of mock yeah yeah well anyway i'm good

Starting point is 00:00:55 thanks for asking oh good i did ask you yeah no okay i thought you were saying that sarcastically i'm like what are you talking about so anyway i was um you know one of the things we have a lot of rabbits around here um i like to call them bunnies because it's cuter and they're cute little animals and every day i find myself looking out the window at them um see them hopping around and trying to figure out what are they getting that what are they trying to do and i'm i'm always making mental notes of what they're doing and I'm like, this is really an inefficient way to track their behavior and see what's going on

Starting point is 00:01:29 and try to understand if there's any attack vectors of the neighborhood cat coming in and it's reminding me of something I just can't put my finger on. I was hoping, Andy, you might be able to help me remember what this reminds me of. This reminds you of of of

Starting point is 00:01:46 observing observing observability that's right maybe with me you know what would be cool if you would have an observability hero that could help us and actually shed some light then maybe also tell us how we can apply the stuff we've been talking about for the last 16 years or eight years on the podcast to not only microservices, but maybe in a broader sense. It would be amazing if we could do that. It would be amazing. What if we just clap our hands? And, whoa, look at that, we have a guest here on the podcast, Rainer Schuppe. Servus, Rainer. Servus, Andi. And and hello brian and thanks for having me here and and double thanks to calling me an observability hero which well i'm definitely a

Starting point is 00:02:34 veteran in that space doing that for 21 years officially for lots of apM companies and inofficially for I would think 25 years but that's stuff that you don't want to know about that's like Stone Age manual observability with logs and Perl scripts to extract some kind of metric from that that's who yeah you could put that a haunted house or something like that yeah I had to call the Observability Hero because, guess what, you put it on your website. Observability Heroes.

Starting point is 00:03:11 He's very humble. I guess he's very humble, yes. That's where Observability Heroes are made. I'm not saying I'm one, but I can help you to become one. That's also the name of the community that I just founded. And yeah, I, that's also the name of the community that I just founded and yeah, I hope it

Starting point is 00:03:27 can be some kind of the Justice League or the Avengers for you know, observability. That would be the goal. But it's a long way. As you know, Andy, I mean, we know each other for 16 years

Starting point is 00:03:44 or so. Yeah, it's been 16 years. You joined Dynatrace while I had my stint there. So after working for Wiley Technology, with the first APM solution that uses biker instrumentation out there, I joined Dynatrace afterwards and then hopped on to other generations with AppDynamics and Instana at the latest, but what I really like is the OpenTelemetry now, which is kind of the

Starting point is 00:04:13 common denominator for observability that we can all benefit from. I hope it will become a street name out there and will make it easier to deploy observability solutions in organizations because everybody needs it. I mean, it's, yeah.

Starting point is 00:04:31 If you don't have it, you surely run into troubles at one point in time. So coming back to the superhero kind of theme, is that basic? Does this mean the open telemetry trace is the secret weapon or the secret craft uh of the superhero or what is the what is what makes a superhero what makes an observability hero a superhero what is it that you need to know what are you teaching people in

Starting point is 00:04:59 your uh you know in your classes uh i know it and it's been it's been great to bump into you last week just by exit. Maybe a quick story for those of you, right? I've been fortunate enough to travel to Barcelona last week giving a talk at a cloud-native meetup about platform engineering posted on LinkedIn. All of a sudden Rainer sends me a LinkedIn message.

Starting point is 00:05:20 Hey, I'm just around the corner. I'm on Mallorca. And I said, I thought, hmm, Mallorca is not just the next town over. But I guess geographically-wise, it's not that far away. No, it's a 30-minute flight. It's an island, so I have to hop over. It's either a couple of hours on the ferry or just jump on a plane, get over to Barcelona. And guess what? Next week

Starting point is 00:05:45 we have a CNCF meetup on Mallorca in a very nice location in Palma. So whoever is on the island and likes to know more about it, it's not specifically observability, it's CNCF. But I know that Pera is there, one of the Dynatrace

Starting point is 00:06:01 engineers. And it's actually again like the one in Barcelona, sponsored and initiated by some Dynatrace folks. So, yeah. What does it take to... Sorry, answer your question.

Starting point is 00:06:18 I love Spain, so I wanted to talk about Spain, but we're not here to talk about Spain. But you can come here anytime. My wife and I, we are having a co-working space, so we have a very good internet connection, a really good coffee, and a very relaxed atmosphere around.

Starting point is 00:06:34 So as soon as you step out of the office, you're like in a 4,000 people city village in the south of Mallorca, and everything goes away and then everything comes back. And for me, for example, the idea about observability heroes.

Starting point is 00:06:50 Because when I go back in, my father was a mechanic and he repaired things. And that's what people applauded him for because he had the endurance to look for stuff that others didn't. So he was like, this was his passion,

Starting point is 00:07:04 like finding the root cause of a problem and then fixing it. And I have inherited that. And because I never give up when somebody presents me with a problem, it's my obligation to solve it. Unless it's really that bad that I said, sorry, I'm the absolute wrong person for that. But as long as I see a chance that I could solve it, I'll give my best. And

Starting point is 00:07:26 being 30 years in the IT industry, I have seen everything I started at the at the end of a hotline, answering end customer calls, then did some operations for PDP 10 mainframes from deck at that time which were computer first running on and then went into consulting architecture and ended up in pre-sales for apm companies and all the time people called me when there was something wrong because i had an understanding of all these systems i and what what it takes to become an observability hero is basically you need to have an understanding of the systems, how they basically work, and then get your tool belt right. And that's different from organization to organization.

Starting point is 00:08:20 You need different tools for, let's say, the old ones with monolithic or three tier applications. And then for microservice serverless, you need a different tool set, generically the same, blocks, traces, and metrics. But in different, how would you call it, Brian? Different flavors. Yeah. Varieties, flavors, yeah. And what I teach people in my coachings and the workshops is

Starting point is 00:08:52 how to get where they want to be. And the first step is finding out where they want to be. Are they simply troubleshooters? Are they in the organization just to make sure that the operation doesn't totally break down? Or are they they in the organization just to make sure that the operation doesn't totally break down or are they already in the forefront and establishing kind of a platform engineering type of thing where they need to get more people on board or do they need to do some troubleshooting

Starting point is 00:09:18 first to get the funding for some wider observability. Because observability is basically what I've seen in the last 20 years is always an overlay function. It's a cost center. It's not something that you pay up front for because you see lots of value. It's something that you need to pay for because you just need it to make systems run.

Starting point is 00:09:42 And if you need a, former times we called it a crit set, a critical situation to get the funding for your observability tools. That's, that's something that's already good. But it's better if you get the funding upfront, because you can convince people that they need observability to make use or not to overpay in Kubernetesubernetes not overpay in the cloud um in the cloud services because all of these microservices are nice to deploy and hey here deploy here deploy there and then at monoliths you already started to throw hardware at performance

Starting point is 00:10:19 problems oh we run out of memory okay well put in, put in more memory. Now, in Kubernetes or in cloud-native deployed applications, it's kind of the same. You just don't notice it because Kubernetes takes care of that for you, and it can cost you an arm and a leg. So if you do it right, and that's also what I teach people, is you can justify the cost for observability

Starting point is 00:10:41 with the cost reduction you have in the... How should I say that? Operational costs? Yeah, in preventing over-provisioning, which is something that lots of people do because, well, let's say GCP does that for you, or AWS or Microsoft Azure, they do that for you

Starting point is 00:11:04 as a kind of an operational security, which is nice, but, well, sometimes you pay through your nose to get that type of security. And that's where observability can help as well, but that's usually step two.

Starting point is 00:11:19 First step is, we do have a massive problem somewhere and we need fast help and then, well, where do we get the data from and then people look around for and you've probably seen this andy and brian as well andy in the pre-sale space these companies that want to run a proof of concept just to get your tool to solve a specific problem and then you and then you start selling. Andy, I was going to say, it sounds like the secret weapon of the superhero is similar to Batman. It's the utility belt.

Starting point is 00:11:56 You mentioned having the tool set, the belt, and also the passion. Now, Batman's a little dark there, but he still has a passion for justice right and you mentioned passion which i really love because i think a lot of people um when they're treating performance and observability as a requirement um they don't always do it so well it's more if we need to get something in there we have it we have the check but when you you have A, the right tool belt, and B, the passion to make sure you're doing observability right, the passion to find ways to save money in the cloud or to increase performance and all that, that's when you can really take off and become that superhero because you're putting everything you have into improving the entire ecosystem, not only for the end users, but for the organization.

Starting point is 00:12:46 And then that's when you end up looking like that superhero. And you feel like one too, right? If you don't have that in any job you have, if you don't have that passion for it, you're just going through the day. So I think we found the secrets. I think we found the secrets of being that superhero there. I want to recap something quickly. I'm not sure if you noticed, but I always take a lot of notes

Starting point is 00:13:09 when we have these podcasts, and then I use them, A, to reflect afterwards, but also to write the summary. But I wanted to go back to what you said earlier. You said traditionally the need for observability comes in because people are having a critical situation, a problem. Then get me any tool that I need to solve the problem. Cool, problem solved. Then the next step typically is, okay, how can we justify the costs?

Starting point is 00:13:37 If you want to keep the observability tool for longer, this is typically where we address the cost-saving potential, right? You know, optimizing your systems, combining it with performance testing, fixing hot spots in your application, right sizing and everything like that. For me, the last step, and I think you mentioned this in the beginning, is then, however, how can we change the mindsets of organizations to think about the potential of observability from the start so that they actually can become data-driven companies. Because if I'm a new startup, I guess, and I have a product idea, I typically have assumptions. I have an assumption about how many people will use my product, how much money will I make? I probably also do experiments with my features that I put out there.

Starting point is 00:14:25 And for monitoring and evaluating my assumptions and run my experiments, I need some type of data, observability data. Because we're all digital organizations, ideally, we get this observability data straight from the digital services. So are these the three steps or are there also things in the middle but what's in the middle between i'm using observability for cost optimization and i'm using observability for really making strategic data-driven decisions is there any other reasons why people then would kind of use observability for something in between uh probably many reasons but I would say they are niche, and

Starting point is 00:15:07 they all depend on the data. And thanks for reminding me. I mean, the biggest problem in troubleshooting is not having

Starting point is 00:15:13 data. And that means if you put in observability when the problem happened, you

Starting point is 00:15:19 have a loss. You have to hope that the problem happens again. So having the data collected from the beginning

Starting point is 00:15:25 gives you not only performance indicators that you won't see in development or testing. I once gave a talk at a conference about the different types of performance problems, and they only reveal themselves usually in production. And this is what you cannot test but you need the data. But you only know what data you need

Starting point is 00:15:51 when you actually have the issues. So I think that's also one of the things that people have to realize that it's a constant learning process. And that's not, let's say, it's a constant learning process. And that's not, let's say, a step. It's a learning step. You have to constantly apply your know-how, your expertise

Starting point is 00:16:12 to collect the right data and scrape the data that you don't need and find a good... The American way is probably saying solve the Goldilocks solve the Goldilocks problem with just the right not too much not too less and this is

Starting point is 00:16:31 for me in between step two and three but not in the thing that you can kind of learn and certify it's it has to come from the inside which is really hard to measure yeah and coming back to the to your

Starting point is 00:16:47 question other data security the observability gives you data about volatile data about logins about um possible attacks that that are happening because all of a sudden you get these a ton of 404 errors here on different endpoints and you see that could be an attack in there so also securities i mean in in our times that's a big a big big thing to to think about and a couple more that i just don't can't come up with from the top of my head at the moment yeah no worries um in your blog if if I look at, and folks, all the links, as always, you'll find them in the description of the blog post,

Starting point is 00:17:30 but observability-heroes.com definitely gets you to the website. And then looking at the blog, I saw that, first of all, you did a good job with explaining what is observability from your perspective. But then the next thing, and I know last week when we were in barcelona and you said hey let's give like another hour or two after the meetup and let's sit down uh with the beer with the cerveza

Starting point is 00:17:55 and we tried i at least i tried to do my best in ordering everything in spanish unfortunately we ended up not uh not dehydrated but we we had enough, plenty of beers. But I remember you said, hey, this OpenTelemetry thing, and you mentioned this earlier, this is really something interesting. And you started your own OpenTelemetry journey. And you also started now to write a blog about your journey, right? You started, I'm looking at at here my hotel journey week one um open telemetry as you said you know is it is hopefully going to be an amazing tool

Starting point is 00:18:41 for all of us to really enable developers but organizations to get the data that they need because they are under control on what type of data gets generated i think there's also still a little bit of a misconception of what open telemetry really is this is a very educational piece come in right open telemetry is not your like your magic it's not a magic wand and you sprinkle over some stars and then everything automatically works and you don't need anything else open telemetry only solves the problem of what type of data do we capture, what type of insights do we want, but you still need to get this data to some endpoint, you need to store it somewhere, you need to analyze it somewhere. But open telemetry, many people are walking through that same journey that you are currently going through. You're learning this new technology. What are your lessons learned

Starting point is 00:19:24 so far and what would you like people to take away from your learnings? Obviously, besides reading your blog series, what else is there? The open telemetry is a good start. And as you said, it's all about data gathering. And that's my biggest learning. You need to have a good backend to analyze the data.

Starting point is 00:19:48 So getting the data is one thing. That's a big thing. If we have one common platform that can gather the data in a specific format, and that's why context and like in the words, the right nomenclature is, is essential. So you can actually find the stuff in other systems. But that takes away a big, big, big chunk of work in actually getting the data. But then you have to store it, you have to analyze it, and this is where it gets tricky. And after working for four big APM companies where we provided that, and now I have to actually,

Starting point is 00:20:25 hats off to all the developers who stored that stuff, who made it easy to analyze, easy to go through and to query all of that stuff because this is what's bothering me most in OpenTelemetry. And it's not what OpenTelemetry is all about. That's what I realized. It's not about analyzing the data. It's about gathering and then use whatever tool you need to get your job done.

Starting point is 00:20:48 If you're an operator, you go with Grafana dashboards and you put the metrics in there and get your alerts. If you're a developer, you want your fine-grained service traces and all of that to dig into how your stuff actually works and maybe make it perform better. When you are testing QA, you want to see where are the things that need to be optimized or just give it a go to release in production or not. But that means you also need different visualization. And this is what OpenTelemedry doesn't help you with.

Starting point is 00:21:23 This is where Dynatrace, Instana, Honeycomb, all of the others out there help you to do that work. And it's seriously, and I looked over, and it's open on the internet so this is no trade secret,

Starting point is 00:21:40 the installation of the Instana backend. When you read the documentation, how to install it, you realize that this is a thing consisting of 30 different services with three databases. So there's Cassandra, there's Elasticsearch, there's ClickHouse, there's also, there's no Postgres anymore. Cockroach was also, well, some SQL database,

Starting point is 00:22:03 but that's only for some auditable stuff. But all these services that have to work together, this is so much know-how, so much expertise flowing into it that this is the really hard part then for you to implement that in your organization. And if you're doing your first steps, that's fine. Take Jaeger, take Zipkin, take all the others,

Starting point is 00:22:29 take a Grafana or Elastic and get your feet wet. And if you're fine with administering that stuff, if you're fine with administering an Elasticsearch cluster, a Clickhouse cluster, maybe on Kubernetes, then okay.

Starting point is 00:22:40 But if you're not, you can still have the OpenTelemetry data. You don't have to touch that, but you can switch your backend. And that's what I really love about OpenTelemetry, is that you don't have to touch that stuff, because it is all the same format. It's a format that most of the commercial vendors out there support at the moment. Some more, some less.

Starting point is 00:23:02 And this is where I would like to see OpenTelemetry go, really. You have to foundation write. And then, with trial and error, you find the tool that you want. Some like the, I mean, and take a look at the Datadog. It's a lot of stuff that you could do with Datadog, but you have to have the passion for,

Starting point is 00:23:22 you know, playing with it and running the queries and putting the dashboards together and all of that. If that's your passion, if that's what you like to have the passion for playing with it and running the queries and putting the dashboards together and all of that. If that's your passion, if that's what you like to do, go with it. If not, go with Dynatrace, go with Instana, because they do a lot of things automatically that

Starting point is 00:23:35 you don't have to do. And yeah, that's basically the biggest learning. And the other learning that I had, and I'm currently, currently still have some AWS instances running with FluentBit, and that's not OpenTelemetry, but I'm trying to get syslog data

Starting point is 00:23:53 from two servers that I have here in our co-working space to report into a collector, and I'm miserably failing. I don't know why, but it is some fickly work that's going on in OpenTelemedge. It's a good start, but if you go with one of the vendors, you can have a positive experience within half an hour.

Starting point is 00:24:16 I always liked, when I was in Edstana, that we could set you up in a Kubernetes cluster in five minutes. The agent was discovering everything, then boom, then they would. This was just jaw-dropping like six years ago. This is not what you're going to have with OpenTelemetry. Definitely not. You're going to have a lot of work.

Starting point is 00:24:37 You pretty much understand what this stuff is all about. But it's a lot of manual work and if you want to sell this to your bosses as a free solution, it may be free of license costs, but it's not free of work.

Starting point is 00:24:53 And it's going to be a lot of work that you put into it. So there's a lot of automation already going on. But still, you have to get familiar with that. And then, of course, you have to store the data somewhere. And this is where also some costs are coming towards you. I think one of the important differentiations you're making there,

Starting point is 00:25:13 at least if I look at the English definitions of the words, is the difference between data and information. Data is your supermarket full of food. And as you're talking about you want to maybe go in with your uh your shopping list and buy the food that you need need which is selecting which data you want to collect but you still don't have the recipe to cook a meal right and and you have to turn that data into information which is useful and that's where those back ends come in and yeah i agree it'd be awesome if something like uh open telemetry can start

Starting point is 00:25:45 expanding into that section um but that's where that back end comes right and i even forget you know not that i forgive but like i understand like open telemetry has not been around for very long right most of us vendors in terms of figuring out how to make it easy, what data is relevant to collect and all that. We've all been doing this for years and years and years. I think the strides that open telemetry has made in such a short amount of time is definitely something to marvel at. But again, it's only been a few years

Starting point is 00:26:19 and once you have a community of people, everyone putting their input in, now you have to battle, not battle, but everyone has to agree upon what to do and it gets sticky. So it will be really, really interesting to see what happens on that side over the coming years, to see it get refined and more complete. And I believe they're even doing a lot of work on ease of deployments, right? There's some of these automatic, a lot of the projects have these automatic instrumenters built in. So yeah, but it really comes down to that, that information versus data.

Starting point is 00:26:55 And what are you going to do with that data when you have it? That's the key. That's actually the, I have two blog posts that I'm working on right now. One is called data versus information, or data and information. One is called Data vs. Information, or Data and Information. It's exactly that point. And the other one is about how easy it is to monitor and how hard it is to ignore the root cause.

Starting point is 00:27:14 I mean, if we all go back to three-tier architecture, it would be a lot easier, right? Yeah, but even then, even in a microservice architecture, it's easy to spot the problem. It's easy to, because there are only two symptoms. It's either too slow, or it doesn't work at all, throws an error at you. And, well, bonuses, it's slow and throws an error, which everybody is annoyed of.

Starting point is 00:27:36 But those are the things that you alert on, that you monitor, and that's kind of easy. You have some baselines, you have a threshold, some experience from the past, and you apply it to a service. This is all done automatically, and then something turns red. Then now the hard part comes. Taking the data that you need to get information about where's the root color of the thing, who is supposed to

Starting point is 00:27:58 resolve that problem. And once you nail that down, at one of the times we called it the blame game, and we called it the blame game. And we actually had a blame game, which was like a spinner. And then you just flipped or spun the arrow. And it turned out to be database operators, developers, you know. And that was as effective as a blame game is.

Starting point is 00:28:25 Because you need data and then you need information out of that. So monitoring, finding out that you have a problem is easy, but then comes the need for the information to resolve the issue. This is the hard part. And this is where the backends are needed and the visualizations, according to your job. And with the whole DevOps SRE

Starting point is 00:28:41 thing, it's a different job description that you have there. You need different data and especially a different communication culture. And that's what, during the meetup last week, Almenuda said, I was able to follow the Spanish a little bit because she was very fast.

Starting point is 00:29:01 She said, DevOps es cultura, which means DevOps is the the culture it's not a question of the tools that you have it's a question of how you're dealing with that and then you pick the right tools coming back to the tool belt if you don't know what tools you need you throw a batarang at uh i don't know to open a fridge it's just it, it's, you know, a fool with a tool is still a fool. So it's your way of doing things the way your organization does things is demanding the tools for that. That's why Open Telemetry opens a lot of different LA's up, but

Starting point is 00:29:41 you have to choose and you have to choose to what is the best for your for your job and the strategy that you're following. Rainer, I'm just looking at your website again because you are offering different types of boot camps that you have. I remember you told me about this. You made it very appealing to fly to mallorca and then spend some time with you and i also really like the different models that you have where you basically also get a coaching session when you have like a vocation camp so folks if you are listening to this really check out observabilityheroes.com check out the boot camps i know a lot of organizations

Starting point is 00:30:23 are doing these vocation camps where their teams kind of get together in a remote location. You in Mallorca, you're providing not only a beautiful island, but also the infrastructure that people need. You also provide them with your coaching on observability. My question to you is, if you do these workshops on observability and people come to you are there any kind of big revelations where you say man why are we going down to these basics why don't i mean like are there certain very basic things that people have never that haven't thought about at all when they come to you around observability or do people already come to you because they

Starting point is 00:31:03 already know they need observability or do some people come to you because they already know they need observability? Or do some people come to you and then are completely like blown away by, because they never thought about these things? It's all of the above. So there are people who want to know

Starting point is 00:31:16 about observability in general. They know that they need something, but they have no idea where to start. And for those, of course, we go through the basics. But even the ones that are already, let's say, senior or advanced observability practitioners,

Starting point is 00:31:33 they sometimes need to widen their scope because they work with a solution or with their setup for quite some time and they want to expand it. And every time I switched my companies, it was mainly for the reason that there was a new technology coming. So be it microservices suddenly implemented and was a pain to deploy other solutions because there were so many installations to be done,

Starting point is 00:32:01 so automatic would be would be nice it all it also boils down to what are we doing for example we need tracing tracing need needs instrumentation instrumentation is added code to the applications and it always sounds oh yeah we just instrumented but do you forget that you add code and if you add code to the wrong part of your code then you can add a massive overhead and that can be very quick so during also dynatrace times we killed systems with one wrongly instrumented method but that was before we also had all these automatic ones so we had to search for stuff that's all way better

Starting point is 00:32:45 but when you do custom instrumentation you should know how this is applied so you can make a good decision what to instrument and what not have what to instrument for production and what you do in development and maybe find a feature flag for your observability to have this turned on only in development and occasionally if need be turn it on the fly in production but this is what also long-time practitioners sometimes sometimes don't realize what the overhead what overhead actually means because it's yeah it's multifaceted that's using additional c CPU and adding to the response time is one thing, but having shared resource contention, deadlocks, and other things is also something that you could easily do when you apply it the wrong way.

Starting point is 00:33:36 So these are the things that we talk about, and I want to be sure that people understand that, because then they can make the right decisions going further. So if I talk about the basics, I make sure that people have a need or, well, should know about them and just checking where they are, which is something that I learned from my hotline days

Starting point is 00:34:02 when people came to me with problems. I first established a baseline what is their level is this the grandma that wants to send an email to their um the grandchild over in uh in europe which i had actually or is this someone very versed in technology who just needs to get desperately online in a very remote country somewhere and his modem or her modem didn't work? Different levels.

Starting point is 00:34:32 I ask different questions. And that's what I check first. Where are we? And then we move on from there. And as you said, the vocation camp, that's one week in our co-working space with very good connections so you can do your regular work and you can focus on observability if you can't do this in your

Starting point is 00:34:52 office at home because people come in and you know the regular you words in a day's work is a lot of people coming in and asking how are you doing so you can focus on your project and you can ask me. That's part of the offering is there are some coaching hours in there or mentoring hours. You can ask me about specific problems. We develop a strategy together or I'll give them something

Starting point is 00:35:17 to do. By the way, I'm also financing the AWS instances needed for that. So if you need a big server to test something out, not a problem, that's included. And then we check where everybody is so that at the end of the week,

Starting point is 00:35:35 they know either the basic concepts of observability or they have a good idea about the strategy that they're going to implement when they're back in their office. Hey Brian, what do you do next week? Should we sign up for a location camp? Oh yeah, absolutely, please.

Starting point is 00:35:55 Are you based on the East Coast, Brian? No, I'm in Denver. I used to be in New Jersey, but I'm now in Denver, Colorado, and smack in the middle. Well, not smack in the middle, but you know. No ocean near me. I used to have in New Jersey, but I'm now in Denver, Colorado, and smack in the middle. Well, not smack in the middle, but you know, no ocean near me. I used to have an ocean near me. I've been there. Back at Wiley Times, because my boss was residing there, actually came from Boulder, but he showed us around Denver.

Starting point is 00:36:17 Nice town, but very far from the sea. There is a direct flight from New York to Palma the mallorca so but not from denver unfortunately yeah i think the closest to the islands was um tarifa and nera but uh never to the islands themselves yeah folks listeners if you think this is a commercial for mallorca well it sounds like it and it's a really beautiful place. We will add some links to the description. Rainer, what you said earlier was really good because, you know, we have great tools. What's the saying? With great power comes great responsibility.

Starting point is 00:36:59 With the power of instrumenting code comes great responsibility, and we as an industry, we have, like in your case, 20 plus years of experience of what it takes to bring down an application with bad instrumentation. In the Appmon days, we called it the shotgun instrumentation. Or even if you are instrumenting a method that is called a million times, then of course, the relative overhead becomes really big. And then you can really bring down systems. And I also feel that it needs more education.

Starting point is 00:37:29 Fortunately, the tools get better and I'm pretty sure there will be more built-in features in OpenTelemetry, more tools for developers that actually make them aware of the potential impact. But folks, if you're listening, think about this is code, right? It only collects a little bit but it gets executed every time this method gets executed and this data is then sent off to your open telemetry endpoint or your open telemetry collector there was the additional overhead the dimension the overhead is not just in the running thread it's on the network or first in

Starting point is 00:38:00 the memory because the data needs to be buffered then it needs to be sent over to the next component this might be a collector or directly the back end but it means network overhead um so yeah think about there's a lot of things and this there's reasons why organizations like you mentioned it right the new relics the data dogs the instanas the dinosaurs of the world we've been doing this for many years and we've we had all these lessons learned and this is why where i think we're doing a good job and you should still look at what we've done maybe we should you know uncover and like repost some of our old blog posts from back then because i'm pretty sure there's many blog posts we wrote about you know proper instrumentation best practices worst practices on instrumentation that people should look into it again.

Starting point is 00:38:46 Yeah, absolutely. And instrumentation is one thing that can kill an application. If you kill it once, then your management is going to be very weary of letting you do this again. But as you also said, with automation,

Starting point is 00:39:03 there is lots of things that you can just get out of the box and you don't mess with it which usually safe um but going back to a blog post that's exactly what i did after our meeting and there are some blog posts from from me from 2010 on the code centric blog with the company that i worked for at that time um about um no time for monitoring so if you don't invest time for monitoring um then you probably spend more time some somewhere else so but also automation helps a lot in saving their time getting it up and running in in no time and then another one was actually not from me,

Starting point is 00:39:45 but from Fabian, who compared a very good overview over sampling and tracing or profiling and tracing and the different overhead that is incurred there. And with some practical examples,

Starting point is 00:40:01 but it's 14 years old, so it's probably on an old Java version that probably has that already optimized out. But I read today an article about memory consumption in OpenTelemetry in Java, where they have a new module that allows you to reuse memory. And it's not an immutable anymore but it's a mutual object a mutable object and that basically reduces your memory impact of the agent or of the instrumentation to zero which is massive when you have lots of they have like kafka um with kafka or pulse

Starting point is 00:40:42 pulsar apache puls, with lots of topics. And each topic created one kind of metric, and that created a ton of objects in the memory, which then blew up kind of the JVMs when you monitored them. And now with this new, which is officially accepted, I think, by OpenTelemetry now in the Java SDK, with this new approach, it's basically zero and this is gone. But this is something that you wouldn't realize until you have that problem, until you see what is happening here.

Starting point is 00:41:13 Why is my JVM blowing up? I didn't do that before. Well, it's just a multiplication of things with the topics and the metrics that you get out of it. This is just one of the things that make open telemetry kind of that complex. On the other hand, if you don't deal with that,

Starting point is 00:41:30 if you don't have queueing mechanisms, messaging systems with lots of topics, then you don't need to bother. This is what we can talk about. My approach was always we start with your problem at hand,

Starting point is 00:41:44 we solve that, and take the next step we start with your problem at hand we solve that and take the next step we're not boiling the ocean up front if we don't have to and get the results quickly in in that week and make you realize what you actually need and get you successful with that and then take the next step and even then if it's well with that approach you can get the most complex situation like handball and yeah easy to deal with hey andy i think i have a title for this episode observing of observing observable i can't even say it observing observability uh right this is all this you know one. One quick anecdote I wanted to mention about the instrumentation. This happened, I think, 2012.

Starting point is 00:42:31 I'm not going to mention the vendor or the customer. But sometimes when observability takes down your system, even the most best designed ones, it could be revealing a core problem. So we went to a potential customer. They were complaining that login was taking like six to eight seconds. It was a really long login. It was a commerce kind of site. So we put Dynatrace on the system, and login went up to 15 seconds.

Starting point is 00:43:03 And we're like, what the heck is going on? And this was with auto-instrumentation already done. This was beyond that. Well, it turned out that they were running upwards of 14,000 database queries upon login. So like any observability tool, you're going to be instrumenting do execute or whatever in the database to just track that bit, which is a standard, normal thing. I just remember it was really funny because they had a guy from the software vendor there. He was like, oh, I don't know. But we're like, yeah, this is the problem. You're running 14,000 queries.

Starting point is 00:43:35 Of course we killed it. No matter what we do, that's just like, talk about instrumenting the wrong thing. That was the right thing, but because of their setup, it took it down. So we then tried to play that game of, well, not the game, but like, look, we didn't take your system down. Well, we did take your system down, but it's revealing what your problem is by doing that. But yeah, it can be really, really tricky in those situations. And this just reminded me of another anecdote, just a quick one, where a vendor actually blocked our agent from working. So the customer that was actually a sole performer customer. So guess who was in there for monitoring? And they told us it doesn't work.

Starting point is 00:44:21 Let's put an agent in. Let's put at that time Dynatrace in. We did. And we found something. And they went back to the vendor saying, yeah, this doesn't look very good. Can you solve that? And the vendor came back,

Starting point is 00:44:34 no reverse engineering in our code. It's forbidden. And they were like, no, no, no, no. You don't. What? The next time we tried, they gave us a patch and then we tried to put the agent in it. We got no results at all.

Starting point is 00:44:52 And we found out that they actually stripped the minus agent path variable from the JVM options. So their code was rewritten to prevent any monitoring from Java. It was like, how desperate as a vendor must you be to do that? That's amazing. And it also reminds me, and this comes back to the use and use of experience and how commercial vendors have optimized the systems, the database example,

Starting point is 00:45:27 these problems that we then saw by capturing every single execution of these 14,000 database queries led to what we call database aggregation back then, where instead of capturing 14,000 instances of the same query maybe just once. And these are things that I hope will also make it into every other instrumentation for open

Starting point is 00:45:47 telemetry or make it to the blogs about best practices because this is things we need to avoid because we've ran into these problems before and we've solved them before. Yep, we did. And this is something that needs to go into open telemetry and the data gathering, absolutely.

Starting point is 00:46:06 Rainer, I fear we're closing at the top of the hour here from a recording perspective. I hope we did. I think we did a good job in promoting Mallorca and your co-working space, your

Starting point is 00:46:21 boot camps. I also offer that online if you're not getting the vocation approved by anyone. No, the thing is my goal is to make observability available for everyone to take it to the next level.

Starting point is 00:46:38 So we cover the ground stuff and then we care about the really as a former colleague of mine called the big hairy audacious problems. So get rid of the easy to solve issues and then focus on the ones that are really tough to solve

Starting point is 00:46:53 because it's much more fun. But having observability in all of the places. So I have my experience. I can share them. Of course, yeah, I can be hired. I can be bought, so to say. But it's also good to talk with you guys about how, because I was always seeing it through the same pair of glasses,

Starting point is 00:47:15 as the Germans say, always from a vendor perspective. And now I'm getting this from the technology perspective, and this opens up a whole different perspective on the technology perspective. And this opens up a whole different perspective on the whole thing. And unfortunately, after 20 years, I reached the same thing over and over again. You need data, you need stability, and you need to write data,

Starting point is 00:47:37 and you need to turn it into information that solves your problems, whatever those may be in your role. It's still not in the mainstream. It's slowly getting there, but it's still not really there. This is my mission, to get fundings for observability from the ground

Starting point is 00:47:56 up. Not as an overlay, not as a, oh god, this doesn't work, we have to throw money at it to make it go away, or make the problem go away. So, yeah. And, yeah, thanks for inviting me here. And, as stated, there is a

Starting point is 00:48:13 CNCF meetup in Palma next week. And also, there is the Web Engineering Unconference in September, also happening in Palma. This is always a nice event. It evolved always a nice event. It evolved from the PHP

Starting point is 00:48:26 unconference to now everything around web engineering. It's a pretty cool event taking

Starting point is 00:48:34 place in one of the hotels here. So the whole island is really good for IT people.

Starting point is 00:48:41 We have a great connection. Enough of the commercial here. Thanks again. That's good. And talking about events, maybe one

Starting point is 00:48:48 last thing because this is a global thing. In the first week of June, I think it's actually June 6th. I think this one,

Starting point is 00:48:56 oh, you're talking about an event. I was going to say I think this one airs next week. No, what I'm saying is that

Starting point is 00:49:02 we have, Kubernetes is turning 10 years. So we have Kubernetes birthday party globally. The main party will happen on the West Coast. But every CNCF, every cloud native local community is encouraged to run their own birthday bashes. I know in Barcelona, our friends from Lidl, from Schwarz.it, they are doing the birthday bash in Barcelona

Starting point is 00:49:25 we are hosting a party in Vienna in the Dynatrace office and I'm sure many places around the world so Kubernetes is 10 years old it's no longer a baby it's already a teenager it's amazing oh my god it's going to be a rampage teenager yeah that's the question puberty is coming. Oh my God. Acne. I think at the parties, you're going to have to have people walking around

Starting point is 00:49:53 trading hors d'oeuvres and seeing if you can get the hors d'oeuvre to the right end destination point by passing it from person to person. You mean you have to order your hors d'oeuvre with a YAML file? Yeah. Okay, great. Well, I hope people, if you ever wanted to get to Mallorca,

Starting point is 00:50:12 I hope they get there really soon, because I think as a result of this episode, Mallorca is probably going to be the hot spring break type of destination for observability people. So before it gets over you know overrun by technologists if you want to see it nice get there soon um of course uh your house is always open to anybody who shows up if you just you know give us your address on on the episode people can just freely walk in i'm kidding um when you take a look at the at the picture of the beginning, you see our

Starting point is 00:50:46 roll-up in the background and that is our co-working space. And if people Google that, they'll find my home address. We have locks and, you know... And big dogs. Actually, just a cat, but yeah. He's fierce.

Starting point is 00:51:04 Oh, he's fierce oh he's fierce cats can be scarier than dogs because you never know what they're going to do or what they're thinking alright Andy

Starting point is 00:51:14 any last words wrap up alright all good I'm really happy that after so many years we met each other again

Starting point is 00:51:21 and then that this worked out that well so all the best with your with your ops with making everybody an observability hero and making observability available to everyone i think that's a really nice statement and if you can say observability three times in a row without twisting your tongue you get a beer from me observability

Starting point is 00:51:40 observability observability next one is on me. He's been practicing. Alright, well thank you for being on the show today. We really, really, really, really appreciate it. This was an awesome show. I just want to say thanks to all of our listeners again. And as we were discussing before we started, I think we're like somewhere over eight years on the show.

Starting point is 00:52:04 So thanks for everyone who's made that all possible. And thanks to wonderful guests like you. I'm going to try it. Reiner? Reiner? Reiner is perfect. Reiner, there you go. All right.

Starting point is 00:52:19 I've been having trouble with that name, people who are listening. It doesn't seem like you should, but anyway, it's just me. I'm a stupid American thank you all for listening thanks everyone for being part of this and we will see you next episode bye bye

Your Ad Here

PurePerformance - Educating the next generation of Observability Heroes with Rainer Schuppe

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.