PurePerformance - Performance Engineering for Hybrid Cloud re-platforming with Klaus Kierer

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me here my co-host who's mocking me today, Andy Grabner. And hello Andy, Mr. Funny Guy. I know, I wanted to make you laugh but instead I'm laughing now. Yeah, it's great to be here. Yeah, I'd say that. It's always great to be here. Yeah, I'd say that.

Starting point is 00:00:46 It's always great to be here. I know. Because it's always such a fun learning experience. Not to segue right away, but I think today is a brand new learning experience for us. So hopefully it'll be a brand new learning experience for our listeners. It is. And as I think we mentioned a couple of times, one of the reasons why we keep doing this podcast is because we learn much more than i guess most of what definitely than our guests because they already know all this stuff yeah but uh today's topic we've we've at least i've been seeing a lot of

Starting point is 00:01:14 blog posts out there recently from stephen townsend uh from uh new zealand and i've been with him on a newtispec conference once and i think he also wrote a blog series about performance engineering in the cloud, kind of does the cloud make performance engineering obsolete? Yes or no. And I think today we definitely have a MythBuster session. Who's Jamie and who's Adam? We'll figure this out. But without further ado i wanna i wanna make sure to introduce our guest he is one of our own dino tracers a performance engineer and klaus killer his name and i actually want to klaus i want to pass it over to you if you could quickly say hi and introduce yourself

Starting point is 00:02:02 to the audience hi brian and andy thanks for the invitation to this great podcast. I'm sometimes listening to it and it is really great to take part in it and hopefully I can shed some light on some of the issues we had in load testing of our own Dynatrace clusters. I'm with Dynatrace since 2019. And since the beginning, I'm a member of the cluster performance engineering team where we cover 24-7 load testing, regression load testing of our own clusters with Dynatrace.

Starting point is 00:02:41 And that's, first of all, thank you for actually helping us make this podcast great, because guests like you allow us. Also great to hear kind of what you just said, 24-7 performance engineering. I was, I'm always the fortunate guy, I think, that can talk about these stories, because I talked about also what you and your team, Thomas Steinmacher,

Starting point is 00:03:06 I think that you're reporting to have done over the years. It's really fascinating. And I want to quickly give a quick background on how we came about this episode. Because I was invited to speak at an Azure conference and then the idea came up. Hey, we as Dynatrace just recently announced we moved our SaaS offering to also Azure and GCP so we are providing our SaaS service not only from AWS where we started but in all the other hyperscalers and then the thought came up so was this just an easy smooth transition how does this work what had to be done any performance challenges and then the two of us got to talk because Thomas, you know, he said, hey, talk with Klaus.

Starting point is 00:03:47 And then we sat down over lunch. Fortunately, this is possible again now in the office. And then you brought up some really fascinating points on what things you have learned as you were moving Dynatrace, our offering into Azure. And this is where I took a lot of notes. And then I said, let's come back to the podcast and let's discuss it one by one. Really fascinating. And with this, Klaus, before we get started, I think you said another thing. And before we started to hit the recording button, for you, it was a lot of new first things, right?

Starting point is 00:04:24 The cloud was also new for you. Kubernetes was new. Exactly. I had some experience with Docker, running it on some of my own servers privately, but no need for me to learn Kubernetes actually, because there is some overhead involved and now i last summer we had a chance to really step into that in the cluster performance engineering team and i took that over and uh learned a lot of new things it was really uh great things i learned and uh it was sometimes a hard time as well yeah because i i got a new dynatrace cluster running in kubernetes and now do some load test with it yeah and it's interesting the idea of load testing right so we've had conversations andy and i in the past with mark tomlinson about load testing in the cloud, right?

Starting point is 00:05:27 And that's a whole different topic. We won't go into that, but just to retouch upon that, you are testing your application that's running in, let's say, Kubernetes. But you also have to be aware of making sure you're testing the capabilities and functionality of, let's say, Kubernetes to support that load. Like if a node goes down, is scaling working? Are they scaling at the right time and all? So that on its own is even just a lot to go into. But I think we're going to be taking a little bit of a look at a little bit even different side of that than today,

Starting point is 00:05:59 because as we discovered or as Andy discovered talking to you, there's even another dimension that needs to be considered. But I'll shut up and let Andy go. No, that's fine. And I want to highlight why this is so relevant for all of our listeners, because like Klaus, what you just said, we had Diamond Trace for years now, right?

Starting point is 00:06:16 And the way we run it, but we basically moved it not only to another cloud vendor, but we also moved to Kubernetes. That means we did what a lot of organizations are trying to figure out. How can Kubernetes become the new orchestration platform for running container-based workloads? Which means, because it helps us to easily, at least in theory, move things around from one vendor to another, from one hyperscaler to another, because the common denominator is Kubernetes.

Starting point is 00:06:47 But as you've learned, there's definitely certain things to be aware of. And Klaus, if you're okay, I would jump into the first thing because I thought this was fascinating. You talked, you told me about one of the things that you played around with, and this is kind of the gateway into everything that runs in Kubernetes is kind of the ingress controllers. And there's different ways in Azure and also I think in AKS and GKE on how you can expose your services to the outside world. What did you learn as you moved to Azure in terms of ingress controllers and the load balancers? From a load testing perspective, we use our own load test generator,

Starting point is 00:07:34 the Cluster Workload Simulator, where we simulate typical load, which also would come from real-life systems. For example, pre-pass with database calls and servlet calls, stuff like that, or log ingest, all things which would possibly come into a cluster and has to be handled accordingly.

Starting point is 00:08:01 And for example, we simulate on our root test cluster now about 20 or 25,000 agents, which have about 5,000 hosts, which are simulated at the moment. Wow. So the cluster workload simulator has some kind of workload pattern which can be configured. And we originally set this up. We took a cluster we have running in our SaaS offerings and scaled the cluster accordingly to have something to compare with. And then we enabled our cluster workload simulator and ingested some load.

Starting point is 00:08:50 Initially, it looked really fine. And as soon as we increased the load, we saw that on the ingress nodes, also our active gates were co-located. We had quite a huge amount of cpu usage which was quite atypical now i investigated a little bit into that had some talks with colleagues from glansk and yeah took quite some time to really get into it and what was a root cause for our issues we had because in Asia you have different scenarios how you can interest the

Starting point is 00:09:38 data for example the application gateway or the application load balancer. One is on layer four and the other one on layer seven. And initially we used the application gateway. We didn't have a clue that this could cause any troubles because we saw the high CPU usage mainly on the ingress nodes where Nginx is used. So it took quite some time to investigate and one of the first steps was that we separated our active gates from the actual ingress to have an own layer in front of the active gates to better investigate what is causing the high CPU load. Didn't improve the situation very much,

Starting point is 00:10:33 and we really tried to kill the problem with throwing hardware at it. Didn't work out. Finally, a colleague from Glansk said, let's try to remove the application load balancer and simply use the application gateway and try to use the application load balancer. And finally, the high CPU usage was gone. We were on levels which were fine, still a bit high, but actually something we could work with. And the reason, and this is fascinating,

Starting point is 00:11:19 and by the way, for the listeners, Klaus, you also sent me a couple of blog posts. For instance, one on when to use Azure Load Balancer or the Application Gateway. You mentioned earlier these different, let's say, ingress controllers work on different levels of the OC stack. One is on layer 7. The other one is on level 4. Obviously, the higher you go up, the more work needs to be done by the load balancer

Starting point is 00:11:49 because they're analyzing the HTTP payload, making, I guess, even SSL termination and handshaking. And I mean, there's so many things that happen there. And in high load environments, I mean, you mentioned you're simulating workloads of 25,000 one agents where you're sending data in from rather large environments. And these are workloads that we see all the time and they obviously need to perform. And we want to make sure we have enough resources

Starting point is 00:12:15 in the real core of the cluster and not just eating up all of the CPU already in the front where the traffic comes in. And then lessons learned. But I think in hindsight if you look if you look at this now if i look at it now after you explain it to me i would say well of course this makes sense but still for somebody that is moving to the cloud and you just turn on the default features and you say hey this just looks all great and everything works

Starting point is 00:12:41 magically anyway and everything is scalable by default it's a great lesson learned just like how you can optimize them performance and what impact it has yeah yeah and i had a question regarding this so do i understand correctly that this was discovered in the azure kubernetes setup so this this didn't exist and this didn't exist in AWS, right? Like you didn't have this problem with that bottleneck. In AWS, we don't use Kubernetes. Oh, that's right. Okay, so that's, okay. That's very, yeah, yeah, yeah, I know. I'm just trying to think, you know, Andy, you talk about lessons learned.

Starting point is 00:13:18 So obviously this wasn't a, it worked fine here. We moved it to the other one. It changed. This was, let's say for lack of a better term, greenfield in a new cloud provider. So just thinking the complications that come into this project, you're taking a new product or new software architecture and a new cloud both at the same time.

Starting point is 00:13:42 So now when you're trying to troubleshoot, you're looking at, is it the way we've set it up in Kubernetes? Is it Kubernetes or is it the cloud? Which is, I don't know, it just sounds like a lot to tackle. I wonder if in hindsight, would it have made sense to first set that up in AWS

Starting point is 00:13:59 where you had your at least known conditions? Or do you think that wouldn't have mattered in this situation? Thinking if somebody is going to do something similar, would it be a worthwhile step to try it in the environment they're used to first? Or did that environment change, you think, have a little, very little impact on what happened? That might indeed make sense to try it out in an environment which you know. But on the other hand,

Starting point is 00:14:29 you might then build a system which runs fine on AWS. And when migrating to TCP or Azure, you again run into troubles because you use features of the cloud provider which are implemented differently at every cloud provider right so it sounds like it'd be better to test it in the environment you want to launch it in I guess then so I think that the goal was to get it running in Azure first so it just made

Starting point is 00:14:59 sense to start there anyway okay that makes. But then also what I want to reiterate here, our AWS environment where we didn't use Kubernetes, but we used obviously our current architecture that we've used for several years. This was your baseline though, right? This was the baseline. We know exactly under certain load conditions, this is what you take. And then what we did as Dynatrace, we moved to the different cloud vendors and with this we replatformed because we knew Kubernetes is the future.

Starting point is 00:15:31 So we had to figure out how to run on Kubernetes anyway. But then it was a great assumption to make because Kubernetes, we should get Kubernetes running as efficiently as if we would run our application on the old infrastructure definition. Because in the end, you want to move over and get the benefit of Kubernetes

Starting point is 00:15:52 without getting any overhead that we can optimize or that should be optimized. Because I'm pretty sure, Klaus, while you've been with the company since 2019, even when we started with AWS, I'm sure over the last 7, 10 years, we've made many performance improvements on AWS compared to where we started. And when we now look at what we do in Kubernetes, we are still in the early stages and we need to figure all this out

Starting point is 00:16:18 and how to really optimize this environment. That's indeed true. There are quite a lot of optimizations on AWS. For example, there we use on the ActiveGate nodes, we have not NGINX in place, but AJA

Starting point is 00:16:35 proxy. And for example, there we directly use the local ActiveGate to direct the traffic, to have some traffic equality. And we don't have it like in Kubernetes where you have an ingress pod

Starting point is 00:16:57 which talks to any active gate in the cluster. And in Kubernetes, we don't have that traffic locality, actually. We worked on that, but it's not really to have node locality, but at least to be in the same availability zone. Yeah.

Starting point is 00:17:24 For me, it's fascinating because as you said, Klaus, you started with Kubernetes just a little while ago. I started with Kubernetes, I think, three years ago when we started our journey with Captain. And it's still a mystery to me,

Starting point is 00:17:40 certain aspects at least, especially the whole networking, if you're not a network expert. And it seems a lot of the things you're explaining right now is a lot of network specialities that just come with moving to something like Kubernetes. Yes, of course. There are quite some network issues. For example, with the initial traffic we had, one issue was that in Dynatrace itself you can monitor where the traffic is going, which process gets traffic and which process it descends to. You can monitor that within Dynatrace. But on the Kubernetes cluster, all of that seemed somewhat strange to me.

Starting point is 00:18:27 It looked like the traffic is wandering around from one availability zone to the next, to one part. And really, it was amazing how the traffic on the cluster went. And there were also quite some improvements to influence how the traffic is processed within Kubernetes. For example, the low balancer knows all nodes within the cluster. And basically, you have traffic coming into the cluster

Starting point is 00:19:04 and it can potentially go to any node. And then it is by Calicrew, I think Calicrew is responsible for that, which directs it to the target part where it should actually go. Now with the health matrix We worked a little bit around that to tell the load balancer that actually only the ingress nodes are up for it and want the traffic. So we had it again in our hand how the traffic flows through the cluster.

Starting point is 00:19:44 And this is, again, for many of us, including me, the network is always a little magic kingdom. But as you said, this is exactly what coming back to my initial opening statement, performance engineering is not going away with the cloud.

Starting point is 00:20:02 Performance engineering becomes tougher with the cloud and especially more important because if you just go with the defaults, then you are in the end paying a whole lot of money because you just throw virtual or cloud hardware on the problem. I'm pretty sure you can scale some way. The question is, is then the software that you're providing and running, is it still sustainable? And can you're providing and running, is it still

Starting point is 00:20:25 sustainable and can you actually make money out of it? Yeah. And additionally, depending on the cloud provider, cross availability zone traffic might cost much more than local traffic. And if it is ended from the interest node of one availability zone to an active gate in another availability tone, and then again to a server in another one, you have a lot of cross availability tone traffic, which on the one hand costs money, and on the other hand also costs performance. The other complication I'd add is just the level of transparency that the cloud provider provides to what they're doing underneath. So you go ahead and run these things in a cloud-managed Kubernetes environment.

Starting point is 00:21:13 What do you have access to finding out how they're routing this stuff, what protocols they're using? If we abstract this out to serverless, you don't even necessarily know what's running behind there. You may know, but the level of transparency comes into question so that when you do run into these issues where we think it might be something with this network, can you get down to that level to understand which layer of network is being used so that you know whether or not you would have to pull that out or maybe re-architect because it's something you're doing so that's always just another consideration is can you find out from the cloud provider from what they're providing you how they're doing what they're doing so that you can get to the root of what's causing that bottleneck and come up with

Starting point is 00:21:59 a good solution in the perfect world you wouldn't even use the Azure portal or GCP portal for that. You would see everything within Dynatrace. At the moment, we are not on that level and have to improve some things, especially in the topic of networking. For example, the load balancer is not really visible within Anadres. You might see it in the traffic UI as an unmonitored node. But you really don't know, for example, where traffic gets lost or is blocked. That's stuff we still might improve in the future.

Starting point is 00:22:50 And I think this is also kind of repeating for people that are listening in and maybe don't understand the full context. We are moving and we're running our Dynatrace in Kubernetes on hyperscalers like Azure, but we are monitoring everything also with Dynatrace in Kubernetes on hyperscalers like Azure, but we are monitoring everything also with Dynatrace. And there's teams like Klaus's team, the cluster performance engineering team, or I'm just pointing, I know nobody can see me,

Starting point is 00:23:18 but I'm pointing to the other room. My wife, Gaby, she sits over there. They're using Dynatrace to monitor these environments to make sure systems run stable in production. We are using our own product and how we want our customers to use it. And we're using it at a very tremendous scale across all the different hyperscalers, across all the different permutations

Starting point is 00:23:38 that Kubernetes gives us. And I think this is really great. And then also improving our product as we see where as you said earlier where we may have blind spots right now where it's not perfect that's really cool hey uh klaus i wanna switch topics because i remember in our discussion when we were sitting down for lunch you brought up a couple of uh points on only network performance or network issues but also disk a lot of things you learned about you know disk performance uh caching and things like that can

Starting point is 00:24:13 you just enlighten us a little bit on some of their lessons learned that everybody should know within the different cloud providers, there are quite some differences on the disks you can use and what they cost. For example, we have a lot of experience with AWS and with millisecond response times for disk access, write, read,

Starting point is 00:24:44 really perform well. And years ago, we also set up some managed clusters within Asia where we already had some issues with disks. For example, the premium SSD disks have quite a different IOPS profile than we were experiencing on Asia. So latencies were different and what I found out is that Asia is optimized for high parallelism. And our cluster software has some issues writing, for example, the session storage or for elastic search,

Starting point is 00:25:42 we had also issues and we had to find solutions for that. Because on Asia, for example, the read disk caching is only supported to about 4 terabyte to be exact, 4095 gigabyte, one gigabyte more and you cannot enable read caching. And this has a tremendous effect on Elastic, for example, when you want to read logs. Cassandra is also highly involved when reading data. We haven't seen that issues that much with writing. The idea is only for the session storage, for example, where we had to increase the parallelism we use for writing,

Starting point is 00:26:37 where we had no problems on AWS. We had to modify our software to run with good performance on Azure. And we have seen this with Kubernetes as well. And I've found some blog posts about that, where comparisons have been done between different cloud providers, for example, for Cassandra. And the result of that was exactly what we have seen. And for Azure, you would need UltraDiscs to be on the same level as, for example, with GCP or AWS.

Starting point is 00:27:23 By the way, for those listeners that are interested in this, the link to the blog post, it's on Cassandra.io, a blog post called Cassandra Performance Benchmarks on Cloud-Managed Kubernetes. We'll put it into the summary of the blog post so that people can easily find it. And this is just fascinating, right? Because in the end, I always thought if you are,

Starting point is 00:27:46 you know, like you're buying a car with a certain horsepower here or there, it's kind of the same. It's like you're buying a certain compute power on this vendor and the other. And if they look the same, they should feel the same, but it's not the same. And I think these lessons learned are really fascinating. I wasn't unmuted on my real recording. It's also another factor to take in when, as a lot of companies are moving to multi-cloud, when you want to just lift and shift it over from Azure to GCP or Azure. Let me start that all over. Wow, Andy, it's me today. This is another factor to consider when moving to multi-cloud, right? Where if you want to lift and shift

Starting point is 00:28:37 from one cloud to another, you would expect that you could just order up same thing, pop it in and get the same performance. But if you skip those testing steps, if you're not looking at the hardware it's running at, you're just looking maybe at the software performance and not looking at all those other layers, but also not being conscious of it. In these situations, you might have to switch up what you're using because you're not going to get the same performance. And this, to me, is the most fascinating thing about today's episode, is the difference in performance between the different cloud providers.

Starting point is 00:29:14 Not necessarily that one has better performance than the other, but that you might have to use different aspects of it to get the same performance. You can't just toss it over and expect it to be the same. Your example, right? It's the same performance. You can't just toss it over and expect it to be the same. Your example, right? It's the same car. The example I like is if you go back to the earlier days of Internet Explorer, Firefox, and Chrome, as a web developer, you couldn't necessarily

Starting point is 00:29:37 just put your page in the other browser and expect it to work. Because the engine running it is different. And it's the whole new level of complexity. And I can understand why when you were talking with Kelsey Hightower, I can understand why this sort of thing is giving the whole serverless movement more momentum because you go from hosting your own infrastructure and dealing with all this stuff to revisiting it all again in the cloud where you thought you might have had

Starting point is 00:30:06 an abstraction from it where you don't have to care about it as much. But as you're revealing clouds, you do have to continue to be very in tune to these components still. Yeah. And the question though is, and I like your bringing in serverless again,

Starting point is 00:30:27 the challenge that I had over with serverless, at least here with Kubernetes, Klaus has the option to tweak certain knobs, right? And optimize with serverless. Well, yeah, you're all in the hands. You're powerless in a lot of ways, which is even... Yeah, a lot of ways. Yeah. So Klaus, we talked about network, like the load balancers.

Starting point is 00:30:52 We talked about the disks. I have a couple of additional points here. When we talked about, I think, again, sitting down for lunch and he said, hey, kernel tuning on Kubernetes. There was also really, really important or certain tuning sessions or sections that you have to do. Can you enlighten us here as well, especially for people, I think, that are running or trying to do what we do with Elastic and Cassandra,

Starting point is 00:31:18 running those on Kubernetes? Yes, there are always some issues with kernel tunings. For example, Cassandra or Datastacks itself, they recommend to set some parameters to get the performance they want. In most cases, you need kernel tunings to enable memory mapped files, to really use the memory for caching. And if you don't

Starting point is 00:31:49 raise the default values for Cassandra, you will sooner or later run into memory problems and out-of-memory errors. And because of that, restarting pods and really,

Starting point is 00:32:08 you could potentially run into bigger problems with the set issues and there it's always it's quite hard in multi cloud environment to find one solution for kernel tuning because Because for example, kernel tuning parameters which are allowed on Azure are not whitelisted within GCP. So you have to find other solutions to really enable that kernel tuning. You have different options for that. You can use scripts directly on the node or use daemon sets to enable kernel tuning. You could also use in-it containers, but that always depends on the software you are running. For example, for Cassandra, we use CCop as an operator. And because of that, we cannot use init containers because they have their own init containers running and you cannot add additional init containers, to my knowledge.

Starting point is 00:33:16 And so on GCP, we had to use an own daemon set to enable the kernel tuning parameters, which we are working on Asia without any problems with our Terraform automation. Which, again, Brian, comes back to what you said earlier about why serverless gets this momentum, because right now what I hear here, Klaus, is that you're building individual fixes,

Starting point is 00:33:45 workarounds for the individual things you just figure out as you go. And then you need to maintain this. And maybe, who knows, GCP, Azure, they are changing their default settings or what they allow. And then you constantly have to test it. And this additional, like serverless,

Starting point is 00:34:00 brings this additional abstraction layer on top of this. But I always thought it's it i always thought it's more easy it's easier in this case it would have been possible to use a demon set on asia as well so we would have one solution which works but i think for our performance reasons and overhead the colleagues in tansk looked for another solution which worked on Azure, but unfortunately not on GCP. Now we have actually two solutions for the same problem. Might be reflected to use daemon sets for every cloud provider, but we have to discuss this internally and find a solution for that.

Starting point is 00:34:46 Or maybe on GCP it changes in the future because there are already better features available to have more influence on that. But we don't want to use better features in production.

Starting point is 00:35:02 Yeah, of course. You want to use better features, but not beta production. Yeah, of course. You want to use beta features, but not beta features. Yeah, exactly. Andy, this also makes me wonder, if you think about what our friends over at Akamasa are doing with JVM tuning and all that, I hear all this kernel tuning,

Starting point is 00:35:20 the disk options and all this, and it sounds ripe for me for some AI layer to, if it can understand all the inputs and tweaks available in the different cloud providers, look at the performance and make these tweaks of the kernel tuning or switching it over to a different disk to find that performance for you. And I know that's a bit of a simplification

Starting point is 00:35:42 of what Akamas is doing, but it sounds ripe for someone to throw an AI layer at. Exactly. And I think that's what Akamas is doing. And I will definitely make sure once this recording is on air to send Stefano

Starting point is 00:35:57 the link so that he looks at this, what we're doing here. Klaus, there was one more thing that I had in my list uh kind of goes into again settings and just brian you mentioned akamas i think they started with jvm tuning um there's also some things you have you had to tune for uh garbage collection because dynatrace is heavily depending on running on j, at least our cluster nodes. What's the story there?

Starting point is 00:36:26 There are at least two topics I know of. The one is be aware of which containers you use. If you don't build them yourself, for example, you might run into issues because a pre-built container has some settings applied which are not production-like, simply because it's easier to get started with them. For example, with Elastic, we had an issue again with memory mapped files where a feature was simply disabled by default to allow to run a container without privileged mode. And that caused that we had quite high system CPU usage with elastic and indexing times with the load which took more than two minutes for indexing and as soon as we switched the flag to use

Starting point is 00:37:39 the memory files indexing times were below 10 seconds. That's an improvement from, let's say, two minutes is like 120 seconds down to, what did you say, 10 seconds? 10 seconds, yeah. Yeah, that's like more than a 90% improvement in performance. Can I ask a question on that? Because it boggles my mind. I'm used to analyzing performance from a transaction code level or looking at things like something else stealing the CPU. But when you talk about a flag for running in privileged mode,

Starting point is 00:38:12 how the heck does somebody go about discovering that that flag is set and maybe we should turn that off? Because it's not like it's obvious. It's not like there's a list of heroes, all the things that are turned on and you might want to consider. This is like some deep setting. I remember Andy Mark-Thompson a while back was talking about a situation where they had some Intel chips with a flag that can go turn hyper-threading on or off,

Starting point is 00:38:35 and that was the issue. So it's someone, I guess, has it just come down to someone looking and thinking, oh, this might be it, let's try it? Or how does that work over with the team? What is that knowledge like to know to go look at that? In this case, it was a colleague in the cluster performance engineering team. It was Markus Farnberger who actually looked into that

Starting point is 00:39:00 and said, look at the I.O. pattern. That's crazy how much I.O. is happening on that machine without any disk access or with extremely low disk access. And that was making me think about what could cause this. And from Cassandra, I already knew that with memory mapped files and stuff like that, that you produce a lot of I.O. if it's not correctly tuned without doing actually anything.

Starting point is 00:39:35 Because you simply load something into the cache, remove it from the cache if the cache is too low. And in this case, it was really good that a colleague of mine took a look at that because I didn't spot it at the first chance. But Brian, I think you bring up a perfect point, right? We have lived so much in our APM transactional world for that many years and always analyzed, you know, which method calls, which other methods, how many database queries are executed. And therefore we focus so much on where can we optimize performance there. And there's a huge potential, as we all know. But also if you think about your traditional pure path,

Starting point is 00:40:19 also going back to the App Mondays, we then always had, there's a pure path and all of a sudden it just spends time in io because we always had either cpu time we had wait sync or the rest was just io and io means i'm waiting and obviously io to complete and now the question is right if i ramp up the load and i see more pure paths coming in all of a sudden io increases more than normal, then the question is either I'm really doing too much of IO or I have a problem with my IO on another end that I need

Starting point is 00:40:52 to optimize. But then it needs experts like Klaus and others in performance engineering to then figure out, okay, what can we do? In this case, we also had issues with the pre-built container we used,

Starting point is 00:41:09 which used a Java version, which we officially do not support anymore. So basically at the beginning, we don't even have monitoring for the Java processes in this case. It made things much more difficult and we had to find a solution how we could enable the monitoring. In this case also other team members helped

Starting point is 00:41:34 on that to enable some flex in our debug UI which enabled monitoring of unsupported versions. And unsupportedorted just that i get it right this was basically again elastic or cassandra you know we ran it on we had older versions of or versions where cassandra and elastic used all the versions of java exactly we didn't support out of the box in this case it was java 15 which isn't supported anymore because it is quite old and yeah and you know what this reminds me of one more thing so it's interrupted but this reminds me of a big discussion we had just not too long ago and brian and i will have our colleagues from open telemetry on one of the upcoming episodes fact is that a lot of new software is built fact is that we're using a lot of software from third

Starting point is 00:42:26 party vendors that might not be instrumented yet with open telemetry that right are like using jvms where even our product by default says we don't support it fortunately we have all these hidden features where we can turn support on but this just tells me that it's really great that we have tools like us and also our competition that has been building also agents over the last years and decades to really get insights into applications that are not manually instrumented yet with something like OpenTelemetry.

Starting point is 00:43:01 Because you cannot just go to Elastic and say, I need OpenTelemetry now for this version that is five years old. They wouldn't do it. And what I wanted to bring up basically with this topic is that you have to take care which containers you use. Especially if you don't build them yourself. You lose influence on what is actually running. For example, if we use Cassandra 3.11 and when a new Java version comes up, they simply rebuild their Cassandra image and provide the same

Starting point is 00:43:46 tag with another Java version. So it might depend on when you download an image on what you're actually running on a system. I was going to say they might not be doing the Java tuning if there's some new features in Java. Andy, it's funny. This sounds to me like a container equivalent

Starting point is 00:44:08 of copying code from Stack Overflow. Yeah, exactly. Here's the container popping in. A cool thing with the latest Java 11 version was 11.0.14, I think. They released the Java version. And a few days after that, they discovered that with the HTTP classes, you couldn't connect to Google.

Starting point is 00:44:38 Because they changed the header handling and submitted a host header and I don't remember it exactly, another header and Google didn't allow that. They returned a 400 error. They had to rebuild the whole Java from scratch to fix that really small bug. And if you have that within the container, the wrong version, you depend on someone else who rebuilds that container and fixes that for you. That's fascinating.

Starting point is 00:45:19 First of all, I do hope that the Java community has added this now as a standard test case that they can request certain things from these domains. But this is a very good point. You're depending on somebody else providing you the right software. This is also why I've seen some of our customers, they are actually not asking for the container from you. They're asking for the whole source code

Starting point is 00:45:46 and then being able to build these containers themselves. I just learned this through our work that we do with Captain, where we have some customers that just say, hey, we don't need your images. We just build it ourselves. I think also for this reason. And also then have their own scans and making sure that

Starting point is 00:46:08 it all aligns with their policies that they have. It sounds like the use for containers would be if you download the pre-built container, that's good for prototyping. But once you're set on it, build your own container then.

Starting point is 00:46:24 Does that sound like it would be a smart way to go? I mean, if you wanted to use the preset one, right? Like, let's see if this is even going to be feasible. Download that and then... The pre-built container image, a great starting point. You get stuff running really fast, but they have potential risks. Because also you have to

Starting point is 00:46:46 take care of which Linux operating system is running on it and are the bugs fixed security holes fixed all this stuff you have to track all this and

Starting point is 00:47:01 deploy all updates also to your own environment and load test with that. And everything we are doing in our use cases with 24-7 tests, you always have to do that with containers built by someone else as well. Wow. Klaus, it's amazing how fast time flies. I wanted to make sure we didn't forget anything, though.

Starting point is 00:47:31 Is there anything else where you say, hey, this is another lesson learned that I want to make sure that people know? Did we cover pretty much everything? The one thing you mentioned before we talked about the containers, the GC settings, for example,

Starting point is 00:47:48 which can cause quite some troubles in production when you don't load test the behavior up front. Because we ran into issues with our ActiveGate, where we use CMS garbage collection. And depending on how much CPU you assign to a container, request and limit, you get different amounts of memory, for example, for Eden space. In our load tests, we had issues

Starting point is 00:48:25 because the active gate couldn't cope with the traffic as soon we increased it. And we ran into garbage collection issues and we had to dig into that to find a root cause for it. Because with similar amount of memory,

Starting point is 00:48:43 for example, for a gigabyte, on our SaaS offerings, we have a certain amount of hidden space and yeah we have our recommendations how on the sizing of an active gate and yeah the sizing was actually fine but But what we didn't take into account was that there was limits set on the container which restricted the Eden space. Because if you look at the Java source code, you can then find that the Eden space is calculated based on some settings which influence that and one of them is how many

Starting point is 00:49:28 garbage collection threads I actually used and that depends on how many CPUs are available for the container. And as soon as we increased the CPU limits those issues were gone. Because it is a huge difference if you assign one CPU or 1.1 CPU, you get twice as much hidden space for the objects which are only short-lived. Just need to take notes here. And this is, I mean, I remember we had podcasts in the past, or we remember I wrote blogs about this, that the JVM is looking at all of these settings and then with that decides how many threads you have. But it's just so amazing that we have these layers and layers of the runtime.

Starting point is 00:50:29 And then there's so many tweaks we can do and have to do. And then sometimes we don't know where we have to set things. I remember that Henrik Rexit, our colleague, he has been doing a couple of sessions in his Is It Observable channel on setting proper resource limits on pods. Because if you don't do them, first of all, you don't know what you really get. But you can also do... It's just like he keeps reminding people how important these settings are, not only for the pod itself, but everything that runs inside and like your JVM.

Starting point is 00:51:04 Sounds like you have to game the system. that runs inside and like your JVM. This would be a great... I was going to say, it sounds like you have to game the system. Right, Klaus? You said going to 1.1 CPU gives you that extra. You don't have to use a whole 2 CPU so you can cut down your cost by using 1.1 but then you get the Eden space and it's...

Starting point is 00:51:19 I don't know. It's absurd in a way that... How do we tweak just enough to get that next level? I don't know. It's absurd in a way that, like, how do we tweak just enough to get that next level? I don't know. It's crazy, crazy, crazy stuff. And there's, again, the issue with if you build the containers yourself,

Starting point is 00:51:35 you have more influence over the actual heap settings, memory settings, garbage collector used, than if you use pre-built container images, you possibly cannot influence it. This is a great trivia question, just writing this down. It would be a trivia question for a pub quiz or maybe like, what's the show, Who Wants to Be a Millionaire? It's the million-dollar question.

Starting point is 00:52:03 When you are changing your resource limits from one CPU to 1.1, how much more Eden Hibbs space do you get? I bet you that's what a lot of tech interviews feel like. You're on who wants to be a millionaire with no phone home help. Then there's the question with which Java version might influence that as well.

Starting point is 00:52:27 And container awareness enabled or disabled makes a big difference. Yeah. Klaus, thank you so much for reminding us, us as Brian and myself, us as the performance, the global performance engineers that are listening, hopefully, to this podcast, reminding us that performance engineering is not dead with the cloud. Performance engineering is more important than ever. And it's more challenging, but also more interesting, I think,

Starting point is 00:52:58 because there's so many cool new things to play around with. But the impact is just phenomenal on what you have because you make sure that our software runs perfectly on the cloud and efficient to make sure we actually make money in the end and make our end users happy. Thanks to that. Yeah, and I think it's just to reiterate again that the complexity of performance, this just highlights how complex it really is. Now moving to different cloud providers. But also, I think what you just highlighted with all this,

Starting point is 00:53:29 if we go back to that idea of moving to serverless, Andy, you have to, the only, this is almost a case against serverless, because if the cloud providers can do all this stuff fantastically, then you'll get the performance you need out of it. But that is a leap of faith. You have to have faith that the cloud providers are looking at all these things, are tweaking these,

Starting point is 00:53:53 are considering all these different options. And maybe not just if you, let's say, you're going to use some sort of a serverless Cassandra. Are they using something with some weird container? And as we all know, faith in companies to do these things can be very dangerous because we don't know what they're doing.

Starting point is 00:54:11 You don't know what's behind the scenes. So at least, you know, you have the controls if you're doing it yourself. But on the other side, if you're going to do all these things yourself with Kubernetes, then you have to have the expertise like you have to know how to look at these different things.

Starting point is 00:54:23 And so there's, you know, there's a balance between what you have available, what can do, what control you give up, what you have to hope and pray that they're going to do it right and you'll get that performance and you're not just going to be resorting to spending more cloud spend to get the performance that you need. It's just such a complex...

Starting point is 00:54:43 Performance gets more and more complex, it seems. Or at least the more aware of it, I should say. The more aware of it we are, the more complex it gets. I think it's probably always been complex. But now that we're... All these things are being brought up to the surface so much more. Just shows you it's a never-ending thing and I don't see how... Yeah, I don't see performance engineering going away any time.

Starting point is 00:55:03 But we may give it fancier names, right? Yeah. That's the only thing. From performance tester to performance engineer to SRE to... Klaus, what is your official title again? I'm a senior software developer. See, there you go. That's the new software engineer name is senior software developer.

Starting point is 00:55:24 But that's, you know what, not a joke, and now we're wrapping up, but that should always be on the forefront of developers' minds, right? Is the performance. Obviously, you can't do all jobs at once, but developing with performance in mind is a key aspect. Anyway, really appreciate you being on today. It's been a pleasure. It's been a fascinating

Starting point is 00:55:39 topic. Just, yeah, thank you. Thanks a ton. My mind is blown today so thank you thanks for the invitation brian and uh andy it was a pleasure to talk to you and it's great that you took that you took the leap of faith as brian just said because i know you said in the very beginning it's a new challenge for you to speak in a podcast also it's not your native language that we all speak in here. At least Brian. I mean, English kind of Brian.

Starting point is 00:56:10 So that's why so much. Thank you so much for doing this. Welcome. Thank you, everyone, for listening. Have a great day, everyone. Bye-bye. Thank you. Bye-bye.

Starting point is 00:56:21 You too. Bye.

PurePerformance - Performance Engineering for Hybrid Cloud re-platforming with Klaus Kierer

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.