PurePerformance - Performance Engineering for Hybrid Cloud re-platforming with Klaus Kierer
Episode Date: April 4, 2022When moving to the cloud - have you thought of the performance difference between App Gateway and Application Load Balancers? The disk speed and disk cache limitations impacting Cassandra and or Elast...icsearch Performance? Challenges with pre-built containers or resource limits on pods impacting Java Garbage Collection behavior?These are all performance considerations Klaus Kierer, Senior Software Engineer in the Cluster Performance Engineering Team at Dynatrace, has learned over the past months as he helped performance optimize the Dynatrace Platform as it was expanded from running on AWS Compute to run on Kubernetes hosted in Azure (AKS) or Google Cloud (GKE).Listen in and learn why Performance Engineering is more important than ever as you are moving your workloads to the “hyper-hybrid-cloud”.Show Links:Klaus on Linkedin:https://www.linkedin.com/in/klaus-kierer-67b83a81/Blog - When to use Azure Load Balancer or Application Gateway:https://blog.siliconvalve.com/2017/04/04/when-to-use-azure-load-balancer-or-application-gateway/K8ssandra performance benchmarks on cloud managed Kuberneteshttps://k8ssandra.io/blog/articles/k8ssandra-performance-benchmarks-on-cloud-managed-kubernetes/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me here my co-host who's mocking me today, Andy Grabner.
And hello Andy, Mr. Funny Guy.
I know, I wanted to make you laugh but instead I'm laughing now.
Yeah, it's great to be here.
Yeah, I'd say that. It's always great to be here. Yeah, I'd say that.
It's always great to be here.
I know.
Because it's always such a fun learning experience.
Not to segue right away, but I think today is a brand new learning experience for us.
So hopefully it'll be a brand new learning experience for our listeners.
It is.
And as I think we mentioned a couple of times, one of the reasons why we keep doing this podcast is because we learn much more than i guess most of what definitely than our guests because they already
know all this stuff yeah but uh today's topic we've we've at least i've been seeing a lot of
blog posts out there recently from stephen townsend uh from uh new zealand and i've been with him on a
newtispec conference once and i think he also wrote a blog series about performance engineering in the cloud,
kind of does the cloud make performance engineering obsolete? Yes or no.
And I think today we definitely have a MythBuster session.
Who's Jamie and who's Adam?
We'll figure this out. But without further ado i wanna i wanna make sure to introduce our guest
he is one of our own dino tracers a performance engineer and klaus killer his name and i actually
want to klaus i want to pass it over to you if you could quickly say hi and introduce yourself
to the audience hi brian and andy thanks for the invitation to this great podcast.
I'm sometimes listening to it and it is really great to take part in it and hopefully I can
shed some light on some of the issues we had in load testing of our own Dynatrace clusters. I'm with Dynatrace since 2019.
And since the beginning,
I'm a member of the cluster performance engineering team
where we cover 24-7 load testing,
regression load testing of our own clusters
with Dynatrace.
And that's, first of all,
thank you for actually helping us make this podcast great,
because guests like you allow us.
Also great to hear kind of what you just said,
24-7 performance engineering.
I was, I'm always the fortunate guy,
I think, that can talk about these stories,
because I talked about also what you and your team, Thomas Steinmacher,
I think that you're reporting to have done over the years.
It's really fascinating.
And I want to quickly give a quick background on how we came about this episode.
Because I was invited to speak at an Azure conference and then the idea came up.
Hey, we as Dynatrace just recently announced we moved our SaaS offering to also Azure and GCP so we are
providing our SaaS service not only from AWS where we started but in all the other hyperscalers
and then the thought came up so was this just an easy smooth transition how does this work what
had to be done any performance challenges and then the two of us got to talk because Thomas, you know, he said, hey, talk with Klaus.
And then we sat down over lunch.
Fortunately, this is possible again now in the office.
And then you brought up some really fascinating points on what things you have learned as you were moving Dynatrace, our offering into Azure.
And this is where I took a lot of notes.
And then I said, let's come back to the podcast and let's discuss it one by one.
Really fascinating.
And with this, Klaus, before we get started, I think you said another thing.
And before we started to hit the recording button, for you, it was a lot of new first things, right?
The cloud was also new for you.
Kubernetes was new. Exactly. I had some experience with Docker, running it on some of my own servers
privately, but no need for me to learn Kubernetes actually, because there is some overhead involved and now i last summer we had a chance to really step into that in the cluster
performance engineering team and i took that over and uh learned a lot of new things it was really
uh great things i learned and uh it was sometimes a hard time as well yeah because i i got a new
dynatrace cluster running in kubernetes and now do some load test with it yeah and it's interesting
the idea of load testing right so we've had conversations andy and i in the past with mark
tomlinson about load testing in the cloud, right?
And that's a whole different topic.
We won't go into that, but just to retouch upon that, you are testing your application that's running in, let's say, Kubernetes.
But you also have to be aware of making sure you're testing the capabilities and functionality of, let's say, Kubernetes to support that load.
Like if a node goes down, is scaling working?
Are they scaling at the right time and all?
So that on its own is even just a lot to go into.
But I think we're going to be taking a little bit
of a look at a little bit even different side of that than today,
because as we discovered or as Andy discovered talking to you,
there's even another dimension that needs to be considered.
But I'll shut up and let Andy go.
No, that's fine.
And I want to highlight why this is so relevant
for all of our listeners,
because like Klaus, what you just said,
we had Diamond Trace for years now, right?
And the way we run it,
but we basically moved it not only to another cloud vendor,
but we also moved to Kubernetes.
That means we did what a lot of organizations are trying to figure out.
How can Kubernetes become the new orchestration platform for running container-based workloads?
Which means, because it helps us to easily, at least in theory, move things around from
one vendor to another, from one hyperscaler to another, because the common denominator
is Kubernetes.
But as you've learned, there's definitely certain things to be aware of.
And Klaus, if you're okay, I would jump into the first thing
because I thought this was fascinating.
You talked, you told me about one of the things that you played around with, and this is kind of the gateway into everything that runs in Kubernetes is kind of the ingress controllers.
And there's different ways in Azure and also I think in AKS and GKE on how you can expose your services to the outside world.
What did you learn as you moved to Azure in terms of ingress controllers and the load balancers?
From a load testing perspective,
we use our own load test generator,
the Cluster Workload Simulator,
where we simulate typical load,
which also would come from real-life systems.
For example, pre-pass with database calls
and servlet calls, stuff like that,
or log ingest, all things which would possibly
come into a cluster
and has to be handled accordingly.
And for example, we simulate
on our root test cluster now about
20 or 25,000 agents, which have about 5,000 hosts, which are simulated at the moment.
Wow.
So the cluster workload simulator has some kind of workload pattern which can be configured.
And we originally set this up.
We took a cluster we have running in our SaaS offerings and scaled the cluster accordingly to have something to compare with. And then we enabled our cluster workload simulator
and ingested some load.
Initially, it looked really fine.
And as soon as we increased the load,
we saw that on the ingress nodes,
also our active gates were co-located.
We had quite a huge amount of cpu usage
which was quite atypical now i investigated a little bit into that
had some talks with colleagues from glansk and yeah took quite some time to really get into it and what was a root
cause for our issues we had because in Asia you have different scenarios how you can interest the
data for example the application gateway or the application load balancer. One is on layer four and the other one on layer seven.
And initially we used the application gateway.
We didn't have a clue that this could cause any troubles because we saw the high
CPU usage mainly on the ingress nodes where Nginx is used. So it took quite some time
to investigate and one of the first steps was that we separated our active gates from the
actual ingress to have an own layer in front of the active gates
to better investigate what is causing the high CPU load.
Didn't improve the situation very much,
and we really tried to kill the problem with throwing hardware at it.
Didn't work out. Finally, a colleague from Glansk said, let's try to remove the
application load balancer and simply use the application gateway and try to use
the application load balancer. And finally, the high CPU usage was gone.
We were on levels which were fine,
still a bit high,
but actually something we could work with.
And the reason, and this is fascinating,
and by the way, for the listeners,
Klaus, you also sent me a couple of blog posts.
For instance, one on when to use Azure Load Balancer or the Application Gateway.
You mentioned earlier these different, let's say,
ingress controllers work on different levels of the OC stack.
One is on layer 7.
The other one is on level 4.
Obviously, the higher you go up, the more work needs to be done by the load balancer
because they're analyzing the HTTP payload, making, I guess, even SSL termination and
handshaking.
And I mean, there's so many things that happen there.
And in high load environments, I mean, you mentioned you're simulating workloads of 25,000 one agents where you're sending data
in from rather large environments.
And these are workloads that we see all the time
and they obviously need to perform.
And we want to make sure we have enough resources
in the real core of the cluster
and not just eating up all of the CPU already in the front
where the traffic comes in.
And then lessons learned.
But I think in hindsight
if you look if you look at this now if i look at it now after you explain it to me i would say
well of course this makes sense but still for somebody that is moving to the cloud and you just
turn on the default features and you say hey this just looks all great and everything works
magically anyway and everything is scalable by default
it's a great lesson learned just like how you can optimize them performance and what impact it has yeah yeah and i had a question regarding this so do i understand correctly that this was discovered
in the azure kubernetes setup so this this didn't exist and this didn't exist in AWS, right? Like you didn't have this problem with that bottleneck.
In AWS, we don't use Kubernetes.
Oh, that's right.
Okay, so that's, okay.
That's very, yeah, yeah, yeah, I know.
I'm just trying to think, you know, Andy, you talk about lessons learned.
So obviously this wasn't a, it worked fine here.
We moved it to the other one.
It changed.
This was, let's say for lack of a better term,
greenfield in a new cloud provider.
So just thinking the complications that come into this project,
you're taking a new product or new software architecture
and a new cloud both at the same time.
So now when you're trying to troubleshoot,
you're looking at,
is it the way we've set it up in Kubernetes?
Is it Kubernetes or is it the cloud?
Which is, I don't know,
it just sounds like a lot to tackle.
I wonder if in hindsight,
would it have made sense to first set that up in AWS
where you had your at least known conditions?
Or do you think that wouldn't have mattered in this situation?
Thinking if somebody is going to do something similar,
would it be a worthwhile step to try it in the environment they're used to first?
Or did that environment change, you think, have a little, very little impact on what happened?
That might indeed make sense
to try it out in an environment which you know.
But on the other hand,
you might then build a system
which runs fine on AWS.
And when migrating to TCP or Azure,
you again run into troubles
because you use features of the cloud provider which are
implemented differently at every cloud provider right so it sounds like it'd be
better to test it in the environment you want to launch it in I guess then so I
think that the goal was to get it running in Azure first so it just made
sense to start there anyway okay that makes. But then also what I want to reiterate
here, our AWS environment where we didn't use Kubernetes, but we used obviously our current
architecture that we've used for several years. This was your baseline though, right? This was
the baseline. We know exactly under certain load conditions, this is what you take. And then what
we did as Dynatrace, we moved
to the different cloud vendors and with this we
replatformed because we knew
Kubernetes is the future.
So we had to figure out how to run
on Kubernetes anyway.
But then it was a great assumption
to make because Kubernetes, we should
get Kubernetes running as efficiently
as if we would run
our application on the old infrastructure
definition. Because in the end, you want to move over and get the benefit of Kubernetes
without getting any overhead that we can optimize or that should be optimized. Because I'm pretty
sure, Klaus, while you've been with the company since 2019, even when we started with AWS, I'm sure over the last 7, 10 years,
we've made many performance improvements
on AWS compared to where we started.
And when we now look at
what we do in Kubernetes,
we are still in the early stages
and we need to figure all this out
and how to really optimize this environment.
That's indeed true.
There are quite a lot of
optimizations on AWS.
For example,
there we use on the ActiveGate
nodes, we have not
NGINX in place, but AJA
proxy. And
for example, there we
directly use
the local ActiveGate
to direct the traffic,
to have some traffic equality.
And we don't have it like in Kubernetes
where you have an ingress pod
which talks to any active gate in the cluster.
And in Kubernetes,
we don't have that traffic locality, actually.
We worked on that,
but it's not really
to have node locality,
but at least to be in the same availability zone.
Yeah.
For me, it's fascinating because
as you said, Klaus,
you started with Kubernetes
just a little while ago. I started
with Kubernetes, I think, three
years ago when we started our journey with
Captain. And
it's still a mystery to me,
certain aspects at least, especially the whole
networking, if you're not a network expert.
And it seems a lot of the things you're explaining right now is a lot of network specialities that just come with moving to something like Kubernetes.
Yes, of course.
There are quite some network
issues. For example, with the initial traffic we had,
one issue was that in Dynatrace itself you can monitor where the traffic is going, which process gets traffic and which
process it descends to. You can monitor that within Dynatrace. But on the Kubernetes cluster, all of that seemed somewhat strange to me.
It looked like the traffic is wandering around from one availability zone to the next, to one part.
And really, it was amazing how the traffic on the cluster went. And there were also quite some improvements
to influence how the traffic is processed
within Kubernetes.
For example, the low balancer
knows all nodes within the cluster.
And basically, you have traffic
coming into the cluster
and it can potentially go to any node.
And then it is by Calicrew, I think Calicrew is responsible for that, which directs it to the
target part where it should actually go. Now with the health matrix We worked a little bit around that
to tell the load balancer
that actually only the ingress nodes are up for it
and want the traffic.
So we had it again in our hand
how the traffic flows through the cluster.
And this is, again,
for many of us, including
me, the network is always
a little magic kingdom.
But as you said, this is exactly what
coming back to my initial
opening statement, performance engineering
is not going away with the cloud.
Performance engineering becomes tougher
with the cloud and especially more important
because if you just go with the defaults,
then you are in the end paying a whole lot of money
because you just throw virtual or cloud hardware on the problem.
I'm pretty sure you can scale some way.
The question is, is then the software that you're providing and running,
is it still sustainable? And can you're providing and running, is it still
sustainable and can you actually make money out of it?
Yeah. And additionally, depending on the cloud provider, cross availability zone traffic
might cost much more than local traffic. And if it is ended from the interest node of one
availability zone to an active gate in another availability tone, and then
again to a server in another one, you have a lot of cross availability tone traffic,
which on the one hand costs money, and on the other hand also costs performance.
The other complication I'd add is just the level of transparency that the cloud provider provides to what they're doing underneath.
So you go ahead and run these things in a cloud-managed Kubernetes environment.
What do you have access to finding out how they're routing this stuff, what protocols they're using?
If we abstract this out to serverless, you don't even necessarily know what's running behind there. You may know, but the level of transparency comes into question so
that when you do run into these issues where we think it might be something with this network,
can you get down to that level to understand which layer of network is being used so that
you know whether or not you would have to pull that out or maybe re-architect
because it's something you're doing so that's always just another consideration
is can you find out from the cloud provider from what they're providing you how they're doing what
they're doing so that you can get to the root of what's causing that bottleneck and come up with
a good solution in the perfect world you wouldn't even use the Azure portal or GCP portal for that.
You would see everything within Dynatrace.
At the moment, we are not on that level and have to improve some things, especially in
the topic of networking.
For example, the load balancer is not really visible within Anadres.
You might see it in the traffic UI as an unmonitored node.
But you really don't know, for example, where traffic gets lost or is blocked.
That's stuff we still might improve in the future.
And I think this is also kind of repeating for people that are listening in and
maybe don't understand the full context.
We are moving and we're running our Dynatrace in Kubernetes on hyperscalers
like Azure, but we are monitoring everything also with Dynatrace in Kubernetes on hyperscalers like Azure,
but we are monitoring everything also with Dynatrace.
And there's teams like Klaus's team,
the cluster performance engineering team,
or I'm just pointing, I know nobody can see me,
but I'm pointing to the other room.
My wife, Gaby, she sits over there. They're using Dynatrace to monitor these environments
to make sure systems run stable in production.
We are using our own product
and how we want our customers to use it.
And we're using it at a very tremendous scale
across all the different hyperscalers,
across all the different permutations
that Kubernetes gives us.
And I think this is really great.
And then also improving our product
as we see where as you
said earlier where we may have blind spots right now where it's not perfect that's really cool
hey uh klaus i wanna switch topics because i remember in our discussion when we were sitting
down for lunch you brought up a couple of uh points on only network performance or network issues but also disk
a lot of things you learned about you know disk performance uh caching and things like that can
you just enlighten us a little bit on some of their lessons learned that everybody should know
within the different cloud providers, there are quite some differences
on the disks you can use
and what they cost.
For example, we have a lot of experience
with AWS
and with millisecond response times
for disk access, write, read,
really perform well.
And years ago, we also set up some managed clusters within Asia
where we already had some issues with disks.
For example, the premium SSD disks have quite a different IOPS profile than we
were experiencing on Asia. So latencies were different and what I found out is that Asia is optimized for high parallelism.
And our cluster software has
some issues writing, for example,
the session storage or for elastic search,
we had also issues
and we had to find solutions for that.
Because on Asia, for example, the read disk caching is only supported to about
4 terabyte to be exact, 4095 gigabyte, one gigabyte more and you cannot enable read caching.
And this has a tremendous effect on Elastic, for example, when you want to read logs.
Cassandra is also highly involved when reading data. We haven't seen that issues that much with writing.
The idea is only for the session storage, for example,
where we had to increase the parallelism we use for writing,
where we had no problems on AWS.
We had to modify our software to run with good performance on Azure.
And we have seen this with Kubernetes as well.
And I've found some blog posts about that,
where comparisons have been done between different cloud providers,
for example, for Cassandra.
And the result of that was exactly what we have seen.
And for Azure, you would need UltraDiscs to be on the same level as, for example, with GCP or AWS.
By the way, for those listeners that are interested in this,
the link to the blog post, it's on Cassandra.io,
a blog post called Cassandra Performance Benchmarks
on Cloud-Managed Kubernetes.
We'll put it into the summary of the blog post
so that people can easily find it.
And this is just fascinating, right?
Because in the end, I always thought if you are,
you know, like you're buying a car with a certain horsepower here or there, it's kind of the same.
It's like you're buying a certain compute power on this vendor and the other. And if they look
the same, they should feel the same, but it's not the same. And I think these lessons learned are really fascinating.
I wasn't unmuted on my real recording. It's also another factor to take in when, as a lot of companies are moving to multi-cloud, when you want to just lift and shift it over from Azure to GCP or Azure. Let me start that all over.
Wow, Andy, it's me today.
This is another factor to consider
when moving to multi-cloud, right?
Where if you want to lift and shift
from one cloud to another,
you would expect that you could just order up same thing,
pop it in and get the same performance.
But if you skip those testing steps, if you're not looking at the hardware it's running at,
you're just looking maybe at the software performance and not looking at all those other
layers, but also not being conscious of it. In these situations, you might have to switch up
what you're using because you're not going to get the same performance. And this, to me, is the most fascinating thing about today's episode,
is the difference in performance between the different cloud providers.
Not necessarily that one has better performance than the other,
but that you might have to use different aspects of it to get the same performance.
You can't just toss it over and expect it to be the same.
Your example, right? It's the same performance. You can't just toss it over and expect it to be the same. Your example, right?
It's the same car.
The example I like is if you go back to the earlier days
of Internet Explorer, Firefox, and Chrome,
as a web developer, you couldn't necessarily
just put your page in the other browser
and expect it to work.
Because the engine running it is different.
And it's the whole new level of
complexity. And I can understand why when you were talking with Kelsey Hightower, I can understand
why this sort of thing is giving the whole serverless movement more momentum because you go
from hosting your own infrastructure and dealing with all this stuff to revisiting it all again in the cloud
where you thought you might have had
an abstraction from it
where you don't have to care about it as much.
But as you're revealing clouds,
you do have to continue to be very in tune
to these components still.
Yeah.
And the question though is,
and I like your bringing in serverless again,
the challenge that I had over with serverless, at least here with Kubernetes,
Klaus has the option to tweak certain knobs, right?
And optimize with serverless.
Well, yeah, you're all in the hands.
You're powerless in a lot of ways, which is even...
Yeah, a lot of ways.
Yeah.
So Klaus, we talked about network, like the load balancers.
We talked about the disks.
I have a couple of additional points here.
When we talked about, I think, again, sitting down for lunch and he said, hey, kernel tuning on Kubernetes.
There was also really, really important or certain tuning
sessions or sections that you have to do.
Can you enlighten us here as well, especially for people,
I think, that are running or trying
to do what we do with Elastic and Cassandra,
running those on Kubernetes?
Yes, there are always some issues with kernel tunings.
For example, Cassandra or Datastacks itself, they recommend to set some parameters to get
the performance they want.
In most cases, you need kernel tunings to enable memory mapped files, to really use
the memory
for caching.
And if you don't
raise the default values
for Cassandra,
you will sooner or later
run into memory problems
and out-of-memory errors.
And because of that,
restarting pods
and really,
you could potentially run into bigger problems with the set issues and there it's always it's quite hard in multi cloud
environment to find one solution for kernel tuning because Because for example, kernel tuning parameters which are allowed on Azure
are not whitelisted within GCP. So you have to find other solutions to really
enable that kernel tuning. You have different options for that. You can use scripts directly on the node or use daemon sets to
enable kernel tuning. You could also use in-it containers, but that always depends on the
software you are running. For example, for Cassandra, we use CCop as an operator. And because of that, we cannot use init containers
because they have their own init containers running
and you cannot add additional init containers, to my knowledge.
And so on GCP, we had to use an own daemon set
to enable the kernel tuning parameters,
which we are working on Asia without any problems
with our Terraform automation.
Which, again, Brian, comes back to what you said earlier
about why serverless gets this momentum,
because right now what I hear here, Klaus,
is that you're building individual fixes,
workarounds for the individual things
you just figure out as you go.
And then you need to maintain this.
And maybe, who knows, GCP, Azure,
they are changing their default settings
or what they allow.
And then you constantly have to test it.
And this additional, like serverless,
brings this additional abstraction layer
on top of this.
But I always thought it's it i always thought
it's more easy it's easier in this case it would have been possible to use a demon set on asia as
well so we would have one solution which works but i think for our performance reasons and overhead
the colleagues in tansk looked for another solution which worked on Azure, but
unfortunately not on GCP. Now we have actually two solutions for the same problem. Might
be reflected to use daemon sets for every cloud provider, but we have to discuss this internally and find a solution for that.
Or maybe on GCP
it changes in the future
because there are already
better features available
to have more
influence on that.
But we don't want
to use better features in production.
Yeah, of course.
You want to use better features, but not beta production. Yeah, of course. You want to use beta features,
but not beta features.
Yeah, exactly.
Andy, this also makes me wonder,
if you think about what our friends over at Akamasa
are doing with JVM tuning and all that,
I hear all this kernel tuning,
the disk options and all this,
and it sounds ripe for me for some AI layer
to, if it can understand all the inputs
and tweaks available in the different cloud providers,
look at the performance and make these tweaks
of the kernel tuning or switching it over
to a different disk to find that performance for you.
And I know that's a bit of a simplification
of what Akamas is doing,
but it sounds ripe
for someone to throw an AI layer at.
Exactly.
And I think that's what Akamas is doing.
And I will definitely make sure once this
recording is on air
to send Stefano
the link so that he looks at
this, what we're doing here.
Klaus,
there was one more thing that I had in my list uh kind of goes
into again settings and just brian you mentioned akamas i think they started with jvm tuning um
there's also some things you have you had to tune for uh garbage collection because dynatrace
is heavily depending on running on j, at least our cluster nodes.
What's the story there?
There are at least two topics I know of.
The one is be aware of which containers you use.
If you don't build them yourself, for example, you might run into issues because a pre-built container has some settings applied
which are not production-like, simply because it's easier to get started with them.
For example, with Elastic, we had an issue again with memory mapped files where a feature was simply
disabled by default to allow to run a container without privileged mode.
And that caused that we had quite high system CPU usage with elastic and indexing times with the load which took
more than two minutes for indexing and as soon as we switched the flag to use
the memory files indexing times were below 10 seconds. That's an improvement from, let's say, two minutes is like 120 seconds down to, what
did you say, 10 seconds?
10 seconds, yeah.
Yeah, that's like more than a 90% improvement in performance.
Can I ask a question on that?
Because it boggles my mind.
I'm used to analyzing performance from a transaction code level or looking at things like something else stealing the CPU.
But when you talk about a flag for running in privileged mode,
how the heck does somebody go about discovering
that that flag is set and maybe we should turn that off?
Because it's not like it's obvious.
It's not like there's a list of heroes,
all the things that are turned on and you might want to consider.
This is like some deep setting.
I remember Andy Mark-Thompson a while back was talking about a situation
where they had some Intel chips with a flag that can go turn hyper-threading on or off,
and that was the issue.
So it's someone, I guess,
has it just come down to someone looking and thinking,
oh, this might be it, let's try it?
Or how does that work over with the team?
What is that knowledge like to know to go look at that?
In this case, it was a colleague in the cluster performance engineering team.
It was Markus Farnberger who actually looked into that
and said, look at the I.O. pattern.
That's crazy how much I.O. is happening
on that machine without any disk access
or with extremely low disk access. And that
was making me think about what could cause this.
And from Cassandra, I already
knew that with memory mapped files and stuff like that,
that you produce a lot of I.O. if it's not correctly tuned without doing actually anything.
Because you simply load something into the cache, remove it from the cache if the cache is too low. And in this case, it was really good that
a colleague of mine took a look at that because I didn't spot it at the first chance.
But Brian, I think you bring up a perfect point, right? We have lived so much in our
APM transactional world for that many years and always analyzed, you know, which method calls, which other methods,
how many database queries are executed.
And therefore we focus so much on where can we optimize performance there.
And there's a huge potential, as we all know.
But also if you think about your traditional pure path,
also going back to the App Mondays, we then always had,
there's a pure path and all of a sudden it just spends time in
io because we always had either cpu time we had wait sync or the rest was just io and io means
i'm waiting and obviously io to complete and now the question is right if i ramp up the load and
i see more pure paths coming in all of a sudden io increases more than normal, then the question is either I'm really
doing too much of IO or I have
a problem with my IO
on another end that I need
to optimize. But then it needs
experts like Klaus
and others in
performance engineering to then figure out, okay,
what can we do?
In this case, we also
had issues
with the pre-built container we used,
which used a Java version,
which we officially do not support anymore.
So basically at the beginning,
we don't even have monitoring
for the Java processes in this case.
It made things much more difficult
and we had to find a solution
how we could enable the monitoring. In this case also other team members helped
on that to enable some flex in our debug UI which enabled monitoring of
unsupported versions. And unsupportedorted just that i get it right this was
basically again elastic or cassandra you know we ran it on we had older versions of or versions
where cassandra and elastic used all the versions of java exactly we didn't support out of the box
in this case it was java 15 which isn't supported anymore because it is quite old and yeah and you know what this
reminds me of one more thing so it's interrupted but this reminds me of a big discussion we had
just not too long ago and brian and i will have our colleagues from open telemetry
on one of the upcoming episodes fact is that a lot of new software is built fact is that we're using a lot of software from third
party vendors that might not be instrumented yet with open telemetry that right are like
using jvms where even our product by default says we don't support it fortunately we have all these
hidden features where we can turn support on but this just tells me that it's really great
that we have tools like us and also our competition
that has been building also agents over the last years
and decades to really get insights into applications
that are not manually instrumented yet
with something like OpenTelemetry.
Because you cannot just go to Elastic and say,
I need OpenTelemetry now for this version that is five years old.
They wouldn't do it.
And what I wanted to bring up basically with this topic is
that you have to take care which containers you use.
Especially if you don't build them yourself.
You lose influence on what is actually running. For example, if we use Cassandra 3.11 and when a new Java
version comes up, they simply rebuild their Cassandra image and provide the same
tag with another Java version.
So it might depend on when you download
an image on what you're actually
running on a system.
I was going to say they might not be doing the Java
tuning if there's some new features in Java.
Andy, it's funny.
This sounds to me like a container equivalent
of copying code from Stack Overflow.
Yeah, exactly.
Here's the container popping in.
A cool thing with the latest Java 11 version was 11.0.14, I think.
They released the Java version.
And a few days after that,
they discovered that with the HTTP classes,
you couldn't connect to Google.
Because they changed the header handling
and submitted a host header
and I don't remember it exactly, another
header and Google didn't allow that. They returned a 400 error. They had to rebuild
the whole Java from scratch to fix that really small bug. And if you have that within the container, the wrong version,
you depend on someone else who rebuilds that container
and fixes that for you.
That's fascinating.
First of all, I do hope that the Java community has added this
now as a standard test case
that they can request certain things from these domains.
But this is a very good point.
You're depending on somebody else providing you the right software.
This is also why I've seen some of our customers,
they are actually not asking for the container from you.
They're asking for the whole source code
and then being able to build these containers themselves.
I just learned this through our work that we do with Captain,
where we have some customers that just say,
hey, we don't need your images.
We just build it ourselves.
I think also for this reason.
And also then have their own scans
and making sure that
it all
aligns with their policies that they have.
It sounds like
the use for containers would be
if you download the pre-built container,
that's good for prototyping.
But once you're set on it,
build your own container then.
Does that sound like it would be a smart way to go?
I mean, if you wanted to use the preset one, right?
Like, let's see if this is even going to be feasible.
Download that and then...
The pre-built container image, a great starting point.
You get stuff running really fast,
but they have potential risks.
Because also you have to
take care of which Linux
operating system
is running on it
and are the bugs fixed
security holes fixed
all this stuff you have to
track all this
and
deploy all updates also
to your own environment and load test with that.
And everything we are doing in our use cases with 24-7 tests, you always have to do that
with containers built by someone else as well.
Wow.
Klaus, it's amazing how fast
time flies. I wanted
to make sure we didn't forget anything, though.
Is there anything else where you say,
hey, this is another lesson learned that I want
to make sure that
people know? Did we cover pretty much
everything?
The one thing you mentioned before
we talked about the containers,
the GC settings, for example,
which can cause quite some troubles in production
when you don't load test the behavior up front.
Because we ran into issues with our ActiveGate,
where we use CMS garbage collection.
And depending on how much CPU you assign to a container,
request and limit, you get different amounts of memory,
for example, for Eden space.
In our load tests, we had issues
because the active gate
couldn't cope with the traffic
as soon we increased it.
And we ran into
garbage collection issues
and we had to dig into that
to find a root cause for it.
Because with similar amount of memory,
for example, for a gigabyte,
on our SaaS offerings, we have a certain amount of hidden space and yeah we have our recommendations how
on the sizing of an active gate and yeah the sizing was actually fine but But what we didn't take into account
was that there was limits set on the container
which restricted the Eden space.
Because if you look at the Java source code,
you can then find that the Eden space is calculated
based on some settings which influence that and one of them is how many
garbage collection threads I actually used and that depends on how many CPUs are available for
the container. And as soon as we increased the CPU limits those issues were gone. Because it is a huge difference if you assign one CPU
or 1.1 CPU, you get twice as much hidden space for the objects which are only short-lived.
Just need to take notes here.
And this is, I mean, I remember we had podcasts in the past,
or we remember I wrote blogs about this,
that the JVM is looking at all of these settings and then with that decides how many threads you have.
But it's just so amazing that we have these layers and layers of the runtime.
And then there's so many tweaks we can do and have to do.
And then sometimes we don't know where we have to set things.
I remember that Henrik Rexit, our colleague,
he has been doing a couple of sessions in his Is It Observable channel
on setting proper resource limits on pods.
Because if you don't do them, first of all, you don't know what you really get.
But you can also do...
It's just like he keeps reminding people how important these settings are, not only for the pod itself, but everything that runs inside and like your JVM.
Sounds like you have to game the system. that runs inside and like your JVM. This would be a great...
I was going to say, it sounds like you have to game the system.
Right, Klaus? You said
going to 1.1 CPU
gives you that extra. You don't have to use a whole 2 CPU
so you can cut down your cost by using 1.1
but then you get the Eden space and
it's...
I don't know.
It's absurd in a way
that... How do we tweak just enough to get that next level? I don't know. It's absurd in a way that,
like, how do we tweak just enough to get that next level?
I don't know.
It's crazy, crazy, crazy stuff.
And there's, again, the issue with
if you build the containers yourself,
you have more influence over the actual heap settings,
memory settings, garbage collector used,
than if you use pre-built container images,
you possibly cannot influence it.
This is a great trivia question, just writing this down.
It would be a trivia question for a pub quiz or maybe like,
what's the show, Who Wants to Be a Millionaire?
It's the million-dollar question.
When you are changing your resource limits
from one CPU to 1.1,
how much more Eden Hibbs space do you get?
I bet you that's what a lot of tech interviews feel like.
You're on who wants to be a millionaire
with no phone home help.
Then there's the question with which Java version
might influence that as well.
And container awareness enabled or disabled makes a big difference.
Yeah.
Klaus, thank you so much for reminding us, us as Brian and myself,
us as the performance, the global performance engineers that are listening,
hopefully, to this podcast,
reminding us that performance engineering is not dead with the cloud.
Performance engineering is more important than ever.
And it's more challenging, but also more interesting, I think,
because there's so many cool new things to play around with.
But the impact is just phenomenal on what you have
because you make sure that our
software runs perfectly on the cloud and efficient to make sure we actually make money in the end and
make our end users happy. Thanks to that. Yeah, and I think it's just to reiterate again that
the complexity of performance, this just highlights how complex it really is. Now moving to
different cloud providers.
But also, I think what you just highlighted with all this,
if we go back to that idea of moving to serverless, Andy,
you have to, the only,
this is almost a case against serverless,
because if the cloud providers can do all this stuff fantastically,
then you'll get the performance you need out of it.
But that is a leap of faith.
You have to have faith that the cloud providers
are looking at all these things, are tweaking these,
are considering all these different options.
And maybe not just if you, let's say,
you're going to use some sort of a serverless Cassandra.
Are they using something with some weird container?
And as we all know,
faith in companies to do these things
can be very dangerous
because we don't know what they're doing.
You don't know what's behind the scenes.
So at least, you know,
you have the controls if you're doing it yourself.
But on the other side,
if you're going to do all these things yourself
with Kubernetes,
then you have to have the expertise like you have
to know how to look at these different things.
And so there's, you know, there's a balance between what you have available,
what can do, what control you give up,
what you have to hope and pray that they're going to do it right
and you'll get that performance
and you're not just going to be resorting
to spending more cloud spend
to get the performance that you need.
It's just such a complex...
Performance gets more and more complex, it seems.
Or at least the more aware of it, I should say.
The more aware of it we are, the more complex it gets.
I think it's probably always been complex.
But now that we're...
All these things are being brought up to the surface so much more.
Just shows you it's a never-ending thing and I don't see how...
Yeah, I don't see performance engineering going away any time.
But we may give it fancier names, right?
Yeah.
That's the only thing.
From performance tester to performance engineer to SRE to...
Klaus, what is your official title again?
I'm a senior software developer.
See, there you go.
That's the new software engineer name is senior software developer.
But that's, you know what, not
a joke, and now we're wrapping up, but that should always
be on the forefront of developers' minds, right?
Is the performance. Obviously, you can't do all
jobs at once, but
developing with performance in mind is a key
aspect. Anyway, really appreciate you
being on today. It's been a pleasure. It's been a fascinating
topic.
Just, yeah, thank you. Thanks a ton.
My mind is blown today so thank you thanks for the
invitation brian and uh andy it was a pleasure to talk to you and it's great that you took that you
took the leap of faith as brian just said because i know you said in the very beginning it's a new
challenge for you to speak in a podcast also it's not your native language that we all speak in here.
At least Brian.
I mean, English kind of Brian.
So that's why so much.
Thank you so much for doing this.
Welcome.
Thank you, everyone, for listening.
Have a great day, everyone.
Bye-bye.
Thank you.
Bye-bye.
You too.
Bye.