PurePerformance - 035 When Multi-Threading, Micro Services and Garbage Collection Turn Sour
Episode Date: May 8, 2017For our one year anniversary episode, we go “back to basics”, or, better said, “back problem patterns”.We picked three patterns that have come up frequently in recent “Share Your PurePath”... sessions from our global user base and try to give some advice on how to identify, analyze and mitigate them:· Bad Multi-threading: Multi-threading is not a bad thing – but if done wrong it doesn’t allow your application to scale. We discuss key server metrics and how to correctly read multi-threaded asynchronous PurePaths. Also see the following blog: https://www.dynatrace.com/blog/how-to-analyze-problems-in-multi-threaded-applications/ · When Micro Service become Nano Services. This was inspired by a blog from Steven Ledoux ( https://www.dynatrace.com/blog/micro-services-when-micro-becomes-nano/ ). It's important to keep a constant eye on your micro-service architecture to avoid too tightly coupled or too fine grained architectures· Garbage Collection Impact: GC is important but bad memory management and heavy GC can potentially impact your critical transactions. We discuss different approaches on how to correctly measure the impact of garbage collection suspension. If you want to learn more check out the Java Memory Management secton of our online performance book: https://www.dynatrace.com/resources/ebooks/javabook/impact-of-garbage-collection-on-performance/
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time of Pure Performance.
My name is Brian Wilson and as always I have with me my co-host Andy Grabner
for a subdued episode of Pure Performance. Hello Andy.
Hello Brian. You have a very interesting, subtle voice today.
Well, we're both kind of feeling a little slow today,
so I figured we can have more of a public radio sort of version of our podcast today.
Yeah. Why are you slow today? What's wrong?
Oh, it's just been a long day already and, you know, stuff with my child
and just a lot of work coming in
and feeling overwhelmed, Andy, overwhelmed by life.
But that is the way how it goes or so it goes as Kurt Vonnegut used to say.
How about you?
You seem very, when I contacted you today, you seemed a little subdued.
Well, it might be related with if I look out the window and it is April 6th today and it's totally foggy in Boston and it's very strange.
It just feels like November and it feels like I have to go to sleep.
Maybe that's part of the reason.
But no, other than that, I think I'm excited about what we do today because I think this
time we do not have a guest of honor.
Well, you are my guest and I'm your guest, kind of, I would say, in this case.
And we thought, what are we going to talk about today?
We're going to, you know, way back when we first started this podcast, just about a year
ago, if you believe it or not, it was May of last year.
And I believe this one is airing in May.
So we're about somewhere at our one-year anniversary.
We had talked about some common problem patterns.
We did an episode on common Java problem patterns and common.NET problem patterns.
And, you know, it's always very – there's a lot of them.
Not a million of them, right?
But there's a finite set of very common problem patterns, as you've seen over and over again, and we've all seen over and over again.
But in those early podcasts, we can only cover so many of them.
So we figured, you know what, it's about time to get back to some of the common problem patterns.
And I think, you know, we have some interesting ones today because with the changing architecture that's been going on,
we're seeing some of what we might call new problem patterns, but we might even call them the new old problem patterns.
So we're going to touch on a couple of those today and continue from there.
Yeah, cool. What do you want to get started with?
Well, let's see. It's not like we
didn't discuss this before, so we'll pretend, I don't know. Let's, uh, you know, the one thing
I've been hearing quite a lot about is, is, is multi-threading and, uh, you know, we, we have,
you know, we're not going to get too deep into it, but a lot of people are probably familiar with, I believe it's ACA, right?
There's a lot of very highly asynchronous, multi-threaded applications out there that are very efficient and very awesome and very confusing
when you try to look at what they're doing.
But multi-threading in itself is not bad, correct, Andy?
No, not at all.
And I think we also need to differentiate a little bit about multi-threading and I think other approaches where you talk about events-driven, where you then have multiple threads taking off the work, basically really constantly working and nobody's waiting on them per se, but then there's going to be callbacks where you just call the next chain of event. So I think this is also one thing to understand.
There's a difference in just spawning threads
and doing something asynchronously
versus some of the frameworks
that have really been optimized
for really squeezing the best out of multiple threads
that you have available
and just coming up with new development models
and making, hopefully, code more efficient
and execution more efficient, right?
Right, and I think it goes without saying, too,
that even with the, if we're not talking about
the event-driven, highly optimized versions,
when you're talking, if we're just saying
transaction comes in and spawns multiple threads
for some asynchronous activity,
that's not necessarily even bad on its own as well but there are dangers and pitfalls that can arise
and i guess from what you've been seeing a lot in in your share your pure path um program which
by the way shameless plug here andy um runs our free trial program so that when you when you do
download our our atmon program you can
share your peer paths with him and he'll take a look at him which is where he sees all these
problems over and over again um you're seeing some some some patterns when it comes to multi-threading
some what these what these problems are that we come across so let us discuss what that is then
yeah so first of all um i'm not actively looking for these
i'm not going in and say does this have the you know the multi-threading issue i typically start
with you know do we have any do we have a performance problem do we have a resource issue
and typically you see this by you know load on the system is going up so i'm looking at
uh the incoming number of transactions and then i'm looking at
response time and typically what we see is that at some point in time response time simply goes
through the roof and we at the same time we see however that these transactions are actually not
doing a whole lot they're mainly spending time either in waiting on something else
or they are actually not doing anything other than,
I mean, not actively waiting on an object that is then woken up by a framework,
but actually waiting on the availability of a background thread that they are requiring. So what I've seen more and more so,
especially with frameworks that make kind of these asynchronous programming easier, where you can
create a work item and put it on a thread in a queue, and then some background thread picks it
up, which is great for developers
but i what i've what i've been seeing is that uh sometimes people are misusing that so one
blog post that i just have in front of me and you can if you want to read it it's uh it's on
dynatrace.blog on the dynatrace blog and we'll look to it yes yeah it's the detecting uh the
n plus one asynchronous thread problem pattern did Did you just say N plus 1?
I know I said N plus 1.
It's always N plus 1, right?
No matter where we go.
But that's not the heart of it, right?
Continue.
I just wanted to acknowledge that we did say N plus 1 yet once again.
But please, people.
Yeah.
So, but in this case, and thanks to our colleagues in Gdansk, they shared this pure with me because they've been using our product to actually analyze the work that they are doing.
And then they sent me a pure path, and they had a problem that they wanted to solve, which was some of their reports were just running too slow.
And the reports were querying a lot of data from the database and then doing you know some calculation and so they thought their developers thought well if we have to go off a lot to the database multiple times to query data why not just
asynchronous i mean synchronize parallelize the whole thing and so the loop that we have right
now where we crunch through the elements we just spawn multiple threads for every element that we
are iterating through in the loop and then multiple threads will take care of it much faster.
And then at the end, we just wait for the result
and everything will be fine and faster.
So classical, I have a problem divided into smaller pieces,
put it into background threads.
Now, this worked well in a development environment, right?
Because if you're the only one on the system, that's not an issue.
But what this really meant is that for every incoming thread that was working on that report request, you had two, five, 10, 50 threads being spawned additionally depending on how many data items they had to crunch through.
And that's the classical N plus one quiver problem again. You have one main thread, and then for every data item that this main thread tries to process asynchronously, you have another thread, another worker thread, asynchronous worker thread.
And obviously, there's two problems with this.
First of all, it is a load-related problem.
That means the more load you have on the system, at some point, you run out of threads if every incoming thread consumes
more than one you know so that means you cannot scale but even worse or actually in combination
with this this is also a data-driven problem because what if you not have to crunch through
10 items but a thousand items and then do you automatically spawn a thousand threads
and do you have a thousand threads available so and i can imagine if you you automatically spawn a thousand threads and do you have a thousand
threads available so and i can imagine if you have to spawn a thousand threads you would also
really need the memory for those thousand threads as well yeah or you run into an exception where
the jvm probably tells you well i cannot give you these threads or at some point maybe even
the operating system tells you it's not possible anymore so so definitely you run into resource issues now the intent understandable why you want to parallelize some of this uh these working heavy
working heavy lifting activity but done in this way is not good and what i then the way i see
these problems now when i analyze the data i look i look at this long running pure paths
and it's also explained in the blog what
i like about the pure path is not only the end-to-end traceability but we also show you
which thread is actually executing a certain method so i can see the thread switches when
it switches from one thread to another and then there's another column which we call the elapsed
time it's like a time stamp it basically tells me when was this method executed, the next, the next, and it's always relative to the entry point of the transaction.
So the entry point is zero, right?
Timestamp zero.
And there you can easily always see, hey, it's interesting.
We are trying to make a call from the main thread and then passing some work to the background thread.
And there I have a gap from one second, five seconds, ten seconds.
So that means the system or the main thread is just waiting for the work to be picked up by the next thread.
And the question is, why is that?
And typically the answer is, well, there's just simply none of these background threads currently available because they are busy because they're
used by other main threads that are trying to do a similar thing right what what strikes me about
this one is this ties very directly to another common problem pattern we spoke of in that first
or second episode ago another threading issue? I think threading issues all probably come from a very similar problem,
but it makes sense, right, that when we were looking at time spent
on the web server before getting to the JVM or the Apple or whatever,
where you could say, all right, we're trying to make a synchronous call
back to the JVM,
and there's a long wait time before it gets to that.
And again, you could see that from that elapsed time on the thread.
But again, it's just not enough threads.
Yeah, and so what I typically now do, I think, is a lesson learned analyzing these problems.
So I always look at, first of all, number of incoming requests on the front, right?
How many requests are coming in on the Tomcat 10?
How many threads are active?
And I'll try to actually calculate a ratio.
So how many threads are active per certain transaction type?
Because then you can immediately see, wow, this transaction,
they only need one thread, so everything is synchronous, no problem.
Oh, this transaction is consuming three threads.
This one, 50 threads.
So I kind of look at this and then better understand the real performance
and resource characteristic of an implementation.
So this is a good one thing that I learned.
Also, another thing that I learned,
if you have a transaction that is heavily using multiple threads,
then what tools like Dynatrace provide
is it gives you the total execution time of all the threads.
Because even though, let assume you have you have endless
threads available and everything is fast and they never run into a threading issue but eventually
your transaction is consuming resources cpu cpu cycles on the different threads so what i always
in order not to be fooled by hi this, this transaction is so fast, it only takes 100 milliseconds.
But if I know that it takes 100 milliseconds on the main thread and it also consumes 100 milliseconds on 50 other threads, that multiplies up to five seconds plus 100 milliseconds. That's always why when you try to optimize multi-threaded applications,
then you always need to look at the total execution time of all the threads that are involved.
And then especially track this over time because you want to know when a code change actually changes the dynamic behavior. And even though the performance perceived by the end user is still good or maybe even better,
it may be that you're now involving so many more background threads
and you're in the end consuming a lot of more resources.
Right. And just to give a kind of a pro tip to anybody who uses Dynatrace who might be listening,
when you're looking at pure paths, or I put the
accent on the wrong pure paths, that's the Christopher Walken way of saying it.
When you're looking at pure paths, you have three different times you can look at, right?
And they all reveal different things about these patterns that you're talking about.
Your response time is the, your basic time on the transaction, right? Your response time, let's say it's 50
milliseconds. That's how long it took from the entry point, uh, for the entry point method to
complete, right? If it spawns off any asynchronous threads though, that that is not covered by that,
right? Your execution sum for a pure path is more along the lines of what you're talking about, right? Where it's a total of the entry point plus the sum of all the different
threads,
even if they're running in parallel,
right?
So it's not,
it doesn't reflect a true time that anybody or any of the systems are being
parallelly impacted by,
but it's the amount of time that transaction consumes.
And then the third one we have is duration, right? And duration is going to cover the entry point of that method, plus your async
times for, but it's not going to cover parallel threads, right? So if you have two asynchronous
threads fire off at the same time, the duration is going to include the length of the longest one exactly the duration
basically is from uh when the transaction initially hits your your jvm for instance
until the last asynchronous thread is really done and and this is very actually great that
you bring it up because this um is always sometimes misleading if people only look at response time
and say, well, look at this.
This code is super fast.
It responds in 50 milliseconds.
But then duration all of a sudden says, well, a second, two seconds, five seconds.
Ten minutes.
Ten minutes, yeah.
And basically what that means, the initial request comes in.
It's super fast.
It responds back to the end user.
But what this request actually does, it spawns off something asynchronously
that then goes off and does something.
But to the end user, the transaction might be completed in 50 milliseconds.
But on the system itself, certain things are happening.
So that's why, very good, Brian.
So just a little pro tip to any users out there to put a little pepper that in
in case you haven't looked at that.
Yeah, so response time, this is the time perceived by whoever initiated the transaction.
Duration is how long does it really take from end to end, including all the asynchronous activity.
And then execution time or total execution time includes all the time of all the threads combined. So we can actually see the real resource footprint of that particular
transaction.
Yeah.
And in terms of the,
in terms of the,
the threading,
right.
As we mentioned,
threading is multi,
you know,
multiple threading,
multi-threading is not necessarily bad,
right.
But being too heavy or having these delays
is where it does get bad.
And another thing,
so you had mentioned
some of the things that you do
to look for it manually, right?
You're looking at the number of requests,
how many threads it's spawning,
the amount of time it's being spent there.
But again, and it's funny
that it's me doing the plugs this time,
but we just want to mention that
with a lot of these common problem patterns,
if you are using Dynatrace, we will identify currently in 6.5,
we're identifying heavily asynchronous transactions.
And then you were mentioning, I believe,
there's something else coming in the next release.
Yeah, in 7.
Yeah, so what we do on a purebeth-to-purebeth basis,
just what I did manually all the time,
basically looking at how many threads are involved.
So we tag a purebeth and say this purebeth is synchronous
or this purebeth is asynchronous.
And if it's asynchronous, we actually say thread heavy or thread medium.
And this was there since 6.5 and makes it easy to say,
show me these purebeths that are very heavy on threading.
But what we now have with 7, we give you this view over a timeline.
So you can say, hey, under a certain load condition, if I look at my load pattern over
the course of the day, then I can see when load goes up, I all of a sudden see the number
of pure paths that have asynchronous issues also go up.
Or if you have constant load and you are deploying a new version of your app into that environment,
if you do some type of continuous performance engineering, continuous performance testing,
and you look at Dynatrace and you can see, oh, right now 5% of my threads are heavy on asynchronous, heavy on threading.
And after the deployment, it's not 5%, it's 50%.
So immediately, no, you introduced probably a regression.
And that makes it just so much easier, especially for architects.
Or we need to call reviews to say, oh, we probably don't want to let this into production.
Because first we want to sit down and figure out was this intentional or not.
Great, right. All right, any final words on threading uh no i think i think what i just like also to build what i always look at is it's just some charts where i i think i mentioned earlier
but i typically look at incoming transactions overall then i want i look at a
number of threads involved and always look at this ratio so that'd be the total number of threads
right so exactly the total number of active threads the total number of transactions that come in
and obviously if you have a chance and if you have a tool that can do that then split the transactions
up into something that makes more sense for you from a business, from a feature perspective, right?
Because typically in an application, you have certain requests that you see a lot, but typically very fast ones.
I don't know, static resource requests or some heartbeats, and they are typically then diluting your averages. So if you have a chance and say,
show me the number of requests that come in on that particular transaction
and how many threads are involved, that will be perfect.
And then trace this over time.
And you know, you could also do that information.
You can also track how much CPU time it's using
or how many memory sources it's using and then find out, well, how much cpu time it's using or how many uh how many memory sources it's using and
then find out well how much is this costing us in our cloud infrastructure that we're spinning up
exactly and if you want to give give another diamond twist tip so we would use business
transactions for that right you create a business transaction for a particular feature
and then as a result measure you we can say total CPU time,
total execution time,
total number of threads,
total number of web requests and all that stuff.
This is, yeah.
Yeah, because again,
as Andy brings up quite often,
one of the things to be looking out for,
the newest thing to look out for
is what is the cost of a feature, right?
And are you getting, and especially as we were talking to, uh,
Karenka the other day, you know, uh, maybe you put it out, you find out people like it,
and then you go ahead and optimize it. Right. And, and, and these are all ways you can,
you can do that. So, all right. So that's, um, the multi-threading, right. And, uh, we'll go
into our next problem pattern now. And this one is everyone's favorite topic.
The big buzzword microservices, right? Microservices. And I guess the, maybe the
problem pattern or whatever, it could be considered nano services, but there are,
you know, there's a whole bunch that we can talk about in microservices, but we're, you know,
limit on time. So we're going to, we're going to just start covering a couple pieces, right?
One thing to point out is a lot of the same problem patterns that we have without microservices get repeated with microservices, right?
But with that introduction, Andy, what is the microservice du jour that you want to bring up today?
Yeah, well, you know, I think it again, it comes back to, and I think you mentioned it in your intro, it took a word from the blog post.
When microservices become nano, that's actually the title of a blog post that one of our users wrote, Stephen Ledoux. He showed us, and again, you can go on the blog, and maybe we'll post it.
Yeah, we'll have it up there.
But basically, he was explaining that they were basically ripping apart the monolith
into microservices, which is obviously great, but they were going too granular. And what the too granular actually meant is that they had one call, again, coming in in
the front and everything that used to be done in the monolith, maybe in the same thread
and the same process, obviously.
Now, instead of processing all of this in the the monolith in one container they were having
many many calls going to these new services they were so fine-grained that they simply
you know had a cascading effect almost where um you you have instead of you know one front-end microservice calls a back-end microservice,
the front-end calls 20 times the back-end microservice,
and that microservice calls another 20 times the next microservice
because they were just not, I think, not thinking about correctly
on how to structure the services and how granular they should be
or which functionality should
actually be combined in one service and where you need another one um and i think this is also where
it's it's so critical before you sit down and do your microservice architecture that you actually
do some dependency analysis so if you really if you really rip your monoliths apart across maybe
class lines or functions even then first think about what are the dependencies?
How often do these classes call each other?
How often do these functions call each other?
And are you aware of the fact that when you then make a call that you have a round trip over network that you have to marshal and unmarshal a call?
Depending on which protocols you use, it can be quite heavy.
So these are the things that we've seen.
So just a misuse, too busy, too chatty microservices.
And I think it seems like it would be something very common, right?
Because everyone, as we've discussed before,
a lot of people are still moving in this direction, right?
So everybody is falling in love with the idea of we're going to,
we're going to break up our monolithic application.
We're going to go to microservices.
We're going to do a great CICD pipeline.
We're going to, you know,
we're going to do everything DevOps and we're going to be the next,
well, they're not unicorns anymore, but stallions or horses or whatever.
Right.
We, and the,
I almost feel that internally at an organization, there's a great
risk to see how far they can break it down. Like, we could take this giant monolithic thing and
break it down to 5,000 services. Look how cool this is. They're all individual, right? We're
all individuals if you ever watched Life of Brian. But that almost feels like it's an accomplishment,
right?
How far we could break it down.
But the issue then comes into exactly what you're saying is if there's especially in the case of a one to one relationship, you went too far.
And now you're adding slowness and overhead and resource consumption to that because now you have to make a network call.
And that analysis. Yeah, that's very important.
I mean, again, you can maybe look at your code,
you can look at some of these things and get an idea,
but I think the next important step is then to properly,
when you make your first attempt at your break apart,
when you first break it apart and start testing it through,
take a look at those dependencies at that point as well
and monitor the calls between them and everything else.
Yeah, go on.
Yeah.
No, and maybe again, you know, we have to give another, I mean, the way I would do this,
if I have a tool like Dynatrace available, you know, obviously the PurePath gives you
a great overview to see which method and which component calls which other component.
But in this case, I would actually think the sequence diagram that we have in dynatrace is actually
really great because it shows you very nicely how components communicate with each other and
can be a good indicator on which components are more isolated already and therefore are better candidates of ripping them out of the monolith
and putting it into a service.
Right.
So that's one thing to keep in mind.
So do your dependency analysis before you do anything.
Like just building microservices just because you can build microservices, it doesn't make sense.
Right.
And then you also mentioned the one-to-one relationship.
I think Martin, he wrote a blog post about sense. Right. And then you also mentioned the one-to-one relationship.
I think Martin, he wrote a blog post about this.
Right.
Where he also talks about detecting these very tightly coupled services.
And if they are really tightly coupled, then again, think about does this really need to be an extra microservice? If it to be then at least be smart enough about the deployment of those because if you have very tightly coupled microservices
then also please deploy them close to each other to actually avoid the overhead of potential network
latency so that would be a bad candidate for putting one in private cloud and the other in
the public cloud yeah something like that yeah that's that's for sure and that's uh and these are all
things that you can that you have to consider obviously and that you can already you know
analyze and and before you actually go into production uh and this is something that i
would hope software architects uh are doing and i'm sure that most people do but it seems it
happens right and that's why we have
these blog posts from customers to tell us about it the world is not perfect right and uh and so
yeah so this is this is one of the problem patterns just to to fine grain and then in in a way to call
to name it again the n plus one query problem yeah uh because if you if you are making this so fine-grained
and you have existing code that makes in a loop calls to a certain class and this class is now
a microservice then these calls in the loop become microservice calls right this was one of my
one of my first blog posts uh i i did on from monolith to microservices the n plus one query problem
between microservice calls the swedish company uh with the search service and um and so be aware of
that um it's amazing how much the n plus one query pops up everywhere yeah you know it's almost like a plague i mean it really is like there's no
getting rid of it it's just it's astounding yeah when the thing is also i mean i understand it too
because um because i think uh development frameworks are you know it may make it easy
for developers to build stuff in a very fast
way and sometimes they hide away
complexity and these frameworks
are also not always perfect and also
sometimes generic.
And if you use it in a generic
way and you don't think about how
to adapt them and configure them
for your use case, then you just end up with
things like that.
We used to talk about hibernate
over the last couple of years and it's the same hibernate is a great framework but if you use it
if you don't configure it and adapt it to your use cases then it can do some horrible things
yeah and and the one warning always with the n plus one query is you know you might even look
at a transaction and say hey this transaction only takes you takes, you know, 200 milliseconds to execute from
end to end. I don't care that there's an N plus one query. And in fact, my N plus one,
either query or service is only contributing 20 milliseconds to the whole thing, right? But don't
let that fool you because there's a lot more than just the amount of time, right? There's the amount
of threads, there's the amount of connections, there are the bites. There's also the unexpected changes in
traffic or changes in usage that can suddenly take and blow that up on you. Um, so, you know,
I just, there was a really great example. I wish I remembered it as we were talking about it. I
remember hearing about, uh, with one of our customers where they had one of those exact
situations where it was so minor. And then the next day it took and blew up on them for like a completely unpredictable reason but just because it existed it was it was a liability um so i would
just say don't don't ignore the n plus one query or the n plus one pattern problem yeah yeah and i
just you know i always like analogies um the analogy with microservices that always comes to mind for me is if you go to a supermarket
and if you go to, you know, you have your basket of items and then you go to the checkout
and then let's say you have five items. And if the cashier, if, if he's a service and the only
thing he does well is basically putting the final price into the system and then giving it a total.
But every time he needs to get the price for a single item,
he needs to call another service to tell him the price.
So in this case, I give him my first item.
Then he needs to walk over to the person that tells him the price,
comes back, puts the price in, and then I give him the second item.
That's basically kind of the analogy for me.
So the question is how fine granular do we need to have these services?
Do we really always need to go to that other service to get a certain part of information?
And also how far away should this person be or this service be to optimize the paths?
That's the end, what we have to care about.
You know, I used to work at a supermarket and used to work.
Oh yeah.
Yeah.
Yeah.
Yeah.
And my favorite was always having to do a price check.
That's always the worst because then suddenly everybody in line grown, but that's, that's
what your customers are going to do.
If you have to do this, this piece, right.
They're going to grown.
Yeah. But the cool thing with this is, right, if you see the cashier as a microservice, you can scale it out by just adding more cashiers to it.
That's the way, that's the beauty of it.
All right.
We're at about 30 minutes.
Do we want to go ahead and tackle the last one we were going to?
Sure, why not?
Sure.
Okay, so we had one more
problem pattern we wanted to discuss today um we are not going to open up an entire pandora's box
because the uh we're going to discuss a little bit about memory right and obviously as soon as
we say memory there are about you know a million things we can talk about and very very in-depth
very your memory is very very complex um and it's very scary. Anytime you see a memory problem,
that's when I think everybody kind of starts shaking a little bit because they can just be
so painful. But today we're going to take a look at one that's hopefully not as painful,
not too painful. We're going to take a look at garbage collection, right? And we all know
garbage collection is necessary. It has a great function. It's part of life, right? It's,
it's, it's like breathing. It's an application breathing and has to have, or I should say,
really, it's kind of like going to the bathroom for the application or that's why they call it
a dump, right? Um, we have all the fun terminology, right? But the point is garbage collection
happens, right? And no matter what, there's going to be some kind of impact to your application.
And again, that's the cost of running an application that does garbage collection.
The problem is when that has an impact on your application that is beyond,
or that's a negative impact that you can feel and, you know, or your customers can feel.
Right?
So let's get into a little bit of GC.
Yeah. Well, yeah, GC, or I think what we call it suspension because it's suspension yeah it's really it's really when the garbage collector
kicks in and actually suspends the jvm or the runtime and actually suspends your current
critical transactions that are executing and and that's what what I think what you try to get to is
garbage collection itself has to do, it has to run, and it
will obviously, when it has major GCs, block
your JVM for a while. The question is how long are these blocks and
who are impacting and maybe they are impacting transactions where you
have SLAs with, right?
And you have a certain service level agreement or you know if it goes beyond a certain threshold,
then you are going to make your users unhappy or whoever else unhappy.
So that's why what we always look at is suspension time impact on transaction response time.
So in our terminology, from a diamond risk perspective,
we always talk about runtime suspension and the portion of time it contributes
to the overall response time of a transaction.
Interestingly enough, let me cut you off there,
because just before we were talking about response time, duration, and execution time total, right? And I believe
when we look at this, we're talking about the suspension impact on the duration.
Exactly. But I mean, the suspension duration on everything that happens within the execution of
a transaction, that's true. Yeah. I mean, it's not seen as far as which one of those ones we're looking at when we take a look
at that from that point of view.
Yeah.
Well, I think it will impact response time and duration.
Of course.
Of course.
Yeah.
And so I think the thing that I'm always looking at is when we are analyzing performance-related problems due to garbage collection,
we always look at the percentage of time that the runtime suspends the threads
and which threads are currently executing business-critical features
and how long they are actually suspended and what the percentage is
of that suspension time to the overall execution time of the transaction.
And so I typically put graphs up when I analyze load tests or systems in production,
and I look at overall response time and then i'll look at response time
without without garbage collection impact right because this is actually then the ratio and what
i like uh again it's not i'll be doing a lot of pictures today with dynatrace but in dynatrace
you can also do a ratio with a ratio measure i'm just gonna say reinhardt and i do that a lot too
in our in our things too exactly yeah so the ratio measure is square because it calculates the ratio between the response time
and the response time without garbage collection and so you can actually see what the real
percentage impact really is right and and if you chart this over time then you can actually see at
which under which load you you hit a certain threshold where garbage collection really becomes a problem for you.
Also, at which time of the day,
when you have certain load patterns,
you may hit a certain threshold.
Or if the code changed.
It's a code change, yeah, exactly.
So if you are deploying a new version
and then all of a sudden you see memory behavior changes
and now you have more impact.
And then obviously the remediating actions can be different ones, right?
It could be your memory behavior changes for a good reason because you changed some code and it just requires more memory.
But maybe you forgot about to adapt your GC settings, your memory settings, or maybe you really have memory issues
where you are simply consuming too much memory,
causing too many garbage collection runs in the end.
And then you obviously need to do your memory diagnostics
and figure out how to be more efficient with memory usage.
Maybe you're keeping those threads alive too long.
You're keeping the threads alive too long, yeah.
So the interesting thing about this one one this is a pretty difficult one
to look at right because this one requires um using jvmti correct well in order to see yeah i
mean we could see a gc run we could see you know some of those pieces but but some of the
details yeah so i think what you what you're getting to is there's different approaches on how to measure this right now let's say there's different approaches on how to how to instrument
jvms and clrs and for java now uh there's two options right you have a java agent uh which
is convenient and easy uh the only problem with the javabased agent is that it runs within the JVM. So the code that the tool uses to actually analyze performance is also suspended if the garbage collector kicks in.
And therefore, this tool doesn't even know that a garbage collection just happened.
And there's the other approach that has been around for a little longer, which is using the native interfaces of the JVM, the JVM TI, the tooling interface, where you kind of sit outside of the runtime and therefore actually understand when the GC is triggered, how long it takes, and also who is impacted by it. And this is then the approach where you can really figure out
what is the real impact of a GC suspension on which threads.
And if you know the threads and what they're executing,
then you can also assign it to your business features.
And you can say, oh, this feature was impacted by 50 milliseconds,
which is 60% of our overall response time.
And it's a critical feature, and that's not acceptable.
Right.
And the interesting thing, too, is you could, with much less success,
if you're looking at some of the JMX metrics type and all,
you could see all right there was
a garbage collection ran it ran for 200 milliseconds and it ran at 1005 today and at
1005 today we also saw a spike in response time so we'll assume that that was because of the gc
right but that again is kind of you're correlating you're you're or yeah you kind of – you're correlating. Yeah.
You're tying two events together that look like they're related, but they might not be.
So getting to that JVMTI tooling aspect is where you get that definitive answer.
Yeah.
So it's just – I never thought about that.
That's a good one.
That's a very good one, yeah.
If you just look at the –'s actually yeah there could be plenty
of other things going on right yeah yeah but you also wouldn't know that impact to those um
to the business functions as well so that jvmti piece is really important i know it sounds like
i'm plugging but it's you know when i learned about this piece i think about four and a half
years ago or something or or maybe five now.
It's been a while.
I just understood the criticality of it.
So I'd love to.
And by the way, I think we had, if people want to learn a little more, we have.
You did a great performance clinic on that, didn't you?
I did a performance clinic too. But what I also wanted to say, we have the book with the Java Enterprise book.
Oh, that's right.
We wrote a book a while ago.
So if you Google it on the website, it's a free book.
We'll put a link up.
We'll put a link up, yeah.
I think it was Michael Kopp back then.
He wrote several – I think he wrote all the chapters on memory management.
He explained how garbage collection
works and also the impact of garbage collection uh there was a phenomenal a phenomenal piece that
he wrote it's under i believe java enterprise performance i just googled for java enterprise
performance dynatrace and and it's under dynatrace.com slash resources slash ebooks slash
java book yeah so what did you uh'll, we'll put the link up,
but for people who don't know the link, what did you did, uh, Dynatrace,
Java, what did you search for?
Dynatrace Java book memory.
It comes up like eBook Dynatrace. Yeah. So exactly. Yeah.
So you'll see that there. It's definitely good.
That's where actually I learned,
started really learning about memory was from, uh,
from the early version of that book that was being put up there um so it's great thing and uh the the
nice thing is even though it was written you know several years ago memory doesn't change too much
this is not like it holds true right so uh definitely still worth going into um anything
else on on the memory there well i think memory as i said as I said, as you said in the beginning,
is a huge topic, right?
I think we should have probably other sessions where we can talk about object churning,
where we can talk about memory leaks
and different ways to analyze memory problems.
But for this particular one here,
I think it's just important to understand
that try to figure out
what the real impact of garbage collection is on your business functions
and figure out if this is load related if it comes in through a different deployment that you had
or you know in case certain memory aggressive features are executed and then impacting all
the other features that are currently executing in parallel uh but yeah always always try to figure out what is the impact of
garbage collection suspension absolutely and if you um for anybody listening if there are certain
aspects of memory you would like to hear us discuss um by all means please please tweet it at us. You can tweet at Pure underscore DT
or at either one of our regular Twitter handles.
So at Grabner Andy or at Emperor Wilson.
Or you can email us at pureperformance.dynatrace.com.
Any other kind of problem pattern type of things
that you may be interested in that we haven't covered yet,
please as well, let us know. And we'll, we'll, we'll dive into them.
We're always looking for finding out what is interesting to you.
I'm still sounding all public radio. I think this, this will be, I'll call myself silky Brian. so uh you know i andy as always um the the problem patterns are are great and and i'm so glad that
you see so many of them uh we all see them quite quite often and i'm sure a lot of our listeners
run into them all the time uh and hopefully we're doing a little to help you identify or be aware of
some of the more common ones because as you you've told, told myself and many audiences over and over again, right? These, we call them common because
they are common. They happen everywhere and they happen in the best of shops and they happen in
the worst of shops. Um, there's just no getting away from them. It's kind of like mosquitoes.
Um, they're always there somewhere. Um, so again, great stuff there. Um, any, anything that you'd like to, any final words
from you, Andy? Uh, no, not really. Just, uh, you know, keep, keep, keep sending us data and keep,
keep approaching us with, uh, with things that we can then take, analyze, and then bring it to a
broader audience, either through the podcast or through blog posts.
I think with that, we can all make our little contribution
in improving the engineering world.
Excellent. Well said.
Give back.
There you go. That's what it is.
And we're having our fun drive.
Yeah.
All right.
Well, with that, I'd like to thank you, Andy,
for doing this for me. And again, it's about a year, All right. Well, with that, I'd like to thank you, Andy.
Thank you.
For doing this for me.
And, you know, again, it's about a year.
So let's wax nostalgically for half a second.
And thank you for all of our listeners who have been enabling us to do this. We do this as part of our day job.
Right.
So because we have listeners like you wonderful people, we get to continue doing this and we really enjoy it so uh just a big shout out to everyone who's uh and and everyone
who listens and everyone who's encouraged us to do this especially uh the one the only mark
tomlinson who was our he's the one who kept pushing the two of us to uh to do this so shout
out to mark and i guess we'll give a little shout out to Steve. Yeah, Steve is listening to us on the way into work
he told us. Yeah.
And also Brett. Yes, hi Brett.
Guest of the show.
Anyhow, and alright, well we'll talk
to everybody. We'll see you all soon
and
until next time. Ciao, ciao.
Ciao.