PurePerformance - 035 When Multi-Threading, Micro Services and Garbage Collection Turn Sour

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time of Pure Performance. My name is Brian Wilson and as always I have with me my co-host Andy Grabner for a subdued episode of Pure Performance. Hello Andy. Hello Brian. You have a very interesting, subtle voice today. Well, we're both kind of feeling a little slow today, so I figured we can have more of a public radio sort of version of our podcast today.

Starting point is 00:00:55 Yeah. Why are you slow today? What's wrong? Oh, it's just been a long day already and, you know, stuff with my child and just a lot of work coming in and feeling overwhelmed, Andy, overwhelmed by life. But that is the way how it goes or so it goes as Kurt Vonnegut used to say. How about you? You seem very, when I contacted you today, you seemed a little subdued. Well, it might be related with if I look out the window and it is April 6th today and it's totally foggy in Boston and it's very strange.

Starting point is 00:01:33 It just feels like November and it feels like I have to go to sleep. Maybe that's part of the reason. But no, other than that, I think I'm excited about what we do today because I think this time we do not have a guest of honor. Well, you are my guest and I'm your guest, kind of, I would say, in this case. And we thought, what are we going to talk about today? We're going to, you know, way back when we first started this podcast, just about a year ago, if you believe it or not, it was May of last year.

Starting point is 00:02:06 And I believe this one is airing in May. So we're about somewhere at our one-year anniversary. We had talked about some common problem patterns. We did an episode on common Java problem patterns and common.NET problem patterns. And, you know, it's always very – there's a lot of them. Not a million of them, right? But there's a finite set of very common problem patterns, as you've seen over and over again, and we've all seen over and over again. But in those early podcasts, we can only cover so many of them.

Starting point is 00:02:34 So we figured, you know what, it's about time to get back to some of the common problem patterns. And I think, you know, we have some interesting ones today because with the changing architecture that's been going on, we're seeing some of what we might call new problem patterns, but we might even call them the new old problem patterns. So we're going to touch on a couple of those today and continue from there. Yeah, cool. What do you want to get started with? Well, let's see. It's not like we didn't discuss this before, so we'll pretend, I don't know. Let's, uh, you know, the one thing I've been hearing quite a lot about is, is, is multi-threading and, uh, you know, we, we have,

Starting point is 00:03:17 you know, we're not going to get too deep into it, but a lot of people are probably familiar with, I believe it's ACA, right? There's a lot of very highly asynchronous, multi-threaded applications out there that are very efficient and very awesome and very confusing when you try to look at what they're doing. But multi-threading in itself is not bad, correct, Andy? No, not at all. And I think we also need to differentiate a little bit about multi-threading and I think other approaches where you talk about events-driven, where you then have multiple threads taking off the work, basically really constantly working and nobody's waiting on them per se, but then there's going to be callbacks where you just call the next chain of event. So I think this is also one thing to understand. There's a difference in just spawning threads and doing something asynchronously

Starting point is 00:04:11 versus some of the frameworks that have really been optimized for really squeezing the best out of multiple threads that you have available and just coming up with new development models and making, hopefully, code more efficient and execution more efficient, right? Right, and I think it goes without saying, too,

Starting point is 00:04:32 that even with the, if we're not talking about the event-driven, highly optimized versions, when you're talking, if we're just saying transaction comes in and spawns multiple threads for some asynchronous activity, that's not necessarily even bad on its own as well but there are dangers and pitfalls that can arise and i guess from what you've been seeing a lot in in your share your pure path um program which by the way shameless plug here andy um runs our free trial program so that when you when you do

Starting point is 00:05:04 download our our atmon program you can share your peer paths with him and he'll take a look at him which is where he sees all these problems over and over again um you're seeing some some some patterns when it comes to multi-threading some what these what these problems are that we come across so let us discuss what that is then yeah so first of all um i'm not actively looking for these i'm not going in and say does this have the you know the multi-threading issue i typically start with you know do we have any do we have a performance problem do we have a resource issue and typically you see this by you know load on the system is going up so i'm looking at

Starting point is 00:05:41 uh the incoming number of transactions and then i'm looking at response time and typically what we see is that at some point in time response time simply goes through the roof and we at the same time we see however that these transactions are actually not doing a whole lot they're mainly spending time either in waiting on something else or they are actually not doing anything other than, I mean, not actively waiting on an object that is then woken up by a framework, but actually waiting on the availability of a background thread that they are requiring. So what I've seen more and more so, especially with frameworks that make kind of these asynchronous programming easier, where you can

Starting point is 00:06:38 create a work item and put it on a thread in a queue, and then some background thread picks it up, which is great for developers but i what i've what i've been seeing is that uh sometimes people are misusing that so one blog post that i just have in front of me and you can if you want to read it it's uh it's on dynatrace.blog on the dynatrace blog and we'll look to it yes yeah it's the detecting uh the n plus one asynchronous thread problem pattern did Did you just say N plus 1? I know I said N plus 1. It's always N plus 1, right?

Starting point is 00:07:09 No matter where we go. But that's not the heart of it, right? Continue. I just wanted to acknowledge that we did say N plus 1 yet once again. But please, people. Yeah. So, but in this case, and thanks to our colleagues in Gdansk, they shared this pure with me because they've been using our product to actually analyze the work that they are doing. And then they sent me a pure path, and they had a problem that they wanted to solve, which was some of their reports were just running too slow.

Starting point is 00:07:36 And the reports were querying a lot of data from the database and then doing you know some calculation and so they thought their developers thought well if we have to go off a lot to the database multiple times to query data why not just asynchronous i mean synchronize parallelize the whole thing and so the loop that we have right now where we crunch through the elements we just spawn multiple threads for every element that we are iterating through in the loop and then multiple threads will take care of it much faster. And then at the end, we just wait for the result and everything will be fine and faster. So classical, I have a problem divided into smaller pieces, put it into background threads.

Starting point is 00:08:14 Now, this worked well in a development environment, right? Because if you're the only one on the system, that's not an issue. But what this really meant is that for every incoming thread that was working on that report request, you had two, five, 10, 50 threads being spawned additionally depending on how many data items they had to crunch through. And that's the classical N plus one quiver problem again. You have one main thread, and then for every data item that this main thread tries to process asynchronously, you have another thread, another worker thread, asynchronous worker thread. And obviously, there's two problems with this. First of all, it is a load-related problem. That means the more load you have on the system, at some point, you run out of threads if every incoming thread consumes more than one you know so that means you cannot scale but even worse or actually in combination

Starting point is 00:09:11 with this this is also a data-driven problem because what if you not have to crunch through 10 items but a thousand items and then do you automatically spawn a thousand threads and do you have a thousand threads available so and i can imagine if you you automatically spawn a thousand threads and do you have a thousand threads available so and i can imagine if you have to spawn a thousand threads you would also really need the memory for those thousand threads as well yeah or you run into an exception where the jvm probably tells you well i cannot give you these threads or at some point maybe even the operating system tells you it's not possible anymore so so definitely you run into resource issues now the intent understandable why you want to parallelize some of this uh these working heavy working heavy lifting activity but done in this way is not good and what i then the way i see

Starting point is 00:09:57 these problems now when i analyze the data i look i look at this long running pure paths and it's also explained in the blog what i like about the pure path is not only the end-to-end traceability but we also show you which thread is actually executing a certain method so i can see the thread switches when it switches from one thread to another and then there's another column which we call the elapsed time it's like a time stamp it basically tells me when was this method executed, the next, the next, and it's always relative to the entry point of the transaction. So the entry point is zero, right? Timestamp zero.

Starting point is 00:10:34 And there you can easily always see, hey, it's interesting. We are trying to make a call from the main thread and then passing some work to the background thread. And there I have a gap from one second, five seconds, ten seconds. So that means the system or the main thread is just waiting for the work to be picked up by the next thread. And the question is, why is that? And typically the answer is, well, there's just simply none of these background threads currently available because they are busy because they're used by other main threads that are trying to do a similar thing right what what strikes me about this one is this ties very directly to another common problem pattern we spoke of in that first

Starting point is 00:11:20 or second episode ago another threading issue? I think threading issues all probably come from a very similar problem, but it makes sense, right, that when we were looking at time spent on the web server before getting to the JVM or the Apple or whatever, where you could say, all right, we're trying to make a synchronous call back to the JVM, and there's a long wait time before it gets to that. And again, you could see that from that elapsed time on the thread. But again, it's just not enough threads.

Starting point is 00:11:56 Yeah, and so what I typically now do, I think, is a lesson learned analyzing these problems. So I always look at, first of all, number of incoming requests on the front, right? How many requests are coming in on the Tomcat 10? How many threads are active? And I'll try to actually calculate a ratio. So how many threads are active per certain transaction type? Because then you can immediately see, wow, this transaction, they only need one thread, so everything is synchronous, no problem.

Starting point is 00:12:26 Oh, this transaction is consuming three threads. This one, 50 threads. So I kind of look at this and then better understand the real performance and resource characteristic of an implementation. So this is a good one thing that I learned. Also, another thing that I learned, if you have a transaction that is heavily using multiple threads, then what tools like Dynatrace provide

Starting point is 00:12:58 is it gives you the total execution time of all the threads. Because even though, let assume you have you have endless threads available and everything is fast and they never run into a threading issue but eventually your transaction is consuming resources cpu cpu cycles on the different threads so what i always in order not to be fooled by hi this, this transaction is so fast, it only takes 100 milliseconds. But if I know that it takes 100 milliseconds on the main thread and it also consumes 100 milliseconds on 50 other threads, that multiplies up to five seconds plus 100 milliseconds. That's always why when you try to optimize multi-threaded applications, then you always need to look at the total execution time of all the threads that are involved. And then especially track this over time because you want to know when a code change actually changes the dynamic behavior. And even though the performance perceived by the end user is still good or maybe even better,

Starting point is 00:14:06 it may be that you're now involving so many more background threads and you're in the end consuming a lot of more resources. Right. And just to give a kind of a pro tip to anybody who uses Dynatrace who might be listening, when you're looking at pure paths, or I put the accent on the wrong pure paths, that's the Christopher Walken way of saying it. When you're looking at pure paths, you have three different times you can look at, right? And they all reveal different things about these patterns that you're talking about. Your response time is the, your basic time on the transaction, right? Your response time, let's say it's 50

Starting point is 00:14:47 milliseconds. That's how long it took from the entry point, uh, for the entry point method to complete, right? If it spawns off any asynchronous threads though, that that is not covered by that, right? Your execution sum for a pure path is more along the lines of what you're talking about, right? Where it's a total of the entry point plus the sum of all the different threads, even if they're running in parallel, right? So it's not, it doesn't reflect a true time that anybody or any of the systems are being

Starting point is 00:15:18 parallelly impacted by, but it's the amount of time that transaction consumes. And then the third one we have is duration, right? And duration is going to cover the entry point of that method, plus your async times for, but it's not going to cover parallel threads, right? So if you have two asynchronous threads fire off at the same time, the duration is going to include the length of the longest one exactly the duration basically is from uh when the transaction initially hits your your jvm for instance until the last asynchronous thread is really done and and this is very actually great that you bring it up because this um is always sometimes misleading if people only look at response time

Starting point is 00:16:05 and say, well, look at this. This code is super fast. It responds in 50 milliseconds. But then duration all of a sudden says, well, a second, two seconds, five seconds. Ten minutes. Ten minutes, yeah. And basically what that means, the initial request comes in. It's super fast.

Starting point is 00:16:21 It responds back to the end user. But what this request actually does, it spawns off something asynchronously that then goes off and does something. But to the end user, the transaction might be completed in 50 milliseconds. But on the system itself, certain things are happening. So that's why, very good, Brian. So just a little pro tip to any users out there to put a little pepper that in in case you haven't looked at that.

Starting point is 00:16:50 Yeah, so response time, this is the time perceived by whoever initiated the transaction. Duration is how long does it really take from end to end, including all the asynchronous activity. And then execution time or total execution time includes all the time of all the threads combined. So we can actually see the real resource footprint of that particular transaction. Yeah. And in terms of the, in terms of the, the threading,

Starting point is 00:17:12 right. As we mentioned, threading is multi, you know, multiple threading, multi-threading is not necessarily bad, right. But being too heavy or having these delays

Starting point is 00:17:25 is where it does get bad. And another thing, so you had mentioned some of the things that you do to look for it manually, right? You're looking at the number of requests, how many threads it's spawning, the amount of time it's being spent there.

Starting point is 00:17:38 But again, and it's funny that it's me doing the plugs this time, but we just want to mention that with a lot of these common problem patterns, if you are using Dynatrace, we will identify currently in 6.5, we're identifying heavily asynchronous transactions. And then you were mentioning, I believe, there's something else coming in the next release.

Starting point is 00:17:59 Yeah, in 7. Yeah, so what we do on a purebeth-to-purebeth basis, just what I did manually all the time, basically looking at how many threads are involved. So we tag a purebeth and say this purebeth is synchronous or this purebeth is asynchronous. And if it's asynchronous, we actually say thread heavy or thread medium. And this was there since 6.5 and makes it easy to say,

Starting point is 00:18:21 show me these purebeths that are very heavy on threading. But what we now have with 7, we give you this view over a timeline. So you can say, hey, under a certain load condition, if I look at my load pattern over the course of the day, then I can see when load goes up, I all of a sudden see the number of pure paths that have asynchronous issues also go up. Or if you have constant load and you are deploying a new version of your app into that environment, if you do some type of continuous performance engineering, continuous performance testing, and you look at Dynatrace and you can see, oh, right now 5% of my threads are heavy on asynchronous, heavy on threading.

Starting point is 00:19:05 And after the deployment, it's not 5%, it's 50%. So immediately, no, you introduced probably a regression. And that makes it just so much easier, especially for architects. Or we need to call reviews to say, oh, we probably don't want to let this into production. Because first we want to sit down and figure out was this intentional or not. Great, right. All right, any final words on threading uh no i think i think what i just like also to build what i always look at is it's just some charts where i i think i mentioned earlier but i typically look at incoming transactions overall then i want i look at a number of threads involved and always look at this ratio so that'd be the total number of threads

Starting point is 00:19:51 right so exactly the total number of active threads the total number of transactions that come in and obviously if you have a chance and if you have a tool that can do that then split the transactions up into something that makes more sense for you from a business, from a feature perspective, right? Because typically in an application, you have certain requests that you see a lot, but typically very fast ones. I don't know, static resource requests or some heartbeats, and they are typically then diluting your averages. So if you have a chance and say, show me the number of requests that come in on that particular transaction and how many threads are involved, that will be perfect. And then trace this over time.

Starting point is 00:20:36 And you know, you could also do that information. You can also track how much CPU time it's using or how many memory sources it's using and then find out, well, how much cpu time it's using or how many uh how many memory sources it's using and then find out well how much is this costing us in our cloud infrastructure that we're spinning up exactly and if you want to give give another diamond twist tip so we would use business transactions for that right you create a business transaction for a particular feature and then as a result measure you we can say total CPU time, total execution time,

Starting point is 00:21:10 total number of threads, total number of web requests and all that stuff. This is, yeah. Yeah, because again, as Andy brings up quite often, one of the things to be looking out for, the newest thing to look out for is what is the cost of a feature, right?

Starting point is 00:21:24 And are you getting, and especially as we were talking to, uh, Karenka the other day, you know, uh, maybe you put it out, you find out people like it, and then you go ahead and optimize it. Right. And, and, and these are all ways you can, you can do that. So, all right. So that's, um, the multi-threading, right. And, uh, we'll go into our next problem pattern now. And this one is everyone's favorite topic. The big buzzword microservices, right? Microservices. And I guess the, maybe the problem pattern or whatever, it could be considered nano services, but there are, you know, there's a whole bunch that we can talk about in microservices, but we're, you know,

Starting point is 00:22:03 limit on time. So we're going to, we're going to just start covering a couple pieces, right? One thing to point out is a lot of the same problem patterns that we have without microservices get repeated with microservices, right? But with that introduction, Andy, what is the microservice du jour that you want to bring up today? Yeah, well, you know, I think it again, it comes back to, and I think you mentioned it in your intro, it took a word from the blog post. When microservices become nano, that's actually the title of a blog post that one of our users wrote, Stephen Ledoux. He showed us, and again, you can go on the blog, and maybe we'll post it. Yeah, we'll have it up there. But basically, he was explaining that they were basically ripping apart the monolith into microservices, which is obviously great, but they were going too granular. And what the too granular actually meant is that they had one call, again, coming in in

Starting point is 00:23:12 the front and everything that used to be done in the monolith, maybe in the same thread and the same process, obviously. Now, instead of processing all of this in the the monolith in one container they were having many many calls going to these new services they were so fine-grained that they simply you know had a cascading effect almost where um you you have instead of you know one front-end microservice calls a back-end microservice, the front-end calls 20 times the back-end microservice, and that microservice calls another 20 times the next microservice because they were just not, I think, not thinking about correctly

Starting point is 00:23:57 on how to structure the services and how granular they should be or which functionality should actually be combined in one service and where you need another one um and i think this is also where it's it's so critical before you sit down and do your microservice architecture that you actually do some dependency analysis so if you really if you really rip your monoliths apart across maybe class lines or functions even then first think about what are the dependencies? How often do these classes call each other? How often do these functions call each other?

Starting point is 00:24:32 And are you aware of the fact that when you then make a call that you have a round trip over network that you have to marshal and unmarshal a call? Depending on which protocols you use, it can be quite heavy. So these are the things that we've seen. So just a misuse, too busy, too chatty microservices. And I think it seems like it would be something very common, right? Because everyone, as we've discussed before, a lot of people are still moving in this direction, right? So everybody is falling in love with the idea of we're going to,

Starting point is 00:25:08 we're going to break up our monolithic application. We're going to go to microservices. We're going to do a great CICD pipeline. We're going to, you know, we're going to do everything DevOps and we're going to be the next, well, they're not unicorns anymore, but stallions or horses or whatever. Right. We, and the,

Starting point is 00:25:24 I almost feel that internally at an organization, there's a great risk to see how far they can break it down. Like, we could take this giant monolithic thing and break it down to 5,000 services. Look how cool this is. They're all individual, right? We're all individuals if you ever watched Life of Brian. But that almost feels like it's an accomplishment, right? How far we could break it down. But the issue then comes into exactly what you're saying is if there's especially in the case of a one to one relationship, you went too far. And now you're adding slowness and overhead and resource consumption to that because now you have to make a network call.

Starting point is 00:26:03 And that analysis. Yeah, that's very important. I mean, again, you can maybe look at your code, you can look at some of these things and get an idea, but I think the next important step is then to properly, when you make your first attempt at your break apart, when you first break it apart and start testing it through, take a look at those dependencies at that point as well and monitor the calls between them and everything else.

Starting point is 00:26:25 Yeah, go on. Yeah. No, and maybe again, you know, we have to give another, I mean, the way I would do this, if I have a tool like Dynatrace available, you know, obviously the PurePath gives you a great overview to see which method and which component calls which other component. But in this case, I would actually think the sequence diagram that we have in dynatrace is actually really great because it shows you very nicely how components communicate with each other and can be a good indicator on which components are more isolated already and therefore are better candidates of ripping them out of the monolith

Starting point is 00:27:06 and putting it into a service. Right. So that's one thing to keep in mind. So do your dependency analysis before you do anything. Like just building microservices just because you can build microservices, it doesn't make sense. Right. And then you also mentioned the one-to-one relationship. I think Martin, he wrote a blog post about sense. Right. And then you also mentioned the one-to-one relationship.

Starting point is 00:27:29 I think Martin, he wrote a blog post about this. Right. Where he also talks about detecting these very tightly coupled services. And if they are really tightly coupled, then again, think about does this really need to be an extra microservice? If it to be then at least be smart enough about the deployment of those because if you have very tightly coupled microservices then also please deploy them close to each other to actually avoid the overhead of potential network latency so that would be a bad candidate for putting one in private cloud and the other in the public cloud yeah something like that yeah that's that's for sure and that's uh and these are all things that you can that you have to consider obviously and that you can already you know

Starting point is 00:28:11 analyze and and before you actually go into production uh and this is something that i would hope software architects uh are doing and i'm sure that most people do but it seems it happens right and that's why we have these blog posts from customers to tell us about it the world is not perfect right and uh and so yeah so this is this is one of the problem patterns just to to fine grain and then in in a way to call to name it again the n plus one query problem yeah uh because if you if you are making this so fine-grained and you have existing code that makes in a loop calls to a certain class and this class is now a microservice then these calls in the loop become microservice calls right this was one of my

Starting point is 00:28:59 one of my first blog posts uh i i did on from monolith to microservices the n plus one query problem between microservice calls the swedish company uh with the search service and um and so be aware of that um it's amazing how much the n plus one query pops up everywhere yeah you know it's almost like a plague i mean it really is like there's no getting rid of it it's just it's astounding yeah when the thing is also i mean i understand it too because um because i think uh development frameworks are you know it may make it easy for developers to build stuff in a very fast way and sometimes they hide away complexity and these frameworks

Starting point is 00:29:49 are also not always perfect and also sometimes generic. And if you use it in a generic way and you don't think about how to adapt them and configure them for your use case, then you just end up with things like that. We used to talk about hibernate

Starting point is 00:30:05 over the last couple of years and it's the same hibernate is a great framework but if you use it if you don't configure it and adapt it to your use cases then it can do some horrible things yeah and and the one warning always with the n plus one query is you know you might even look at a transaction and say hey this transaction only takes you takes, you know, 200 milliseconds to execute from end to end. I don't care that there's an N plus one query. And in fact, my N plus one, either query or service is only contributing 20 milliseconds to the whole thing, right? But don't let that fool you because there's a lot more than just the amount of time, right? There's the amount of threads, there's the amount of connections, there are the bites. There's also the unexpected changes in

Starting point is 00:30:46 traffic or changes in usage that can suddenly take and blow that up on you. Um, so, you know, I just, there was a really great example. I wish I remembered it as we were talking about it. I remember hearing about, uh, with one of our customers where they had one of those exact situations where it was so minor. And then the next day it took and blew up on them for like a completely unpredictable reason but just because it existed it was it was a liability um so i would just say don't don't ignore the n plus one query or the n plus one pattern problem yeah yeah and i just you know i always like analogies um the analogy with microservices that always comes to mind for me is if you go to a supermarket and if you go to, you know, you have your basket of items and then you go to the checkout and then let's say you have five items. And if the cashier, if, if he's a service and the only

Starting point is 00:31:38 thing he does well is basically putting the final price into the system and then giving it a total. But every time he needs to get the price for a single item, he needs to call another service to tell him the price. So in this case, I give him my first item. Then he needs to walk over to the person that tells him the price, comes back, puts the price in, and then I give him the second item. That's basically kind of the analogy for me. So the question is how fine granular do we need to have these services?

Starting point is 00:32:12 Do we really always need to go to that other service to get a certain part of information? And also how far away should this person be or this service be to optimize the paths? That's the end, what we have to care about. You know, I used to work at a supermarket and used to work. Oh yeah. Yeah. Yeah. Yeah.

Starting point is 00:32:32 And my favorite was always having to do a price check. That's always the worst because then suddenly everybody in line grown, but that's, that's what your customers are going to do. If you have to do this, this piece, right. They're going to grown. Yeah. But the cool thing with this is, right, if you see the cashier as a microservice, you can scale it out by just adding more cashiers to it. That's the way, that's the beauty of it. All right.

Starting point is 00:32:57 We're at about 30 minutes. Do we want to go ahead and tackle the last one we were going to? Sure, why not? Sure. Okay, so we had one more problem pattern we wanted to discuss today um we are not going to open up an entire pandora's box because the uh we're going to discuss a little bit about memory right and obviously as soon as we say memory there are about you know a million things we can talk about and very very in-depth

Starting point is 00:33:20 very your memory is very very complex um and it's very scary. Anytime you see a memory problem, that's when I think everybody kind of starts shaking a little bit because they can just be so painful. But today we're going to take a look at one that's hopefully not as painful, not too painful. We're going to take a look at garbage collection, right? And we all know garbage collection is necessary. It has a great function. It's part of life, right? It's, it's, it's like breathing. It's an application breathing and has to have, or I should say, really, it's kind of like going to the bathroom for the application or that's why they call it a dump, right? Um, we have all the fun terminology, right? But the point is garbage collection

Starting point is 00:34:00 happens, right? And no matter what, there's going to be some kind of impact to your application. And again, that's the cost of running an application that does garbage collection. The problem is when that has an impact on your application that is beyond, or that's a negative impact that you can feel and, you know, or your customers can feel. Right? So let's get into a little bit of GC. Yeah. Well, yeah, GC, or I think what we call it suspension because it's suspension yeah it's really it's really when the garbage collector kicks in and actually suspends the jvm or the runtime and actually suspends your current

Starting point is 00:34:41 critical transactions that are executing and and that's what what I think what you try to get to is garbage collection itself has to do, it has to run, and it will obviously, when it has major GCs, block your JVM for a while. The question is how long are these blocks and who are impacting and maybe they are impacting transactions where you have SLAs with, right? And you have a certain service level agreement or you know if it goes beyond a certain threshold, then you are going to make your users unhappy or whoever else unhappy.

Starting point is 00:35:16 So that's why what we always look at is suspension time impact on transaction response time. So in our terminology, from a diamond risk perspective, we always talk about runtime suspension and the portion of time it contributes to the overall response time of a transaction. Interestingly enough, let me cut you off there, because just before we were talking about response time, duration, and execution time total, right? And I believe when we look at this, we're talking about the suspension impact on the duration. Exactly. But I mean, the suspension duration on everything that happens within the execution of

Starting point is 00:36:01 a transaction, that's true. Yeah. I mean, it's not seen as far as which one of those ones we're looking at when we take a look at that from that point of view. Yeah. Well, I think it will impact response time and duration. Of course. Of course. Yeah. And so I think the thing that I'm always looking at is when we are analyzing performance-related problems due to garbage collection,

Starting point is 00:36:27 we always look at the percentage of time that the runtime suspends the threads and which threads are currently executing business-critical features and how long they are actually suspended and what the percentage is of that suspension time to the overall execution time of the transaction. And so I typically put graphs up when I analyze load tests or systems in production, and I look at overall response time and then i'll look at response time without without garbage collection impact right because this is actually then the ratio and what i like uh again it's not i'll be doing a lot of pictures today with dynatrace but in dynatrace

Starting point is 00:37:17 you can also do a ratio with a ratio measure i'm just gonna say reinhardt and i do that a lot too in our in our things too exactly yeah so the ratio measure is square because it calculates the ratio between the response time and the response time without garbage collection and so you can actually see what the real percentage impact really is right and and if you chart this over time then you can actually see at which under which load you you hit a certain threshold where garbage collection really becomes a problem for you. Also, at which time of the day, when you have certain load patterns, you may hit a certain threshold.

Starting point is 00:37:53 Or if the code changed. It's a code change, yeah, exactly. So if you are deploying a new version and then all of a sudden you see memory behavior changes and now you have more impact. And then obviously the remediating actions can be different ones, right? It could be your memory behavior changes for a good reason because you changed some code and it just requires more memory. But maybe you forgot about to adapt your GC settings, your memory settings, or maybe you really have memory issues

Starting point is 00:38:25 where you are simply consuming too much memory, causing too many garbage collection runs in the end. And then you obviously need to do your memory diagnostics and figure out how to be more efficient with memory usage. Maybe you're keeping those threads alive too long. You're keeping the threads alive too long, yeah. So the interesting thing about this one one this is a pretty difficult one to look at right because this one requires um using jvmti correct well in order to see yeah i

Starting point is 00:38:55 mean we could see a gc run we could see you know some of those pieces but but some of the details yeah so i think what you what you're getting to is there's different approaches on how to measure this right now let's say there's different approaches on how to how to instrument jvms and clrs and for java now uh there's two options right you have a java agent uh which is convenient and easy uh the only problem with the javabased agent is that it runs within the JVM. So the code that the tool uses to actually analyze performance is also suspended if the garbage collector kicks in. And therefore, this tool doesn't even know that a garbage collection just happened. And there's the other approach that has been around for a little longer, which is using the native interfaces of the JVM, the JVM TI, the tooling interface, where you kind of sit outside of the runtime and therefore actually understand when the GC is triggered, how long it takes, and also who is impacted by it. And this is then the approach where you can really figure out what is the real impact of a GC suspension on which threads. And if you know the threads and what they're executing,

Starting point is 00:40:18 then you can also assign it to your business features. And you can say, oh, this feature was impacted by 50 milliseconds, which is 60% of our overall response time. And it's a critical feature, and that's not acceptable. Right. And the interesting thing, too, is you could, with much less success, if you're looking at some of the JMX metrics type and all, you could see all right there was

Starting point is 00:40:45 a garbage collection ran it ran for 200 milliseconds and it ran at 1005 today and at 1005 today we also saw a spike in response time so we'll assume that that was because of the gc right but that again is kind of you're correlating you're you're or yeah you kind of – you're correlating. Yeah. You're tying two events together that look like they're related, but they might not be. So getting to that JVMTI tooling aspect is where you get that definitive answer. Yeah. So it's just – I never thought about that. That's a good one.

Starting point is 00:41:20 That's a very good one, yeah. If you just look at the –'s actually yeah there could be plenty of other things going on right yeah yeah but you also wouldn't know that impact to those um to the business functions as well so that jvmti piece is really important i know it sounds like i'm plugging but it's you know when i learned about this piece i think about four and a half years ago or something or or maybe five now. It's been a while. I just understood the criticality of it.

Starting point is 00:41:52 So I'd love to. And by the way, I think we had, if people want to learn a little more, we have. You did a great performance clinic on that, didn't you? I did a performance clinic too. But what I also wanted to say, we have the book with the Java Enterprise book. Oh, that's right. We wrote a book a while ago. So if you Google it on the website, it's a free book. We'll put a link up.

Starting point is 00:42:15 We'll put a link up, yeah. I think it was Michael Kopp back then. He wrote several – I think he wrote all the chapters on memory management. He explained how garbage collection works and also the impact of garbage collection uh there was a phenomenal a phenomenal piece that he wrote it's under i believe java enterprise performance i just googled for java enterprise performance dynatrace and and it's under dynatrace.com slash resources slash ebooks slash java book yeah so what did you uh'll, we'll put the link up,

Starting point is 00:42:47 but for people who don't know the link, what did you did, uh, Dynatrace, Java, what did you search for? Dynatrace Java book memory. It comes up like eBook Dynatrace. Yeah. So exactly. Yeah. So you'll see that there. It's definitely good. That's where actually I learned, started really learning about memory was from, uh, from the early version of that book that was being put up there um so it's great thing and uh the the

Starting point is 00:43:08 nice thing is even though it was written you know several years ago memory doesn't change too much this is not like it holds true right so uh definitely still worth going into um anything else on on the memory there well i think memory as i said as I said, as you said in the beginning, is a huge topic, right? I think we should have probably other sessions where we can talk about object churning, where we can talk about memory leaks and different ways to analyze memory problems. But for this particular one here,

Starting point is 00:43:38 I think it's just important to understand that try to figure out what the real impact of garbage collection is on your business functions and figure out if this is load related if it comes in through a different deployment that you had or you know in case certain memory aggressive features are executed and then impacting all the other features that are currently executing in parallel uh but yeah always always try to figure out what is the impact of garbage collection suspension absolutely and if you um for anybody listening if there are certain aspects of memory you would like to hear us discuss um by all means please please tweet it at us. You can tweet at Pure underscore DT

Starting point is 00:44:26 or at either one of our regular Twitter handles. So at Grabner Andy or at Emperor Wilson. Or you can email us at pureperformance.dynatrace.com. Any other kind of problem pattern type of things that you may be interested in that we haven't covered yet, please as well, let us know. And we'll, we'll, we'll dive into them. We're always looking for finding out what is interesting to you. I'm still sounding all public radio. I think this, this will be, I'll call myself silky Brian. so uh you know i andy as always um the the problem patterns are are great and and i'm so glad that

Starting point is 00:45:11 you see so many of them uh we all see them quite quite often and i'm sure a lot of our listeners run into them all the time uh and hopefully we're doing a little to help you identify or be aware of some of the more common ones because as you you've told, told myself and many audiences over and over again, right? These, we call them common because they are common. They happen everywhere and they happen in the best of shops and they happen in the worst of shops. Um, there's just no getting away from them. It's kind of like mosquitoes. Um, they're always there somewhere. Um, so again, great stuff there. Um, any, anything that you'd like to, any final words from you, Andy? Uh, no, not really. Just, uh, you know, keep, keep, keep sending us data and keep, keep approaching us with, uh, with things that we can then take, analyze, and then bring it to a

Starting point is 00:46:02 broader audience, either through the podcast or through blog posts. I think with that, we can all make our little contribution in improving the engineering world. Excellent. Well said. Give back. There you go. That's what it is. And we're having our fun drive. Yeah.

Starting point is 00:46:21 All right. Well, with that, I'd like to thank you, Andy, for doing this for me. And again, it's about a year, All right. Well, with that, I'd like to thank you, Andy. Thank you. For doing this for me. And, you know, again, it's about a year. So let's wax nostalgically for half a second. And thank you for all of our listeners who have been enabling us to do this. We do this as part of our day job.

Starting point is 00:46:40 Right. So because we have listeners like you wonderful people, we get to continue doing this and we really enjoy it so uh just a big shout out to everyone who's uh and and everyone who listens and everyone who's encouraged us to do this especially uh the one the only mark tomlinson who was our he's the one who kept pushing the two of us to uh to do this so shout out to mark and i guess we'll give a little shout out to Steve. Yeah, Steve is listening to us on the way into work he told us. Yeah. And also Brett. Yes, hi Brett. Guest of the show.

Starting point is 00:47:11 Anyhow, and alright, well we'll talk to everybody. We'll see you all soon and until next time. Ciao, ciao. Ciao.

Your Ad Here

PurePerformance - 035 When Multi-Threading, Micro Services and Garbage Collection Turn Sour

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.