PurePerformance - 001 Performance 101: Key Metrics beyond Response Time and Throughput

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance. Thank you for joining us. I am host number one, Brian Wilson, and with me is... Well, host number two, Andy Grabner. Thanks, Brian, for doing all this. And I think we can actually get this kicked off. And shall we tell the audience what this is all about and i think we can actually get this kicked off um and uh shall we tell the audience what this is all about what they actually downloaded now yeah so the the name of this podcast is pure performance and i guess that means it is about performance purely in all

Starting point is 00:00:57 shapes and sizes exactly well maybe not the performance that some people think about now when they when they just see the word performance on its own uh we're talking about mainly software performance application performance right the two of us we've been working in the performance field of software engineering for many many years i'm not sure how long have you been actually in the field uh before i joined the apm side i was um about 10 years 10 years and you know, total of about maybe 15 now, including having gone APM. Yeah, that's about the same for me. I mean, actually,

Starting point is 00:01:29 you mentioned APM, maybe a term that we need to explain. It stands for Application Performance Monitoring or Application Performance Management, whatever you want to call it. And we both have been colleagues

Starting point is 00:01:41 for the last couple of years working for a company called Dynatrace, but obviously in our life life we care about performance. And in my previous life I did load testing actually, my first eight years of my career. And I think that's also where I got a lot of input on what to do and also what not to do and also one of the things we want to cover in the first episode is actually performance 101 some key metrics beyond response time and throughput which i believe many people actually think we are mainly interested in when we actually do performance testing right and i think it's it's

Starting point is 00:02:17 it is but not but it should not just be the only thing that you are reporting back to your to your managers saying hey here's the's the report. I just ran this load runner or Jmeter or SOAP UI script. And based on my numbers, based on this PDF report, I can see that response time is up by 25%. And I think that alone doesn't cut it anymore. Right. It's funny you mentioned that. When I first started in performance, it was very much that way. In fact, one of the things, so funny background on me is I was a communication major. I was going to be some filmmaker and everything until I gave that up and got into the IT industry.

Starting point is 00:02:55 But one of the things that appealed to me about the performance testing side of things was, this is back in the early days, maybe around 2001, where a lot of the role of performance engineers was simply running a test and handing off a set of results for the engineers to figure out. And without too much computer background, except having tinkered and built a computer and all that kind of stuff, I thought, oh, that's perfect. I don't have to know too much. I can just hand it off, right? And I quickly learned that that was not going to cut it at all. And that, that then started my, you know, desire to keep learning more and more and more. But back in the day, we really were stuck with such a small set of metrics that we can hand off.

Starting point is 00:03:36 We, you know, you could look and say, Hey, the CPU increased, you know, and we also see some, you know, maybe, you know, I was doing a lot of.NET testing. So we'd see a couple of items in the.NET queue. And someone would ask, well, which happened first? And what's causing that? And, you know, we had no way to know. So it was great when the APM tools came along, you started really being able to expand that. And I think part of what we're going to cover today, going into what you're saying there, is those days of limited view, beyond the fact that they're just not necessary anymore,

Starting point is 00:04:15 I don't think they're excusable anymore. If you're not looking further and you're not taking a look at the deeper metrics to understand what's behind it and be able to hand off actionable data, then you're kind of shirking on your responsibility as a performance tester. And I really hope from today's blog, we can help people who kind of have been stuck or maybe haven't been so much aware of what are some of these other metrics are available so they can

Starting point is 00:04:41 really up their game and start making a splash and getting themselves noticed as a great performer in whatever company they're working with. Yeah, great, great put. Yeah, so also what I still see, but I think that the times are changing now, what I still see when people look at load testing, the first thing they ask, so can I really create a report that I can then pass over to somebody else? And I think this is actually really strange because if you are a performance engineer and the only thing you care about is how you can record and replay some scripts and then hand off the data, then guess what?

Starting point is 00:05:13 You won't have a job for that much longer anymore, I believe. Because honestly, I mean, this is what it is because, I mean, we're moving towards automation and everything can be automated and there's a lot of cheap labor out there in the world. So you are on the brink of being outsourced soon. So that's why I believe everyone that is serious about performance management needs to level up and needs to understand what they're looking at. But they also need to execute more tests in less time, testing more software. But then the analysis part becomes very critical because we need to find faster answers to what the problem really is. And that's why I also believe, and I totally agree with you, that tools in the APM space or

Starting point is 00:06:05 the application performance management space will help. And I think to the audience, I mean, obviously we both now work for an APM vendor. We're called Dynatrace and you can look us up, but basically the thoughts that we are trying to tell you here and the best practices or our know-how that we have can be applied with any type of tool out there. We really want to get you just some more additional thoughts, food for thought on how you can improve your job in becoming a better performance engineer, right? Yep, absolutely. It's all about, as we keep saying, leveling up.

Starting point is 00:06:39 It's a term I think we just use over and over again. But it's, you know, performance has always been a second or for a long time. Performance was a secondhand thought at most organizations, and they usually give it their first thought when they have their first performance outage in production. And then they expect everything to work quickly. And often is the case, even in my own career, they just grab somebody and throw them into performance without them having too much knowledge. And it's really great when you can get some help from people. I've had some wonderful mentors in my time that really helped me get ahead in the game. And hopefully some of the information we can impart here will help you all get ahead in where you're going.

Starting point is 00:07:20 But more importantly, it's about making the Internet a better place. Because nobody likes bad performance. Nobody likes going on a website that crashes. I have so many times I go onto a mobile site on my phone and it crashes the browser, the mobile browser still, still today. So we're all working together to make the internet a better place for everybody. Which brings me to one,

Starting point is 00:07:43 I mean, we didn't talk about this this but the reason why your browser probably crashes is because the javascript that runs on that page or the the content that is loaded is probably too big so one of the other metrics to look at is you know how many resources are we actually wasting or consuming for certain use cases so if you are a performance, you should have a good feeling about, hey, are we loading 100 kilobyte of data? Are we loading 10 megabytes of data? Are we loading very heavy JavaScript? And also allocating more memory on the heap.

Starting point is 00:08:17 And especially, I'm not sure which mobile phone you have, but I mean, I'm still stuck with my old iPhone 4S. And it's not the fastest anymore. And that means that for me, the browser crashes probably even faster than yours because my OS here and also my CPU and hardware is just not as good anymore. And therefore, apps that may work well on the iPhone for my girlfriend, they don't really work well on my machine anymore. So I think as performance testers, we also need to understand the different environments that people will use our software in.

Starting point is 00:08:49 We cannot assume that everybody has the latest and greatest hardware and technology. We cannot assume that we all have great network connectivity and low latency. We cannot assume we are located in a geographical region where the CDNs that we may use, the content delivery networks are very close to me. So that's why we also need to factor all of that in as well, I believe. So resources, I think, is a big metric. And I'm always looking at resources. So when I run, and Brian, correct me if you do anything else, but if I look at load testing

Starting point is 00:09:20 results, and especially if I compare results from, let's say, the previous build to this build, not only do I look at response time and failure rate, which make obviously a lot of sense, but I look at CPU, memory, network, and storage as well. Why? Because I want to know, hey, we still keep the same response time, but now I see CPU increased by 20% on our web servers, by 15% on our app servers. And even though we're still good

Starting point is 00:09:48 from a performance perspective to the end user, let's say, that way, we're now consuming that many more cycles, which means if we deploy this and if you want to scale it up, we need to probably buy more hardware or put more money into Amazon to buy more virtual resources.

Starting point is 00:10:03 Right. Or figure out why that CPU has gone up. Exactly. That's an engineering approach, right? We should just accept the fact and basically update our data for

Starting point is 00:10:17 capacity planning, but really say, hey, feature-wise, everything stayed the same. Response time stayed the same, but CPU memory ended up. I had a – you mentioned you started in the.NET space, or you did a lot of testing in.NET. I recently worked with an app, and what they did, they updated the dependency injection library.

Starting point is 00:10:35 So dependency injection allows developers to, after the fact, while the application is running, actually inject some code to do certain tasks. It's very commonly used for logging. So instead of developers putting in their own log statements, they can use a dependency injection library to say, I want to log a message every time these types of methods are called. That's dependency injection. And what they did over the course of the lifetime of the app,

Starting point is 00:11:01 they updated this dependency injection library from version to version, which makes a lot of sense if you want to stay up to date. Unfortunately, at one point, they were installing a library that had a bug. And this library now actually injected many more of their code, of their custom code, into the code base and also had a memory leak. So what actually the result was, they were logging out more data than ever before and a lot of duplicated data. And also they were consuming more CPU because they consumed more memory, and more memory means the garbage collector has to kick in further. And the biggest problem with them was they only did load testing before major releases but never between the individual sprints. So they actually had to go back six sprints, which in their case was 12 weeks, so three months.

Starting point is 00:11:50 Yeah, three months. They had to go back three months and retest every single sprint build to figure out where the change actually came in. And I'm sure there was a project manager somewhere saying, isn't there a way we can get this done quicker? Exactly. They had to run two weeks of retesting. They had to do two weeks of retesting, and that obviously delayed the release. But what they learned is, hey, A, we need to do this continuously to come back to our continuous delivery.

Starting point is 00:12:16 So we need to run some of these tests, maybe not the full-blown tests, but we need to run some performance tests that look at response time, but also resource consumption on a regular, the best would be build-to-build basis, at least sprint build-to-sprint build basis so that we can automatically identify regressions. And regressions are now no longer just on response time or failure rate, but on all of these resource consumption metrics as well. Right. And some of those resources you're mentioning there then would be items like number of items logged, right?

Starting point is 00:12:45 Obviously, you can manually comb through your log files, but what I've seen in my own personal experience in the past is you don't really look at the log files until you have an issue. And then you see all these entries in the log file and then you're like, oh my gosh, it's all here in the logs. There's so many problems. But then when you go back a couple of releases and more and more releases, oh, these log entries have existed forever.

Starting point is 00:13:06 So tracking the number of items logged, not by looking at a log, but actually counting those numbers from release to release is a very important metric. Because if your numbers are staying the same, they should either stay the same or go down. You shouldn't be seeing them go up. You mentioned garbage collection, right? So another metric to be keying in on is time and garbage collection, number of GCs, taking a look at some of your heap utilization graphs. A lot of times visually that can help you see what your memory is doing. Is it having healthy GC activity where you're seeing that kind of sawtooth pattern where it's collecting, but you're also not seeing a large impact to your transaction performance during those collections? Or are you getting really large before you're taking those dumps

Starting point is 00:13:52 and really impacting the performance? What other metrics would you say in that situation besides looking at maybe some of the GC activity and logs? I want to add to the log because I agree with you. You should look at the total number of log statements, but I also always like to look at the category. So log statements are typically, they have a severity, fatal error warning.

Starting point is 00:14:16 So that would be awesome to look at this. And also what I always teach people, if I'm a performance tester in a testing app, I run the test and at the end of the test, I actually grab the log files and I sit down with the developers and say, hey, look at these log files. Would this actually help you in a real production environment? And actually, how many of these – can you mark the log entries that are actually helpful for you? And typically, this goes very quickly into an exercise where at the end, there's not many log statements that actually make a lot of sense or that provide a lot of meaning for the engineers. And then they figure actually out, wow, we're logging just too much crap here.

Starting point is 00:14:52 And even though we have Splunk or Elasticsearch, we can process all this. It's just the user's information for the most part. We better not log it because we're consuming a network and we write it out to a shared file system and storage, obviously. And it just takes time to write, too. Exactly. So logging can become a performance problem. I'm sure, Brian, the two of us, we've seen logging as the big performance problem itself. I wrote several blogs on how a bad logging strategy basically is the major impact of performance problems because everything was just logged. Or sometimes it's stupid mistakes like,

Starting point is 00:15:33 hey, I forgot to choose the right log level. And then I basically deployed my application into a high-load environment with debug logging. Oh, I've seen that too many times. Yeah, I know. Or using still older versions of logging frameworks that were not yet optimized for a big concurrent or like a heavy heavy concurrent access i have examples of an old version of log4j bringing down a websphere cluster

Starting point is 00:15:58 because it was just not thread safe or they were just blocking each other out so so things like that um and you ask what else. So besides, I would say CPU, memory. We talked about garbage collection. We talked about logging. I think the network is a big thing. Obviously, we need to make sure not only what amount of data do we send back to the browser or whoever consumes, whoever is the end consumer. It might be a rich client app.

Starting point is 00:16:24 But also in a world where we live, where we are heavily moving towards a service-oriented architecture, we also need to understand how much data is sent back and forth between service calls. So if you are testing an app, and let's assume you're testing a login feature on your website, and the login actually then passes the login token to a backend service that is really doing the authentication and validation then there's data sent back and forth so i think you as a performance engineer you also need to understand which components talk to which other how are they triggered are they triggered by logging and for every login we're

Starting point is 00:17:00 sending a megabyte of data in our backend system between our services. And if that's the case and login is used by 80% of our users because they have to log in, then we need to factor this in and this might become a performance issue and actually a resource issue. And if you are thinking about companies that are moving away from having everything deployed in a physical data center but moving away to either private or public cloud or even hybrid cloud environments this could be that these services are not sitting on the same box anymore or the same data center but they might sit continents apart and then the transfer the transfer of huge bytes uh of megabytes and megabytes becomes always an issue right because they'll they'll also incur an expense for that correct in, in most cloud services?

Starting point is 00:17:45 Exactly. Nothing is for free anymore, right? You pay per use, per usage, and that's going to be a tricky thing. So that's also why I think performance engineers, again, I think we mentioned this many, many times now, we should not just give a thumbs up or a thumbs down if response time is good, but we should think more about, hey, how is this going to act and react in a real environment? When we give this out to the wild, does this mean we can actually run this efficiently, not only from a performance and resource perspective, but also from a monetary perspective? Right. And that extends, I think, to the web server.

Starting point is 00:18:47 We talked about the web server a little bit in the beginning, but I think there are some – a lot, to the web server. or been part of myself, it kind of seems like, well, there's not much we can do there. It's serving up images or it's serving up CSS file or it's just passing stuff through. But there are still metrics that have to be looked at on the web server tier because there are things that can be done to fix those issues, obviously. And, you know, some of those metrics we were talking before about the browser crashing and you were just recently talking about throughput on those service layers. You're going to want to also be looking at the throughput and tracking over time the throughput of your pages. You know, so if your homepage is, you know, I don't even know what an average homepage size is, but let's say it's 800 kilobytes, you know, and then suddenly jumps up to, you

Starting point is 00:19:18 know, two megabytes, obviously that's going to have some kind of impact and you're not going to necessarily see that unless you're running those tests. You know, it could be that the organization has put some sort of new advertisement or promotion on the page that's just poorly configured and poorly optimized that everyone thinks, oh, this will be great. We'll generate so much revenue and traffic from this new piece on our page. And meanwhile, that might be what ends up destroying the performance of that system. So looking at your throughput is important in terms of both megabytes and how many requests you're doing per second. What else would you say is really key on the web server side? So on the web server side, what I see a lot is modules. So there's a lot of, if you understand

Starting point is 00:20:03 in terms of a web server right the request comes in then it gets passed through to different modules that do maybe content validation to do compression or uncompressed decompress and all that stuff so and you can add modules to it url rewrites is a very common one and you want to make sure that none of these modules is actually impacting performance because you decided to choose a free version of some module that helps you. And we had this situation actually in our company where in the very beginning, we needed a URL rewrite module. And unfortunately, that URL rewrite model worked pretty well on the low load environments. But on the high load, it had a memory leak.

Starting point is 00:20:40 And also, it has a synchronization issue across threads. So actually, this module, this third-party module became the performance bottleneck. memory leak and also it was a it has a synchronization issue across threads so actually this module this third-party module became the performance bottleneck and we never tested for it because we thought well this is a free module some other people voted it up on the internet and so it was it was great right so we added to it and then we actually in production found out this was not not that good so modules um everything basically everything that is processing the request. When the request comes in and before it goes out, it's very critical. Another thing is proper caching settings.

Starting point is 00:21:17 So you mentioned the overloaded websites. The web server can obviously also make sure that elements are cached and not always retrieved from disk or from the app server and also the web server can set appropriate cache headers for the client for the browser that's something that you should look into and the web server however is not i mean there's always a static content that most of it will be handled by the web server. But the dynamic requests that go all the way back to the next to the app server, here you typically have a connector or some way of, for instance, Apache connecting to JBoss. So there's a connector in between. And the key thing with the connector is there's an outgoing and an incoming thread pool so that means you actually have to configure that connector to say hey how many concurrent requests am i actually allowing from apache to go to tomcat and then on the tomcat side you also need to understand to set

Starting point is 00:22:15 the incoming threat worker threads queue to the same size because if you allow let's say a thousand concurrent requests coming in from ap to Tomcat, but Tomcat only has 10 threads, then you have a problem. Also, the other way around. If you have Tomcat perfectly configured and you allow a thousand concurrent requests, but the bottleneck is the connector on Apache because you have not configured it to actually allow a thousand requests through, then this will become the bottleneck. So that's why it's very important to look at the end-to-end scenario.

Starting point is 00:22:44 And I think I just want to do a segue now to a big topic for me, which is pools and queues. Right. So I just touched base on the pools and the queues between the web server and the app server. So there's some queue in the middle that passes these requests on. But there's more than that. You also have in an application

Starting point is 00:23:06 that talks to other services, you have connection pools and thread pools when you make external web service calls. So what I mentioned before with a service-oriented architecture, if you have one JBoss talking to another JBoss, then there's an outgoing and an ingoing connection pool and they need to be properly sized.

Starting point is 00:23:24 Right. And also properly utilized, right? Because if I recall correctly from a blog you wrote about the database pool even, you are allowed to, it's not necessarily bad to be using your entire pool. What is bad is if getting those connections is running slow, which means you're holding them too long or they're not being released or utilized quick enough. Obviously, you're not going to want to run at 100% all the time, but it's more about how long is it even taking to get those connections

Starting point is 00:23:55 and really a combination of the size and the acquisition time for those. Exactly, yeah. So I think the term that we coined in our product in diamond trace is the acquisition time so we basically measure how long does it take to acquire the next connection out of a pool and as you said if if you have a hundred percent pool utilization but the acquisition time is actually zero which means you're perfectly a% using and consuming your connections but you don't need anything else then everything is fine but if you are on 100% and you need more people or more threads that actually

Starting point is 00:24:36 consume a certain connection or certain pool then this time goes up and this is really what you want to optimize and I think this is also for performance engineers relevant if you do some capacity planning so if you what you need to do is you need to run load against the system you know whatever load you expect and they need to monitor every single pool and queue and figure out okay how many messages make it through the pool how many how many threads are actually needed how many connections are needed so then you can actually correctly adjust these settings

Starting point is 00:25:06 so that your pools and your queues are not becoming a bottleneck. If you think about if you go to the airport, airport security is always a great example. If you get there at 4 o'clock in the morning, you may only have one line because that's enough. But then at 5 or 6 o'clock, more

Starting point is 00:25:22 people come up and they open up more lines. Sometimes. It depends on the airport, I guess. because that's enough. But then at 5 or 6 o'clock, more people come up, and they open up more lines. Sometimes. Sometimes. It depends on the airport, I guess. But that's basically what the analogy here is, right? You have to figure out if demand is – if there's more people coming in in the front on your website, their request will basically trickle down through the individual tiers

Starting point is 00:25:42 to make sure you have enough room and resources available for the individual phases of the request processing. Right, and a lot of these pools and threads are available in your JMX metrics. I'm not sure exactly where they are on the.NET side, but they're going to be there and they're available to set up to monitor. It's more for anyone listening out there. It's about finding out which pools are utilized. So, of course, this is where you have to have those discussions with the development teams to find out which pools are the ones to monitor.

Starting point is 00:26:18 And, in fact, I'm sure if you go to them and say, hey, I want to monitor the pools in my tests, you know, they're going to be like, oh, wow, great. You know, hopefully they'll then be very eager to help you out and identify which are the ones that you need to look at. Because, you know, to be honest, if you pull up a JMX browser, it could be overwhelming. You know, there's just a lot of stuff in there. And if you're not used to doing these kinds of things, you're not going to know where to start. So definitely start that conversation then with your development teams to find out what you're going to monitor, which pools you're going to pull in. And also find, you know, that's a good time where you're going to find out what they think the optimal settings are going to be. Right. And then you can monitor, see how you're, you're running against that. You could even then set up your thresholds. If you're going to be using some, you know,

Starting point is 00:26:59 if you're going to be using an APM type of tool or any other tool that it's going to be collecting this, this data, you can set up your thresholds and alerts based on these numbers so that as your test is running, you'll automatically get notified when you start hitting that line. So it'll make identifying an issue there a lot easier. Forward.net, because you were saying JMAX for Java, that's a clear, obvious choice. Forward.net is typically Windows performance counters. So you have ESP.NET itself, and typically also any framework

Starting point is 00:27:30 that is using any pools or queues to typically expose this data through performance counters, and you can query it through WMI or things like that. And as you said, if you're using APM tools, then typically, at least what we did, I mean, we can only talk, I guess, about Dynatrace, but I'm sure the other vendors are doing similar things. We try to automatically pick up the most common metrics for you.

Starting point is 00:27:54 So, for instance, if you install a Dynatrace agent on your Java, on your.NET here, you automatically get the connection pools. You automatically get garbage collection, memory network, disk utilization. You get on the web server how many threads the web server and the app server have. So that's all there for you. And so if you run your test, you automatically see where your bottlenecks might be. And if it is a queue, if it's a pool, if it's a CPU hotspot, if you are writing too many lines to the log file and stuff like that. Right. And the other reason why it's great to monitor those things is, you know, from experience,

Starting point is 00:28:34 it's a heck of a lot easier to fix a problem caused by just not having enough threads than it is to have something in code. So it's one of those happy finds. If you're running a test and your issue is just not enough threads in your pool, that's something really easy to fix. And everyone will be happier and you won't be the bearer of the dreadful bad news that there's some sort of code rewrite required. It's just going to be a matter of, hey, we just need to increase the threads and retest, make sure I can handle that. think i think it may sound very simple for our listeners but actually i see this a lot that nobody ever thinks about what the proper sizing should be a lot of people actually have no idea especially when they develop the product because

Starting point is 00:29:15 they don't know the complexity once the software is running that's why as a performance engineer you're typically the first one to actually face the problem of trying to figure out how the correct sizing of these pools and queues is. And then you actually need to touch these default settings. Because most of the times, I mean, Brian, correct me if I'm wrong, but when I see people sending me data, I typically tell them, well, you run Tomcat and JBoss and ASP.NET on the default settings, right? That's probably why you cannot sustain a thousand concurrent users on the system because it's all default settings. Right.

Starting point is 00:29:47 You know, in all fairness, a lot of times nobody knows what to set them to. There's a dance that goes on where the developers, you know, get the frameworks, they deploy them. They have no idea what to set it to for the, for, you know, a production or some other sized environment

Starting point is 00:30:02 because they don't have the data to feed back into. So it really then comes into the performance team and maybe, I guess, operations if it gets to that level, but hopefully it doesn't get to the operations team. But for those other teams to feed back the data of, well, under this amount of load, these are how many threads we're using, and these are the bottlenecks. And that then helps the development teams and others figure out what that tuning is. Because out of the box, you're really not going to know what it is, what to put it into. So part of a lot of all It comes down to establishing those relationships, not just with the development team, but also with the operations teams as well,

Starting point is 00:30:49 because they're looking at the application from a completely different point of view. You know, I know a lot of times back in the days when I worked in the real world, you know, I always refer to the real world as when I worked for, you know, not an APM company, but when I actually had releases and was running the performance testing, you know, we would be in there in a meeting talking about our concerns about the application and the development team's concerns were completely different than the operations teams because the operations team has to make sure that thing survives in the real world, whereas the development team has to make sure the application works, right?

Starting point is 00:31:24 And there's two different kind of metrics. So it's really about pulling in those, talking to those teams and understanding what their concerns are, what they need to be aware of. And then in being in the middle on the performance side, watching all of them so that you can answer to operations, you can answer to development, what's being used, and you can see how that impact changes. You know, if you're running a test and I don't know, what's a good operator? I mean, an obvious operations one might be like network throughput, right? Or I guess that would go to the networking team. But if you notice, oh, the new release is going to add in the performance lab 3% more network utilization, that's really important for the operations team to know because they're going to have to support that level of increase. And they may already be

Starting point is 00:32:09 at that throttle point on the network. To add, I think we are, I'm not sure how we are in timing here. We don't want to waste too much, not waste, hopefully this is good time

Starting point is 00:32:20 obviously spent by people that listen to us. But we talked a lot about metrics. We talked a lot about not only response time, butpu memory network disk we talked about the the queues and the pools we also mentioned database which is obviously there's a database pool connection pool what i actually do as one of the first things when i analyze load testing results and people use you know any load running load testing tool load runner j use any load running, load testing tool, load runner, Jmeter, whatever, Silk Performer, and they use Dynatrace, or a tool like Dynatrace that I typically look at response time,

Starting point is 00:32:53 failure rate, and then as one of my first metrics is the number of database statements being executed. Because believe it or not, I would say in 80% of the cases where I get involved with analyzing performance problems, it's typically database. And it's not that I'm pointing a finger to the database guy now. That's not the case. It's not the database is slow, but it's improper database access because they're using frameworks like Hibernate and Spring, and it's misconfigured so that the caching layer is working incorrectly. The statements that are generated are inefficient. So what I'm always doing, I'm always looking at how many database statements are actually executed while I run my load test.

Starting point is 00:33:32 And also, not on the total number, but this is the beauty of APM tools that do transactional tracing, so they actually trace end-to-end every single request. You can say, hey, do we have a large number of SQL statements on a single transaction? Do we have outliers? So I typically look at average number per URL, but also max, and maybe not the 90th percentile. So you can see, hey, interesting, when we crank up the load on the login page, then we see that the number of database statements, for whatever reason now, dramatically go up.

Starting point is 00:34:08 And that's unusual. And that actually causes the major performance impact. But with this, I automatically know who to go to. Right. I know I don't need to go to the DBA, but I need to go to the guy who's responsible for the data access layer because he's executing so many database queries. And if I've done this a while, then I know it's typically not that developers

Starting point is 00:34:30 don't write their own database access code anymore on their own, but they use frameworks like Hibernate and Spring to access this data. And then I go to these experts and say, hey, did you change anything? I mean, what's going on? And can we look at this?

Starting point is 00:34:43 Because I see 500 database queries when I'm logging in these set of users under load. And you bring up the database, and this is something I actually wanted to really focus on a little bit today. You know, I see a lot of your blogs and a lot of your performance clinics analysis, and it seems one of – this is definitely one of the biggest issues. And oftentimes what gets brought up is the N plus one query problem. So I just think we'll probably be bringing it up. I'm sure multiple times over the course of the podcast, because it just seems to pop up everywhere continually. Can you give a very low level explanation? Cause you know, I, I, for even for me, right, I have a basic understanding of what it is.

Starting point is 00:35:28 But I think a lot of people out there have a decent kind of conceptual idea, but just really like in the most simple terms, you know, let everyone know, you know, what is this great N plus one query issue? Because it just seems so prevalent. Yeah. So what it really is, and to give it in simple terms, and I think I will actually go with the login example. So let's assume you have an application and you have a login.

Starting point is 00:35:53 And if you logged in, then later you get an overview of who you are. You are user Andy and you have purchased that many products with us, like your product catalog or your previous shopping cart. So now what you can do as an application developer, you can say, hey, database, show me all the recent purchases from Andy. And then for each individual one, I ask you separately, what was the total amount of dollars that Andy spent? So let's assume I bought only one product in the past. It means I have one SQL query that tells me, hey, this was the invoice. And then another SQL query

Starting point is 00:36:30 that says, and this is now the total number. If I am a loyal customer and I purchase every week with you and two years down the line, I have 100 purchases, the same code would first go to the database and say, hey, database, give me all the purchases that Andy made in the last two years. You get back a list of 104, if I do the math correctly, two years, 52 weeks. And then for every week, you make another select statement saying, now give me the total of this invoice and this and this and this. So that's the M plus one query problem. Actually, it should be called one plus N because it starts out with one sql query a list of records come back and then for every single record you're going back to the database and asking for more details and the reason why this is obviously

Starting point is 00:37:15 inefficient is because it's a data-driven problem that you have then what data you have in a database the slower it gets and the reason why it's really bad is because SQL, S-Q-L, the query language, allows us to say, hey, database, give me the total sum of what Andy purchased in the last two years. It's a single query and gives me the same result. The reason, Brian, why we end up with this is because as developers, we tend to use frameworks that shield the complexity of the database. We use Hibernate and Spring. We model the data through object models. And then what we do, we say, hey, we need the invoices. And then we get a collection of objects back.

Starting point is 00:37:55 And then we iterate through the collection of objects in a loop. And every time I access an object in the background, Hibernate is now going off and capturing the next data element. Now, there's also a reason why this is there, why Hibernate implements it that way or why it gives you the chance because there's a concept of lazy and eager loading. So maybe I don't know if I need all the data. Maybe I'll just iterate through the first five. And then later, I go on with number six to ten. So that would be like a paging, a pagination thing.

Starting point is 00:38:26 But in most cases, what I see, developers typically don't really know and actually don't even care what these frameworks are doing internally. Because what they really want to do is they want to build great features on top of these frameworks. And then they end up with something that actually works pretty well in their local environment. Because in their local desktops, they don't have any big database. But if you now push this into your testing environment and now, Brian, you run your big Lotus with a real, hopefully, with a big database, with a lot of data, then you will see these problems. And the cool thing is you can immediately find these problems by looking at these metrics.

Starting point is 00:39:01 So how many database queries are executed per single request? Or even better, is the same database statement being called multiple times on a single transaction? And these are metrics that we as Dynatrace, for instance, spit out automatically. Once you are in there, in the app, we spit out these metrics. Same statement being called how many times? Oh, really? 500 times? Wow.

Starting point is 00:39:24 Yeah. And it also segues to, I guess, a topic we're not really going to delve into, at least not today. But some of the more common pitfalls of performance testing is not performing with proper data. I mean, not testing with proper data. So if you, every time you go to run a test, you set up a whole new batch of fresh users with zero purchases in their history, you may not see this issue becoming a problem or slowing things down during the course of that release. But if you have a historical set of users that have been around for a long time, that have, in your example with the login where someone has 100 purchases or 104 purchases, you'll be able to see this. So it

Starting point is 00:40:06 really comes down to also using the right data set when you're running those tests. I know you talk about this a lot, even the same idea with searching. Don't use the same search query all the time, but don't also have it be 100% random all the time because that's not reality. So it's that mimicking of reality, which is in some ways, one of the more difficult arts of setting up a good performance test, because there are just so many variables have to put in there. But just, you know, wrapping it into that N plus one query issue, or maybe we should start calling it the one plus N query issue is, you know, make sure your, your users aren't fully clean. You know, I know in my performance history, one of the first questions I have always gotten from

Starting point is 00:40:51 a developer when I've brought up an issue like this was, oh, well, what's that user? How old have you had that? How long has that user been in the system? Their, their, their profile might be corrupted from too many releases and changes. And, you know, that's a battle you have to kind of stick to your guns on. If that user, you know, you got to look deeper into what the issue is before you just dismiss that user as maybe someone who got corrupted over time and having that.

Starting point is 00:41:21 Yeah, wow. So I guess a lot of performance one-on-one, hopefully, huh? Yeah. For our listeners. So shall we maybe close it by saying, hey, these are the metrics that we want people to look at, and besides response time and performance, just to round it up? Yeah, and I'm sure wherever this is posted, there'll be a comment section. So if you have any questions about some of these metrics Andy's going to be talking about here quickly, please feel free to put that in the comments.

Starting point is 00:41:56 Or if you have any other of your favorites that you'd like to include as well, let us know. So my metrics, or I guess both of our metrics, we mentioned it many, many times. So obviously we start with response time. We start with throughput. We look at failure rates. That's obviously clear. But then look at the main or the basic resource consumption metrics

Starting point is 00:42:20 of your tier. And with every tier, I mean your web servers and your app servers and your database. And I'm talking about CPU utilization. I talk about memory, not only total memory, but Brian, you mentioned it,

Starting point is 00:42:30 breaking the memory for Java and.NET down into the individual heap spaces, looking at garbage collection and not only how often the garbage collector kicks in, but also the impact of the garbage collector

Starting point is 00:42:41 on the runtime of transactions. So there's an option to actually get really how long a garbage collector actually suspends the runtime when it's collecting the garbage. So for major garbage collection runs, that's what happens. Actually, it stops the world. Look at storage requirements. So how often do you write to disk? We mentioned log files.

Starting point is 00:43:00 How many log statements are written to disk? I will also propose breaking the log statements up into severity groups, warning, info, fatal, severe. Then also look at pools and queues. And there's a lot of pools and queues out there. There's thread pools. There's connection pools. There is message queues.

Starting point is 00:43:19 There is what else do we have here? Between the incoming on the web server, from web server to app server, from the app server to the next app server, and from the app server to the database. So these are all things you need to consider. And there's always JMX metrics, performance counters, SNMP. There's a lot of different ways how you can capture these metrics. And the last thing we touched upon was the database activity itself. So how many database statements are executed per single transaction? Are the same database statements executed

Starting point is 00:43:47 per single transaction? And actually, if the same database statement, a single one that is returning some static data is constantly executed, then this might also be a good candidate for caching. But this opens now another total door to probably a follow-up discussion on how we decide which data that

Starting point is 00:44:05 we normally pull from the database should be cached on different application layers. But yeah, I think if you do all that, if you look at the page weight of your site, if you are at least making yourself familiar with the basics of web performance optimization on how a good page should look like, and then do some some read up on the architecture, the components, the types of servers you have in your environment. If you read up a little bit on Java.net, PHP, Node, all of these technologies, then I'm sure you will soon find a lot of guys out there that have a lot of additional metrics to add.

Starting point is 00:44:43 And if you don't have the tools yet, look into APM tools, so Application Performance Monitoring, Application Performance Management tools. And if you don't know which tool to choose, I mean, the good thing is all the APM tool vendors nowadays offer free trial versions, so you can register for free. You get a free version, you can test it, and then pick the tool that makes most sense for you, I would say. And if you want to look at the Dynatrace tool, then just search for Dynatrace free trial, Google or Bing it, whatever you want. I'm sure you'll find it.

Starting point is 00:45:13 Bing it. Yeah. You like Bing more than Google? I can't remember the last time I used Bing. Okay. But I can say that I like Google. It seems to follow me around everywhere. Yeah.

Starting point is 00:45:25 Did I miss anything, Ryan? Well, the only one which is, I guess, another kind of a deeper topic is, again, if your APM tool allows it, looking at the time spent in your different layers of the application, right? So which APIs are you spending your time in and how much time you're spending in each API. And I think that's something we'll probably wrap that into a lot of other different discussions in the future but that's uh you know how much time is being spent in your jdbc code execution or your own um you know internal code versus java native code or or things like that the the famous damage is layer breakdown chart huh yeah it's one of my favorites

Starting point is 00:46:05 yeah me too actually alright well I guess that's it for this episode thank everyone for listening and any final thoughts well final thoughts are I think we are on the brink of a change

Starting point is 00:46:22 in our industry I mean it sounds dramatic I think it is dramatic because what what we'll see is we have to we have to build software that is not only performing well but that scales well and we know that if you are fortunate enough to work for a company that had a great idea and then build software for it and then you get promoted to different channels maybe somebody accidentally picks it up and tweets it or puts it on a television show then the best software doesn't help you if it doesn't scale and perform then so i think that's also why as performance engineers you also need to look at these scalability metrics and scalability comes really how these different components communicate with each other.

Starting point is 00:47:10 And if you're just doing your job as you've done it the last 10, 15, 20 years where we just executed load tests and then generated a PDF report, I think that doesn't cut it anymore, to be honest with you. So I think we all need to level up. And we need to, as bad as it sounds, we need to do more with less. That's one of these terms, right? And we need to, as bad as it sounds, we need to do more with less. That's one of these terms, right? And we need to automate a lot. So you also have to think a lot about how you can automate the test execution into your continuous delivery cycle, how you can automatically trigger these tests, how you can automatically capture these metrics and analyze them.

Starting point is 00:47:40 The perfect thing would be, that's also what we at Dynatrace do, we baseline these metrics from build to build and from test to test and then automatically tell you, hey, based on the current results, something has changed significantly. Now the number of database calls go up like crazy or you're running out of connection pools or connections in the pool. So you may want to stop that build because it's not going to be nice when you deploy it. Okay. Excellent. All right. Well, hopefully you'll all be back for our next podcast.

Starting point is 00:48:14 And what is the topic for the next one? Well, what we have on the line, that's actually also good. And maybe you want to add comments so people would listen to the comments on topics that they want to hear. But what we have on the list is what is load versus performance versus a stress test. I mean there's different terminology around. And Brian and I, we both have worked long, long – many, many years in load testing. So I guess maybe we have a different definition even because we came up with our own definition.

Starting point is 00:48:40 So we'll see. We want to enlighten the people on our definition of what is a load test, a performance test, a stress test. And we'll talk about this and how you can use them and where they should be used. Great. Excellent. And as Andy said, if you have any ideas or topics you'd like to hear us discuss, please add those into the comments as well. I'm sure soon enough we'll have a way that you can send an email or something else as well. But right now we don't know how that's all going to be set up. So thank you all for listening and letting us take this approximately 50 minutes of your day. We hope you enjoyed it.

Starting point is 00:49:14 And I guess we'll see you next time. Thank you. Bye-bye. Bye.

PurePerformance - 001 Performance 101: Key Metrics beyond Response Time and Throughput

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.