PurePerformance - 001 Performance 101: Key Metrics beyond Response Time and Throughput
Episode Date: May 5, 2016If you are running load tests it is not enough to just look at response time and throughput. As a performance engineer you also have to look at key components that impact performance: CPU, Memory, Net...work and Disk Utilization should be obvious. Connection Pools (Database, Web Service), Thread Pools and Message Queues have to be part of monitoring as well. On top of that you want to understand how your individual components that you test (frontend server, backend services, database, middleware, …) communicate with each other. You need to identify any communication bottlenecks because of too chatty components (how many calls between tiers) and to heavy weight conversations (bandwidth requirements).Listen to this podcast and learn which metrics you should look at while running your load test. As performance engineer you should not only report that the app is slow under a certain load but also give recommendations on which components are to blame.
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance. Thank you for joining us.
I am host number one, Brian Wilson, and with me is...
Well, host number two, Andy Grabner. Thanks, Brian, for doing all this.
And I think we can actually get this kicked off.
And shall we tell the audience what this is all about and i think we can actually get this kicked off um and uh shall
we tell the audience what this is all about what they actually downloaded now yeah so the the name
of this podcast is pure performance and i guess that means it is about performance purely in all
shapes and sizes exactly well maybe not the performance that some people think about now
when they when they just see the word performance on its own uh we're talking about mainly software performance application performance
right the two of us we've been working in the performance field of software engineering for
many many years i'm not sure how long have you been actually in the field uh before i joined the
apm side i was um about 10 years 10 years and you know, total of about maybe 15 now,
including having gone APM.
Yeah, that's about the same for me.
I mean, actually,
you mentioned APM,
maybe a term
that we need to explain.
It stands for
Application Performance Monitoring
or Application Performance Management,
whatever you want to call it.
And we both have been colleagues
for the last couple of years
working for a company
called Dynatrace,
but obviously in our life life we care about performance.
And in my previous life I did load testing actually, my first eight years of my career.
And I think that's also where I got a lot of input on what to do and also what not to do and also one of the things we want to cover in the first episode is actually performance
101 some key metrics beyond response time and throughput which i believe many people actually
think we are mainly interested in when we actually do performance testing right and i think it's it's
it is but not but it should not just be the only thing that you are reporting back to your
to your managers saying hey here's the's the report. I just ran this
load runner or Jmeter or SOAP UI script. And based on my numbers, based on this PDF report, I can see
that response time is up by 25%. And I think that alone doesn't cut it anymore.
Right. It's funny you mentioned that. When I first started in performance, it was very much
that way.
In fact, one of the things, so funny background on me is I was a communication major.
I was going to be some filmmaker and everything until I gave that up and got into the IT industry.
But one of the things that appealed to me about the performance testing side of things was,
this is back in the early days, maybe around 2001, where a lot of the role of performance engineers was simply running a test and handing off a set of results for the engineers to figure out.
And without too much computer background, except having tinkered and built a computer and all that kind of stuff, I thought, oh, that's perfect.
I don't have to know too much.
I can just hand it off, right?
And I quickly learned that that was not going to cut it at all.
And that, that then started my, you know, desire to keep learning more and more and more.
But back in the day, we really were stuck with such a small set of metrics that we can hand off.
We, you know, you could look and say, Hey, the CPU increased, you know, and we also see some,
you know, maybe, you know, I was doing a lot of.NET testing. So we'd see a couple of items in the.NET queue.
And someone would ask, well, which happened first?
And what's causing that?
And, you know, we had no way to know.
So it was great when the APM tools came along, you started really being able to expand that. And I think part of what we're going to cover today, going into what you're saying there,
is those days of limited view,
beyond the fact that they're just not necessary anymore,
I don't think they're excusable anymore.
If you're not looking further
and you're not taking a look at the deeper metrics
to understand what's behind it
and be able to hand off actionable data,
then you're kind of shirking on your responsibility as a performance tester.
And I really hope from today's blog, we can help people who kind of have been stuck or maybe
haven't been so much aware of what are some of these other metrics are available so they can
really up their game and start making a splash and getting themselves noticed as a great performer in whatever company they're working with.
Yeah, great, great put.
Yeah, so also what I still see, but I think that the times are changing now,
what I still see when people look at load testing, the first thing they ask,
so can I really create a report that I can then pass over to somebody else?
And I think this is actually really strange because if you are a performance engineer
and the only thing you care about is how you can record and replay some scripts and then
hand off the data, then guess what?
You won't have a job for that much longer anymore, I believe.
Because honestly, I mean, this is what it is because, I mean, we're moving towards automation
and everything can be automated and there's a lot of cheap labor out there in the world.
So you are on the brink of being outsourced soon.
So that's why I believe everyone that is serious about performance management needs to level up and needs to understand what they're looking at.
But they also need to execute more tests in less time, testing more software.
But then the analysis part becomes very critical because we need to find faster answers to what the problem really is.
And that's why I also believe, and I totally agree with you, that tools in the APM space or
the application performance management space will help. And I think to the audience, I mean,
obviously we both now work for an APM vendor. We're called Dynatrace and you can look us up,
but basically the thoughts that we are trying to tell you here and the best practices or our
know-how that we have can be applied with any type of tool out there.
We really want to get you just some more additional thoughts,
food for thought on how you can improve your job in becoming a better performance engineer, right?
Yep, absolutely.
It's all about, as we keep saying, leveling up.
It's a term I think we just use over and over again.
But it's, you know, performance has always been a second or for a long time.
Performance was a secondhand thought at most organizations, and they usually give it their first thought when they have their first performance outage in production.
And then they expect everything to work quickly.
And often is the case, even in my own career, they just grab somebody and throw them into performance without them having too much knowledge.
And it's really great when you can get some help from people.
I've had some wonderful mentors in my time that really helped me get ahead in the game.
And hopefully some of the information we can impart here will help you all get ahead in where you're going.
But more importantly, it's about making the Internet a better place.
Because nobody likes bad performance.
Nobody likes going on a website that crashes.
I have so many times I go onto a mobile site on my phone and it crashes the browser,
the mobile browser still, still today.
So we're all working together
to make the internet a better place for everybody.
Which brings me to one,
I mean, we didn't talk about this this but the reason why your browser probably crashes is because the javascript
that runs on that page or the the content that is loaded is probably too big so one of the other
metrics to look at is you know how many resources are we actually wasting or consuming for certain
use cases so if you are a performance, you should have a good feeling about,
hey, are we loading 100 kilobyte of data?
Are we loading 10 megabytes of data?
Are we loading very heavy JavaScript?
And also allocating more memory on the heap.
And especially, I'm not sure which mobile phone you have,
but I mean, I'm still stuck with my old iPhone 4S.
And it's not the fastest anymore.
And that means that for me, the browser crashes probably even faster than yours
because my OS here and also my CPU and hardware is just not as good anymore.
And therefore, apps that may work well on the iPhone for my girlfriend,
they don't really work well on my machine anymore.
So I think as performance testers, we also need to understand the different environments that people will use our software in.
We cannot assume that everybody has the latest and greatest hardware and technology.
We cannot assume that we all have great network connectivity and low latency.
We cannot assume we are located in a geographical region where the CDNs that we may use, the content delivery
networks are very close to me.
So that's why we also need to factor all of that in as well, I believe.
So resources, I think, is a big metric.
And I'm always looking at resources.
So when I run, and Brian, correct me if you do anything else, but if I look at load testing
results, and especially if I compare results from, let's say, the
previous build to this build, not only do I look at response time and failure rate,
which make obviously a lot of sense, but I look at CPU, memory, network, and storage
as well.
Why?
Because I want to know, hey, we still keep the same response time, but now I see CPU
increased by 20% on our web servers, by 15% on our app servers.
And even though we're still good
from a performance perspective
to the end user, let's say, that way,
we're now consuming that many more cycles,
which means if we deploy this
and if you want to scale it up,
we need to probably buy more hardware
or put more money into Amazon
to buy more virtual resources.
Right.
Or figure out why
that CPU has gone up.
Exactly.
That's an
engineering approach, right?
We should just accept the fact and
basically update our data for
capacity planning, but really say,
hey, feature-wise, everything
stayed the same. Response time
stayed the same, but CPU memory ended up.
I had a – you mentioned you started in the.NET space,
or you did a lot of testing in.NET.
I recently worked with an app, and what they did,
they updated the dependency injection library.
So dependency injection allows developers to, after the fact,
while the application is running, actually inject some code to do certain tasks.
It's very commonly used for logging.
So instead of developers putting in their own log statements,
they can use a dependency injection library to say,
I want to log a message every time these types of methods are called.
That's dependency injection.
And what they did over the course of the lifetime of the app,
they updated this dependency injection library from version to version,
which makes a lot of sense if you want to stay up to date.
Unfortunately, at one point, they were installing a library that had a bug.
And this library now actually injected many more of their code, of their custom code,
into the code base and also had a memory leak.
So what actually the result was, they were logging out more data than ever before and a lot of duplicated data.
And also they were consuming more CPU because they consumed more memory, and more memory means the garbage collector has to kick in further.
And the biggest problem with them was they only did load testing before major releases but never between the individual sprints. So they actually had to go back six sprints, which in their case was 12 weeks, so three months.
Yeah, three months.
They had to go back three months and retest every single sprint build to figure out where the change actually came in.
And I'm sure there was a project manager somewhere saying, isn't there a way we can get this done quicker?
Exactly.
They had to run two weeks of retesting.
They had to do two weeks of retesting, and that obviously delayed the release.
But what they learned is, hey, A, we need to do this continuously to come back to our
continuous delivery.
So we need to run some of these tests, maybe not the full-blown tests, but we need to run
some performance tests that look at response time, but also resource consumption on a regular,
the best would be build-to-build basis, at least sprint build-to-sprint build basis
so that we can automatically identify regressions.
And regressions are now no longer just on response time or failure rate,
but on all of these resource consumption metrics as well.
Right.
And some of those resources you're mentioning there then would be items like number of items logged, right?
Obviously, you can manually comb through your log files, but what I've seen in my own
personal experience in the past is you don't really look at the log files until you have
an issue.
And then you see all these entries in the log file and then you're like, oh my gosh,
it's all here in the logs.
There's so many problems.
But then when you go back a couple of releases and more and more releases, oh, these log
entries have existed forever.
So tracking the number of items logged, not by looking at a log, but actually counting those numbers from release to release is a very important metric.
Because if your numbers are staying the same, they should either stay the same or go down.
You shouldn't be seeing them go up.
You mentioned garbage collection, right? So another metric to be keying in on is time and garbage collection, number of GCs, taking a look at some of your heap utilization graphs.
A lot of times visually that can help you see what your memory is doing.
Is it having healthy GC activity where you're seeing that kind of sawtooth pattern where it's collecting, but you're also not seeing a large impact
to your transaction performance during those collections?
Or are you getting really large before you're taking those dumps
and really impacting the performance?
What other metrics would you say in that situation
besides looking at maybe some of the GC activity and logs?
I want to add to the log because I agree with you.
You should look at the total number of log statements,
but I also always like to look at the category.
So log statements are typically,
they have a severity, fatal error warning.
So that would be awesome to look at this.
And also what I always teach people,
if I'm a performance tester in a testing app,
I run the test and at the end of the test, I actually grab the log files and I sit down with the developers and say, hey, look at these log files.
Would this actually help you in a real production environment?
And actually, how many of these – can you mark the log entries that are actually helpful for you?
And typically, this goes very quickly into an exercise where at the end, there's not many log statements that actually make a lot of sense or that provide a lot of meaning for the engineers.
And then they figure actually out, wow, we're logging just too much crap here.
And even though we have Splunk or Elasticsearch, we can process all this. It's just the user's information for the most part. We better not log it because we're consuming a network and we write it out to a shared file system and storage, obviously.
And it just takes time to write, too.
Exactly.
So logging can become a performance problem.
I'm sure, Brian, the two of us, we've seen logging as the big performance problem itself. I wrote several blogs on how a bad logging strategy
basically is the major impact of performance problems
because everything was just logged.
Or sometimes it's stupid mistakes like,
hey, I forgot to choose the right log level.
And then I basically deployed my application
into a high-load environment with debug logging.
Oh, I've seen that too many times.
Yeah, I know.
Or using still older
versions of logging frameworks that were not yet optimized for a big concurrent or like a heavy
heavy concurrent access i have examples of an old version of log4j bringing down a websphere cluster
because it was just not thread safe or they were just blocking each other out so so things like
that um and you ask what else.
So besides, I would say CPU, memory.
We talked about garbage collection.
We talked about logging.
I think the network is a big thing.
Obviously, we need to make sure not only what amount of data do we send back to the browser or whoever consumes, whoever is the end consumer.
It might be a rich client app.
But also in a world where we live, where we are heavily moving towards a service-oriented
architecture, we also need to understand how much data is sent back and forth between service
calls.
So if you are testing an app, and let's assume you're testing a login feature on your website,
and the login actually then passes the login token to a backend service
that is really doing the authentication and validation then there's data sent back and forth
so i think you as a performance engineer you also need to understand which components talk to
which other how are they triggered are they triggered by logging and for every login we're
sending a megabyte of data in our backend system between our services.
And if that's the case and login is used by 80% of our users because they have to log in,
then we need to factor this in and this might become a performance issue and actually a resource issue.
And if you are thinking about companies that are moving away from having everything deployed in a physical data center but moving away to either private or public cloud or even hybrid cloud
environments this could be that these services are not sitting on the same box anymore or the
same data center but they might sit continents apart and then the transfer the transfer of huge
bytes uh of megabytes and megabytes becomes always an issue right because they'll they'll also incur
an expense for that correct in, in most cloud services?
Exactly.
Nothing is for free anymore, right?
You pay per use, per usage, and that's going to be a tricky thing.
So that's also why I think performance engineers, again, I think we mentioned this many, many times now, we should not just give a thumbs up or a thumbs down if response time is good, but we should think more about, hey, how is this going to act and react in a real environment?
When we give this out to the wild, does this mean we can actually run this efficiently,
not only from a performance and resource perspective, but also from a monetary perspective?
Right.
And that extends, I think, to the web server.
We talked about the web server a little bit in the beginning, but I think there are some – a lot, to the web server. or been part of myself, it kind of seems like, well, there's not much we can do there. It's serving up images or it's serving up CSS file or it's just passing stuff through.
But there are still metrics that have to be looked at on the web server tier because there are things that can be done to fix those issues, obviously.
And, you know, some of those metrics we were talking before about the browser crashing
and you were just recently talking about throughput on those service layers.
You're going to want to also be looking at the throughput and tracking over time the
throughput of your pages.
You know, so if your homepage is, you know, I don't even know what an average homepage
size is, but let's say it's 800 kilobytes, you know, and then suddenly jumps up to, you
know, two megabytes, obviously that's going to have some kind of impact and you're not
going to necessarily see that unless you're running those tests.
You know, it could be that the organization has put some sort of new advertisement or promotion on the page that's just poorly configured and poorly optimized that everyone thinks, oh, this will be great.
We'll generate so much revenue and traffic from this new piece on our page.
And meanwhile, that might be what ends up destroying the performance of that system.
So looking at your throughput is important in terms of both megabytes and how many requests
you're doing per second. What else would you say is really key on the web server side?
So on the web server side, what I see a lot is modules. So there's a lot of, if you understand
in terms of a web server right
the request comes in then it gets passed through to different modules that do maybe content
validation to do compression or uncompressed decompress and all that stuff so and you can
add modules to it url rewrites is a very common one and you want to make sure that none of these
modules is actually impacting performance because you decided to choose a free version of some module that helps you.
And we had this situation actually in our company where in the very beginning, we needed a URL rewrite module.
And unfortunately, that URL rewrite model worked pretty well on the low load environments.
But on the high load, it had a memory leak.
And also, it has a synchronization issue across threads.
So actually, this module, this third-party module became the performance bottleneck. memory leak and also it was a it has a synchronization issue across threads so actually
this module this third-party module became the performance bottleneck and we never tested for
it because we thought well this is a free module some other people voted it up on the internet and
so it was it was great right so we added to it and then we actually in production found out this was
not not that good so modules um everything basically everything that is processing the request.
When the request comes in and before it goes out, it's very critical.
Another thing is proper caching settings.
So you mentioned the overloaded websites.
The web server can obviously also make sure that elements are cached and not always retrieved from
disk or from the app server and also the web server can set appropriate cache headers for
the client for the browser that's something that you should look into and the web server however
is not i mean there's always a static content that most of it will be handled by the web server. But the dynamic requests that go all the way back to the next to the app server, here you typically have a connector or some way of, for instance, Apache connecting to JBoss.
So there's a connector in between.
And the key thing with the connector is there's an outgoing and an incoming thread pool so that means you actually have to configure that connector to say hey how many concurrent requests am i actually allowing
from apache to go to tomcat and then on the tomcat side you also need to understand to set
the incoming threat worker threads queue to the same size because if you allow let's say a thousand
concurrent requests coming in from ap to Tomcat, but Tomcat
only has 10 threads, then you have a problem.
Also, the other way around.
If you have Tomcat perfectly configured and you allow a thousand concurrent requests,
but the bottleneck is the connector on Apache because you have not configured it to actually
allow a thousand requests through, then this will become the bottleneck.
So that's why it's very important to look at the end-to-end scenario.
And I think I just want to do a segue now to a big topic for me, which is pools and
queues.
Right.
So I just touched base on the pools and the queues between the web server and the app
server.
So there's some queue in the middle that passes these requests on.
But there's more than that.
You also have in an application
that talks to other services,
you have connection pools and thread pools
when you make external web service calls.
So what I mentioned before
with a service-oriented architecture,
if you have one JBoss talking to another JBoss,
then there's an outgoing and an ingoing connection pool
and they need to be properly sized.
Right.
And also properly utilized, right?
Because if I recall correctly from a blog you wrote about the database pool even,
you are allowed to, it's not necessarily bad to be using your entire pool.
What is bad is if getting those connections is running slow,
which means you're holding them too long or they're not being released or utilized quick enough.
Obviously, you're not going to want to run at 100% all the time,
but it's more about how long is it even taking to get those connections
and really a combination of the size and the acquisition time for those.
Exactly, yeah.
So I think the term that we coined in our product in diamond trace is
the acquisition time so we basically measure how long does it take to acquire the next connection
out of a pool and as you said if if you have a hundred percent pool utilization but the acquisition
time is actually zero which means you're perfectly a% using and consuming your connections but you don't need
anything else then everything is fine but if you are on
100% and you need more people or more threads that actually
consume a certain connection or certain pool then this time
goes up and this is really what you want to optimize and I think this is also for performance engineers relevant
if you do some capacity planning so if you what
you need to do is you need to run load against the system you know whatever
load you expect and they need to monitor every single pool and queue and figure
out okay how many messages make it through the pool how many how many
threads are actually needed how many connections are needed so then you can
actually correctly adjust these settings
so that your pools
and your queues are not becoming a bottleneck.
If you think about
if you go to the airport,
airport security is always a great example.
If you get there at 4 o'clock in the morning,
you may only have one line because
that's enough. But then at 5 or 6 o'clock, more
people come up and they open up more lines.
Sometimes.
It depends on the airport, I guess. because that's enough. But then at 5 or 6 o'clock, more people come up, and they open up more lines. Sometimes. Sometimes.
It depends on the airport, I guess.
But that's basically what the analogy here is, right?
You have to figure out if demand is –
if there's more people coming in in the front on your website,
their request will basically trickle down through the individual tiers
to make sure you have enough room and resources available
for the individual phases of the request processing.
Right, and a lot of these pools and threads are available in your JMX metrics.
I'm not sure exactly where they are on the.NET side,
but they're going to be there and they're available to set up to monitor.
It's more for anyone listening out there.
It's about finding out which pools are utilized.
So, of course, this is where you have to have those discussions with the development teams to find out which pools are the ones to monitor.
And, in fact, I'm sure if you go to them and say, hey, I want to monitor the pools in my tests, you know, they're going to be like, oh, wow, great. You know, hopefully they'll then be very eager to help you out and
identify which are the ones that you need to look at. Because, you know, to be honest, if you pull
up a JMX browser, it could be overwhelming. You know, there's just a lot of stuff in there. And
if you're not used to doing these kinds of things, you're not going to know where to start. So
definitely start that conversation then with your development teams to find out what you're going to monitor, which pools you're going to pull in. And also find,
you know, that's a good time where you're going to find out what they think the optimal settings
are going to be. Right. And then you can monitor, see how you're, you're running against that.
You could even then set up your thresholds. If you're going to be using some, you know,
if you're going to be using an APM type of tool or any other tool that it's going to be collecting
this, this data, you can set up your thresholds and alerts based on these numbers so that as your test
is running, you'll automatically get notified when you start hitting that line.
So it'll make identifying an issue there a lot easier.
Forward.net, because you were saying JMAX for Java, that's a clear, obvious choice.
Forward.net is typically Windows performance counters.
So you have ESP.NET itself,
and typically also any framework
that is using any pools or queues
to typically expose this data
through performance counters,
and you can query it through WMI or things like that.
And as you said, if you're using APM tools,
then typically, at least what we did,
I mean, we can only talk, I guess, about Dynatrace, but I'm sure the other vendors are doing similar things.
We try to automatically pick up the most common metrics for you.
So, for instance, if you install a Dynatrace agent on your Java, on your.NET here,
you automatically get the connection pools.
You automatically get garbage collection, memory network, disk utilization.
You get on the web server how many threads the web server and the app server have.
So that's all there for you.
And so if you run your test, you automatically see where your bottlenecks might be.
And if it is a queue, if it's a pool, if it's a CPU hotspot, if you are writing too many lines to the log file and stuff like that.
Right. And the other reason why it's great to monitor those things is, you know, from experience,
it's a heck of a lot easier to fix a problem caused by just not having enough threads than it is to have something in code.
So it's one of those happy finds.
If you're running a test and your issue is just not enough threads in your pool, that's something really easy to fix. And everyone will be happier
and you won't be the bearer of the dreadful bad news that there's some sort of code rewrite
required. It's just going to be a matter of, hey, we just need to increase the threads and
retest, make sure I can handle that. think i think it may sound very simple for our
listeners but actually i see this a lot that nobody ever thinks about what the proper sizing
should be a lot of people actually have no idea especially when they develop the product because
they don't know the complexity once the software is running that's why as a performance engineer
you're typically the first one to actually face the problem of trying to figure out
how the correct sizing of these pools and queues is.
And then you actually need to touch these default settings.
Because most of the times, I mean, Brian, correct me if I'm wrong, but when I see people sending me data,
I typically tell them, well, you run Tomcat and JBoss and ASP.NET on the default settings, right?
That's probably why you cannot sustain a thousand concurrent users on the system because it's all default settings.
Right.
You know, in all fairness,
a lot of times nobody knows what to set them to.
There's a dance that goes on
where the developers, you know,
get the frameworks, they deploy them.
They have no idea what to set it to
for the, for, you know, a production
or some other sized environment
because they don't have the data to feed back into.
So it really then comes into the performance team and maybe, I guess, operations if it gets to that
level, but hopefully it doesn't get to the operations team. But for those other teams to
feed back the data of, well, under this amount of load, these are how many threads we're using,
and these are the bottlenecks. And that then helps the development teams and others figure out what
that tuning is. Because out of the box, you're really not going to know what it is, what to put it into.
So part of a lot of all It comes down to establishing those relationships,
not just with the development team, but also with the operations teams as well,
because they're looking at the application from a completely different point of view.
You know, I know a lot of times back in the days when I worked in the real world,
you know, I always refer to the real world as when I worked for, you know, not an APM company,
but when I actually had releases and was running the performance testing, you know, we would be in there in a meeting
talking about our concerns about the application and the development team's concerns were completely
different than the operations teams because the operations team has to make sure that
thing survives in the real world, whereas the development team has to make sure the
application works, right?
And there's two different kind of metrics. So it's really about pulling in those, talking to those teams and understanding what their concerns are, what they need to be aware of. And then in being in the middle on the performance side, watching all of them so that you can answer to operations, you can answer to development, what's being used, and you can see how that impact changes. You know, if you're running a test and I don't know, what's a good operator?
I mean, an obvious operations one might be like network throughput, right?
Or I guess that would go to the networking team.
But if you notice, oh, the new release is going to add in the performance lab 3% more
network utilization, that's really important for the operations team to know
because they're going to have
to support that level of increase.
And they may already be
at that throttle point
on the network.
To add, I think we are,
I'm not sure how we are
in timing here.
We don't want to waste too much,
not waste,
hopefully this is good time
obviously spent by people
that listen to us.
But we talked a lot about metrics.
We talked a lot about not only response time, butpu memory network disk we talked about the the queues and
the pools we also mentioned database which is obviously there's a database pool connection pool
what i actually do as one of the first things when i analyze load testing results and people use
you know any load running load testing tool load runner j use any load running, load testing tool, load runner, Jmeter, whatever,
Silk Performer, and they use Dynatrace, or a tool like Dynatrace that I typically look at response time,
failure rate, and then as one of my first metrics is the number of database statements being executed.
Because believe it or not, I would say in 80% of the cases where I get involved with analyzing performance problems, it's typically database.
And it's not that I'm pointing a finger to the database guy now.
That's not the case.
It's not the database is slow, but it's improper database access because they're using frameworks like Hibernate and Spring, and it's misconfigured so that the caching layer is working incorrectly.
The statements that are generated are inefficient. So what I'm always doing, I'm always looking at
how many database statements are actually executed
while I run my load test.
And also, not on the total number,
but this is the beauty of APM tools that do transactional tracing,
so they actually trace end-to-end every single request.
You can say, hey, do we have a large number of SQL statements on a single
transaction? Do we have outliers? So I typically look at average number per URL, but also max,
and maybe not the 90th percentile. So you can see, hey, interesting, when we crank up the load
on the login page, then we see that the number of database statements, for whatever reason now,
dramatically go up.
And that's unusual.
And that actually causes the major performance impact.
But with this, I automatically know who to go to.
Right.
I know I don't need to go to the DBA, but I need to go to the guy who's responsible
for the data access layer because he's executing so many database queries.
And if I've done this a while,
then I know it's typically not that developers
don't write their own database access code anymore
on their own,
but they use frameworks like Hibernate and Spring
to access this data.
And then I go to these experts and say,
hey, did you change anything?
I mean, what's going on?
And can we look at this?
Because I see 500 database queries when I'm logging in these set of users under load.
And you bring up the database, and this is something I actually wanted to really focus on a little bit today.
You know, I see a lot of your blogs and a lot of your performance clinics analysis,
and it seems one of – this is definitely one of the
biggest issues. And oftentimes what gets brought up is the N plus one query problem. So I just
think we'll probably be bringing it up. I'm sure multiple times over the course of the podcast,
because it just seems to pop up everywhere continually. Can you give a very low level
explanation? Cause you know, I, I, for even for me, right, I have a basic understanding of what it is.
But I think a lot of people out there have a decent kind of conceptual idea,
but just really like in the most simple terms, you know, let everyone know,
you know, what is this great N plus one query issue?
Because it just seems so prevalent.
Yeah.
So what it really is, and to give it in simple terms, and I think I will actually go with
the login example.
So let's assume you have an application and you have a login.
And if you logged in, then later you get an overview of who you are.
You are user Andy and you have purchased that many products with us, like your product catalog
or your previous shopping cart.
So now what you can do as an application developer, you can say, hey, database,
show me all the recent purchases from Andy. And then for each individual one, I ask you separately,
what was the total amount of dollars that Andy spent? So let's assume I bought only one product
in the past.
It means I have one SQL query that tells me, hey, this was the invoice. And then another SQL query
that says, and this is now the total number. If I am a loyal customer and I purchase every week
with you and two years down the line, I have 100 purchases, the same code would first go to the
database and say, hey, database, give me all the purchases that Andy made in the last two years.
You get back a list of 104, if I do the math correctly, two years, 52 weeks.
And then for every week, you make another select statement saying, now give me the total of this invoice and this and this and this.
So that's the M plus one query problem.
Actually, it should be called one plus N because it starts out with one sql query a list of records come back and then for every single record
you're going back to the database and asking for more details and the reason why this is obviously
inefficient is because it's a data-driven problem that you have then what data you have in a database
the slower it gets and the reason why it's really bad is because SQL, S-Q-L, the query language, allows us to say,
hey, database, give me the total sum of what Andy purchased in the last two years.
It's a single query and gives me the same result. The reason, Brian, why we end up with this
is because as developers, we tend to use frameworks that shield the complexity of the database. We use Hibernate and Spring.
We model the data through object models.
And then what we do, we say, hey, we need the invoices.
And then we get a collection of objects back.
And then we iterate through the collection of objects in a loop.
And every time I access an object in the background,
Hibernate is now going off and capturing the next data element.
Now, there's also a reason why this is there, why Hibernate implements it that way or why it gives you the chance because there's a concept of lazy and eager loading.
So maybe I don't know if I need all the data.
Maybe I'll just iterate through the first five.
And then later, I go on with number six to ten.
So that would be like a paging, a pagination thing.
But in most cases, what I see, developers typically don't really know and actually don't even care what these frameworks are doing internally.
Because what they really want to do is they want to build great features on top of these frameworks.
And then they end up with something that actually works pretty well in their local environment.
Because in their local desktops, they don't have any big database.
But if you now push this into your testing environment
and now, Brian, you run your big Lotus with a real, hopefully, with a big database,
with a lot of data, then you will see these problems.
And the cool thing is you can immediately find these problems by looking at these metrics.
So how many database queries are executed per single request?
Or even better, is the same database statement being called multiple times on a single transaction?
And these are metrics that we as Dynatrace, for instance, spit out automatically.
Once you are in there, in the app, we spit out these metrics.
Same statement being called how many times?
Oh, really?
500 times?
Wow.
Yeah. And it also segues to, I guess, a topic we're not really going to delve into,
at least not today. But some of the more common pitfalls of performance testing is not performing
with proper data. I mean, not testing with proper data. So if you, every time you go to run a test,
you set up a whole new batch of fresh users with zero purchases in their history,
you may not see this issue becoming a problem or slowing things down during the course of that release.
But if you have a historical set of users that have been around for a long time,
that have, in your example with the login where someone has 100 purchases or 104 purchases,
you'll be able to see this. So it
really comes down to also using the right data set when you're running those tests.
I know you talk about this a lot, even the same idea with searching. Don't use the same search
query all the time, but don't also have it be 100% random all the time because that's not reality.
So it's that mimicking of reality, which is in some ways, one of the more
difficult arts of setting up a good performance test, because there are just so many variables
have to put in there. But just, you know, wrapping it into that N plus one query issue, or maybe we
should start calling it the one plus N query issue is, you know, make sure your, your users aren't fully clean.
You know, I know in my performance history, one of the first questions I have always gotten from
a developer when I've brought up an issue like this was, oh, well, what's that user? How old
have you had that? How long has that user been in the system? Their, their, their profile might
be corrupted from too many releases and changes. And, you know, that's a battle you have to kind of stick to your guns on.
If that user, you know,
you got to look deeper into what the issue is
before you just dismiss that user
as maybe someone who got corrupted over time
and having that.
Yeah, wow.
So I guess a lot of performance one-on-one, hopefully, huh?
Yeah.
For our listeners.
So shall we maybe close it by saying, hey, these are the metrics that we want people to look at,
and besides response time and performance, just to round it up?
Yeah, and I'm sure wherever this is posted, there'll be a comment section. So if you have any questions about some of these metrics Andy's going to be talking about here quickly,
please feel free to put that in the comments.
Or if you have any other of your favorites that you'd like to include as well, let us know.
So my metrics, or I guess both of our metrics,
we mentioned it many, many times.
So obviously we start with response time.
We start with throughput.
We look at failure rates.
That's obviously clear.
But then look at the main or the basic resource consumption metrics
of your tier.
And with every tier, I mean your web servers
and your app servers and your database.
And I'm talking about
CPU utilization.
I talk about memory,
not only total memory,
but Brian, you mentioned it,
breaking the memory for Java
and.NET down
into the individual heap spaces,
looking at garbage collection
and not only how often
the garbage collector kicks in,
but also the impact
of the garbage collector
on the runtime of transactions.
So there's an option
to actually get really how long a garbage collector actually suspends the runtime when it's collecting the garbage.
So for major garbage collection runs, that's what happens.
Actually, it stops the world.
Look at storage requirements.
So how often do you write to disk?
We mentioned log files.
How many log statements are written to disk?
I will also propose breaking the log statements up into severity groups,
warning, info, fatal, severe.
Then also look at pools and queues.
And there's a lot of pools and queues out there.
There's thread pools.
There's connection pools.
There is message queues.
There is what else do we have here?
Between the incoming on the web server, from web server to app server, from the app server to the next app server, and from the app server to the database.
So these are all things you need to consider.
And there's always JMX metrics, performance counters, SNMP.
There's a lot of different ways how you can capture these metrics.
And the last thing we touched upon was the database activity itself.
So how many database statements are executed per single transaction?
Are the same database statements executed
per single transaction?
And actually, if the same database statement,
a single one that is returning some static data
is constantly executed,
then this might also be a good candidate for caching.
But this opens now another total door
to probably a follow-up discussion
on how we decide which data that
we normally pull from the database should be cached on different application layers.
But yeah, I think if you do all that, if you look at the page weight of your site, if you
are at least making yourself familiar with the basics of web performance optimization
on how a good page should look like, and then do some some read up on the architecture, the components,
the types of servers you have in your environment.
If you read up a little bit on Java.net, PHP, Node, all of these technologies, then I'm
sure you will soon find a lot of guys out there that have a lot of additional metrics
to add.
And if you don't have the tools yet, look into APM tools,
so Application Performance Monitoring, Application Performance Management tools.
And if you don't know which tool to choose, I mean, the good thing is all the APM tool vendors nowadays
offer free trial versions, so you can register for free.
You get a free version, you can test it, and then pick the tool that makes most sense for you, I would say.
And if you want to look at the Dynatrace tool, then just search for Dynatrace free trial,
Google or Bing it, whatever you want.
I'm sure you'll find it.
Bing it.
Yeah.
You like Bing more than Google?
I can't remember the last time I used Bing.
Okay.
But I can say that I like Google.
It seems to follow me around everywhere.
Yeah.
Did I miss anything, Ryan?
Well, the only one which is, I guess, another kind of a deeper topic is, again, if your APM tool allows it,
looking at the time spent in your different layers of the application, right?
So which APIs are you spending your time in and how much time you're spending in each API.
And I think that's something we'll probably wrap that into a lot of other different discussions in the future
but that's uh you know how much time is being spent in your jdbc code execution or your own
um you know internal code versus java native code or or things like that
the the famous damage is layer breakdown chart huh yeah it's one of my favorites
yeah me too actually
alright well I guess that's it
for this episode
thank everyone for listening
and
any final thoughts
well final thoughts are I think we are
on the brink of a change
in our industry I mean
it sounds dramatic I think it is dramatic
because what what we'll see is we have to we have to build software that is not only performing well
but that scales well and we know that if you are fortunate enough to work for a company that had a
great idea and then build software for it and then you get promoted to different channels maybe somebody accidentally picks it up and tweets it or puts it on a television show then the best software doesn't
help you if it doesn't scale and perform then so i think that's also why as performance engineers
you also need to look at these scalability metrics and scalability comes really how these
different components communicate with each other.
And if you're just doing your job as you've done it the last 10, 15, 20 years where we just executed load tests and then generated a PDF report,
I think that doesn't cut it anymore, to be honest with you.
So I think we all need to level up.
And we need to, as bad as it sounds, we need to do more with less.
That's one of these terms, right? And we need to, as bad as it sounds, we need to do more with less. That's one of these terms, right?
And we need to automate a lot.
So you also have to think a lot about how you can automate the test execution into your continuous delivery cycle,
how you can automatically trigger these tests, how you can automatically capture these metrics and analyze them.
The perfect thing would be, that's also what we at Dynatrace do,
we baseline these metrics from build to build and from test to test and then automatically tell you, hey, based on the current results, something has changed significantly.
Now the number of database calls go up like crazy or you're running out of connection pools or connections in the pool.
So you may want to stop that build because it's not going to be nice when you deploy it.
Okay.
Excellent.
All right.
Well, hopefully you'll all be back for our next podcast.
And what is the topic for the next one?
Well, what we have on the line, that's actually also good.
And maybe you want to add comments so people would listen
to the comments on topics that they want to hear.
But what we have on the list is what is load versus performance versus a stress test.
I mean there's different terminology around.
And Brian and I, we both have worked long, long – many, many years in load testing.
So I guess maybe we have a different definition even because we came up with our own definition.
So we'll see.
We want to enlighten the people on our definition of what is a load test, a performance test, a stress test.
And we'll talk about this and how you can use them and where they should be used.
Great. Excellent. And as Andy said, if you have any ideas or topics you'd like to hear us discuss, please add those into the comments as well.
I'm sure soon enough we'll have a way that you can send an email or something else as well.
But right now we don't know how that's all going to be set up.
So thank you all for listening and letting us take this approximately 50 minutes of your day.
We hope you enjoyed it.
And I guess we'll see you next time.
Thank you. Bye-bye.
Bye.