PurePerformance - Good Performance Engineers Look Behind the Percent Usage Metrics

Episode Date: April 29, 2019

Have you ever used USE? Have you ever wondered what differentiates a performance tester from a performance engineer? Want to know how to automate performance engineering into DevOps Pipelines?Twan Koo...t, Performance Engineer at Sogeti, is answering all these questions. We met him at the last Neotys PAC Event where he gave an in-depth look on metrics and enlightened us all with USE (a method from Netflix’s Brendan Gregg). In our conversation we explain what USE really is, how to apply it and how a good performance engineer needs to understand more than just response time!Links:Twan on linkedin - https://www.linkedin.com/in/twan-koot-a813a8b7/Twan's deck from Neotys PAC - https://www.neotys.com/performance-advisory-council/twan-kootTwans Video at Neotys PAC - https://www.youtube.com/watch?v=hV8wpkDUtyshttp://www.brendangregg.com/Brendan Gregg's home page - http://www.brendangregg.com/eBPF - https://prototype-kernel.readthedocs.io/en/latest/bpf/BCC - https://iovisor.github.io/bcc/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance Podcast. And normally the introduction is done by my lovely co-host Brian Wilson, but I don't think Brian is here today. Brian, can you hear me? Probably not because he's in the mountains today skiing, enjoying the end of the ski seasons on the West Coast. He lives in Denver, for those people that have not figured that out yet, that are listening to our shows, and he decided to take a couple of days off and enjoy the slopes. But anyway, I think I should be hopefully doing a good show today, and I want to go
Starting point is 00:01:00 actually right into the topic, or before going into the topic, actually introducing my guest today. Twan, are you there with me? Yeah, I'm here. Hi. Hi. So, Twan, the two of us, we met a couple of weeks ago, actually, in another ski resort. We met in the French Alps when our friends from Neodys did their performance advisory council, the PECEC event at the top of the mountains in Chamonix and I believe it was the first time we ever met, right? Yeah, it was the first time we met. So it's a relatable subject Brian is doing now because we did some skiing there. Exactly. I don't remember which ski group did you participate in? Was it intermediate, advanced, beginners?
Starting point is 00:01:46 I was in the beginner ones. How did you enjoy your lessons? Yeah, it was really good. Too bad it was a short ski trip, to say it that way. But I really enjoyed it. And hopefully next year they do the same location, because I really liked it. And hopefully next year they, they had to do the same location because I really liked it. Yeah, it was fantastic.
Starting point is 00:02:09 And you're right. I think we only had like two hours for scheme because the rest of the time we actually had to talk about performance relevant topics. And this is also the reason why I brought you on the podcast, because I was really intrigued by your presentation. I think it was called beyond the Percentage Usage, an in-depth look into monitoring. And I really liked a lot of the concepts you explained,
Starting point is 00:02:33 you brought to the table. I think a lot of the folks that sat in the session, it was an eye-opener for many or for some, maybe it's just a reminder, but I think for many it was a great lesson. So, Twan, before we go into the topic topic and i want you to explain some of the things uh can you give us a little background first on on your person on your what you do right now and why performance is so important to you yeah uh i started out working in the it for five years ago a small it firm and then i switched to Sogeti,
Starting point is 00:03:05 which is my current employer, and started doing performance testing after half a year there. So now for almost two and a half years doing performance testing, and I think it's the best specialism around because it's a really broad skill set you need. So that's why I had a presentation about the in-depth looked into monitoring. But it's also on processes, DevOps, Agile. So I really enjoy that.
Starting point is 00:03:34 And currently just working at multiple clients, doing performance tests, but also helping them build DevOps pipelines, all sorts of things. Yeah, cool. So maybe we want to talk about this as well, because this is one of the topics I love to talk about a lot, DevOps and how performance fits into the pipeline.
Starting point is 00:03:55 But let's start with your presentation, beyond the percentage usage. And I believe you did, I mean, we have to give credit to a couple of people here and especially I think Brandon Gregg that you referenced quite well to where you got inspired, I guess,
Starting point is 00:04:15 for the introduction of use. Can you explain a little bit what that is all about? Yeah, so I started doing performance testing and of course you need to analyze data, monitoring data specifically. So I started looking online, asking colleagues, okay, can you give me some reference online or some documentation which I can follow to learn the trick?
Starting point is 00:04:41 There wasn't any, or there isn't much online. So I stumbled upon the the brennan greg site first i need to say i stumbled upon a conference uh from him which was posted on youtube so i watched it and there he talked about use so i looked it up online and found his site and yeah it was i was sold the first moment I heard and read about it. I think it helped me a lot analyzing better the results I get from my tests. And also I use it to train new performance testers, engineers on how to analyze monitoring data using use because it's really simple but in the meantime it's also in my opinion it goes really in
Starting point is 00:05:33 depth if you use it the right way. So can you can you quickly tell us what does use stand for? Yes so it-E. Yes, so it stands for utilization, saturation, and errors. The first we all know, one of the most looked after metrics is utilization of CPU, utilization of memory, maybe even network or IO. The second one is not that often I see that in reports, but it's saturation. So looking at how overloaded is the system. So how much of the work isn't being able, isn't serviced at the moment, so it's being queued. And lastly, E for errors, it's are there any hardware errors or other errors in the system
Starting point is 00:06:23 which may cause performance degradation? So that means the use is really, and I think this was the interesting moment for me, instead of looking at a single metric, as you said, like CPU usage, or I think the other example that you brought in, you had many like disk usage. The metric alone doesn't tell you anything because an 80 cpu utilization can be good or bad versus also a 10 cpu utilization could also be bad because you
Starting point is 00:06:54 always need to see it in context of the other two and i believe the example that you brought now i wrote it down here uh he said cpu usage right? That would be the U, the utilization. Then the saturation can be monitored by looking at things like the run queue or the disk queue when we talk about disk monitoring. And then the error will be something like, I think on CPU, it's like a run queue latency or on disk, it was IO latency, or any errors, basically, that are happening. So I think that was really great for me to be reminded that there's not many metrics where just looking at a single utilization matrix actually makes sense. You always have to look at it in the combination of usage of
Starting point is 00:07:46 saturation and of errors yeah yeah of course that's really for me just to say because for me a hundred percent utilization it can be a great thing because you're now utilizing all those um hardware components you have in a in a server and maybe you have a cloud solution, you use what you pay for. So that's really good. But then again, if you use 10% of your CPU and you maybe look at a saturation metric, for instance, run queue, and you see it's really high,
Starting point is 00:08:20 that means there is a serious problem because the CPU isn't doing anything, but all the instructions to the CPU are being viewed. So there's a bottleneck somewhere in the system, and you wouldn't have noticed it if you were only looking at CPU utilizations. And the same goes for disks or network or memory. So what you were saying is true. It's the combination of all those
Starting point is 00:08:47 types of metrics, so utilization, saturation and errors, which gives the big picture, which you need to find out. So purely looking at one metric isn't enough. And now in your work, you said you're working with a lot of clients on individual projects. So that means you are obviously faced with all different types. I would assume with every engagement, you have a different environment you're testing against. So with different environment, I don't just mean the application, but a different stack, I would assume. So it's very different every time. Yeah, it's really different. It varies from cloud solutions to on-premises hardware
Starting point is 00:09:27 to a private cloud solution, which range from IEX machines from IBM to the newest Intel machines. So every client has a different stack. But then again, if you use, and it's a bit of a tongue-breaker, but you can apply it on any stack, on any OS, on any metric. I was just wondering then. That was kind of the point that I wanted to make.
Starting point is 00:09:56 So you can apply it, but do you always get all these metrics? I mean, for, let's say, for CPUpu for memory uh for disk and for network does every every every stack every os every cloud vendor do they give you these metrics are they available or how do you deal with that oh yeah it also depends on the stack but for let's say for most on-premises solutions you have the metrics some of the times errors is a bit hard because it's a you're mostly looking for hardware errors and those aren't mostly aren't best true to let's say a VM or because that it's handled at the infrastructure department. For cloud solutions, it's really hard, because you have no idea how busy the host is,
Starting point is 00:10:50 how big the fight from the host to the VM is. So you do your best with the metrics you have, but then again, it still gives a good picture if you look at saturation, because if you see a big process queue being built up, you know, there's something maybe happening on the on the host part so you can Contact your cloud provider say hey, I see a lot of viewing the CPU isn't doing anything There's no memory bottleneck or IO bottleneck. What's happening? Can you tell me so it's also
Starting point is 00:11:24 gives you some handles to maybe challenge your cloud provider. And so that's a good point. So that means you can actually go back to the infrastructure team and whether this is your own team because you run on-premises or whether it's the cloud vendor. Now, I assume you've done this in the past. And are they receptive? Can you go to an Amazon or to a google or microsoft and say hey i i see very strange behavior here and i i assume it is the underlying hardware can you look into this are they willing
Starting point is 00:11:56 to do this or what is their typical answer uh for me it was always no. So, yeah, maybe the clients were too small at Amazon, but they're not willing to give up any information. It's the same if you look at, it's a bit of a side note, but if you look at CPUs and now the new generations have turbo frequencies, that's going to impact a lot on performance, but they won't say which chips they're using. So you have no idea if you have any issues on the platform itself. So that's really hard, opening up those metrics on, let's say, Azure or AWS.
Starting point is 00:12:43 Yeah. Even though I would, I mean, maybe this is a shout out that we can do to these cloud vendors to open up these metrics, because if they do, I mean, we can better utilize their resources. And I think if one starts to do it, it could be a nice competitive advantage versus the other.
Starting point is 00:13:01 And then hopefully the others will catch up and say, well, we are giving you all the metrics that you need so you can make better decisions and you can better optimize the code that runs on our infrastructure. I think so. Maybe a little shout out to the cloud vendors of the world out there. Yeah, that would be great if we have more metrics. Again, more information can lead to better analysis
Starting point is 00:13:27 and, in the end, better code, better decision-making on architectures. So it will help a lot if they open up. So Google, if you're listening. Exactly. Google, Amazon, and Microsoft, if you're listening, give us these metrics. And also listen to, if you're listening, give us these metrics. And also listen to if people ask you, like Twan,
Starting point is 00:13:49 and even though some of this project might be small, you should still not just give a no as an answer. That would always be good. The other thing that I remember from the presentation, you talked about eBPF and BCC. Can you see what I'm talking about and give us a little insight? Yeah so we talked you just talked about a saturation metric for CPU that's a run-Q.
Starting point is 00:14:17 Then a run-Q is a number of how many items are in a queue for the CPU, but it's hard to give a real value to this metric. And then comes eBPF. It's based on some really cool features in the Linux virtual machine, which allows you to get your hands on new metrics, which they were available before, but were really hard to measure. For instance, you can measure the schedule latency. So it's possible to have the metric on how big is your biggest queue, but also what often what of an impact has a queue of let's say 10 items, you can
Starting point is 00:15:13 now measure with EBPF and BCC how much milliseconds this queue was. So this is really helpful because now you can say okay i have a queue of 10 but it doesn't give a big latency because it's it's in the nanoseconds but you can also measure uh schedule the latency of let's say 45 seconds you know now i have a big problem so because the cpu instructions are waiting for 45 seconds before they even are processed on the CPU. So it opens up a really cool new pool of metrics, which can be used. And also in combination with use, it gives a good picture on how the system is performing. Very cool. Very cool.
Starting point is 00:16:02 Hey, come in. So we will also put some links into the description of the podcast. So we talked about eBPF and BCC. And in the beginning, we talked about the use method that was made popular from Brandon Gregg. And the website is Brandon Gregg, Gregg with two Gs at the end, dot com. And there's a lot of useful information there. Now, Twan, I have another question coming actually back to a scenario.
Starting point is 00:16:32 When you use use, as you said, it's a little tongue-breaker, but if you use use with your project and you figure out, hey, there's something wrong with CPU or disco memory? And what are, I mean, one potential root cause could obviously be the underlying hardware, but have you seen cases where it was actually not the underlying hardware, but it was actually maybe the stuff that ran on these machines, maybe too many processes or bad coding, bad applications. Are there problem patterns that you have seen that were causing, let's say, a low CPU, but a large run queue?
Starting point is 00:17:12 Is this something like this that can also be impacted by applications? Have you seen this before? Yes, yes. It's lots of, I have many examples, but one of the things I saw was there was a really low CPU utilization, but the system really hangs, so long response times during the performance test.
Starting point is 00:17:34 So then for me, I use views, and then I go down all these metrics. So I started the utilization and see low utilization, and I came to the saturation so I pulled the run queue metric and I saw a really big queue and I was like okay what is happening so I started to pick another group of metrics I go to IO and they are so also a big queue so So for me, then I know, okay, maybe the run queue on the CPU is caused by slow disk. So I was looking at the amount of reads, writes to the disk that weren't that high. So then I used a tool in the BCC collection, which shows all the reads and writes to the file system. And also in particular, which files are being read or write or read during the test.
Starting point is 00:18:31 So then I saw a config file being hammered every time a request came in. So I asked the developer, okay, what are you doing with this config file? And he said, yeah, every time there is a request, we pick the latest config and we make the connection to the database. So then putting all those pieces together, it was very clear to me that each time I send a request to the server, the specific config file was being picked up by the application and that caused a lot of io latency because we were hitting it with like two to three thousand requests each second so the file system couldn't get up keep up and the only way for me to really get that information
Starting point is 00:19:20 really quick was by using bcc tooling and it it was in a matter of minutes I had that metric up, and I saw the huge amount of reach and writes to that particular file. And it was a quick solution after that, but it really helped me in a case where the application was the problem during an analysis. And I think it's a problem that I've seen in the past where whether it is on purpose done by the developers
Starting point is 00:19:51 or if it is done by the frameworks that they're using. Typically, I think you mentioned that they were making database connections. Very often these things are completely hidden away because you're using frameworks that are abstracting the data access layer from you. But then these frameworks are obviously internally, you know, checking the config files. And then typically there are settings where you can say, well, check the config file
Starting point is 00:20:17 with every access or cache it and then only refresh it every minute or something like that. And it seems in your case, you had to see an area where every single request was rereading that config file. And that obviously with increasing load puts load on disk, on IO, things are queuing up and then you'll never be able to utilize the CPU. And obviously in the end, you will see huge response times on a system that seems completely underutilized from a CPU perspective because you're completely slowed down because of a disk queue problem.
Starting point is 00:20:54 Besides CPU, disk, memory and network, what other, do you have other examples on how to use use? Any other examples on how to use use? Any other examples? Yeah, I usually do it for hardware metrics. Lately, I started using it also for performance metrics, so how many hits per second a system can handle or how big the response time is then. But then again, you can also apply it to, let's say, a factory utilization.
Starting point is 00:21:40 You can check how many workers are busy doing the work. But if you have too many factory workers on a machine and the machine is clogged up, you are still saturated, but not at the utilization part because all those workers are standing still. And especially maybe the first example that you mentioned, we in the performance world, when we are testing, let's say, microservices or any type of component that is handling that request that we are executing
Starting point is 00:22:14 from a load testing tool. So we would obviously measure throughput as a way, but then also, as you said, queue length on, let's say, an application server or on a web server. So how many requests are actually queuing up? And then the failure rate. I think that's also another thing.
Starting point is 00:22:33 You can do this on an application level by looking at the HTTP responses. So you can figure out, okay, so what's the throughput? Do we see things queuing up on the application server? And what's the failure rate? What's the HTTP status codes that are coming back? And I think with this, you often probably find out that the system is currently pounded with too many requests. Now they start also failing and things are queuing up.
Starting point is 00:23:02 And I think these are classical metrics that I would look at on an application server, even web server. I also like the example that you brought with the workers, but it comes back to the same thing. As you said, you can apply use to really much everything. And I think that's the thing that I wanted to make sure that our listeners also understand that if you have a single metric try to figure out what other metrics on the use pattern so utilization saturation and error you can pull in to get a better picture because a single metric it doesn't
Starting point is 00:23:40 always give you the the full story i think that's what it is right yeah that's that's really the message uh to bring across when using uh use it's it's the same every component every line of code every piece of middleware it all together makes the the application so it's the same for metrics you can't uh do analysis based on like four metrics you need the whole picture to get uh to to to see what's really happening on the on the system so that yeah and using a methodology like use helps you to uh to get these metrics but because if you go just look at different graphs, you're not doing it consistently, but also you're running like a headless chicken, just plowing away through data. Yeah.
Starting point is 00:24:39 One thing that we started doing at Dynatrace is when we detect, let's say, a slowdown of a front-end service that you are, for instance, testing, then what we are doing, we're actually walking through the whole dependency tree. So we're looking at every metric from that service that you're testing, from the underlying process, from the underlying host, but then also all the dependent services and the dependent processes. And then we look for anomalies. But we do this on every single metric.
Starting point is 00:25:09 So we figure out which metric is behaving abnormally to, as compared with the time before the problem happened. I believe what we should start thinking of is then clustering these metrics, the three metrics into a use metric. I need to talk with our team if we are already doing this or if we should do it, because then you can make immediately better recommendations to the people that look at our monitoring data and say, hey, we see that the CPU is abnormally low, but it doesn't mean that you're underutilizing it. The problem is that you have a very high queue length and these two are obviously correlated.
Starting point is 00:25:54 And therefore you should look into what is consuming all of these, you know, why are things queuing up? So I think that's something I want to propose to our team. Yeah, it's really cool because most dynasties know for clients building great dashboards and they always have CPU utilization and if you have
Starting point is 00:26:17 maybe some alerts being you have a setting which alerts you when the CPU is really high. But you don't most of the time have an alert when the CPU is really low. But when the CPU is really low, you're going to have big problems on other parts of the system. So what you're saying is really cool, like combining metrics
Starting point is 00:26:40 into one use metric and measuring that and if there's a normally almost any part of the use metric you you can alert the right guys and give them the right feedback they need to address problems yeah exactly cool hey uh twan is there anything else uh looking back at your presentation is there anything else, looking back at your presentation, is there anything else we should cover? We have the really cool stuff we talked about, use eBPF, PCC. One of the things I also wanted to give across with my presentation is start digging into metrics and start learning more about how do these components work and how are they related to each other? Because I find it really fun, but also I think many performance testers and engineers aren't doing it that much. They're more keen on building tests rather than digging deep into metrics. But I find we need to do it more.
Starting point is 00:27:55 Yeah, and especially, as you said, I think this also differentiates you from a regular performance engineer. Let's say this differentiates people from being a performance engineer or let's say this this differentiates people from being a performance tester and a performance engineer because a performance engineer understands how the individual components that make up the application service that they're testing how they interact with each other how they're cross impacting each other and from a component perspective as you say we could talk you say, we could talk about the service, we could talk about the process, we could talk about internals of the runtime.
Starting point is 00:28:29 A classical example that just comes to mind is what's the impact of high garbage collection to CPU or high memory churn rates to the garbage collection? And then if you have high garbage collection, how does that impact CPU? So there's a lot of things that are connected obviously and you're completely right if you are if you're just creating tests and then just you know say hey something is slower and here's the reports then whoever needs to look at the reports needs to do all the heavy lifting and and i think this is not what we should do i think we should think about how can we provide better analytics on top of all the data that we have so that the people that need to look into this can much faster figure out what is the root cause and then address the root cause. we're moving towards a, or we're moving from performance testing to really performance engineering and to something that I would love to call
Starting point is 00:29:28 performance engineering as a service where eventually, Twan, right, you also want to automate as much as possible your work right now of digging through all the metrics and then maybe just give the developers or
Starting point is 00:29:45 architects at the end of your test, a link where you're, you maybe you build some analytics engine that automatically analyzed all the metrics based on your approaches. And then they can just consume that, that data and whether it's something that you build, whether you're using tools that can do this for you, whether it's Dynatrace
Starting point is 00:30:08 or Neotis or any other tools, I think this is where we need to go to. In order to get there, we need to better understand the relationships between the metrics, how systems actually are architectured, how they're talking with each other, how they can influence each
Starting point is 00:30:24 other. And I think this is separating the regular performance testers to the performance engineers. Yeah, I think that's really true because most of the time I'm now at a client, I'm not even performance testing anymore. I'm just building an infrastructure to what you were saying, a fully automated pipeline, and analyzing data is a big part of it. So you really need to know on which metrics you can measure and how you need to measure them. So applying use, for my instance, is really good
Starting point is 00:31:02 because when you are automating your analysis you need to you need to make sure you really make sure you do the right thing. Yeah and if you use Dynat isn't being used to its full potential. And that's really a loss of money and even can be dangerous when you solely rely on one metric for your performance test or your pipeline. Yeah, I think this could be another, this could be the line title of a blog post. You know, how to be dangerous.
Starting point is 00:31:51 It's dangerous to only look at a single metric. Yeah. Don't get fooled by response time as your sole metric or don't get fooled by CPU utilization. That's a cool one. Hey, let's quickly, before we stop, talk about DevOps and pipelines. Because this is a topic that is dear to my heart on how we can integrate monitoring data in the delivery pipelines. So you said when you go into customers right now, you're often building automation and pipelines.
Starting point is 00:32:24 Can you tell me a little bit more about how they look like how that looks like what you really do there yes so currently I'm building a future architecture they want to know how can we be more agile and how can we move to DevOps for performance testing. Most customers or most clients in the Netherlands, they are still a bit behind. So they still use those centralized teams who do the performance testing. But that's not a viable option when you're devops or even applying scrum so then you need to automate the performance test but also automate a big part of the analysis because you if you have like 50 scrum teams you can't have 50 people analyzing every uh pipeline one because that's that's crazy so what i do at this client i'm building a fully containerized platform where
Starting point is 00:33:29 the test is automated pull the monitoring data from an APM tool or push the performance data to the APM tool and use that to do a quick analysis on certain metrics we find important for that specific client and then give a status back to jenkins to either hold or discard the pipeline or the build so that's yeah and the tricky part is uh how you are how are you going to automate the analysis that's I'm really struggling or not struggling I'm really finding out really new things every day so that's really cool very cool and that I mean and this is something a trend that I've seen and I think this is where what I mentioned the other with
Starting point is 00:34:41 performance is a self-service because what you're building allows everyone to simply make a code change and then minutes later, hopefully, the automated Jenkins pipeline or whatever CI tool you're using will test your code or deploy and test your code and give you feedback on key metrics. And I think the biggest thing here is, has something changed to the previous build? and test your code and give you feedback on key metrics. And I think the biggest thing here is,
Starting point is 00:35:09 has something changed to the previous build? Do we have a performance or what I also always call an architectural regression to the previous code change? Like, do we make 50% more times access to that config file? That could be an interesting metric. Maybe somebody forgot. Yeah, and it doesn't always have to be a response time more times access to that config file, right? That could be an interesting metric. Maybe somebody forgot, yeah, and it doesn't always have to be a response time or memory consumption, but just the way your services and applications
Starting point is 00:35:32 are interacting with the other services and resources around them. And having this fully automated really allows you to then scale this because as you said, centralized center of excellence teams cannot cope with this anymore and therefore you need to automate as much as possible yeah and and the first thing you see is it's it's they integrate the the tests in in the pipeline
Starting point is 00:35:57 and that's uh that's really good because that enables you to run tests without ever touching a server or pressing play. But then you still need to analyze the results. And that's mostly a big chunk of the work when you do a performance test. So automating these steps, they really need to be done if you want to go have a good devops environment so it's it's i'm still looking for good examples on uh i know you did a great uh was it on perform you you and uh hendrick gave a a demo about this so yeah that was really cool. Yeah, we did some work on how we can integrate NeoLoad with Jenkins
Starting point is 00:36:50 and with Dynatrace to fully automatically execute tests that are triggered by code change and then pull back data into Jenkins. So just the same thing that you just explained.
Starting point is 00:37:04 And what we're also doing now, and maybe a little plug here to some upcoming podcasts that we have an open source framework now that is called Captain that we're building. So it's open source. And what it really does, it tries to automate all the plumbing between your CI, so your build, your deploy, your testing, and also your validation. So if you want to have a look at it, captain.sh, and captain is spelled K-E-P-T-N.sh. That's the website.
Starting point is 00:37:41 And the idea here is we want to make sure that people who are building microservices, that they don't need to worry about building large pipelines anymore that are doing the build, the deploy, the test, the validate, and then deploy it to the next stage. This is something we can also automate. And so the CAPTN, the framework is really all about automating quality gates and also on the on the production side because it does it doesn't stop there in the production side it automatically supports deployment models like blue green or canaries and then also supports things like auto remediation and uh yeah and also you know Hendrik from Neoload he is he's also crucial part to that project because he's contributing
Starting point is 00:38:31 extensions from the open source project captain to their tools so it's all good stuff yeah yeah so yeah and when you're saying it's also out of remedy, it's a cool new feature which is popping up more and more. And then again, the same applies for those rule sets which are being used for out of remedy. When are you going to do certain steps? So also looking in the right metrics to trigger a remedy. Yeah, exactly. And then the use metrics can obviously come into help because if you see response time going up
Starting point is 00:39:14 and you see still like low CPU, then what do you do? But if you then look at the other metrics, you see, oh my God, there's something going on here, the queue length. Oh, now I know where I need to what my problem is and this is what I need to address. Yeah, yeah.
Starting point is 00:39:33 There are examples when you spin up a new node which only increases the problem, so it's maybe better to sometimes decrease certain parts of the application to
Starting point is 00:39:46 dam in the amount of calls coming into a server. Again, using the right metrics is key to also automate your pipelines or
Starting point is 00:40:02 performance tests, but also in production. Cool. Hey, Twan, I think you want to wrap it up here. Thank you so much for being on the call. I typically do a quick summary at the end. Brian, who hopefully is still on the slopes and enjoying the snow in the Rockies. He's typically then saying, shall we summon the summer raider?
Starting point is 00:40:28 And that's what I'm doing right now. I summon myself. So I think what I've learned is that every seasoned performance engineer differentiates itself from a regular performance tester by understanding all the metrics that they need to look at, how the metrics interact with each other. And therefore, you really set yourself apart from the classical performance testing we used to do and what performance engineering has to be. And the next step will be to automate a lot of the analysis, bake it into pipelines, so that large teams, large organizations with many, many
Starting point is 00:41:06 development and scrum teams can also leverage it fully automated like you do with the work you're doing right now with this customer where you build the DevOps pipelines. There's more information to be found on use. So use stands for usage, saturation and error online. So you can check out the work from Brendan Gregg. He is the guy who inspired also Twan here to look into use. And, you know, check it out. Also check out the link to Twan's presentation at Neotis Pack.
Starting point is 00:41:42 A big shout out also to Neotis for constantly and repetitively getting us together as the Performance Advisory Council. Now, I also know that they are hosting WAPR this year. So in case people, I think it's going to be in Marseille in the end of May. And it's another great workshop around performance engineering. And I think they still have their call for papers open so WOPR, W-O-P-R, check it out in case you want to go to the south of France in May. All right, Twan, any final words on your end? I'd really like to thank Neolis for having me at back it opened it introduced me to to you and i asked
Starting point is 00:42:27 and we have this talk today i really enjoyed it and again to everybody who's listening just start start digging up more metrics i won't spoil my presentation if you haven't seen it but I have a cool Star Wars reference based on monitoring yeah cool alright well then van hope to see you soon
Starting point is 00:42:52 in person and if there's anything new coming up any new lessons learned just reach out to us and we'll get you on air again on here again.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.