PurePerformance - Extreme load testing with 2Mio Virtual Users: Lessons learned with Joerek van Gaalen

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it another episode of Pure Performance. My name is Brian Wilson and as always we have with me my co-host Andy Grabner. Andy, how are you doing today? I'm pretty good. How often has it happened that you introduce myself and kind of saying I'm here and I wasn't here? Has it ever happened? saying i'm here and i wasn't here has it ever happened that i'm here like yeah you always say hey and with me my host and then you don't even know i'm here of course i know you're there because we we you know why would i mean sure you know this is the kind of argument my daughter makes with me like that like anything can happen you know you could

Starting point is 00:01:05 suddenly you know disperse or disintegrate disintegrate even though i was just talking to you the next second you might no longer exist there might have been a blink or a thanos event or something yes that could happen um but if it did happen i think i would have a little bit more to be worried about than the fact that i went to introduce you and you suddenly weren't there, you know? So I take it for granted. You're right. You know, I should never take anything for granted. I'm just so happy that you're always there for me, Andy.

Starting point is 00:01:33 You're always there supporting me. I'm happy, too, because without you, this would not be recorded and wouldn't make it out to the World Wide Web so people can listen to it. Well, you would record it on regular Zoom worldwide web so people can listen to it. Well, you would record it on regular Zoom and it would sound terrible, but yes. Hey, speaking of which, things that people can listen to. So we have a new guest, a first timer. Yay. Yay.

Starting point is 00:02:04 And I want to introduce him. Well, I think the best is for him to introduce himself, but I just want to give some background on how we know him. I was lucky enough to be invited for a new test pack earlier this year in Santorini where a couple of performance engineers got together and talked about performance engineering stuff and uh yurik he was there and he talked about some really cool stories from his large-scale load testing projects that he has done and it was so fascinating all the lessons learned about his major major projects that we said,

Starting point is 00:02:45 let's get him on the podcast and the whole world, all of our listeners should hear what to watch out for in case you run those really large scale load tests. I think it was actually extreme load tests, right? Extreme load tests, exactly. So, Eric, do us a favor. First of all, thanks for joining. Introduce yourself, who you are, what you're doing, what your background is, and then let's jump into the topic.

Starting point is 00:03:20 Yeah, thank you. And thanks for the invite, by the way, for this podcast with you. So, it's been a pleasure um well i am uh i'm a self-employed uh performance tester and i've always been a performance engineer for the last 15 years so uh yeah always been around in the performance part of the world actually and yeah actually since I started with performance testing I've always done short term projects most of the time so I've seen a lot of different smaller projects so I think I've done more than a hundred different projects which is cool and the last of them

Starting point is 00:04:12 I've done a lot of large scale performance so yeah that was also the talk of me in Santorini yeah and actually I remember now looking at the page on the Neotis website, the Performance Advisory Council.

Starting point is 00:04:30 The both of us, we were not only in Santorini, we were also at the inauguration event back in 2017 in Edinburgh, correct? In the castle. Yeah, that was cool, in the castle. Castle, yeah. I got to admit, can you quickly remind me the topic that you talked about back then um cool yeah that was more or less the topic back then in uh in the castle where we are in uh kind of guarantee guarantee basically yeah yeah but um the topic i have to remember what it was yeah the topic was about

Starting point is 00:05:11 uh the changing role of the performance engineer so usually the the role was always like, hey, you're a performance tester, so you do your old-fashioned way of performance test. And I was talking about not only the shift left, but also the shift right. So there was a lot of things as a performance tester, performance engineer that you can do. So not only testing, but also being part of production, monitoring, monitoring in production looking into that doing the capacity management also helping the developers so on the on the left part of the pipeline helping the developers with their performance challenges etc so a lot of more stuff that you can do as a performance engineer so i want to shine a light on everything that

Starting point is 00:06:15 you can do and being more like a performance specialist rather than an engineer now coming back to the the by the way, the presentations are all online. So in case you want to look at it, check it out, go to the NIRTES website, go to the Performance Advisory Council, find Jurek's picture, click on it, and then you get all the presentations. Now, coming to the large-scale extreme performance tests that you've been running recently, I remember that the one example where you were running a 2 million virtual user test

Starting point is 00:06:55 or concurrent user test for a TV show, right? Can you give us a little bit more details? What? Yeah, yeah, yeah. That's correct. Yeah, yeah. what yeah yeah that's correct yeah yeah i've done a two million concurrent virtual user performance test and it was for a tv show and actually the project management asked me to talk about some performance tests that they wanted for their for their project

Starting point is 00:07:23 and they started yeah we would like you to do we would like to do a performance test that they wanted for their project and they started yeah we would like you to do a performance test that you do it and we expect 2 million users so we should be able to see we should be able to verify that the platform is able

Starting point is 00:07:40 to handle that load so the first thing you think of 2 million users, sure. You're probably talking about registered users or total visits on a day because that's what you see often, right? Talk about the expected loads. They tend to give too high numbers.

Starting point is 00:08:04 But then they were talking about the project and say okay you know it's for a tv show um it's for a live tv show where people should start voting we expect two million votes within seven seconds and that's what the application should be able to handle so yeah it was truly they wanted to see two million concurrent users um voting for people because contestants actually that get on stage and at the second they get on stage people have to vote on if they should pass or not pass so um yeah that was actually a test. It was cool because then we did the test with Neolons and asked them for a license for 2 million users.

Starting point is 00:09:02 And they actually said, hey hey we never have done this we can't even generate the license because the maximum number of users we can select is a million so they actually had to change their back-end systems to be able to generate a license of 2 million users so that was also a fun fact actually yeah you were stretching the boundaries of your load testing vendor that's yeah but what's interesting is they didn't say no they they probably said oh this sounds fun let's do this you know yeah because they they also know that it was it's it's a nice challenge yeah to it. It's a good, yeah. It's cool to say that your tool supported,

Starting point is 00:09:49 but also I thought it's cool to do it. So it was like a challenging project. It's a free load test of the load tool. That really, really was the case because we had a lot of feedback for them to improve. Of course, if you find stuff that you never found, you will never find out. So what's the trickiest thing for such a project? Because obviously this is a testing for a one-time event and it is yeah i mean it's infrastructure for that test just for

Starting point is 00:10:28 you know the time of the project um probably for i don't know how long how long did it take you to kind of set it up to make sure that the test infrastructure is there that they actually figure out what what is the test infrastructure i think that's probably a challenge too. How much test infrastructure load generators do you need and things like that. Yeah, of course, the biggest challenge for such a test is just to be able to generate the loads, right? To have the capacity to generate the loads

Starting point is 00:11:01 and not being the bottleneck yourself. Because that would be kind of ironic, right? If you do a performance test on an application and you are the bottlenecks. So yeah, that was the biggest challenge to have sufficient power. And also what I wanted to do to make the challenge even more challenging is to run the test from a single controller. So I didn't want to duplicate, let's say, eight or ten different controllers and hit the start button all at the same time. And then gather results later on.

Starting point is 00:11:39 I also wanted to be the challenge that was possible from a single controller. And yeah, the test setup was like we had, I used 800 load generators. The load generators were able to simulate up to 3000 concurrent virtual users on the application so we were i was fine with well 2500 on a single load generator okay so um and for that to happen because you need to know first what can a single load generator um handle in the test so you then you know how many load generators you actually need so i first did the stress test uh against the application with just a single load generator and see how far how far i was able to come and that was about three thousand and it's three thousand and things starts to get itchy it starts to slow down response time to get the agent become unresponsive where things happen so I said a safe limit 2500 okay I remember just to jump in here I remember before aynatrace I used to work for Segway

Starting point is 00:13:08 software which later became Boilend we did Silperformer right I was a performer yeah yeah and I remember we had a feature in Silperformer this was like at least 15 20 years ago now, where the controller monitored the health state of the agents and also gave you recommendations. So you ran a script. And then, and I'm sure they have advanced this over the years now. It's been really 15 years since I've been there. And the controller gave an overview of all the agents

Starting point is 00:13:42 and gave recommendations on how many scripts you can safely run, how many scripts of that particular type or how many virtual users for that script you can safely run before the agent gets into a mode where the agent itself is impacting the test results. So I think that's obviously something you have to consider if you're running these large-scale load tests. You don't want your test infrastructure to be the problem. And that could either be the load generating agents, but I guess also the network bandwidth that you have from your load generators to your application that you're testing on. Yeah.

Starting point is 00:14:20 Well, yeah, New Load has some kind of features to look into the current system resources of the load generator, but it doesn't actually give you a lot of advice up front on how it's behaving and how many you would need for the test. But yeah, it would be a great feature, by the way. So what else? So that means if you started with figuring out what's the capacity of a single load agent, then you could do the math. And you said you came up with about 800 load generating agents to cover your 2 million number.

Starting point is 00:15:03 What else is there? I mean, this is obviously some important preparation step. Yeah, there are a lot of preparations actually to do and also a bit of the approach. Because there are some levels actually on the test that you have to kind of tune. Because you can tune your scripts, you can tune the agents,... om te tunen. Je kunt je scripten tunen, je kunt de agenten tunen, je kunt de scenario en de controller tunen. Ik had bijvoorbeeld ook de controller en de scenario settings moeten tunen, want 2 miljoen gebruikers...

Starting point is 00:15:40 ... werken concurrently. And Nilo does a feature to live show the current behavior of the test, the average response time, the transactions per second, et cetera. So it's a lot of live data that's going on. Actually, when I first ran the test, the controller was just so slowed down at nearly a million users. So I need to tune the controller first. Because there was so much network traffic going back live at one time from all the agents towards the controller.

Starting point is 00:16:18 So one thing that I need to tune was to kind of throttle down the network push back from the ages to the controller there less like a feature most tools have such a feature that you can follow the users life so you can see whether it's transactions they are which steps their step they are doing at the moment, you can disable that. And by disabling that, it was a tremendous save on live network traffic. So that was already a big win. And I remember when we had the discussion in Santorini about it, that we said it would be smart for tools to say, hey,

Starting point is 00:17:06 you are running a 2 million user test. You probably don't want to see every single user live. That means you may want to disable this feature by default or at least give recommendations on settings you have right now that are, you know, like give advice on better defaults or settings based on your workload configuration. Yeah. Yeah, so something that was also really important to do, because if something occurs on the platform you're testing,

Starting point is 00:17:38 let's say a service is restarting or whatever, and all the users get a small part of errors at that moment. If you're running with 2 million users or even with 100,000, all the users can get the same error at the same time. There are tons of errors per second you're generating. Every time when that happens, it costs a lot of agents to crash at that moment, and the controller to be just unresponsive. So tuning was necessary on that part too.

Starting point is 00:18:12 Just storing a limited amount of errors per second, and that also helped me to be able to cope with just sudden spikes of errors. Something you will just encounter when you have a heavy test. Yeah. So besides that, besides the load agents, figuring out the capacity, the number of agents you need, optimizing the whole controller and your runtime settings. Was there anything else or any other lessons learned from a testing tool perspective that you can tell us about or always studied?

Starting point is 00:18:55 Well, yeah, I haven't talked about how you can, what you can do on the scripting part because, yeah, you have to tune your scripts too because the scripts are actually the core of your test you know the scripts are run by the agents and there are some stuff that you can do on your scripts to limit the load actually on the on the We kunnen de lading eigenlijk beperken op de agenten. Zoals wat voor een voorbeeld? Ja, zeker. Als je veel scripten hebt, eigenlijk alle performance testing scripten, hebben ze extracties.

Starting point is 00:19:41 Je moet een variabele waarde krijgen en gebruik dat in de volgende bezoek. De meeste tools gebruiken regle expresies, of je kunt regle expresies gebruiken voor dat. Je moet ze gebruiken met zin, want ze kunnen echt CPU consumeren. Dus gebruik niet veel wild cards in them. They can really cripple the performance. Consider proper assertions. Don't use them too many. I'm usually a really big fan of having a lot of assertions all over the place, all over your script and requests

Starting point is 00:20:22 so you can verify a lot so you know when there's an error occurring. But for a heavy test, you might be able to just get a little bit less of the assertions to save some resources too, because they can be really consuming. Right. So other things that you consider is decreasing the think time in your test. I'm never a fan of this, also not for a heavy user test, but decreasing the think time means you will have a smaller session time of a third user. That means you have less concurrent users running. And that also means less memory usage at one time.

Starting point is 00:21:16 You can do that too. It's funny you mentioned that one because I was just thinking myself, aside from that, that there are a lot of considerations that you have to take when running a load test um such as you know there was a question that came up once a long time ago like can you run a heavy load with one user in jmeter the answer is yes no think time right but it's zero concurrency you know and and understanding what kind of test you need to run is a huge part of that,

Starting point is 00:21:46 right? You can say the product managers could come to you and say, we want to have 2 million votes, right? In an hour or in seven minutes. All right. But what's the concurrent, the questions you as a performance engineer, performance tester need to ask are, you know, what's the concurrency we're expecting to have in this? And then when you get to the scale that you're running at, you do then run into the issues of can you actually test that concurrency? Because what's that going to do to the amount of generators that you're going to need? And is the controller going to be able to, you know, if you need 2 million concurrent, that's going to be taking the challenge you had already to a whole different level. Exactly, yeah. And there may still have to be some real world shortcuts you have to take just in order to get some sort of resemblance test done.

Starting point is 00:22:40 But also things you'd have to consider, which you always have to consider in any test, but also just kind of come to a whole different scale at the scale of testing you're thinking about. You know, there's conceptually, do I run a test from an internal generator or external generator? How many locations am I running from? Am I running it from different cloud vendors so that I can make sure that as this goes on? Because in reality, your users are going to be coming in through different angles. It's not all going to be coming in through an AWS DNS or pipeline, I should say. So you really have to start thinking much bigger scale. How do we distribute that load?

Starting point is 00:23:21 Internal, external, you know, what different geolocations is this going to be, you know, from, if this is, you know, if this is like, obviously, you know, a single country show, then you don't have to worry about too much of geolocation. You might want to consider some, but it just takes it, all these decisions become a lot more relevant and a lot more important because when you're, so what were some of the, you know, and you, and I,

Starting point is 00:23:47 sorry, cut off on that one, one piece though, but I think that was a really, really important part that you mentioned about the concurrency. What of those things that would normally be, yes, we need to consider that we need to think about that,

Starting point is 00:24:00 but which, which one of those kind of became really, really strongly focused or really, really high priority, I should say? Yeah, well, first of all, I agree 100% with you, what you mean, what you're saying. And that's actually always the case. I mean, being really as realistic as possible, that's always a good approach for your performance test, I would say. As you said, you can do a lot of transactions and you say, okay, we need 2 million boats in an hour,

Starting point is 00:24:37 and you just generate it with a low amount of virtual users. Yeah, you can probably do that, but you are not doing the right concurrency. And that's what I want to achieve in this test too, because we were doing the web application was about web sockets. So web sockets were involved. And that means that every user sets up a connection and every user has just one simultaneous connection at one time

Starting point is 00:25:08 to the application. And as we all probably know, connections is really maximum connections. It's usually a very common performance issue, some rough threads, some rough connections or whatever, things like that. So when you want to tackle that, you just need to simulate it. So that's why I also always try to have the right number of concurrent users as you can expect in a peak moment in production. Does it answer the question a bit?

Starting point is 00:25:44 Because I thought the question was bit because i thought the question was a lot of different along the way yeah no no it is it really really was more of a discussion or just me thinking aloud i guess about some of the things you would normally consider during a low test yeah which seems like okay we, we consider this, but now it's like, this is even more important to consider. Cause a lot of times people like in the past, let's just take the internal versus external load generation, right? There are use cases for that, right? You can have, maybe your users are internal, or maybe you don't care about testing from the internet. You're just testing the traffic rate on the servers, right? Everything that you set up in a performance test is based on what your goal of the test

Starting point is 00:26:29 is to test, right? You have a theory, you have a, you know, it's kind of a scientific approach. So what do we need to test for that? But when you're testing at something at this scale, all these things that are sort of optional or sort of maybe talking points become critical is really, I guess, what my thought process was there. I actually have a very good example for that. Usually you can take a shortcut when you do a test application or website.

Starting point is 00:27:00 In 99% of the time you can just avoid doing all your CDN objects. Like they're coming from the CDN, not hosted by you. Someone else is hosting it and they're good at it. So it will be fine. But I've run into very large scale performance tests, which rely on CDN.

Starting point is 00:27:23 And actually ran into issues on the CDN objects. Yeah. So for large scale performance tests, I would always say just include all the objects there are, even if it's CDN, because you can run into issues on them as well. Yeah. Like I remember the discussion we had in Santeria. I remember the discussion we had in Santorini and I remember

Starting point is 00:27:48 a couple of years back when we on Black Friday we always analyzed how websites were doing. We found a lot of websites that were rate limited by their CDN

Starting point is 00:28:04 provider because either they were rate limited because they thought it was a denial of service attack or it was a rate limit because they just didn't pay enough money it was not part of the contract and all of a sudden they were rate limited and the content wasn't delivered anymore and is that what you're talking about? Yeah exactly that because yeah I ran into those issues as well when I thought, first of all, the first time, I think, maybe it's not necessary to add all these static objects

Starting point is 00:28:35 because it can be a real saver on the network, on the load generators, et cetera. But when I included them, we actually ran into issues on those third parties or on the CD generators, et cetera. But when I included them, we actually ran into issues on those third parties or on the CDNs. They actually have quotas and limits. Even CloudFront, for example, there's a limit of, I think,

Starting point is 00:28:56 it was 1,000 requests per second or something per region. You can just hit that easily with high-volume users. So if you just don't tune that on your CDN provider, you only will run into it at your peak. And then you're screwed. And then people come to you and say, didn't you test this? We told you exactly what was expected. But that also brings up two ideas.

Starting point is 00:29:27 We talked about CDNs, and you mentioned third-party, and I think third-party is really important to just make sure we explicitly state, because you might have a Google font, right? You might have a tracker, right? This isn't necessarily a CDN object. It's a third-party object, and that can come to bite you as well. But also, if you start talking about all these CDN object, it's a third-party object, and that can come to bite you as well. But also, if you start talking about all these CD elements, all these CDN elements, all these static elements, I would really want to go back to the product team and say, hey, if

Starting point is 00:29:55 you're expecting 2 million people to vote in the next 10 minutes, we need to get rid of everything we don't absolutely need on this page. If you think back to, I'm not sure how familiar you are with what goes on in the United States during the Superbowl, right? So they have these, you know,

Starting point is 00:30:12 near the Superbowl, the big football event, or sorry, American football event. And it's, you know, advertisements run and websites crash, right?

Starting point is 00:30:21 But one of the things, a lot of successive successful companies have done, some have tried the approach of let's just really increase our hardware, right? And that's somewhat successful, but the more successful ones, and we see a lot of the car companies do this, is they strip all the junk off of their page. If they're trying to drive you to their homepage for the new, let's say Volkswagen Passat or something, whatever it is, you know, they get rid of the build your own feature to get rid of all the features on the page. And they just have the main content they're trying to drive you to. They get rid of all the fluff.

Starting point is 00:30:52 They get rid of all the extra fonts. They get rid of all of all these things. So I guess this brings me back to the question is when you're talking about CDN in context of this test, was there a lot of fluff on this? Were they, were they trying to cram there a lot of fluff on this? Were they trying to cram in a bunch of things that were unnecessary for the voting app because they wanted it to be really flashy and cool looking or did they keep it simple? Yeah, well, it was actually, to be honest, very, very simple.

Starting point is 00:31:18 Oh, good. Because it was an app, a native app. Oh, okay. It was an app, a native app. When you open the app, it just starts a WebSocket connection. It does some stuff. And yes, it retrieves some static objects as well, which are placed on the CDN, which we had issues on basically in the first place because there were too many objects retrieved at the same

Starting point is 00:31:45 time. So, we're using CloudFront and CloudFront ran into rate limiting. So, it was really, really helpful that we tested that and tuned that. So, yeah, for the latest apps, there are not a lot of waste on it anyway, because most of them are very simple static objects or API calls or in this case, web services. But there was another example basically on a project that was actually with 500,000 virtual users. And that was also actually for artificial, but for a totally different one. They were relying on a static object on a CDN to retrieve, to actually do the pulling.

Starting point is 00:32:40 So they were not using AppSockets or whatever, but they were just pulling on a JSON file on the CDN. Every two seconds, you get the live state. But imagine 500,000 users doing that. Yeah.

Starting point is 00:32:57 There were 250,000 requests per second on CDN from different regions, of course, but still, because it was an app in the Netherlands, they were not really relying on different regions. It was all the same region. It was not really much geographical spread.

Starting point is 00:33:21 So what happened is that we not even just got into rate limit, but we even were stressing the edge nodes that were located in the islands. So the edge nodes, they fell apart, actually. So you stressed

Starting point is 00:33:40 the capacity in your country wasn't even capable of the CDN provider to handle the load. So that's interesting. Wow. That happened, that happened, yeah. So eventually they changed the pulsing, the pulling part, so let's not retrieve it two times a second because it's quite a lot, but the developers expect it. Yeah, it's a CDN. It's cached. Even if it's cached for one second, they should be fine.

Starting point is 00:34:13 But yeah. It's just like cloud resources. They're infinite and free. It's a CDN. Yeah, I think, Brian, that's actually a good point. While maybe these cloud services are scaling and they can handle the load, but it's not free. Because eventually somebody has to pay for it. That's why you always have to factor that in as well.

Starting point is 00:34:41 One other question. This one hasn't come up yet, I don't think. But I'm just wondering how you might tackle it. I haven't seen evidence of this happening yet, but one new scenario I'm expecting to see in terms of load and performance degradation

Starting point is 00:34:57 is with everybody working from home right now. A lot of people, or quite a large amount of people are connected on their company's VPN right and we all know corporate VPNs are not built for high bandwidth because they have some remote workers a lot are in the office and they don't need the VPN but even still you might be going to a VPN you know at least here in the States I might be going you know half the country away to a vpn

Starting point is 00:35:27 adding all that latency then i'm going through the corporate vpn any sort of network security policies being applied to what's going on when you think about if this test were going to be run in today's you know quarantine world, where you have these considerations of possible slower bandwidth and VPN, what could you do to accommodate that? Would it just be setting some bandwidth throttling on the load generator, or would it even be a consideration, I guess?

Starting point is 00:36:01 We'll start there, and what might be done. So what you actually say is that you add extra network root so you add a vpn to the root to the application or you're talking about testing the vpn service well we couldn't you couldn't necessarily test the vpn service because it can come from anybody's and anywhere we don't you couldn't necessarily test the vpn service because it can come from anybody's in anywhere we don't know the settings but considering considering the impact of performance that might occur if people are using their vpn like let's say again i'm making blake and statements here and i think this is part of the things that would have to be determined and

Starting point is 00:36:40 found out but if let's say the voting was during the day let's say it was a world cup something going on with the world cup and there was a thing vote during the day so everyone's working and then they oh my gosh they go ahead and vote um while they're on their corporate vpn right now the application the response times of the application are going to slow down not necessarily because of the back end because although that could happen but because they have a throttled network connection, everything is going to slow down some more, which means number one, their performance is going to degrade for the end user. But that could also cause some either beneficial or detrimental bottlenecks, because if everyone's slowing down, things aren't getting quite through quite as fast and i just wonder like if we're talking about and it might might just come down to the same the plain simple theory of if this vpn in general slow down your network bandwidth and if so is that another thing to

Starting point is 00:37:37 consider when running these kind of tests if you're thinking about like something in today's world right now where we're at yeah yeah well i think basically you're talking about network emulation yeah i guess yeah when you test your application you have to consider actually the network types of your users so for that you can most of the tools a lot of tools have some kind of network emulation included in which you can set you can throttle the network bandwidth

Starting point is 00:38:11 or the add latency or even add some packet loss if possible. And I would say if a lot of users are on a more like crippled network that could eventually harm your application as well yeah in the sense that um response times increase and that also means that on the server end, there is an increase too,

Starting point is 00:38:47 because the connections are open longer. Right, right. At one time, there is a lot more concurrency in terms of concurrent connections. Yeah. And also you need to keep more data in the buffers before they can be sent out. Exactly.

Starting point is 00:39:03 The whole pipe. It's open. So you have more concurrent connections, the buffers, the memory on all the network components. They're getting heavier. So, yeah, it makes sense to simulate that. But it's not even for only VPN, it's also for your mobile users.

Starting point is 00:39:27 Right, right. Many mobile users are on, if they are on slower network, even packet loss involved that can have a significant impact on that. I haven't done this actually for my 2 million user test because I found that a bit too heavy to include even network accumulation. It would be cool, though, to do that.

Starting point is 00:39:52 But, yeah, the network traffic was really low. So I didn't consider that as a real risk. So I didn't get into the troubles of adding that. But if you are testing a web application, it is really worthwhile to just give your test the extra edge and add network emulation. I agree. I think that's one of the toughest things.

Starting point is 00:40:21 You mentioned before being realistic with your test. I think that's one of the toughest things. You mentioned before being realistic with your test. And I think that's one of the biggest challenges in designing a test is the more you think about being realistic, the larger the scope becomes, right? Because you suddenly start thinking of, oh, this aspect, this is like so many variables. variables and that's always just been a challenge at least back when i used to do performance testing was it's always such a challenge to try to get that real try to take that realistic approach while actually trying to get a test done because the more you think about it the more you know the proverbial can of worms opens up and you'll sit there for you know you could sit there for two months setting up a test and as you think more and more about it and meanwhile the test needs to be executed you know it's always the balance right right being as realistic as possible trying to mimic the users

Starting point is 00:41:16 as good as possible because yeah that gives you the most honest real results right if you take shortcuts yeah take them wisely i would say um every every shortcut should be thought about and and considered and evaluated yeah hey uric i remember and i don't remember the details but i remember this topic was uh you know started some discussions back in Santorini something about DNS that you some lessons learned on DNS I remember that can you can you remember remind me of that and fill us in yeah yeah that's basically that's actually a bit of a technical part but if you're running if you're doing a heavy performance test,

Starting point is 00:42:08 you're probably testing an application that runs on the cloud or relies on cloud services like CDNs or other cloud platforms. And as far as I know, practically all of them rely on the DNS system En zoals ik weet, geloven ze praktisch allemaal op het DNS-beheer van de gebruikers door de low balancing te gebruiken. En als je een performance test doet, kan het echt heel mo actually to to to tackle this so it's really important that you know how your uh the the platform you're testing is dealing with this dns load balancing because for sure it is dealing with that and if you just don't do anything you take the test tool out of the box is it neo load j meter load runner doesn't matter the dns um i would say simulation is actually pretty bad so it what will happen is your load generator will cache the dns response and most of your users or all of the users

Starting point is 00:43:26 are just going connected to the same IP address all over again. And you're testing just a small part of the application. It's not realistic. So, yeah, some things that I do for these heavy tests, what I do is I just use a self-hosted local DNS server on the load generators. So I've installed a local DNS server, and then I override actually the TTL. So usually the TTL for these cloud services is 60 seconden.

Starting point is 00:44:05 Dat betekent dat voor 60 seconden alle gebruikers aan dezelfde IP-adres zijn. Dat is niet realistisch, maar omdat alle gebruikers dat doen. Maar bij de TTL, je verloopt de load generator eigenlijk constant om de nieuwe IP adreslijst te herhalen en te verbinden met al die. Want als de applicatie of de cloud service je verloopt naar een andere IP adres omdat het opgekilderd is of het gewoon wil voorkomen dat wants to avoid a heavy edge node or whatever, you want to follow that behavior like real users would do. Because every time a new user is opening the browser and going to the website, it's doing a DNS request for that user.

Starting point is 00:45:08 So that's really tricky. So that I understand this correctly, you had your 800 agents, load-generating machines, and then on all of those you installed a local DNS. And this one had a TTL, a time to live of one second versus 60. That means every second you requested, you got the new DNS entries. And this actually forced the load generator itself to actually refresh the DNS.

Starting point is 00:45:39 That means the testing tool needs to understand this as well. So the testing tool actually then gets served from your local DNS server at PRSS with a time to live of one second versus 60 seconds. I get it. Yeah. Yeah. And, of course, you have to tune your JVM that it's not caching DNS responses, of course.

Starting point is 00:46:02 But there's another trick that you can do, actually. Because usually your load generators are located on maybe one or just a few ISPs locations. So if you use AWS or the Google Cloud or Azure or whatever, it's still not really geographically spread. What you can also do is also ook wat ik in de test deed. Als je wilt bevorderen geografisch verspreide gebruikers. Je kunt een DNS-server gebruiken op die machine. Met een verschillende geografische locatie.

Starting point is 00:46:43 Bijvoorbeeld als ik een loodgenerator heb, die in Amsterdam is geplaatst. with different geographical locations. For example, if I have a load generator, which is located in Amsterdam, but actually I want to have it simulated from Berlin, I can use a DNS server, which actually is in Berlin, and then the cloud service would think the users are located in Berlin and respond with IP addresses of edge nodes which are in Berlin.

Starting point is 00:47:10 So technically, I'm not the expert on DNS, will you say that when you install your local DNS, you basically give it a location, a geolocation, you override it or you define the geolocation, you define the geolocation as being, let's say, Berlin. And therefore, the cloud services believe you are from Berlin. Or are you talking about there are some DNS servers out there in Berlin that you're using? The last one. The last one, okay. So if you're using a self-hosted DNS server, the CDN would think you're actually in the location where it actually is.

Starting point is 00:47:49 Most probably. If you use a third-party DNS service, and there are lists on the internet where there are open DNS servers with their location, even with their city. And if you use that DNS server on the machine where your load generator is installed on, then that machine would do the DNS request to the DNS server in a completely different location. Then that DNS server will do the actual request to the CDN. And then that CDN would think, ah, the user is using that DNS server, so it's probably located there. I would respond with an IP address list or IP address that's close to the user,

Starting point is 00:48:35 so the user would have the lowest latency. Yeah. And by doing this, you can kind of trick the test, trick tricky application that you're in a lot of locations and um so you don't yeah that's a that's a bit of a trick actually to to spread uh load generators on geographical locations yeah but it's not important really to have a load generator in that location or you want to test from location, but you want to hit the different edge nodes. You don't want to be the edge nodes you're hitting to be a bottleneck. Yeah.

Starting point is 00:49:16 That's very cool. Hey, Jurek, I know that there's also a blog post that you wrote as one of the proceedings for Neotest Pack. We want to make sure we link that. And I assume in the blog post, you're going through some more details. So we may not want to steal all of the thunder. We want to make sure people are reading this as well. Yeah, the blog post also has to be uh published and in the

Starting point is 00:49:46 blog post there's more explanation on on how to trick the dns actually and and how to set it up in your performance test to yeah that's amazing extra acts of being more realistic. Yeah. I got one last question for you. And because I know we are, we both said earlier, we want to make sure that our better apps are not waiting too long for us for dinner because it's late here

Starting point is 00:50:15 in the evening. But so what you've just told us, these are very special types of load tests. You are testing for a big event it's a one-time thing it's not possible to test uh let's say in in um by doing let's say shadow traffic in production because you don't have production traffic right now things like that so these are very specific very special tests now brian and, in the last couple of months and years,

Starting point is 00:50:46 we've been talking a lot about shifting left, testing earlier, testing more frequent, smaller tests, integrating a CICD pipeline. Is there anything we can learn from these tests to shift left in order to avoid these tests? Or are they just completely two different disciplines where you say yes you want to you can't you have we have to shift left to find things earlier but you still need these large load tests in a preparation for a big event yeah i really would love to say that yeah you can go left with this type of test. And this is another approach to tackle this.

Starting point is 00:51:31 I would say you really need this kind of full acceptance test to be really sure that it's capable of because as I said there is the high loads you can never test that in in your pipeline easily never but it's not easy to do the CDN part which can have troubles it's probably not part of your pipeline or you cannot just tackle that in a small scale test. The concurrency problems you might have throughout the whole application is very complicated to test automated in your pipeline. So with these heavy tests, I would say, yeah, probably you can

Starting point is 00:52:31 test your small services that you're creating already. So along the pipeline, along the releases, you can see you can tackle differences. That always good but in the end just before go live i would always say yeah you have you just have to do it to be sure yeah that makes sense yeah yeah i was thinking too you can't really this isn't the case where you can slowly trickle traffic to the new feature because this is almost an all-at-one you know so it's yeah it's interesting yeah but it and and actually still i'm still i still want to encourage everyone to still test their application as it is in a in the traditional way as with as a full acceptance test. Because when everything is put together

Starting point is 00:53:27 and you test with the real load, there are so many other problems you can encounter. Right. And I think even those tests, they become a decision point, right? Like in your specific case, that 2 million user, right? That's where we'll take a step back, right? A lot of times people will say, we'll run performance tests against my services. We'll run, we'll do a lot of shifting left and run for capacity and everything on the specific services so that we know if we test every single service,

Starting point is 00:54:05 it should all work together, right? Which we know from history is not always the case, but I think a lot of the, let's call it maybe more agile focused performance testing is testing the changes that are being introduced and then observing in production and then fixing, right? This whole kind of feedback cycle. And so when you have a situation where it's normal traffic, a normal scenario, you can take calculated risks, you know, understanding everything that's going on and everything that you're willing to risk to push something to production without having a full end-to-end test. Because maybe it's your steady standard load.

Starting point is 00:54:45 Maybe you're going to do a blue green deployment or something where, or a canary release where you trickle in your users, you could do slow observations. But, but again, in considering what that traffic is going to be and what the real life scenario is going to be in, in the situation you have specifically,

Starting point is 00:55:03 I don't think there's any way you can argue that, oh yeah, we're just going to test everything before production and then assume 2 million users end-to-end are going to work. So I think a lot of times that consideration has to be made and in the performance world, there is a lot of give and take and decision points being made about

Starting point is 00:55:22 how frequently do we still run the end-to-end tests right they're still obviously very much important but you know we're not in the waterfall day where you can hold up a bill to do an end-to-end test every single time so it becomes part of that decision point but i think yeah in that scenario that you're talking about that's an obvious like you have to yeah yeah there are still so many applications that are like yeah event based where it's just like maybe a one-time occasion where they have a lot of loads a lot of users it's not like a real ordinary ordinary day where you can learn and grow and tune and back and forth. Yeah.

Starting point is 00:56:07 So that's the difference. Yeah. Awesome. Yeah. Jurek, in the beginning, before we started recording, you said, I don't even know if we can talk for an hour. Well, it was an hour later, right? We're still here. And I believe though so yorick i would love to have you back because i know you're constantly out there you're constantly

Starting point is 00:56:32 doing projects you learn a lot on the job that you want to share hopefully it will be great so maybe get you back in a couple of months and get an episode of extreme load testing. Yeah. Or totally completely different topic. Maybe. Yeah. Or maybe some forms related though. Yeah.

Starting point is 00:56:52 Yeah. Of course. Yeah. Sounds good. Well, thanks for having me. Thank you for taking the time. Give me the opportunity to talk about this topic.

Starting point is 00:57:02 Yeah. It's a pleasure to have you. And if anybody, do you do any of the social media you would like to put out there for people to follow you, whether it's LinkedIn or Twitter or anything? Yeah, you can follow me on LinkedIn.

Starting point is 00:57:18 It's probably my best social media channel I use. Okay. We'll put the link. Connect me. Yep. And if anybody has any ideas, topics, or anything else, or if they want to come tell us some stories as well,

Starting point is 00:57:36 you can reach out to us at pure underscore DT on Twitter, or you can send us an email, pureperformance at dynatrace.com. And thank you, Jörg, for being on the show. This was great fun today. Great. Well. Thanks, everyone. Go back to your quarantine part.

Starting point is 00:57:57 Yeah. Good luck with that. Yeah. Good luck, everybody. All right. Thank you. Bye-bye.

PurePerformance - Extreme load testing with 2Mio Virtual Users: Lessons learned with Joerek van Gaalen

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.