PurePerformance - Extreme load testing with 2Mio Virtual Users: Lessons learned with Joerek van Gaalen
Episode Date: May 25, 2020How do you prepare for a 2Mio concurrent user load that lasts for 7 seconds? What does the load infrastructure look like? How do you optimize your scripts? How do you deal with DNS or CDNs?In this epi...sode we hear from Joerek van Gaalen who has done these types of tests. He shares his experiences and approaches to running these “special event extreme load tests”. If you want to learn more make sure to check out his presentation and read his blog post from Neotys PAC 2020.https://www.linkedin.com/in/joerekvangaalen/https://www.neotys.com/performance-advisory-council/joerek_van_gaalen
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it another episode of Pure Performance.
My name is Brian Wilson and as always we have with me my co-host Andy Grabner.
Andy, how are you doing today?
I'm pretty good.
How often has it happened that you introduce myself and kind of saying I'm here and I wasn't here?
Has it ever happened? saying i'm here and i wasn't here has it ever happened that i'm here like yeah you always say hey and with me my host and then you don't even know i'm here of course i know you're there because we we you know why would i mean
sure you know this is the kind of argument my daughter makes with me like that like anything can happen you know you could
suddenly you know disperse or disintegrate disintegrate even though i was just talking
to you the next second you might no longer exist there might have been a blink or a thanos event
or something yes that could happen um but if it did happen i think i would have a little bit more
to be worried about than the fact that i went to introduce you and you suddenly weren't there, you know?
So I take it for granted.
You're right.
You know, I should never take anything for granted.
I'm just so happy that you're always there for me, Andy.
You're always there supporting me.
I'm happy, too, because without you, this would not be recorded and wouldn't make it out to the World Wide Web so people can listen to it.
Well, you would record it on regular Zoom worldwide web so people can listen to it.
Well, you would record it on regular Zoom and it would sound terrible, but yes.
Hey, speaking of which, things that people can listen to.
So we have a new guest, a first timer.
Yay.
Yay.
And I want to introduce him. Well, I think the best is for him to introduce himself,
but I just want to give some background on how we know him.
I was lucky enough to be invited for a new test pack earlier this year in Santorini
where a couple of performance engineers got together
and talked about performance engineering stuff and uh yurik
he was there and he talked about some really cool stories from his large-scale load testing
projects that he has done and it was so fascinating all the lessons learned about his
major major projects that we said,
let's get him on the podcast and the whole world,
all of our listeners should hear what to watch out for in case you run those
really large scale load tests.
I think it was actually extreme load tests, right?
Extreme load tests, exactly.
So, Eric, do us a favor.
First of all, thanks for joining.
Introduce yourself, who you are, what you're doing, what your background is, and then let's jump into the topic.
Yeah, thank you. And thanks for the invite, by the way, for this podcast with you.
So, it's been a pleasure um well i am uh i'm a self-employed uh performance tester and
i've always been a performance engineer for the last 15 years
so uh yeah always been around in the performance part of the world actually and yeah actually since I started with performance testing I've always done short term projects
most of the time so I've seen a lot of different smaller projects so I think I've done more than a hundred different
projects
which is cool
and the last of them
I've done a lot of
large scale performance
so
yeah that was also the talk of
me in Santorini
yeah and actually I remember
now looking at the page on the Neotis website,
the Performance Advisory Council.
The both of us, we were not only in Santorini,
we were also at the inauguration event back in 2017 in Edinburgh, correct?
In the castle.
Yeah, that was cool, in the castle.
Castle, yeah.
I got to admit, can you quickly remind me the topic that
you talked about back then um cool yeah that was more or less the topic back then in uh in the
castle where we are in uh kind of guarantee guarantee basically yeah yeah but um the topic i have to remember what it was yeah the topic was about
uh the changing role of the performance engineer so usually the the role was always like, hey, you're a performance tester, so you do your old-fashioned way of performance test.
And I was talking about not only the shift left, but also the shift right.
So there was a lot of things as a performance tester,
performance engineer that you can do.
So not only testing, but also being part of production,
monitoring, monitoring in production looking into that doing the capacity management also helping the developers so on the
on the left part of the pipeline helping the developers with their performance challenges etc so a lot of more
stuff that you can do as a performance engineer so i want to shine a light on everything that
you can do and being more like a performance specialist rather than an engineer now coming
back to the the by the way, the presentations are all online.
So in case you want to look at it, check it out, go to the NIRTES website,
go to the Performance Advisory Council, find Jurek's picture, click on it,
and then you get all the presentations.
Now, coming to the large-scale extreme performance tests that you've been running recently,
I remember that the one example
where you were running a 2 million virtual user test
or concurrent user test for a TV show, right?
Can you give us a little bit more details?
What?
Yeah, yeah, yeah.
That's correct.
Yeah, yeah. what yeah yeah that's correct yeah yeah i've done a two million concurrent virtual user
performance test and it was for a tv show and actually the project management asked me to
talk about some performance tests that they wanted for their for their project
and they started yeah we would like you to do we would like to do a performance test that they wanted for their project and they started yeah we would like
you to do a performance
test that you do it
and
we expect 2 million users
so we should be able to see
we should be able to verify
that the platform is able
to handle that load
so the first thing you think of
2 million users, sure.
You're probably talking about registered users
or total visits on a day
because that's what you see often, right?
Talk about the expected loads.
They tend to give too high numbers.
But then they were talking about the project and say okay you know
it's for a tv show um it's for a live tv show where people should start voting we expect
two million votes within seven seconds and that's what the application should be able to handle so yeah it was truly they wanted
to see two million concurrent users um voting for people because contestants actually that get on
stage and at the second they get on stage people have to vote on if they should pass or not pass
so um yeah that was actually a test.
It was cool because then we did the test with Neolons
and asked them for a license for 2 million users.
And they actually said, hey hey we never have done this
we can't even generate the license because the maximum number of users we can select is a million
so they actually had to change their back-end systems to be able to generate a license of 2
million users so that was also a fun fact actually yeah you were
stretching the boundaries of your load testing vendor that's yeah but what's interesting is they
didn't say no they they probably said oh this sounds fun let's do this you know yeah because
they they also know that it was it's it's a nice challenge yeah to it. It's a good, yeah.
It's cool to say that your tool supported,
but also I thought it's cool to do it.
So it was like a challenging project.
It's a free load test of the load tool.
That really, really was the case
because we had a lot of feedback for them to improve.
Of course, if you find stuff that you never found, you will never find out.
So what's the trickiest thing for such a project? Because obviously this is a
testing for a one-time event and it is yeah i mean it's infrastructure for that test just for
you know the time of the project um probably for i don't know how long how long did it take you
to kind of set it up to make sure that the test infrastructure is there that they actually figure
out what what is the test infrastructure i think that's probably a challenge too.
How much test infrastructure load generators do you need
and things like that.
Yeah, of course, the biggest challenge for such a test
is just to be able to generate the loads, right?
To have the capacity to generate the loads
and not being the bottleneck yourself.
Because that would be kind of ironic, right?
If you do a performance test on an application and you are the bottlenecks.
So yeah, that was the biggest challenge to have sufficient power.
And also what I wanted to do to make the challenge even more challenging
is to run the test from a single controller.
So I didn't want to duplicate, let's say, eight or ten different controllers and hit the start button all at the same time.
And then gather results later on.
I also wanted to be the challenge that was possible from a single controller.
And yeah, the test setup was like we had, I used 800 load generators.
The load generators were able to simulate up to 3000 concurrent virtual users on the application so we were i was fine with well 2500 on a single load generator okay so um and for that to happen because you need to know first what can a single
load generator um handle in the test so you then you know how many load generators you actually need
so i first did the stress test uh against the application with just a single load generator
and see how far how far i was able to come and that was about three thousand and it's three thousand and things starts to get itchy it starts to
slow down response time to get the agent become unresponsive where things happen
so I said a safe limit 2500 okay I remember just to jump in here I remember before aynatrace I used to work for Segway
software which later became Boilend we did Silperformer right I was a performer yeah yeah
and I remember we had a feature in Silperformer this was like at least 15 20 years ago now, where the controller monitored the health
state of the agents and also gave you recommendations.
So you ran a script.
And then, and I'm sure they have advanced this over the years
now.
It's been really 15 years since I've been there.
And the controller gave an overview of all the agents
and gave recommendations on how many scripts you can safely run, how many scripts of that particular type or how many virtual users for that script
you can safely run before the agent gets into a mode where the agent itself is impacting the test
results. So I think that's obviously something you have to consider if you're running these
large-scale load tests. You don't want your test infrastructure to be the problem.
And that could either be the load generating agents,
but I guess also the network bandwidth that you have
from your load generators to your application that you're testing on.
Yeah.
Well, yeah, New Load has some kind of features
to look into the current system resources of the load generator,
but it doesn't actually give you a lot of advice up front on how it's behaving and how many you would need for the test.
But yeah, it would be a great feature, by the way.
So what else?
So that means if you started with figuring out what's the capacity of a single load agent,
then you could do the math.
And you said you came up with about 800 load generating agents to cover your 2 million number.
What else is there?
I mean, this is obviously some important preparation step.
Yeah, there are a lot of preparations actually to do
and also a bit of the approach.
Because there are some levels actually on the test
that you have to kind of tune.
Because you can tune your scripts, you can tune the agents,... om te tunen. Je kunt je scripten tunen, je kunt de agenten tunen, je kunt de scenario en de controller tunen.
Ik had bijvoorbeeld ook de controller en de scenario settings moeten tunen, want 2 miljoen gebruikers...
... werken concurrently. And Nilo does a feature to live show the current behavior of the test,
the average response time, the transactions per second, et cetera.
So it's a lot of live data that's going on.
Actually, when I first ran the test, the controller was just so slowed down
at nearly a million users.
So I need to tune the controller first.
Because there was so much network traffic going back live at one time
from all the agents towards the controller.
So one thing that I need to tune was to kind of throttle down the network push back from the
ages to the controller there less like a feature most tools have such a feature
that you can follow the users life so you can see whether it's transactions
they are which steps their step they are doing at the moment, you can disable that.
And by disabling that, it was a tremendous save on live network traffic.
So that was already a big win.
And I remember when we had the discussion in Santorini about it,
that we said it would be smart for tools to say, hey,
you are running a 2 million user test.
You probably don't want to see every single user live.
That means you may want to disable this feature by default or at least give recommendations on settings you have right now
that are, you know, like give advice on better defaults
or settings based on your workload configuration.
Yeah.
Yeah, so something that was also really important to do,
because if something occurs on the platform you're testing,
let's say a service is restarting or whatever,
and all the users get a small part of errors at that moment.
If you're running with 2 million users or even with 100,000,
all the users can get the same error at the same time.
There are tons of errors per second you're generating.
Every time when that happens, it costs a lot of agents to crash at that moment,
and the controller to be just unresponsive.
So tuning was necessary on that part too.
Just storing a limited amount of errors per second,
and that also helped me to be able to cope with just sudden spikes of errors.
Something you will just encounter when you have a heavy test.
Yeah.
So besides that, besides the load agents, figuring out the capacity, the number of agents you need, optimizing the whole controller
and your runtime settings.
Was there anything else or any other lessons learned from a testing tool perspective that you can tell us about
or always studied?
Well, yeah, I haven't talked about how you can,
what you can do on the scripting part because, yeah,
you have to tune your scripts too because
the scripts are actually the core of your test you know the scripts are run by the agents and
there are some stuff that you can do on your scripts to limit the load actually on the on the We kunnen de lading eigenlijk beperken op de agenten. Zoals wat voor een voorbeeld?
Ja, zeker.
Als je veel scripten hebt, eigenlijk alle performance testing scripten,
hebben ze extracties.
Je moet een variabele waarde krijgen en gebruik dat in de volgende bezoek.
De meeste tools gebruiken regle expresies, of je kunt regle
expresies gebruiken voor dat. Je moet ze gebruiken met zin, want ze kunnen echt
CPU consumeren. Dus gebruik niet veel wild cards in them. They can really cripple the performance.
Consider proper assertions.
Don't use them too many.
I'm usually a really big fan of having a lot of assertions
all over the place, all over your script and requests
so you can verify a lot so you know when there's
an error occurring. But for a heavy test, you might be able to just get a little bit
less of the assertions to save some resources too, because they can be really consuming. Right.
So other things that you consider is decreasing the think time in your test.
I'm never a fan of this, also not for a heavy user test, but decreasing the think time means
you will have a smaller session time of a third user.
That means you have less concurrent users running.
And that also means less memory usage at one time.
You can do that too.
It's funny you mentioned that one because I was just thinking myself,
aside from that,
that there are a lot of considerations
that you have to take when running
a load test um such as you know there was a question that came up once a long time ago like
can you run a heavy load with one user in jmeter the answer is yes no think time right but it's
zero concurrency you know and and understanding what kind of test you need to run is a huge part of that,
right? You can say the product managers could come to you and say, we want to have
2 million votes, right? In an hour or in seven minutes. All right. But what's the concurrent,
the questions you as a performance engineer, performance tester need to ask are, you know, what's the concurrency we're expecting to have in this?
And then when you get to the scale that you're running at, you do then run into the issues of can you actually test that concurrency?
Because what's that going to do to the amount of generators that you're going to need?
And is the controller going to be able to, you know, if you need 2 million concurrent, that's going to be taking the challenge you had already to a whole different level.
Exactly, yeah.
And there may still have to be some real world shortcuts you have to take just in order to get some sort of resemblance test done.
But also things you'd have to consider, which you always have to consider in any test, but also just kind of come to a whole different scale at the scale of testing you're thinking about.
You know, there's conceptually, do I run a test from an internal generator or external generator?
How many locations am I running from?
Am I running it from different cloud vendors so that I can make sure that as this goes on?
Because in reality, your users are going to be coming in through different angles.
It's not all going to be coming in through an AWS DNS or pipeline, I should say.
So you really have to start thinking much bigger scale.
How do we distribute that load?
Internal, external, you know, what different geolocations is this
going to be, you know, from, if this is, you know, if this is like, obviously, you know,
a single country show, then you don't have to worry about too much of geolocation.
You might want to consider some, but it just takes it, all these decisions become a lot
more relevant and a lot more important because when you're, so what were some of the,
you know,
and you,
and I,
sorry,
cut off on that one,
one piece though,
but I think that was a really,
really important part that you mentioned about the concurrency.
What of those things that would normally be,
yes,
we need to consider that we need to think about that,
but which,
which one of those kind of became really,
really strongly focused or really, really high priority, I should say?
Yeah, well, first of all, I agree 100% with you, what you mean, what you're saying.
And that's actually always the case.
I mean, being really as realistic as possible, that's always a good approach for your performance test, I would say.
As you said, you can do a lot of transactions and you say,
okay, we need 2 million boats in an hour,
and you just generate it with a low amount of virtual users.
Yeah, you can probably do that, but you are not doing the right concurrency.
And that's what I want to achieve in this test too,
because we were doing the web application
was about web sockets.
So web sockets were involved.
And that means that every user sets up a connection
and every user has just one simultaneous connection at one time
to the application.
And as we all probably know, connections is really maximum connections.
It's usually a very common performance issue, some rough threads, some rough connections
or whatever, things like that.
So when you want to tackle that, you just need to simulate it.
So that's why I also always try to have the right number of concurrent users
as you can expect in a peak moment in production.
Does it answer the question a bit?
Because I thought the question was bit because i thought the question
was a lot of different along the way yeah no no it is it really really was more of a discussion or
just me thinking aloud i guess about some of the things you would normally consider
during a low test yeah which seems like okay we, we consider this, but now it's like, this is even
more important to consider. Cause a lot of times people like in the past, let's just take the
internal versus external load generation, right? There are use cases for that, right? You can have,
maybe your users are internal, or maybe you don't care about testing from the internet. You're just
testing the traffic rate on the servers, right? Everything that you set up in a performance test is based on what your goal of the test
is to test, right?
You have a theory, you have a, you know, it's kind of a scientific approach.
So what do we need to test for that?
But when you're testing at something at this scale, all these things that are sort of optional
or sort of maybe talking points become critical is really, I guess, what my thought process
was there. I actually have a very good example for that.
Usually you can take a shortcut
when you do a test application or website.
In 99% of the time you can just avoid
doing all your CDN objects.
Like they're coming from the CDN,
not hosted by you.
Someone else is hosting it and they're good at it.
So it will be fine.
But I've run into very large scale performance tests,
which rely on CDN.
And actually ran into issues on the CDN objects.
Yeah.
So for large scale performance tests,
I would always say just include all the objects there are,
even if it's CDN, because you can run into issues on them as well.
Yeah.
Like I remember the discussion we had in Santeria. I remember the discussion
we had in Santorini and I remember
a couple of years
back when we
on Black Friday we always
analyzed how websites
were doing. We found a lot of
websites that were
rate limited by their
CDN
provider because either they were rate
limited because they thought it was a denial of service attack or it was a
rate limit because they just didn't pay enough money it was not part of the
contract and all of a sudden they were rate limited and the content wasn't
delivered anymore and is that what you're talking about? Yeah exactly that
because yeah I ran into those issues as well when I thought,
first of all, the first time, I think,
maybe it's not necessary to add all these static objects
because it can be a real saver on the network,
on the load generators, et cetera.
But when I included them,
we actually ran into issues on those third parties or on the CD generators, et cetera. But when I included them, we actually ran into issues on those third parties
or on the CDNs.
They actually have quotas and limits.
Even CloudFront, for example,
there's a limit of, I think,
it was 1,000 requests per second or something per region.
You can just hit that easily with high-volume users.
So if you just don't tune that on your CDN provider,
you only will run into it at your peak.
And then you're screwed.
And then people come to you and say,
didn't you test this? We told you exactly what was expected.
But that also brings up two ideas.
We talked about CDNs, and you mentioned third-party,
and I think third-party is really important to just make sure we explicitly state,
because you might have a Google font, right?
You might have a tracker, right?
This isn't necessarily a CDN object.
It's a third-party object, and that can come to bite you as well. But also, if you start talking about all these CDN object, it's a third-party object, and that can come to bite you as well.
But also, if you start talking about all these CD elements, all these CDN elements, all these
static elements, I would really want to go back to the product team and say, hey, if
you're expecting 2 million people to vote in the next 10 minutes, we need to get rid
of everything we don't absolutely need on this page.
If you think back to,
I'm not sure how familiar you are with what goes on in the United States during
the Superbowl,
right?
So they have these,
you know,
near the Superbowl,
the big football event,
or sorry,
American football event.
And it's,
you know,
advertisements run and websites crash,
right?
But one of the things,
a lot of successive successful companies have done,
some have tried the approach of let's just really increase our hardware, right? And that's somewhat
successful, but the more successful ones, and we see a lot of the car companies do this,
is they strip all the junk off of their page. If they're trying to drive you to their homepage for
the new, let's say Volkswagen Passat or something, whatever it is, you know, they get rid of the build your own feature to get rid of all the features on the page.
And they just have the main content they're trying to drive you to.
They get rid of all the fluff.
They get rid of all the extra fonts.
They get rid of all of all these things.
So I guess this brings me back to the question is when you're talking about CDN in context
of this test, was there a lot of fluff on this?
Were they, were they trying to cram there a lot of fluff on this? Were they
trying to cram in a bunch of things that were unnecessary for the voting app because they
wanted it to be really flashy and cool looking or did they keep it simple?
Yeah, well, it was actually, to be honest, very, very simple.
Oh, good.
Because it was an app, a native app.
Oh, okay. It was an app, a native app. When you open the app, it just starts a WebSocket connection.
It does some stuff.
And yes, it retrieves some static objects as well,
which are placed on the CDN,
which we had issues on basically in the first place
because there were too many objects retrieved at the same
time. So, we're using CloudFront and CloudFront ran into rate limiting. So, it was really,
really helpful that we tested that and tuned that. So, yeah, for the latest apps, there
are not a lot of waste on it anyway, because most of them are
very simple static objects or API calls or in this case, web services.
But there was another example basically on a project that was actually with 500,000 virtual users. And that was also actually for artificial,
but for a totally different one.
They were relying on a static object on a CDN
to retrieve, to actually do the pulling.
So they were not using AppSockets or whatever,
but they were just pulling on a
JSON file on
the CDN. Every
two seconds, you get the
live state. But
imagine 500,000 users
doing that. Yeah.
There were 250,000 requests per second
on
CDN
from different regions, of course,
but still, because it was an app in the Netherlands,
they were not really relying on different regions.
It was all the same region.
It was not really much geographical spread.
So what happened is that we not even just got into rate limit, but we even
were stressing
the edge nodes that were located in the
islands.
So the edge nodes, they
fell apart, actually.
So you
stressed
the capacity
in your country wasn't even capable of the CDN provider to handle the
load. So that's interesting. Wow. That happened, that happened, yeah. So eventually they changed
the pulsing, the pulling part, so let's not retrieve it two times a second because it's quite a lot, but the developers expect it.
Yeah, it's a CDN.
It's cached.
Even if it's cached for one second,
they should be fine.
But yeah.
It's just like cloud resources.
They're infinite and free.
It's a CDN.
Yeah, I think, Brian, that's actually a good point.
While maybe these cloud services are scaling and they can handle the load, but it's not free.
Because eventually somebody has to pay for it.
That's why you always have to factor that in as well.
One other question.
This one hasn't come up yet, I don't think.
But I'm just wondering how you might tackle it.
I haven't
seen evidence of this happening
yet, but one new scenario
I'm expecting to see in terms
of load and performance degradation
is with everybody working
from home right now.
A lot of people,
or quite a large amount of people
are connected on their company's VPN right and we all know corporate VPNs are
not built for high bandwidth because they have some remote workers a lot are
in the office and they don't need the VPN but even still you might be going to
a VPN you know at least here in the States I might be going you know half the country away to a vpn
adding all that latency then i'm going through the corporate vpn any sort of network security
policies being applied to what's going on when you think about if this test were going to be run
in today's you know quarantine world, where you have these considerations
of possible slower bandwidth and VPN,
what could you do to accommodate that?
Would it just be setting some bandwidth throttling
on the load generator,
or would it even be a consideration, I guess?
We'll start there, and what might be done.
So what you actually say is that
you add extra network root so you add a vpn to the root to the application or you're talking
about testing the vpn service well we couldn't you couldn't necessarily test the vpn service
because it can come from anybody's and anywhere we don't you couldn't necessarily test the vpn service because it can come from
anybody's in anywhere we don't know the settings but considering considering the impact of
performance that might occur if people are using their vpn like let's say again i'm making blake
and statements here and i think this is part of the things that would have to be determined and
found out but if let's say the voting was during the day let's say it was a
world cup something going on with the world cup and there was a thing vote during the day so
everyone's working and then they oh my gosh they go ahead and vote um while they're on their
corporate vpn right now the application the response times of the application are going to
slow down not necessarily because of the back end because although that could happen but because
they have a throttled network connection, everything is going to slow down some more, which means number one, their performance is going to degrade for the end user.
But that could also cause some either beneficial or detrimental bottlenecks, because if everyone's slowing down, things aren't getting quite through quite as fast and i just wonder like if we're talking about and it might might just come down to the same the plain simple theory of
if this vpn in general slow down your network bandwidth and if so is that another thing to
consider when running these kind of tests if you're thinking about like something in today's
world right now where we're at yeah yeah well i think basically you're
talking about network emulation yeah i guess yeah when you test your application
you have to consider actually the network types of your users so for that you can
most of the tools a lot of tools have some kind of
network emulation included
in which you can set
you can throttle the network bandwidth
or the add latency
or even add some
packet loss if possible.
And I would
say if
a lot of users are on
a more like crippled network that could eventually harm
your application as well yeah in the sense that um response times increase and that also means that on the server end, there is an increase too,
because the connections are open longer.
Right, right.
At one time, there is a lot more concurrency
in terms of concurrent connections.
Yeah.
And also you need to keep more data in the buffers
before they can be sent out.
Exactly.
The whole pipe.
It's open.
So you have more concurrent connections, the buffers,
the memory on all the network components.
They're getting heavier.
So, yeah, it makes sense to simulate that.
But it's not even for only VPN,
it's also for your mobile users.
Right, right.
Many mobile users are on,
if they are on slower network,
even packet loss involved
that can have a significant impact on that.
I haven't done this actually for my 2 million user test
because I found that a bit too heavy to include even network accumulation.
It would be cool, though, to do that.
But, yeah, the network traffic was really low.
So I didn't consider that as a real risk.
So I didn't get into the troubles of adding that.
But if you are testing a web application,
it is really worthwhile to just give your test the extra edge
and add network emulation.
I agree.
I think that's one of the toughest things.
You mentioned before being realistic with your test.
I think that's one of the toughest things. You mentioned before being realistic with your test. And I think that's one of the biggest challenges in designing a test is the more you think about being realistic, the larger the scope becomes, right?
Because you suddenly start thinking of, oh, this aspect, this is like so many variables. variables and that's always just been a challenge at least back when i used to do performance testing was it's always such a challenge to try to get that real try to take that realistic approach
while actually trying to get a test done because the more you think about it the more
you know the proverbial can of worms opens up and you'll sit there for you know you could sit there
for two months setting up a test
and as you think more and more about it and meanwhile the test needs to be executed you know
it's always the balance right right being as realistic as possible trying to mimic the users
as good as possible because yeah that gives you the most honest real results right if you take shortcuts yeah take them wisely i would say
um every every shortcut should be thought about and and considered and evaluated yeah
hey uric i remember and i don't remember the details but i remember this topic was
uh you know started some discussions back in Santorini
something about DNS that you some lessons learned on DNS I remember that
can you can you remember remind me of that and fill us in yeah yeah that's
basically that's actually a bit of a technical part but if you're running if
you're doing a heavy performance test,
you're probably testing an application that runs on the cloud or relies on cloud services like CDNs or other cloud platforms.
And as far as I know, practically all of them rely on the DNS system En zoals ik weet, geloven ze praktisch allemaal op het DNS-beheer van de gebruikers door de low balancing te gebruiken.
En als je een performance test doet, kan het echt heel mo actually to to to tackle this so it's really
important that you know how your uh the the platform you're testing is dealing with this
dns load balancing because for sure it is dealing with that and if you just don't do anything you take the test tool out of the box is it neo load j meter
load runner doesn't matter the dns um i would say simulation is actually pretty bad
so it what will happen is your load generator will cache the dns response and most of your users
or all of the users
are just going connected to the same IP address all over again.
And you're testing just a small part of the application.
It's not realistic.
So, yeah, some things that I do for these heavy tests,
what I do is I just use a self-hosted local DNS server on the load generators.
So I've installed a local DNS server,
and then I override actually the TTL.
So usually the TTL for these cloud services is 60 seconden.
Dat betekent dat voor 60 seconden alle gebruikers aan dezelfde IP-adres zijn.
Dat is niet realistisch, maar omdat alle gebruikers dat doen. Maar bij de TTL, je verloopt de load generator eigenlijk constant
om de nieuwe IP adreslijst te herhalen en te verbinden met al die.
Want als de applicatie of de cloud service je verloopt naar een andere IP adres
omdat het opgekilderd is of het gewoon wil voorkomen dat wants to avoid a heavy edge node or whatever,
you want to follow that behavior like real users would do.
Because every time a new user is opening the browser and going to the website, it's doing
a DNS request for that user.
So that's really tricky.
So that I understand this correctly,
you had your 800 agents, load-generating machines,
and then on all of those you installed a local DNS.
And this one had a TTL,
a time to live of one second versus 60.
That means every second you requested, you got the new DNS entries.
And this actually forced the load generator itself to actually refresh the DNS.
That means the testing tool needs to understand this as well.
So the testing tool actually then gets served from your local DNS server
at PRSS with a time to live of one second versus 60 seconds.
I get it.
Yeah.
Yeah.
And, of course, you have to tune your JVM that it's not caching
DNS responses, of course.
But there's another trick that you can do, actually.
Because usually your load generators are located
on maybe one or just a few ISPs locations.
So if you use AWS or the Google Cloud or Azure or whatever,
it's still not really geographically spread.
What you can also do is also ook wat ik in de test deed. Als je wilt bevorderen geografisch verspreide gebruikers.
Je kunt een DNS-server gebruiken op die machine.
Met een verschillende geografische locatie.
Bijvoorbeeld als ik een loodgenerator heb, die in Amsterdam is geplaatst. with different geographical locations. For example, if I have a load generator,
which is located in Amsterdam,
but actually I want to have it simulated from Berlin,
I can use a DNS server, which actually is in Berlin,
and then the cloud service would think
the users are located in Berlin
and respond with IP addresses of
edge nodes which are in Berlin.
So technically, I'm not the expert on DNS, will you say that when you install your local
DNS, you basically give it a location, a geolocation, you override it or you define the geolocation, you define the geolocation as being, let's say, Berlin.
And therefore, the cloud services believe you are from Berlin. Or are you talking about there are some DNS servers out there in Berlin
that you're using?
The last one.
The last one, okay.
So if you're using a self-hosted DNS server,
the CDN would think you're actually in the location where it actually is.
Most probably.
If you use a third-party DNS service, and there are lists on the internet where there are open DNS servers with their location, even with their city. And if you use that DNS server on the machine where your load generator is installed on,
then that machine would do the DNS request
to the DNS server in a completely different location.
Then that DNS server will do the actual request to the CDN.
And then that CDN would think, ah, the user is using that DNS server,
so it's probably located there.
I would respond with an IP address list or IP address that's close to the user,
so the user would have the lowest latency.
Yeah.
And by doing this, you can kind of trick the test, trick tricky application that you're in a lot of locations and um so you don't
yeah that's a that's a bit of a trick actually to to spread uh load generators on geographical
locations yeah but it's not important really to have a load generator in that location or you
want to test from location, but you want to hit the different edge nodes.
You don't want to be the edge nodes you're hitting to be a bottleneck.
Yeah.
That's very cool.
Hey, Jurek,
I know that there's also a blog post that you wrote as one of the proceedings for Neotest Pack.
We want to make sure we link that.
And I assume in the blog post, you're going through some more details.
So we may not want to steal all of the thunder.
We want to make sure people are reading this as well.
Yeah, the blog post also has to be uh published and in the
blog post there's more explanation on on how to trick the dns actually and and how to set it up
in your performance test to yeah that's amazing extra acts of being more realistic. Yeah. I got one last question for you.
And because I know we are,
we both said earlier,
we want to make sure that our better apps
are not waiting too long
for us for dinner
because it's late here
in the evening.
But so what you've just told us,
these are very special
types of load tests.
You are testing for a big event it's a one-time
thing it's not possible to test uh let's say in in um by doing let's say shadow traffic in
production because you don't have production traffic right now things like that so these
are very specific very special tests now brian and, in the last couple of months and years,
we've been talking a lot about shifting left, testing earlier,
testing more frequent, smaller tests, integrating a CICD pipeline.
Is there anything we can learn from these tests to shift left
in order to avoid these tests?
Or are they just completely two different disciplines where you say yes you want to you can't you have we have to shift left to
find things earlier but you still need these large load tests in a preparation for a big event yeah
i really would love to say that yeah you can go left with this type of test.
And this is another approach to tackle this.
I would say you really need this kind of full acceptance test to be really sure that it's capable of because as I said there is the high
loads you can never test that in in your pipeline easily never but it's not easy
to do the CDN part which can have troubles it's probably not part of your pipeline or you cannot just tackle that in a small scale test.
The concurrency problems you might have
throughout the whole application
is very complicated to test
automated in your pipeline.
So with these heavy tests, I would say, yeah, probably you can
test your small services that you're creating already. So along the pipeline, along the
releases, you can see you can tackle differences. That always good but in the end just before go live
i would always say yeah you have you just have to do it to be sure yeah that makes sense yeah
yeah i was thinking too you can't really this isn't the case where you can slowly trickle
traffic to the new feature because this is almost an all-at-one you know so it's yeah it's interesting
yeah but it and and actually still i'm
still i still want to encourage everyone to still test their application as it is in a
in the traditional way as with as a full acceptance test. Because when everything is put together
and you test with the real load,
there are so many other problems you can encounter.
Right.
And I think even those tests, they become a decision point, right?
Like in your specific case, that 2 million user, right?
That's where we'll take a step back, right?
A lot of times people will say, we'll run performance tests against my services.
We'll run, we'll do a lot of shifting left and run for capacity and everything on the specific services so that we know if we test every single service,
it should all work together, right? Which we know from history is not always the case,
but I think a lot of the, let's call it maybe more agile focused performance testing is
testing the changes that are being introduced and then observing in production and then fixing,
right? This whole kind of feedback cycle. And so when you have a situation where it's normal traffic, a normal scenario,
you can take calculated risks, you know, understanding everything that's going on
and everything that you're willing to risk to push something to production
without having a full end-to-end test.
Because maybe it's your steady standard load.
Maybe you're going to do a blue green deployment or something where,
or a canary release where you trickle in your users,
you could do slow observations.
But,
but again,
in considering what that traffic is going to be and what the real life
scenario is going to be in,
in the situation you have specifically,
I don't think there's any way you can argue that,
oh yeah, we're just going to test everything
before production and then assume
2 million users end-to-end are going to work.
So I think a lot of times that consideration
has to be made and in the performance world,
there is a lot of give and take
and decision points being made about
how frequently do we still run
the end-to-end tests
right they're still obviously very much important but you know we're not in the waterfall day where
you can hold up a bill to do an end-to-end test every single time so it becomes part of that
decision point but i think yeah in that scenario that you're talking about that's an obvious like you have to yeah yeah there are still so many applications that are like yeah event based where it's just like maybe a one-time
occasion where they have a lot of loads a lot of users it's not like a real ordinary ordinary day
where you can learn and grow and tune and back and forth.
Yeah.
So that's the difference.
Yeah.
Awesome.
Yeah.
Jurek, in the beginning, before we started recording,
you said, I don't even know if we can talk for an hour.
Well, it was an hour later, right?
We're still here. And I believe though so yorick i would love to have you back because i know you're constantly out there you're constantly
doing projects you learn a lot on the job that you want to share hopefully it will be great
so maybe get you back in a couple of months and get an episode of extreme load testing.
Yeah.
Or totally completely different topic.
Maybe.
Yeah.
Or maybe some forms related though.
Yeah.
Yeah.
Of course.
Yeah.
Sounds good.
Well,
thanks for having me.
Thank you for taking the time.
Give me the opportunity to talk about this topic.
Yeah.
It's a pleasure to have you.
And if anybody, do you do
any of the social media you would like
to put out there for people to follow you,
whether it's LinkedIn or Twitter or anything?
Yeah, you can follow
me on LinkedIn.
It's probably my
best social media channel I use.
Okay.
We'll put the link.
Connect me.
Yep.
And if anybody has any ideas, topics, or anything else,
or if they want to come tell us some stories as well,
you can reach out to us at pure underscore DT on Twitter,
or you can send us an email, pureperformance at dynatrace.com.
And thank you, Jörg, for being on the show.
This was great fun today.
Great.
Well.
Thanks, everyone.
Go back to your quarantine part.
Yeah.
Good luck with that.
Yeah.
Good luck, everybody.
All right.
Thank you.
Bye-bye.