PurePerformance - Why Performance Engineering in 2020 is still failing with James Pulley
Episode Date: September 7, 2020Why do some organizations still see performance testing as a waste of time? Why are we not demanding the same level of performance criteria for SaaS-based solutions as we do for in-house hosted servic...es? Why are many organizations just validating performance to be “within specification” vs “holistically optimized”?In this episode we have invited James Pulley (@perfpulley), Performance Veteran and PerfBytes News of the Damned host, to discuss who organizations can level up from performance testing to true performance engineering. He also shares his approaches to analyzing performance issues and gives everyone advice on what to do to start a performance practice in your organization.https://www.linkedin.com/in/jameslpulley3/https://www.perfbytes.com/p/news-of-damned.html
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
It's great to be back. We've had a little bit of a vacation there, shall we say.
But we're back and by we, I mean of course myself, Brian Wilson, and my lovely co-host Andy Grabner.
Andy, how are you doing today? It's been a while.
I know it's been a while. Yeah, it's been good. It's one week exactly since I came back from vacation. Fortunate enough in these days to actually take a vacation and travel a little
bit. Even with traveling, you know, it means staying within the same country and just touring around. But yeah, no, it's all good. And I'm also happy to be back.
Yeah, what else would I say?
Did I learn anything about performance when I was on vacation?
Maybe.
But I think we'll keep this for the episode itself.
Maybe it comes up.
How about you?
Did your car have any performance issues?
I would say just, well, not performance issues, but let's say some design issues.
So my car lost, and I'm not sure if it lost it because my car is 14 years old.
And after 14 years, certain things just get loose.
And the logo of my car in the front on the hood, from one day to the other was no longer there.
And it's a Beemer.
So I'm not sure if it is a souvenir for somebody
that walked by and just ripped it off
or it really just came off because after 14 years,
the glue was just no good enough anymore
and just fell off.
I don't know.
That's the only thing that changed.
If you can't have the logo your beamer logo
on your carts it's completely worthless so you might as well just trade it in um we thought
about putting something else on there maybe like a unicorn or i don't know put the dynatrace logo
on it we could do that too yeah just don't get caught speeding or running anybody over with it
uh yeah no i've just been um my wife took my daughter on
like a desert trip so i was home did a bunch of recording and um just been trying patiently my
patience is running out though but patiently waiting for people to uh help bring the curve
down but you know freedom um anyway we have uh a we're doing a what we call a cross pollinator show right we need to keep uh
apples and fruits growing so we are cross pollinating with more friends of the show and i
think this is the first time we've had anybody else from the perf bites family on the show
aside from me but but that's,
well,
that's a,
that's a more complicated issue these days,
but yeah,
this is,
um,
it's not Mark Tomlinson today for the first time we're introducing a friend
from PerfBytes and it's not Mark.
Who would that be?
That would be somebody named Jamethin.
Is that,
I think that's the correct long form.
Jamethin Pulley.
Is that correct,
James?
Hi y'all. Yeah, this isin Pulley. Is that correct, Jame? Hi, y'all.
Yeah, this is James Pulley from the PerfBytes Southern Command in South Carolina, joining you today.
Hi, y'all.
Jamethin can be right, right?
Is that?
No.
How do you pronounce it?
How do you pronounce your name? That may be the formal English Middle Ages long form of my name, but I just go by James.
And I'm just curious, James, why do you go by the plural version? Why not just go by James?
Because we really don't want to clone me. I think that would be really bad.
All right.
I'm foolery.
So, James, you're on News of the Damned, right?
Yes.
You've been doing that for how long?
Many, many years now, right?
Well, we started as a segment on the Core Perf Byte show as a way to bring in examples from every day,
particularly to address those issues of product managers who want to
discount performance and say, what could possibly go wrong? And so we found every day examples of
what could possibly go wrong with the added benefit that we could take a look at what was
going on with these items in production and do a little bit of diagnostics
and provide some insight that people could apply to their own work in progress and hopefully
get some extended value out of their performance testing and moving it more into a performance
engineering context that everybody wants to move towards these days.
Hey, James, just because you brought it up, and I don't want to go into a religious discussion now,
but you said performance testing, performance engineering. What's your definition of the
difference between these two practices? Just throw it out there. So performance engineering is an
all-encompassing item that really deals with system efficiency, response time, use of resources,
and it really flows through the entire lifecycle of an application. It begins with how you write
a contract for a SaaS application, how you build your requirements for an in-house app.
It moves to development and how you decompose the application into performance budgets and you ask questions at every tier to see if you are within budget from a resource and a response time perspective. It moves into traditional quality assurance where
we ask for one user, is it performant? Because if it's not performant for one user, asking that
multi-user question, which is the traditional performance testing role, is not going to give
you a better answer. We're not going to get more scalable as we add users in that case. And then it moves on to platform engineering, production items, it moves into production
monitoring.
Then we have the long-term view, which is capacity planning.
And then we have as well simulation and modeling, which has a whole bunch of different forms
from running mathematics models to building out high performance stubs for third party interfaces.
Or if you are in a safe model or an agile model and you have multiple sprints running parallel,
you might have to build a stub first so your colleagues can build their software, which interfaces with your software, which is not quite done yet.
So performance engineering has a lot of encompassing items and disciplines.
But at the root of them all, there's this concern with how fast is it?
How scalable is it?
And the inversion of that how scalable is it, is how efficient is it?
How are we using resources for single users, multiple users?
When are we allocating them? When are we releasing them? How often do we hit them? How large of a
resource do we acquire? And that is often missed for people that are, say, socially promoted into a performance testing role or pushed into it for whatever
reason, in many cases, they're only answering the question of a symptom.
How fast is it or how slow is it as it degrades?
And they're not ever capturing evidence on that resource pool side to answer the question
of why.
There's a kind of a traditional progression when you talk about, say, performance tester to performance engineer to performance architect.
If you come up through the traditional performance testing path, performance testers can answer
a question on performance using a tool.
Performance engineers can often answer a question based upon data. And the data can come
from production. It can come from a test because all a test does is it generates timing records
and resource measurements. So those same analytical measurements can then be applied to production
using the same or different tools. And then when you get to a performance architect, now they're
looking at patterns related to performance at the design stage. They can look at how an application is
constructed, the types of data flows that are going back and forth and saying, hey, based upon
my experience, I foresee that you're going to have an issue here in the application scalability and
be able to point to a piece of the architecture and define this why.
So, Andy, does that address your question on the whole performance test, performance engineering, and religious issues aside?
Yeah, definitely.
And again, the reason why I mentioned let's not go into a religious discussion is because I know there's a lot of people out there that probably have their own definition.
And you can debate on just like the traditional Windows
versus Linux debate or something like that.
And correct me if I'm wrong, what I'm saying now,
one thing that I always try to take notes that are typically
later used for my summary.
But for me, I think you said one very cool thing.
We said it is how fast would be something that a performance tester can tell you versus
a performance engineer would tell you how efficient is the system holistically and why
is it behaving that way versus just, well, it is an observation of how fast is it versus
performance engineering is really looking at the system holistically,
understands why certain things are that way,
has the knowledge and the experience to then not only do diagnostics
and help solve a particular scalability performance problem,
but then also take the experience and kind of become a mentor
for the next project, for the next iteration of the software, starting early in the lifecycle.
So I think that's what I hear.
You're correct.
And in fact, usually performance engineers are not isolated to the testing discipline.
You see them flowing into development.
You see them flowing into development. You see them flowing into operations.
There is the SRE, which is kind of the Google-ist definition of someone who is associated with performance engineering.
But in their context, it's got 50% of a foot rooted in development and the other 50% in production. And I find that those
definitions are not as clean cut once you get outside of Google. In fact, there was an SRE
survey recently that showed that while that is the ideal, what is actually happening in the wild is substantially different than that.
So in other words, James, you never gave this question any thought or consideration at all.
No, no. And while I love all of my colleagues who have a different answer for performance
engineering, I just want all of you to know, I love you, but you're wrong.
Now, James, I'm just looking at your LinkedIn profile because I think it's always good to know, I love you, but you're wrong. Now, James, I'm just looking at your LinkedIn
profile because I think it's always good to know where people come from, their history.
And you obviously have a long list of experiences on LinkedIn as being part of Merkur, like a senior
consultant for Merkur Interactive. I also see that you've been a moderator for SQA forums for a long, long time.
And I believe I've also been on these forums
when I used to work for Segway Software back in the days.
So you obviously picked up a lot of cool things over the years,
a lot of experience by not only actively helping people as part of your job,
but I think moderating groups.
For me, I remember back in the days when I was on these discussion forums, it was amazing
to have this source of information and hearing what problems people have and then trying,
seeing how they solve it, but also, you know, helping a little bit. But then I think
the thing that I even liked more, the side benefit of the gratification of helping somebody with a problem is
also learning about all these other problems,
which in the end builds up your own experience.
So you can,
you can get better in your job and in your next role,
maybe that you have.
So it's,
it's fast.
And I'm going to say,
you know,
you got a lot of experience.
What I would be interested in.
And now we are in 2020.
And you mentioned earlier the word SaaS,
performance engineering on SaaS,
which I think we never discussed that much about,
but can you give us a little insight on,
do people come to you or do you see projects
where a company picks a SaaS solution
for whatever system they want to SaaSify, how is performance engineering
happening there? Or how is the process of picking a SaaS solution and ensuring the performance is
done? I mean, can you have any insights there? Yeah. And I have to be very careful, not because
I want to pick on particular SaaS vendors, but so you don't have to hit the bleep button in this case.
SaaS and performance is a very difficult conversation.
And the reason why it is so darn difficult is because the SaaS vendors have made it so.
And I'm not going to pick on any individual SaaS vendor.
I'm going to say, if you have a SaaS application,
go and get a copy of your contract.
Don't take my word for it.
Just go get a copy of your contract and read through it.
And particularly get to the point where they have committed to certain items
as part of what they consider performance.
I have yet to read a SaaS contract where the vendor will commit to a response time.
They'll commit to an uptime.
All the SaaS contracts out there will commit to an uptime.
And in fact, many of them will tell you if you want to run a performance test, you have to coordinate with their in-house performance testing organization, which in some cases, your organization may be more mature than theirs.
I run into that a lot.
And so you're asking them questions that they can't answer or have never considered.
But in a multi-tenant environment, their concern is keeping the software up and alive.
And even if it takes 120 seconds to respond, which is the traditional timeout for HTTP protocol. It's still responding.
That may be totally unreasonable for your business
and not support your needs,
but you're within their SLA.
Now, what the SaaS vendors seem to be counting on
is that there are edicts from management
that we're going to turn off applications in-house
and we're just going to move to a SaaS solution. So you have contract organizations that are being
pushed by their own management to get these SaaS solutions in place, get the contracts done,
and they're not negotiating on behalf of the organizations to manage risk. They are not insisting that an addendum be put in place,
which allows for the integration of real user monitoring within the application, as opposed
to, say, a browser add-in or something, at which point, when you call for support, they're going
to say, hey, we don't support the browser add and you're at your limitation of support, go away.
They don't support deep diagnostics, which is okay because, you know, unless there's a special vendor agreement in place,
that is one of the things that you are likely going to give up with a SaaS application because someone else is managing the infrastructure.
In some cases, but not all, they'll give you logs and they'll give you logs that might have meaningful data in it. I know some vendors will give you logs that have
zero time elements in them. So, you know, time taken, zero milliseconds.
That usually indicates an error, but they don't cite it as an error.
They cite it as zero milliseconds. Others will give you logs from, say, an application tier,
which is not multi-tenant and dedicated just to you. But you can't get logs from the front end,
so you're not able to see what maybe your front end experience is lacking the thick client
components. So there's a lot of difficulty in performance engineering on the SaaS front.
The key on SaaS is this contract piece, and it's got to be fixed. There's got to be some
organization out there that is willing to push the envelope and say, we will not
buy your solution unless you agree to our terms because we're managing our risk. If we were to
build an equivalent application or deploy it in-house, we would require these response times.
And if we're going to go to software as a service, we should insist that we
have the same level of response from your software, or at least, or at least, or at least
matching some vendor standard, like a Google rail, you know, five seconds to the end user,
a hundred milliseconds for a rest or soap call. But give me something other than just ignore
it altogether. And I'm going to shut up now. Otherwise, I'm just going to rant for another 45
minutes. Yes, I had a couple of thoughts. I mean, three things I wanted to bring up on the last
point. While I agree with you, I think it gets very difficult because at least my experience
in dealing with some SaaS components is that oftentimes the SaaS solution can be can be customizing the images, all other kind of components, then how does the SaaS vendor ensure what the user, end user did, their customer did to their setup meets an SLA?
They can control it if you use the plain vanilla and maybe on that side they can do it.
But then it just even becomes difficult because then you have to do a proving game, a finger pointing game.
You did this, you did that, blah, blah, blah.
So again, we don't need to go into that, but that's just, I think,
one of the complications. And that's
entirely fair. What
stops a customer from putting a
45 megabyte GIF on the front
page like J.Crew did
last Black Friday?
It's not even images.
Sometimes they can modify the way the codex is.
That's a really good way to distort response time right there.
Right, right.
But I mean, this could potentially be solved if, I mean, I guess it always depends on the SaaS company if they provide this. version where the customer of a SaaS solution can try new things out and then they have to run
through a set of tests to make sure that they are following best practices of that SaaS platform.
And then they kind of charge services for that. And I mean, yeah, it can be done. I'm just saying
there's a conflict. There's a, there's a extreme level of complication, which suddenly removes the
benefit of using SaaS. But as soon as you, as soon as you add multiple people using the same software and multi-tenancy,
it becomes horribly complicated because if I run a test and I take out someone else,
then I've now impacted that vendor service level agreement with that other organization,
and there could be financial implications involved.
So I'm not saying that it's easy to fix.
No.
But what I'm saying is that people should at least try.
They should negotiate and come to an agreement.
And maybe that agreement is we can only hold you accountable for soap and rest calls because there's very little customization on that.
And we can monitor that.
And here's the way we monitor it, and things of that nature.
That's a good point.
The other two things I wanted to bring were just two examples
of what you were talking about in terms of what the vendors supply.
So, again, I'm not going to mention any vendor names
since we don't get legal involved, but I'm familiar with two SaaS vendors,
and one of them not only uses Dynatrace in their entire SaaS setup,
but they give access to their
customers.
So the customers can see all this performance data.
I'm sure they wall parts of it off because they don't want to see multi-tenancy and all
that kind of stuff.
But the customer has access to a lot of this performance data.
So in that regard, I know one example where they are, it might not be in the contract.
I have no idea if any of that stuff's in the contract,
but it's all very transparent to the customer, which really is awesome.
On the flip side, there was one that I was dealing with
where a lot of their customers were trying to buy our product
to monitor the end user.
So they would just use it for the browser ROM piece of it
because they couldn't put agents on the back end or whatever.
So we had a lot of customers trying to use that. So so finally we returned to them and said hey a lot of your
customers are trying to use our product for this do you have any interest in using it on the back
end and i remember in a meeting with them their infrastructure team had an infrastructure tool
the development team for production had one apm tool for production they used different tools in
pre-production.
The web front-end team was using a whole separate set of tools. There was about five or six different
sets of tools, and it was just an absolute mess. And just going to your point of sometimes your
organization does things better. They were all open source and free, so why would they buy it?
No, they weren't all open source and free, but they all operated in complete silos.
The infrastructure team didn't give a damn
about the process and code performance, right?
And we're like, you know, there's a better way to do this.
And not even just from selling our product,
but you know, you should all be working together.
And they're like, yeah, that's how they like to work.
We're like, okay, we can't do much there.
But yeah, but you do see the gamut.
So not all SaaS is equal.
But yeah, to your point of your organization might have been doing it better, that could definitely be true.
Anyhow, that was just my little thought box on that.
I've gotten to the point where if I have a customer solution that has a PaaS option versus a SaaS option, I'm advising our customers to say, okay, think about what you want
from a performance perspective, and then look at the SaaS contract very, very critically.
If you can't get it with SaaS, if you put it on your own infrastructure, platform as a service,
either in the cloud or in-house or some sort of hybrid arrangement, at least you have greater visibility and you have a greater ability to hit your goals in a way that you manage versus someone else.
True. so james again instead of doing my summary at the end across everything i want to just kind of the
key takeaways of this topic when i ask about you know sass performance engineering i think the key
takeaway is that most what you said most people don't look closely into the contract so i think
the change of thinking should be at least look at the contract and start negotiating performance into the contract.
Because right now it's probably just a classical availability.
Is it up and running? And that's it.
But you should try to get as close as possible to the standard that you would enforce on the same product
if you would run it on-premise and it's delivered or developed by your own company.
So you should try to set
the same or similar standards to any SaaS product that you're going to acquire. Is this right?
Correct. And that holds for the project inception for any project, whether it's in-house
or in the cloud as a platform as a service or software as a service, you want to define very,
very clearly those items up front, which are critical measures for success for your own
organization and performance. And you want to make sure that that is clearly captured
and articulated to all downstream partners. So in the case of an in-house app, that would be to
architecture, development, platform engineering, quality assurance. In the case of a platform or
in the case of a software as a service application, certainly to your SaaS provider and likely your
in-house quality assurance and platform engineering that are going to have to hold that SaaS application in check,
especially for any customizations, as you know, Brian,
and monitoring for proactive actions in case performance is not up to par.
And, you know, for people listening who thinks this sounds familiar, these to me sound exactly how we were defining SLOs and SLAs with the Google team earlier.
You know, the more true to original definition of an SLO based on the SRE handbook and some of the Google people was the idea of uptime response time, what the end user, what your deliverables to your end user are right
and everything else is just how do you make sure that's happening and the sla come part comes in
when you add it to the contract so just some of our recent shows we were james we were talking
about the difference between slos slis and sla so just wanted to kind of tie that in with that for
for your context yep you got it yeah so next topic um james i know in your news
of the damn over the years you have brought up a lot of examples and stories about uh you know
website launches that crashed or you know the famous let's hear some damn stories
actually i know we can't hear a lot of stories but i would rather have people
uh go over to the podcast today the proof byfPyte News of the Dam podcast and listen to that.
But here's a thought.
So, you know, a lot of patterns, you know, a lot of things that crash systems.
You just mentioned the 46 megabyte image that somebody uploaded. Yesterday, as the time of the recording yesterday was August 25th, I did one of my performance clinics on chaos engineering.
And the way we talked about chaos engineering there is, you know,
enforcing, running experiments to see how the system behaves.
But it was mainly focused on, you know, what happens if you have a CPU hog?
What happens if you have a network issue hawk what happens if you have a network
issue what happens if pods crash all the time what happens if there's a dns issue and then basically
inflicting chaos and you run an experiment and you have a hypothesis what will happen with the
system and hopefully the system is able to handle it gracefully but here's my thought and maybe
you're doing this already you have a it must be a long catalog of
things that that people have done wrong that caused sites to crash like allowing people to
uploading a 46 megabyte large file do you see people actually taking this type of catalog and
then running through it with a new software release and say, now let's actually
figure out what happens if we do upload a 46 megabyte image or if we're using the system on
a very slow bandwidth. Do you see this as well? Have you seen this in your practice with your
customers? With my own customers, it's easy to control. So we can ask those questions earlier. We can look at those patterns in production and in test to try to mitigate against those things going all the way to prod. crew earlier with the 45 megabyte GIF on the front page on Black Friday.
That was a marketing event.
So someone in marketing put a page up that had that image on it.
There's almost nothing you can do to stop that other than to put some penalty in place against marketing if they do something like that.
And it's going to be a financial penalty for something of that nature.
But wouldn't that be, sorry to interrupt you here, wouldn't that problem be solvable if
the system that they use, because I assume they don't write HTML pages and upload the
images on the web server.
I'm sure they're using some type of content management system to correct kind of these
things into the platform that they're using. And therefore, when I don't know, they roll out a new
version of WordPress, they roll out a new version of the CMS to just add these, let's say,
vulnerability tests or chaos tests to the mix and then validate that this cannot happen.
It would be nice if you could do that.
There is something that does come to mind that showed up just within the last couple of days.
There is a YouTube personality
who is previewing a new version of WordPress,
WordPress 5.5.
And in that release,
they have heavily addressed some of the patterns
that made WordPress a low performer related to how they were constructing style sheets, including all sorts of old artifacts that were not required, how they were handling images, how they were handling cash, how they were handling compression, and taking a look at how that score appears inside of Google Lighthouse
and how that score appears on GTmetrix.
And it is substantially improved in WordPress 5.5.
So in that sense, we see some vendors trying to use that as a marketing differentiator
when there is an opportunity to do so.
But what are we seeing inside of most organizations?
I hate to say it, but performance testing organizations for the vast majority of organizations
have delivered such low value over the years.
Basically, the only thing they're turning in
is a report of response times.
And that's it, because that's all they know to turn in.
And in fact, some of the tools in the marketplace,
and most notably the open source tools,
the JMeter, Grindr, things of that nature,
are geared around just turning in response times.
That's it.
Leaving to other tools to answer other questions.
And in those organizations, performance is being, you know,
how should we say, left in place to deliver its low value as a checkbox
and not really being asked to consider these questions.
That consideration is being moved in many cases
to tool expertise, AI, pattern detection,
things of that nature, and more to operational tools
to find these issues,
either finding them in a quality assurance pre-deployment type of model
or finding them in production.
Now, what we don't see a whole lot of,
particularly whether it's in the prod side with AI,
pattern detection, or pre-production, is this issue of bot load
inside of your application, particularly if it's public facing. And the reason why we don't see
that is when you have security tools that are looking for patterns, they're looking for patterns
to try to exploit the system. They're not necessarily looking for patterns of load on the system,
but it's a legitimate request.
So this underlying bot load on a system for a public-facing site,
which could be anywhere from 30% to 80% of the load on a public site,
goes unnoticed. And that's a real opportunity in the market, either for expertise
from an individual or an organization perspective, or from a tool vendor perspective,
to be able to say, hey, we notice your top 200 users are all coming from offshore cloud data centers and they've never
bought anything, but they're using 50% of your site's resources. Would you like to block them?
Because I can tell you blocking those is a whole lot easier than re-engineering your software
to actually overcome the scalability of all the resources they're using.
And that, I guess, goes again into the discussion we had in the beginning where you said,
what's the value of performance testing?
They just run some load and give you response time versus performance engineering.
They have a holistic view, move into production,
understand everything that's going on,
understand things like this, like, hey, we have 80% bots traffic.
And I mean, some bots might be useful for an e-commerce site
that are crawling the data and then producing, you know,
but if you do that, if you're really then analyzing the traffic,
you can, I mean, I have examples.
We have examples.
I think we have some blogs on our blog site where we talk about the rise of the bots and how we can do performance optimization.
Exactly what you said, but by blocking traffic that is really completely unnecessary and not beneficial to the business.
There was even a cross-shopping bot several years ago that was taking out websites.
It was taking out e-com websites because the rate of interrogation on a website to find the price of an object was so aggressive, so high,
that it voided the do no harm,
and it was simply taking down websites,
that it was cross-shopping.
I had a...
Marketing would say, don't turn off that.
Don't turn off that.
We might get a sale.
But now none of your people can buy anything.
I had one of my performance clinics a couple of
weeks ago with the cto of ant clothing and they had so it's also e-commerce store and they did
a hackathon internally and one of the projects was to do sophisticated bots detection because
uh you know not every bot is just hitting the the same just their home page and then um
doing nothing with it but they actually have they have a lot of uh like special gifts or you can you
can sign up for a price and people were real users were always complaining that nobody ever wins the
price and they were then kind of thinking maybe there are smart bots that have been written to
actually find these these special offers and these games where they can get them free clothing.
And they actually then, in the hackathon, really tried to analyze real user traffic
using Dynatrace data and then really detecting bots that were on...
By first looking at the run data, it was not apparent that it's really a bot.
But I thought it was an interesting, cool project for a hackathon that they did.
Yeah. In fact, and there's been a lot of public data on this, so I'm not disclosing anything.
Nike has severe problems with bots. In fact, there's a commercial aftermarket where you can
buy a Nike bot to put yourself in good position to be able to buy at 5 a.m. Pacific time on Saturday mornings
for whatever their newest shoe release is for the week. There is nothing that stops Nike or
their performance engineering group from surreptitiously buying the bot and monitoring the traffic on the bot
on one of these events and getting a pattern associated with the bot and then making sure
that when these events occur that such patterns are not allowed or they're only allowed say one
percent of the time or one half of one% because as most of us are reformed developers in this profession,
and as we all know, the thing that frustrates us the most is when something works sometimes.
You know, that's called corralling a bot. You don't block it 100%. So it gets through sometimes
and the rest of the time it gets the 503, which is an
expected event when it's under load. So if you can keep the bot manufacturer from evolving the bot
or moving the bot by giving them a set of responses that look legitimate, kind of the psychology of managing bots here, you can actually improve your chances
of getting them to a minimal level and keeping them there without disrupting your users.
You know what would be an awesome way to treat that? Send the bot, I guess there wouldn't be a
transaction involved, right? So send the bot to the transaction page but a special one just for the
bots where on the last page saying small print somewhere you are not actually purchasing these
sneakers but any money given will be donated to xyz or how about this b Brian? Have one PC in the cluster, then once you identify that they're a bot, all their load is routed to that one server.
So if it gets slower and slower and slower for the bots, well, you know.
And besides these e-commerce examples, and I know, James, you've had several examples on your News of the Dam day. The ticket sales, right, for events. I think that's the same thing where bots are buying the tickets and then probably put them on the black market for a big concert that comes online.
Yeah, and bots for tickets are particularly troublesome because these usually involve large bot networks all over the planet to do these purchases.
And who knows if someone from South Africa is going to come to a concert for Beyonce in Atlanta?
You don't know.
So unless you have some marker in the data, such as a serialized ID, which does not change, a timestamp embedded in some sort of
masked form, which is too far off of the actual time right now. Something that allows you to say,
this is a bot. You have to treat it as legitimate load. And so you get a ticket site that goes on sale and then boom, you have hundreds of
thousands of people from all over the planet come in and attempt to purchase a ticket.
And in some cases, I hate to say it, this type of gross failure of a system is being taken as a mark of success by the artist.
Last year, the Jonas Brothers were regularly crashing their website for ticket sales,
and their marketing department was ecstatic about how this shows how popular the Jonas Brothers are. That also happened with a presidential candidate in the United States earlier in the election cycle.
This was for Pete Buttigieg, whose website kept going down all the time after an event related to taking in donations.
So similar type of item, but probably not bot related in that case.
But using these outages as a measure of success and a way to generate buzz from the event.
There's an intro that we use on News of the Damned, which comes from a step-cousin of mine, the rap artist
Dessa. She was doing an event with the Minneapolis Symphony Orchestra, and that was selling out.
Probably not a bot event. She just happens to be very popular. That's her hometown.
And she thanks everybody at the end for taking out the website.
And if I had any hair on my head, I would have reached up and pulled it out by its roots.
I'm like, come on.
If you're going to celebrate these outages as success,
it's going to be much more difficult to get money to fix them.
That's actually, I think, the biggest challenge, right?
Because you can bet you, coming back to what you said earlier,
a lot of organizations may not see the value of performance testing
because they just, you know, some organizations just either don't take it serious
or, you know, you just, you know, take the open source tools,
nothing against open source tools, but just run some load against it
and tell me the response time.
And then if the marketing team says hey we actually get
more social media impressions for our we went down uh message then i don't know for everything
else we did before then hey it's actually good that performance is not good it's good that it's
not scaling because we get more attention yeah then obviously it's counter especially especially
for especially for the ticket vendors right there they're going
to sell their tickets are selling regardless and most of them have stake in the resale market so
they're making double money on those anyway so yeah so they're happy they're a monopoly provider
in that case so it's not like the owners of the secondary market and buy and i blame christine
todd whitman for all the ticket stuff because i remember uh i
don't know if she started it but when she became governor of new jersey one of the first things she
did was legalized reselling tickets in new jersey because it used to be illegal and that just opened
the floodgates and then a lot more states started following suit after that and all these legal
resale markets made it now impossible to get tickets for anybody.
And I really don't understand
what makes a baseball cap
worth $10,000 when
paired with a ticket. I really don't.
It's all in
what the person wants. I mean, you can make the same
argument about what makes an action figure money.
Desire for somebody
to have it and it's worth as much as it is for that person.
But to you, to me, I could care less about a baseball hat sure but you know or a pair of sneakers you
know it's my interest though but would i pay extra money for an autographed copy of the butcher cover
of beatles uh yesterday and today probably well if i had the spare money but you know all comes down to our likes but yes i i
i got really specific there didn't i if anybody wants to donate a butcher cover copy to me i'd
be happy to i'm picturing those two days that that cover was actually on sale at sears before
they pulled it yeah thinking yeah it's a kind of a narrow uh narrowly scoped item there brian yeah
well they're still out there you can get them for like 500 or so bucks but not autographed i don't
know if you can get an autograph anyhow the one thing i want to point out before we go on though
just because like all this stuff is really really amazing and what i need to point out for people
that when james goes through and looks at all this stuff um and finds all these patterns and
when they discuss the stuff that they do at news of the damned keep in mind the only the only things they have to go off of
is any front-end tools that they can look at so if something's going down on a on the site they
can pull up a browser and try to take a look at what's going on and then anything released by the
company after this happened you know james doesn't have um tools or anything inside the company
inside the data centers to understand what's going on.
So all the analysis he's doing, he's doing in this miraculous way by looking, I guess, probably because the fact that you've been doing this for such a long time, James, right?
You can look at things in spot and understand what's going on and infer and then later on have that confirmed, usually when the press release comes out about what happened. I just think that on its own is amazing. I wanted to give you a big kudos because
that's a whole different level of performance analysis when you're doing it from
500 feet away with limited insight.
Well, I hope that anyone who listens to the show can take that away and say, you know what, I don't necessarily need the perfect tool to do this job.
What I really need to do is be able to have a critical eye towards the use of resources under load and be able to get some measurement of time and resource,
even if it's through the front end, just individually.
And tools like GTmetrix are great for that.
Pingdom is great for that.
The Lighthouse tools that are built into Google Browser,
Google Chrome is great for that.
But being able to look at that and say,
oh, I see this one image is way out of that. But being able to look at that and say, oh, I see this one image is, you know, way out of
scale. You know, it's an eight megabyte image. And is it really adding a whole lot? Could it
be just as valuable at 200K? Yeah, probably. And being able to say, okay, I would make these recommendations.
One area where I do have to say that performance testing has failed in many respects
is we're usually testing to see if it falls within spec. And within spec is usually a response time,
and it might include some resource measurements.
But it never includes, is it optimal?
So I might have a highly compliant application that still is resource dirty in how it's managing resources.
And I'm not going to know that until it actually gets to production.
And now it's a public-facing website for whatever reason.
And 40% of my resource pool is now taken up by bots.
And we hit that magical tipping point, say, during benefits sign up in October.
And it goes down. We should really be looking at what is the optimized
deployments in performance testing versus what is the compliant deployment.
I have to write this down. That's true. Hey, and I think coming just quickly back,
before you moved into this section,
coming back to the example,
what can you actually do from the outside, right?
Hey, this image looks off.
To add to the discussion,
performance testing versus engineering,
for me then the difference from these two disciplines
would be performance testing.
Yeah, maybe I'll look into, you know, occasionally,
what are these pages?
How can we optimize them or what's wrong?
And then report it back.
Performance engineering, the way I see it,
is to automate these checks into the pipeline
so that it doesn't take me to look at it all the time,
but it's done automatically every time you push a new build through the pipeline.
Because these things, especially front-end web performance optimization checks,
we've been talking about this, since I've been at Dynatrace,
we've been talking about it back then with the Ajax edition,
and we automated these checks into continuous delivery.
So it still strikes me, me actually that now in 2020, we still have pages that are 46 megabyte large
or 45 megabytes large,
because I would hope that by now,
these things are just automatically detected
because it's been integrated into the delivery pipeline
because people have stepped up
from just being performance testers
to truly enable an organization to do performance engineering.
So, Andy, I want to issue you a challenge.
Pick any five developers at random.
Not inside of Dynatrace, of course, because you're a very performance-aware organization.
But pick five developers at random.
Take them to lunch and just say, what defines poor performance?
What in your code defines poor performance?
This is a conversation I'm actually having with the university department head where I graduated from.
What defines poor performance?
It's something that's not taught.
It's something that's not taught. It's don't talk about what defines poor performance.
How you use resources defines poor performance.
When you allocate, when you collect, how large of a resource you allocate,
how often you hit it.
I think of a sub-query in that case of database performance.
And that resource lock,
I like to use a visualization
of four Tetris games running side by side,
you know, CPU, disk, memory, and network
with different block sizes in them.
And it could be that three of those games
are in a winning state.
That fourth one, whatever it is,
you know, pick the resource randomly,
locks up. Well, your system is in vapor lock or resource lock at that point. You can't go any
further. And that is really not emphasized. But if going to your pipeline issue, if we start at project inception and we say that an application shall
respond in five seconds to DOM complete, that's pretty specific. And we could add some load
measurements onto that. So we have a measurement we can collect in RUM, DOM complete. And now we
have 18 architectural components that have to be touched in order to hit that five seconds, including the thick JavaScript front end.
If we have this at Project Inception, that means architecture has to go through for every 18 of those components and assign what is the maximum that can occur in order to hit that five seconds to DOM complete.
Once they've done that, holy cow, inside of the CI pipeline, we can now ask the question at unit
testing, are we compliant? We can ask at a component assembly, are we compliant? Ideally, we can do this in a passive way because, as we all
know, developers are not really good about collecting time. They like to collect, does it
work? Yay, it works, but not necessarily is it slow. So if we can collect that through logs,
if we can collect it through deep diagnostic solutions and actually provide that feedback as part of the pipeline,
make a SOPR REST call out to the data store,
which has that log processing or that deep diagnostics
and get that measurement of time as part of a complex ask of,
does this pass and can it be promoted?
That has enormous power.
We can collect those performance defects earlier and often. And as we see from everything from ISO and ANSI and Gartner, the earlier we can collect a performance defect, the cheaper it is to fix because there are fewer architecture
issues that we have to unwind to get to the core of that issue. And then a shout out to people that
think this is a great idea and want to get started, we give you a head start by using Keptn.
Because Keptn provides that framework to actually pull in data from different data sources,
comparing it against your thresholds, and then giving you a very simple to interpret number between zero and 100, like a score.
So just the listeners that are interested in figuring out how you can integrate this
into the pipeline.
But you're right.
I mean, the challenge, though, and I think the initial challenge is people need to sit down and say okay first of all we do have a performance requirement five seconds
and what does this mean what's the performance budget that every individual component has along
the chain and i often hear that this might be also a blocker because people just don't often have an end-to-end,
they don't feel the urge of an end-to-end responsibility, or they don't feel like,
or there might not be a performance engineering team or an SRE team or whatever team that is,
that is actually taking this action item of saying, we need to sit down, we need to define
our SLOs, and then we need to break the SLOs on the front
and down to the individual microservices
and then enforce quality gates with every build
that is pushed through by every service
so we get an early warning.
I'm not sure how you see this,
but I see a lot of organizations struggle.
I see this quite often where all of these components are being developed by decoupled teams, and they all say when response time is bad and something has crashed, well, our stuff works.
Nobody has taken responsibility for a uniform view of what is that maximum amount of tolerance state. Now, something that really is helpful,
particularly on the deep diagnostic side, is the ability to profile across all of these tiers
and see what is the most expensive request or what are the non-compliant requests, either passively
or actively, collect that data. So at least you know from a SWAT effort where to allocate your resources for what is the most offensive component that has been profiled, highest cost and know where to send your engineering team to do the diagnostics and work.
And it could be something as simple, and this comes from an actual live case, where the log level was set too high and it was writing too much to the disk.
And that, in fact, slowed an entire architectural tier down.
Yeah.
We've seen this too often, unfortunately, these things, yeah.
Or the classical, you know, whether it's logging or whether it's the famous,
infamous OR mapper that is killing the database with too many queries and things like that.
Right.
Yeah.
Or, or, or a missing index or a non partitioned index where someone has to start at a, where most of the data is clustered in S and R.
And so they have to trace through the entire alphabet in order to get there.
Yeah.
Little, little things like that can make
substantial differences in performance,
particularly when you're shaving
100 milliseconds off of every query.
Maybe you have an embedded query that has
caused something a thousand times,
it's like, boom, that's a lot, it adds up.
Very cool. James, typically, at of getting close to the hour here we
we wrap it up but i think we wrapped up the individual sections i think that this is
something that i did differently this time brian right there's no summer raider but i just did a
mini summary i don't think it was yeah the, the Summaryator bots.
They were jumping in.
Yeah.
But I mean, for me, again, I took a couple of notes that I will also put into the summary of the podcast.
Who?
James.
I know, Andy, you're going to interrupt.
I got to interrupt.
Instead of doing a waterfall summary, you did an agile summary. Look at that.
Iterated.
In sprints. There you go. All right. Look at that. Iterative. Yeah, yeah.
There you go.
There you go.
All right.
Yes.
Okay.
We'll get you to scaled agile in the next podcast, Andy.
And Andy, we're going to have to stand up for the next podcast.
Yeah, of course.
Well, I was actually standing.
I'm on my standing desk here, which actually means I'm at the bar in my kitchen.
And I'm typically standing.
And it's been five months now.
And, uh, from time to time, I wish I had a proper chair again.
Um, but, uh, James, I'm pretty sure, well, obviously our paths cross anyway, because
of the, um, the collaboration we've been doing over the years, uh, whether it's, um, you
know, doing, doing podcasts with each other on each other
show or the uh you know perform obviously next year won't be in vegas but it will be virtual
but i'm pretty sure we'll figure out something there how we can all contribute uh also to the
podcasting scene um is there any any uh not final words words, not famous last words,
but any resources that we can... Before you die, Peter.
What do people need to know?
If people just get started with performance engineering,
what are the best places to get started?
Any resources on the web?
Any people to follow besides you?
This is a tough question
because there are very few resources
on the web on performance engineering.
I will give a couple of rules of thumb here.
And this goes back to our PerfBytes,
you know, give five takeaways.
One, find a way to ask a performance question
every time you ask a functional question.
It doesn't have to be the same performance question.
It could be a measurement of time.
It could be a measurement of a resource.
It could be a measurement of throughput.
But find a way to ask that question as early as possible.
Two, wherever you can, collect performance data passively.
People are naturally resistant to change.
And as part of collecting data as early as possible, you're probably going to have to collect it from logs.
You may have to change your logging model. You may have to collect it using deep diagnostic tools. But unless someone
sees a personal benefit and they've experienced that benefit, their ability to resist change is
incredibly high. And for a developer who may be trying to hit a deadline, adding an extra quality gate that
they personally have to ask and control, well, you can see how this is not likely to go into effect.
When it comes to functional testing, no matter what type of application that you're doing, whether it's a thick client app, a thick JavaScript application, kind of a pure HTTP application, web services, whatever it is, if your application does not scale for one user, it's not going to scale for more than one user. So if you've got a 30-second response time on your web services call,
fix it for one user rather than sending it over to performance testing and waiting for them to
build the test, design, well, design the test, build the test. Hopefully they design first before
they build. Sometimes it goes in reverse there. We don't like that. But, you know, design, build, implement, and then finally get a measurement of test result.
That's going to take a long time.
And if you can answer that question with a single user, it's going to be a lot cheaper to fix.
And the last one is don't feel siloed.
Notice I'm saying ask these questions earlier.
Ask these questions outside of a traditional multi-user performance test.
If you can ask this question and you can discover an issue, you're adding value earlier.
This is not, oh, my God, I'm outside of my box and I'm going to get in trouble.
Well, hopefully that's not the case.
But generally speaking, if you find a defect earlier that would have to wait until far
later where it's more difficult to fix, you become more valuable to the organization.
So just kind of keep those five things in mind.
And anybody can act in a performance engineering capacity.
You just simply have to ask, how fast is it?
And what is the resource that's taken up?
That's all.
Thank you.
That's all I can say about this.
Thank you so much.
Brian, anything else from your end?
Let me see.
There's a million things, but related to this podcast, uh, um, not particularly the one thing I would add to, to James, I think James, those are amazing five points.
Um, uh, 5.1, maybe we'll go for some surround sound, um, would be, yeah, would be to talk to people in development,
talk to the product owners, talk to marketing, talk to operations,
establish relationships with these people because those are going to be
critical. You know, take them out to lunch or whatever, maybe not,
but just establish some sort of relationship because when you do start to need
answers, even if you're going to, you know, somewhat not directly get them, you're going to
get the answers from them. And the more they have a connection to you, the more they'll be able to,
they'll be willing to help you with some of these things. Like if you're talking to marketing and
then they're going to say, oh yeah, we're running a big campaign next month. Oh really? What sort
of traffic are you expecting to drive to that?
Now you know you have something
that might not have been on the radar
for performance that you need to know about.
So there's a lot of things that come in
from those soft skills.
That's a whole different set of things,
which I just wanted to introduce there.
But yeah, otherwise, James,
it's a pleasure to have you on i can't believe it took
us to i believe this might be episode 114 uh i can't believe it took us this long to have you on
wow i don't know i i think the next episode we do together we should all come together at
debonu colony club on the coast of south carolina andy where you got engaged um
just a few hours from me and uh we should have a nice um alcoholic fruity drink and
comedy club but we got we got uh we we officially got engaged in seattle and i asked
yeah underneath the space needle but then we did a trip to South Carolina
and actually did
kind of celebrate the whole thing, that's true.
See?
Yeah, I
remember your picture from
I want you to reach inside James' pocket.
He has something for you.
It's the ring.
You made
James part of the
engagement ceremony.
Yeah.
And yeah, James, that's an awesome idea.
I think the only thing that stands between now and that event is we need to get rid of this strange COVID virus so we can travel again.
But once that is done, once that is over, we'll definitely be there.
Yeah, it's really awkward. You know, you go into the
airport and you have those people in the Middle Ages
plague masks out front, particularly if they're carrying
a scythe or something. It makes me really uncomfortable.
I imagine when this is all over, the terrible
end scenes from the Star Wars celebration scenes
where everyone's partying in the streets and bad music is playing.
That's kind of what I picture the whole end of this.
And George Lucas is going to be there filming it all for some horrible...
He's just going to make an entire celebration movie of all the worst bits of those pieces of Star Wars.
Anyway, I don't know how the heck i
got into that um and yeah anything else otherwise um i think that's it it's good to be back yeah
thank you for bearing with the repeats um james will have we'd definitely love to have you on
again now that we've proven that you're a responsible guest and don't fill the airways
with curses you know that's something, that's something that your colleague Mark Tomlinson, I occasionally have to bleep
or cut out some foul language from him.
So kudos to you, James, for not being like Mark.
We don't want to be like Mark, right, James?
It's never a good thing.
You know, Mark is a spectacular individual.
Like me, we should never clone him.
Bringing it back to the clones.
All right.
Thanks again, everybody, for listening.
If you have any questions or topics you would like us to discuss
or maybe even want to come on and discuss,
you can reach out to us at pure underscore DT on Twitter
or you can email us
at pureperformance at dynatrace.com
Unlike PerfBytes, we do not
have a phone number. Sorry for that.
And
we'll talk to you all soon.
Thank you, everyone.
I'm questioning thanking everybody.
I just didn't realize it was the end of the
show, so yes. But let's keep
this going for another half hour.
Bye-bye. I just didn't realize it was the end of the show, so yes. But let's keep this going for another half hour. Exactly.
Bye-bye.