PurePerformance - Why Performance Engineering in 2020 is still failing with James Pulley

Episode Date: September 7, 2020

Why do some organizations still see performance testing as a waste of time? Why are we not demanding the same level of performance criteria for SaaS-based solutions as we do for in-house hosted servic...es? Why are many organizations just validating performance to be “within specification” vs “holistically optimized”?In this episode we have invited James Pulley (@perfpulley), Performance Veteran and PerfBytes News of the Damned host, to discuss who organizations can level up from performance testing to true performance engineering. He also shares his approaches to analyzing performance issues and gives everyone advice on what to do to start a performance practice in your organization.https://www.linkedin.com/in/jameslpulley3/https://www.perfbytes.com/p/news-of-damned.html

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. It's great to be back. We've had a little bit of a vacation there, shall we say. But we're back and by we, I mean of course myself, Brian Wilson, and my lovely co-host Andy Grabner. Andy, how are you doing today? It's been a while. I know it's been a while. Yeah, it's been good. It's one week exactly since I came back from vacation. Fortunate enough in these days to actually take a vacation and travel a little bit. Even with traveling, you know, it means staying within the same country and just touring around. But yeah, no, it's all good. And I'm also happy to be back.
Starting point is 00:01:07 Yeah, what else would I say? Did I learn anything about performance when I was on vacation? Maybe. But I think we'll keep this for the episode itself. Maybe it comes up. How about you? Did your car have any performance issues? I would say just, well, not performance issues, but let's say some design issues.
Starting point is 00:01:30 So my car lost, and I'm not sure if it lost it because my car is 14 years old. And after 14 years, certain things just get loose. And the logo of my car in the front on the hood, from one day to the other was no longer there. And it's a Beemer. So I'm not sure if it is a souvenir for somebody that walked by and just ripped it off or it really just came off because after 14 years, the glue was just no good enough anymore
Starting point is 00:01:59 and just fell off. I don't know. That's the only thing that changed. If you can't have the logo your beamer logo on your carts it's completely worthless so you might as well just trade it in um we thought about putting something else on there maybe like a unicorn or i don't know put the dynatrace logo on it we could do that too yeah just don't get caught speeding or running anybody over with it uh yeah no i've just been um my wife took my daughter on
Starting point is 00:02:26 like a desert trip so i was home did a bunch of recording and um just been trying patiently my patience is running out though but patiently waiting for people to uh help bring the curve down but you know freedom um anyway we have uh a we're doing a what we call a cross pollinator show right we need to keep uh apples and fruits growing so we are cross pollinating with more friends of the show and i think this is the first time we've had anybody else from the perf bites family on the show aside from me but but that's, well, that's a,
Starting point is 00:03:05 that's a more complicated issue these days, but yeah, this is, um, it's not Mark Tomlinson today for the first time we're introducing a friend from PerfBytes and it's not Mark. Who would that be? That would be somebody named Jamethin.
Starting point is 00:03:19 Is that, I think that's the correct long form. Jamethin Pulley. Is that correct, James? Hi y'all. Yeah, this isin Pulley. Is that correct, Jame? Hi, y'all. Yeah, this is James Pulley from the PerfBytes Southern Command in South Carolina, joining you today. Hi, y'all.
Starting point is 00:03:38 Jamethin can be right, right? Is that? No. How do you pronounce it? How do you pronounce your name? That may be the formal English Middle Ages long form of my name, but I just go by James. And I'm just curious, James, why do you go by the plural version? Why not just go by James? Because we really don't want to clone me. I think that would be really bad. All right.
Starting point is 00:04:06 I'm foolery. So, James, you're on News of the Damned, right? Yes. You've been doing that for how long? Many, many years now, right? Well, we started as a segment on the Core Perf Byte show as a way to bring in examples from every day, particularly to address those issues of product managers who want to discount performance and say, what could possibly go wrong? And so we found every day examples of
Starting point is 00:04:35 what could possibly go wrong with the added benefit that we could take a look at what was going on with these items in production and do a little bit of diagnostics and provide some insight that people could apply to their own work in progress and hopefully get some extended value out of their performance testing and moving it more into a performance engineering context that everybody wants to move towards these days. Hey, James, just because you brought it up, and I don't want to go into a religious discussion now, but you said performance testing, performance engineering. What's your definition of the difference between these two practices? Just throw it out there. So performance engineering is an
Starting point is 00:05:25 all-encompassing item that really deals with system efficiency, response time, use of resources, and it really flows through the entire lifecycle of an application. It begins with how you write a contract for a SaaS application, how you build your requirements for an in-house app. It moves to development and how you decompose the application into performance budgets and you ask questions at every tier to see if you are within budget from a resource and a response time perspective. It moves into traditional quality assurance where we ask for one user, is it performant? Because if it's not performant for one user, asking that multi-user question, which is the traditional performance testing role, is not going to give you a better answer. We're not going to get more scalable as we add users in that case. And then it moves on to platform engineering, production items, it moves into production monitoring.
Starting point is 00:06:30 Then we have the long-term view, which is capacity planning. And then we have as well simulation and modeling, which has a whole bunch of different forms from running mathematics models to building out high performance stubs for third party interfaces. Or if you are in a safe model or an agile model and you have multiple sprints running parallel, you might have to build a stub first so your colleagues can build their software, which interfaces with your software, which is not quite done yet. So performance engineering has a lot of encompassing items and disciplines. But at the root of them all, there's this concern with how fast is it? How scalable is it?
Starting point is 00:07:16 And the inversion of that how scalable is it, is how efficient is it? How are we using resources for single users, multiple users? When are we allocating them? When are we releasing them? How often do we hit them? How large of a resource do we acquire? And that is often missed for people that are, say, socially promoted into a performance testing role or pushed into it for whatever reason, in many cases, they're only answering the question of a symptom. How fast is it or how slow is it as it degrades? And they're not ever capturing evidence on that resource pool side to answer the question of why.
Starting point is 00:08:10 There's a kind of a traditional progression when you talk about, say, performance tester to performance engineer to performance architect. If you come up through the traditional performance testing path, performance testers can answer a question on performance using a tool. Performance engineers can often answer a question based upon data. And the data can come from production. It can come from a test because all a test does is it generates timing records and resource measurements. So those same analytical measurements can then be applied to production using the same or different tools. And then when you get to a performance architect, now they're looking at patterns related to performance at the design stage. They can look at how an application is
Starting point is 00:08:49 constructed, the types of data flows that are going back and forth and saying, hey, based upon my experience, I foresee that you're going to have an issue here in the application scalability and be able to point to a piece of the architecture and define this why. So, Andy, does that address your question on the whole performance test, performance engineering, and religious issues aside? Yeah, definitely. And again, the reason why I mentioned let's not go into a religious discussion is because I know there's a lot of people out there that probably have their own definition. And you can debate on just like the traditional Windows versus Linux debate or something like that.
Starting point is 00:09:33 And correct me if I'm wrong, what I'm saying now, one thing that I always try to take notes that are typically later used for my summary. But for me, I think you said one very cool thing. We said it is how fast would be something that a performance tester can tell you versus a performance engineer would tell you how efficient is the system holistically and why is it behaving that way versus just, well, it is an observation of how fast is it versus performance engineering is really looking at the system holistically,
Starting point is 00:10:07 understands why certain things are that way, has the knowledge and the experience to then not only do diagnostics and help solve a particular scalability performance problem, but then also take the experience and kind of become a mentor for the next project, for the next iteration of the software, starting early in the lifecycle. So I think that's what I hear. You're correct. And in fact, usually performance engineers are not isolated to the testing discipline.
Starting point is 00:10:39 You see them flowing into development. You see them flowing into development. You see them flowing into operations. There is the SRE, which is kind of the Google-ist definition of someone who is associated with performance engineering. But in their context, it's got 50% of a foot rooted in development and the other 50% in production. And I find that those definitions are not as clean cut once you get outside of Google. In fact, there was an SRE survey recently that showed that while that is the ideal, what is actually happening in the wild is substantially different than that. So in other words, James, you never gave this question any thought or consideration at all. No, no. And while I love all of my colleagues who have a different answer for performance
Starting point is 00:11:38 engineering, I just want all of you to know, I love you, but you're wrong. Now, James, I'm just looking at your LinkedIn profile because I think it's always good to know, I love you, but you're wrong. Now, James, I'm just looking at your LinkedIn profile because I think it's always good to know where people come from, their history. And you obviously have a long list of experiences on LinkedIn as being part of Merkur, like a senior consultant for Merkur Interactive. I also see that you've been a moderator for SQA forums for a long, long time. And I believe I've also been on these forums when I used to work for Segway Software back in the days. So you obviously picked up a lot of cool things over the years,
Starting point is 00:12:18 a lot of experience by not only actively helping people as part of your job, but I think moderating groups. For me, I remember back in the days when I was on these discussion forums, it was amazing to have this source of information and hearing what problems people have and then trying, seeing how they solve it, but also, you know, helping a little bit. But then I think the thing that I even liked more, the side benefit of the gratification of helping somebody with a problem is also learning about all these other problems, which in the end builds up your own experience.
Starting point is 00:12:50 So you can, you can get better in your job and in your next role, maybe that you have. So it's, it's fast. And I'm going to say, you know, you got a lot of experience.
Starting point is 00:13:01 What I would be interested in. And now we are in 2020. And you mentioned earlier the word SaaS, performance engineering on SaaS, which I think we never discussed that much about, but can you give us a little insight on, do people come to you or do you see projects where a company picks a SaaS solution
Starting point is 00:13:22 for whatever system they want to SaaSify, how is performance engineering happening there? Or how is the process of picking a SaaS solution and ensuring the performance is done? I mean, can you have any insights there? Yeah. And I have to be very careful, not because I want to pick on particular SaaS vendors, but so you don't have to hit the bleep button in this case. SaaS and performance is a very difficult conversation. And the reason why it is so darn difficult is because the SaaS vendors have made it so. And I'm not going to pick on any individual SaaS vendor. I'm going to say, if you have a SaaS application,
Starting point is 00:14:09 go and get a copy of your contract. Don't take my word for it. Just go get a copy of your contract and read through it. And particularly get to the point where they have committed to certain items as part of what they consider performance. I have yet to read a SaaS contract where the vendor will commit to a response time. They'll commit to an uptime. All the SaaS contracts out there will commit to an uptime.
Starting point is 00:14:44 And in fact, many of them will tell you if you want to run a performance test, you have to coordinate with their in-house performance testing organization, which in some cases, your organization may be more mature than theirs. I run into that a lot. And so you're asking them questions that they can't answer or have never considered. But in a multi-tenant environment, their concern is keeping the software up and alive. And even if it takes 120 seconds to respond, which is the traditional timeout for HTTP protocol. It's still responding. That may be totally unreasonable for your business and not support your needs, but you're within their SLA.
Starting point is 00:15:35 Now, what the SaaS vendors seem to be counting on is that there are edicts from management that we're going to turn off applications in-house and we're just going to move to a SaaS solution. So you have contract organizations that are being pushed by their own management to get these SaaS solutions in place, get the contracts done, and they're not negotiating on behalf of the organizations to manage risk. They are not insisting that an addendum be put in place, which allows for the integration of real user monitoring within the application, as opposed to, say, a browser add-in or something, at which point, when you call for support, they're going
Starting point is 00:16:20 to say, hey, we don't support the browser add and you're at your limitation of support, go away. They don't support deep diagnostics, which is okay because, you know, unless there's a special vendor agreement in place, that is one of the things that you are likely going to give up with a SaaS application because someone else is managing the infrastructure. In some cases, but not all, they'll give you logs and they'll give you logs that might have meaningful data in it. I know some vendors will give you logs that have zero time elements in them. So, you know, time taken, zero milliseconds. That usually indicates an error, but they don't cite it as an error. They cite it as zero milliseconds. Others will give you logs from, say, an application tier, which is not multi-tenant and dedicated just to you. But you can't get logs from the front end,
Starting point is 00:17:19 so you're not able to see what maybe your front end experience is lacking the thick client components. So there's a lot of difficulty in performance engineering on the SaaS front. The key on SaaS is this contract piece, and it's got to be fixed. There's got to be some organization out there that is willing to push the envelope and say, we will not buy your solution unless you agree to our terms because we're managing our risk. If we were to build an equivalent application or deploy it in-house, we would require these response times. And if we're going to go to software as a service, we should insist that we have the same level of response from your software, or at least, or at least, or at least
Starting point is 00:18:15 matching some vendor standard, like a Google rail, you know, five seconds to the end user, a hundred milliseconds for a rest or soap call. But give me something other than just ignore it altogether. And I'm going to shut up now. Otherwise, I'm just going to rant for another 45 minutes. Yes, I had a couple of thoughts. I mean, three things I wanted to bring up on the last point. While I agree with you, I think it gets very difficult because at least my experience in dealing with some SaaS components is that oftentimes the SaaS solution can be can be customizing the images, all other kind of components, then how does the SaaS vendor ensure what the user, end user did, their customer did to their setup meets an SLA? They can control it if you use the plain vanilla and maybe on that side they can do it. But then it just even becomes difficult because then you have to do a proving game, a finger pointing game.
Starting point is 00:19:24 You did this, you did that, blah, blah, blah. So again, we don't need to go into that, but that's just, I think, one of the complications. And that's entirely fair. What stops a customer from putting a 45 megabyte GIF on the front page like J.Crew did last Black Friday?
Starting point is 00:19:40 It's not even images. Sometimes they can modify the way the codex is. That's a really good way to distort response time right there. Right, right. But I mean, this could potentially be solved if, I mean, I guess it always depends on the SaaS company if they provide this. version where the customer of a SaaS solution can try new things out and then they have to run through a set of tests to make sure that they are following best practices of that SaaS platform. And then they kind of charge services for that. And I mean, yeah, it can be done. I'm just saying there's a conflict. There's a, there's a extreme level of complication, which suddenly removes the
Starting point is 00:20:21 benefit of using SaaS. But as soon as you, as soon as you add multiple people using the same software and multi-tenancy, it becomes horribly complicated because if I run a test and I take out someone else, then I've now impacted that vendor service level agreement with that other organization, and there could be financial implications involved. So I'm not saying that it's easy to fix. No. But what I'm saying is that people should at least try. They should negotiate and come to an agreement.
Starting point is 00:20:51 And maybe that agreement is we can only hold you accountable for soap and rest calls because there's very little customization on that. And we can monitor that. And here's the way we monitor it, and things of that nature. That's a good point. The other two things I wanted to bring were just two examples of what you were talking about in terms of what the vendors supply. So, again, I'm not going to mention any vendor names since we don't get legal involved, but I'm familiar with two SaaS vendors,
Starting point is 00:21:19 and one of them not only uses Dynatrace in their entire SaaS setup, but they give access to their customers. So the customers can see all this performance data. I'm sure they wall parts of it off because they don't want to see multi-tenancy and all that kind of stuff. But the customer has access to a lot of this performance data. So in that regard, I know one example where they are, it might not be in the contract.
Starting point is 00:21:44 I have no idea if any of that stuff's in the contract, but it's all very transparent to the customer, which really is awesome. On the flip side, there was one that I was dealing with where a lot of their customers were trying to buy our product to monitor the end user. So they would just use it for the browser ROM piece of it because they couldn't put agents on the back end or whatever. So we had a lot of customers trying to use that. So so finally we returned to them and said hey a lot of your
Starting point is 00:22:07 customers are trying to use our product for this do you have any interest in using it on the back end and i remember in a meeting with them their infrastructure team had an infrastructure tool the development team for production had one apm tool for production they used different tools in pre-production. The web front-end team was using a whole separate set of tools. There was about five or six different sets of tools, and it was just an absolute mess. And just going to your point of sometimes your organization does things better. They were all open source and free, so why would they buy it? No, they weren't all open source and free, but they all operated in complete silos.
Starting point is 00:22:45 The infrastructure team didn't give a damn about the process and code performance, right? And we're like, you know, there's a better way to do this. And not even just from selling our product, but you know, you should all be working together. And they're like, yeah, that's how they like to work. We're like, okay, we can't do much there. But yeah, but you do see the gamut.
Starting point is 00:23:02 So not all SaaS is equal. But yeah, to your point of your organization might have been doing it better, that could definitely be true. Anyhow, that was just my little thought box on that. I've gotten to the point where if I have a customer solution that has a PaaS option versus a SaaS option, I'm advising our customers to say, okay, think about what you want from a performance perspective, and then look at the SaaS contract very, very critically. If you can't get it with SaaS, if you put it on your own infrastructure, platform as a service, either in the cloud or in-house or some sort of hybrid arrangement, at least you have greater visibility and you have a greater ability to hit your goals in a way that you manage versus someone else. True. so james again instead of doing my summary at the end across everything i want to just kind of the
Starting point is 00:24:07 key takeaways of this topic when i ask about you know sass performance engineering i think the key takeaway is that most what you said most people don't look closely into the contract so i think the change of thinking should be at least look at the contract and start negotiating performance into the contract. Because right now it's probably just a classical availability. Is it up and running? And that's it. But you should try to get as close as possible to the standard that you would enforce on the same product if you would run it on-premise and it's delivered or developed by your own company. So you should try to set
Starting point is 00:24:46 the same or similar standards to any SaaS product that you're going to acquire. Is this right? Correct. And that holds for the project inception for any project, whether it's in-house or in the cloud as a platform as a service or software as a service, you want to define very, very clearly those items up front, which are critical measures for success for your own organization and performance. And you want to make sure that that is clearly captured and articulated to all downstream partners. So in the case of an in-house app, that would be to architecture, development, platform engineering, quality assurance. In the case of a platform or in the case of a software as a service application, certainly to your SaaS provider and likely your
Starting point is 00:25:40 in-house quality assurance and platform engineering that are going to have to hold that SaaS application in check, especially for any customizations, as you know, Brian, and monitoring for proactive actions in case performance is not up to par. And, you know, for people listening who thinks this sounds familiar, these to me sound exactly how we were defining SLOs and SLAs with the Google team earlier. You know, the more true to original definition of an SLO based on the SRE handbook and some of the Google people was the idea of uptime response time, what the end user, what your deliverables to your end user are right and everything else is just how do you make sure that's happening and the sla come part comes in when you add it to the contract so just some of our recent shows we were james we were talking about the difference between slos slis and sla so just wanted to kind of tie that in with that for
Starting point is 00:26:41 for your context yep you got it yeah so next topic um james i know in your news of the damn over the years you have brought up a lot of examples and stories about uh you know website launches that crashed or you know the famous let's hear some damn stories actually i know we can't hear a lot of stories but i would rather have people uh go over to the podcast today the proof byfPyte News of the Dam podcast and listen to that. But here's a thought. So, you know, a lot of patterns, you know, a lot of things that crash systems. You just mentioned the 46 megabyte image that somebody uploaded. Yesterday, as the time of the recording yesterday was August 25th, I did one of my performance clinics on chaos engineering.
Starting point is 00:27:30 And the way we talked about chaos engineering there is, you know, enforcing, running experiments to see how the system behaves. But it was mainly focused on, you know, what happens if you have a CPU hog? What happens if you have a network issue hawk what happens if you have a network issue what happens if pods crash all the time what happens if there's a dns issue and then basically inflicting chaos and you run an experiment and you have a hypothesis what will happen with the system and hopefully the system is able to handle it gracefully but here's my thought and maybe you're doing this already you have a it must be a long catalog of
Starting point is 00:28:08 things that that people have done wrong that caused sites to crash like allowing people to uploading a 46 megabyte large file do you see people actually taking this type of catalog and then running through it with a new software release and say, now let's actually figure out what happens if we do upload a 46 megabyte image or if we're using the system on a very slow bandwidth. Do you see this as well? Have you seen this in your practice with your customers? With my own customers, it's easy to control. So we can ask those questions earlier. We can look at those patterns in production and in test to try to mitigate against those things going all the way to prod. crew earlier with the 45 megabyte GIF on the front page on Black Friday. That was a marketing event. So someone in marketing put a page up that had that image on it.
Starting point is 00:29:21 There's almost nothing you can do to stop that other than to put some penalty in place against marketing if they do something like that. And it's going to be a financial penalty for something of that nature. But wouldn't that be, sorry to interrupt you here, wouldn't that problem be solvable if the system that they use, because I assume they don't write HTML pages and upload the images on the web server. I'm sure they're using some type of content management system to correct kind of these things into the platform that they're using. And therefore, when I don't know, they roll out a new version of WordPress, they roll out a new version of the CMS to just add these, let's say,
Starting point is 00:29:56 vulnerability tests or chaos tests to the mix and then validate that this cannot happen. It would be nice if you could do that. There is something that does come to mind that showed up just within the last couple of days. There is a YouTube personality who is previewing a new version of WordPress, WordPress 5.5. And in that release, they have heavily addressed some of the patterns
Starting point is 00:30:24 that made WordPress a low performer related to how they were constructing style sheets, including all sorts of old artifacts that were not required, how they were handling images, how they were handling cash, how they were handling compression, and taking a look at how that score appears inside of Google Lighthouse and how that score appears on GTmetrix. And it is substantially improved in WordPress 5.5. So in that sense, we see some vendors trying to use that as a marketing differentiator when there is an opportunity to do so. But what are we seeing inside of most organizations? I hate to say it, but performance testing organizations for the vast majority of organizations have delivered such low value over the years.
Starting point is 00:31:24 Basically, the only thing they're turning in is a report of response times. And that's it, because that's all they know to turn in. And in fact, some of the tools in the marketplace, and most notably the open source tools, the JMeter, Grindr, things of that nature, are geared around just turning in response times. That's it.
Starting point is 00:31:46 Leaving to other tools to answer other questions. And in those organizations, performance is being, you know, how should we say, left in place to deliver its low value as a checkbox and not really being asked to consider these questions. That consideration is being moved in many cases to tool expertise, AI, pattern detection, things of that nature, and more to operational tools to find these issues,
Starting point is 00:32:25 either finding them in a quality assurance pre-deployment type of model or finding them in production. Now, what we don't see a whole lot of, particularly whether it's in the prod side with AI, pattern detection, or pre-production, is this issue of bot load inside of your application, particularly if it's public facing. And the reason why we don't see that is when you have security tools that are looking for patterns, they're looking for patterns to try to exploit the system. They're not necessarily looking for patterns of load on the system,
Starting point is 00:33:08 but it's a legitimate request. So this underlying bot load on a system for a public-facing site, which could be anywhere from 30% to 80% of the load on a public site, goes unnoticed. And that's a real opportunity in the market, either for expertise from an individual or an organization perspective, or from a tool vendor perspective, to be able to say, hey, we notice your top 200 users are all coming from offshore cloud data centers and they've never bought anything, but they're using 50% of your site's resources. Would you like to block them? Because I can tell you blocking those is a whole lot easier than re-engineering your software
Starting point is 00:34:01 to actually overcome the scalability of all the resources they're using. And that, I guess, goes again into the discussion we had in the beginning where you said, what's the value of performance testing? They just run some load and give you response time versus performance engineering. They have a holistic view, move into production, understand everything that's going on, understand things like this, like, hey, we have 80% bots traffic. And I mean, some bots might be useful for an e-commerce site
Starting point is 00:34:37 that are crawling the data and then producing, you know, but if you do that, if you're really then analyzing the traffic, you can, I mean, I have examples. We have examples. I think we have some blogs on our blog site where we talk about the rise of the bots and how we can do performance optimization. Exactly what you said, but by blocking traffic that is really completely unnecessary and not beneficial to the business. There was even a cross-shopping bot several years ago that was taking out websites. It was taking out e-com websites because the rate of interrogation on a website to find the price of an object was so aggressive, so high,
Starting point is 00:35:25 that it voided the do no harm, and it was simply taking down websites, that it was cross-shopping. I had a... Marketing would say, don't turn off that. Don't turn off that. We might get a sale. But now none of your people can buy anything.
Starting point is 00:35:43 I had one of my performance clinics a couple of weeks ago with the cto of ant clothing and they had so it's also e-commerce store and they did a hackathon internally and one of the projects was to do sophisticated bots detection because uh you know not every bot is just hitting the the same just their home page and then um doing nothing with it but they actually have they have a lot of uh like special gifts or you can you can sign up for a price and people were real users were always complaining that nobody ever wins the price and they were then kind of thinking maybe there are smart bots that have been written to actually find these these special offers and these games where they can get them free clothing.
Starting point is 00:36:28 And they actually then, in the hackathon, really tried to analyze real user traffic using Dynatrace data and then really detecting bots that were on... By first looking at the run data, it was not apparent that it's really a bot. But I thought it was an interesting, cool project for a hackathon that they did. Yeah. In fact, and there's been a lot of public data on this, so I'm not disclosing anything. Nike has severe problems with bots. In fact, there's a commercial aftermarket where you can buy a Nike bot to put yourself in good position to be able to buy at 5 a.m. Pacific time on Saturday mornings for whatever their newest shoe release is for the week. There is nothing that stops Nike or
Starting point is 00:37:19 their performance engineering group from surreptitiously buying the bot and monitoring the traffic on the bot on one of these events and getting a pattern associated with the bot and then making sure that when these events occur that such patterns are not allowed or they're only allowed say one percent of the time or one half of one% because as most of us are reformed developers in this profession, and as we all know, the thing that frustrates us the most is when something works sometimes. You know, that's called corralling a bot. You don't block it 100%. So it gets through sometimes and the rest of the time it gets the 503, which is an expected event when it's under load. So if you can keep the bot manufacturer from evolving the bot
Starting point is 00:38:14 or moving the bot by giving them a set of responses that look legitimate, kind of the psychology of managing bots here, you can actually improve your chances of getting them to a minimal level and keeping them there without disrupting your users. You know what would be an awesome way to treat that? Send the bot, I guess there wouldn't be a transaction involved, right? So send the bot to the transaction page but a special one just for the bots where on the last page saying small print somewhere you are not actually purchasing these sneakers but any money given will be donated to xyz or how about this b Brian? Have one PC in the cluster, then once you identify that they're a bot, all their load is routed to that one server. So if it gets slower and slower and slower for the bots, well, you know. And besides these e-commerce examples, and I know, James, you've had several examples on your News of the Dam day. The ticket sales, right, for events. I think that's the same thing where bots are buying the tickets and then probably put them on the black market for a big concert that comes online.
Starting point is 00:39:31 Yeah, and bots for tickets are particularly troublesome because these usually involve large bot networks all over the planet to do these purchases. And who knows if someone from South Africa is going to come to a concert for Beyonce in Atlanta? You don't know. So unless you have some marker in the data, such as a serialized ID, which does not change, a timestamp embedded in some sort of masked form, which is too far off of the actual time right now. Something that allows you to say, this is a bot. You have to treat it as legitimate load. And so you get a ticket site that goes on sale and then boom, you have hundreds of thousands of people from all over the planet come in and attempt to purchase a ticket. And in some cases, I hate to say it, this type of gross failure of a system is being taken as a mark of success by the artist.
Starting point is 00:40:49 Last year, the Jonas Brothers were regularly crashing their website for ticket sales, and their marketing department was ecstatic about how this shows how popular the Jonas Brothers are. That also happened with a presidential candidate in the United States earlier in the election cycle. This was for Pete Buttigieg, whose website kept going down all the time after an event related to taking in donations. So similar type of item, but probably not bot related in that case. But using these outages as a measure of success and a way to generate buzz from the event. There's an intro that we use on News of the Damned, which comes from a step-cousin of mine, the rap artist Dessa. She was doing an event with the Minneapolis Symphony Orchestra, and that was selling out. Probably not a bot event. She just happens to be very popular. That's her hometown.
Starting point is 00:42:00 And she thanks everybody at the end for taking out the website. And if I had any hair on my head, I would have reached up and pulled it out by its roots. I'm like, come on. If you're going to celebrate these outages as success, it's going to be much more difficult to get money to fix them. That's actually, I think, the biggest challenge, right? Because you can bet you, coming back to what you said earlier, a lot of organizations may not see the value of performance testing
Starting point is 00:42:30 because they just, you know, some organizations just either don't take it serious or, you know, you just, you know, take the open source tools, nothing against open source tools, but just run some load against it and tell me the response time. And then if the marketing team says hey we actually get more social media impressions for our we went down uh message then i don't know for everything else we did before then hey it's actually good that performance is not good it's good that it's not scaling because we get more attention yeah then obviously it's counter especially especially
Starting point is 00:43:02 for especially for the ticket vendors right there they're going to sell their tickets are selling regardless and most of them have stake in the resale market so they're making double money on those anyway so yeah so they're happy they're a monopoly provider in that case so it's not like the owners of the secondary market and buy and i blame christine todd whitman for all the ticket stuff because i remember uh i don't know if she started it but when she became governor of new jersey one of the first things she did was legalized reselling tickets in new jersey because it used to be illegal and that just opened the floodgates and then a lot more states started following suit after that and all these legal
Starting point is 00:43:42 resale markets made it now impossible to get tickets for anybody. And I really don't understand what makes a baseball cap worth $10,000 when paired with a ticket. I really don't. It's all in what the person wants. I mean, you can make the same argument about what makes an action figure money.
Starting point is 00:44:00 Desire for somebody to have it and it's worth as much as it is for that person. But to you, to me, I could care less about a baseball hat sure but you know or a pair of sneakers you know it's my interest though but would i pay extra money for an autographed copy of the butcher cover of beatles uh yesterday and today probably well if i had the spare money but you know all comes down to our likes but yes i i i got really specific there didn't i if anybody wants to donate a butcher cover copy to me i'd be happy to i'm picturing those two days that that cover was actually on sale at sears before they pulled it yeah thinking yeah it's a kind of a narrow uh narrowly scoped item there brian yeah
Starting point is 00:44:45 well they're still out there you can get them for like 500 or so bucks but not autographed i don't know if you can get an autograph anyhow the one thing i want to point out before we go on though just because like all this stuff is really really amazing and what i need to point out for people that when james goes through and looks at all this stuff um and finds all these patterns and when they discuss the stuff that they do at news of the damned keep in mind the only the only things they have to go off of is any front-end tools that they can look at so if something's going down on a on the site they can pull up a browser and try to take a look at what's going on and then anything released by the company after this happened you know james doesn't have um tools or anything inside the company
Starting point is 00:45:24 inside the data centers to understand what's going on. So all the analysis he's doing, he's doing in this miraculous way by looking, I guess, probably because the fact that you've been doing this for such a long time, James, right? You can look at things in spot and understand what's going on and infer and then later on have that confirmed, usually when the press release comes out about what happened. I just think that on its own is amazing. I wanted to give you a big kudos because that's a whole different level of performance analysis when you're doing it from 500 feet away with limited insight. Well, I hope that anyone who listens to the show can take that away and say, you know what, I don't necessarily need the perfect tool to do this job. What I really need to do is be able to have a critical eye towards the use of resources under load and be able to get some measurement of time and resource, even if it's through the front end, just individually.
Starting point is 00:46:28 And tools like GTmetrix are great for that. Pingdom is great for that. The Lighthouse tools that are built into Google Browser, Google Chrome is great for that. But being able to look at that and say, oh, I see this one image is way out of that. But being able to look at that and say, oh, I see this one image is, you know, way out of scale. You know, it's an eight megabyte image. And is it really adding a whole lot? Could it be just as valuable at 200K? Yeah, probably. And being able to say, okay, I would make these recommendations.
Starting point is 00:47:06 One area where I do have to say that performance testing has failed in many respects is we're usually testing to see if it falls within spec. And within spec is usually a response time, and it might include some resource measurements. But it never includes, is it optimal? So I might have a highly compliant application that still is resource dirty in how it's managing resources. And I'm not going to know that until it actually gets to production. And now it's a public-facing website for whatever reason. And 40% of my resource pool is now taken up by bots.
Starting point is 00:47:54 And we hit that magical tipping point, say, during benefits sign up in October. And it goes down. We should really be looking at what is the optimized deployments in performance testing versus what is the compliant deployment. I have to write this down. That's true. Hey, and I think coming just quickly back, before you moved into this section, coming back to the example, what can you actually do from the outside, right? Hey, this image looks off.
Starting point is 00:48:35 To add to the discussion, performance testing versus engineering, for me then the difference from these two disciplines would be performance testing. Yeah, maybe I'll look into, you know, occasionally, what are these pages? How can we optimize them or what's wrong? And then report it back.
Starting point is 00:48:51 Performance engineering, the way I see it, is to automate these checks into the pipeline so that it doesn't take me to look at it all the time, but it's done automatically every time you push a new build through the pipeline. Because these things, especially front-end web performance optimization checks, we've been talking about this, since I've been at Dynatrace, we've been talking about it back then with the Ajax edition, and we automated these checks into continuous delivery.
Starting point is 00:49:22 So it still strikes me, me actually that now in 2020, we still have pages that are 46 megabyte large or 45 megabytes large, because I would hope that by now, these things are just automatically detected because it's been integrated into the delivery pipeline because people have stepped up from just being performance testers to truly enable an organization to do performance engineering.
Starting point is 00:49:49 So, Andy, I want to issue you a challenge. Pick any five developers at random. Not inside of Dynatrace, of course, because you're a very performance-aware organization. But pick five developers at random. Take them to lunch and just say, what defines poor performance? What in your code defines poor performance? This is a conversation I'm actually having with the university department head where I graduated from. What defines poor performance?
Starting point is 00:50:20 It's something that's not taught. It's something that's not taught. It's don't talk about what defines poor performance. How you use resources defines poor performance. When you allocate, when you collect, how large of a resource you allocate, how often you hit it. I think of a sub-query in that case of database performance. And that resource lock, I like to use a visualization
Starting point is 00:51:09 of four Tetris games running side by side, you know, CPU, disk, memory, and network with different block sizes in them. And it could be that three of those games are in a winning state. That fourth one, whatever it is, you know, pick the resource randomly, locks up. Well, your system is in vapor lock or resource lock at that point. You can't go any
Starting point is 00:51:33 further. And that is really not emphasized. But if going to your pipeline issue, if we start at project inception and we say that an application shall respond in five seconds to DOM complete, that's pretty specific. And we could add some load measurements onto that. So we have a measurement we can collect in RUM, DOM complete. And now we have 18 architectural components that have to be touched in order to hit that five seconds, including the thick JavaScript front end. If we have this at Project Inception, that means architecture has to go through for every 18 of those components and assign what is the maximum that can occur in order to hit that five seconds to DOM complete. Once they've done that, holy cow, inside of the CI pipeline, we can now ask the question at unit testing, are we compliant? We can ask at a component assembly, are we compliant? Ideally, we can do this in a passive way because, as we all know, developers are not really good about collecting time. They like to collect, does it
Starting point is 00:52:52 work? Yay, it works, but not necessarily is it slow. So if we can collect that through logs, if we can collect it through deep diagnostic solutions and actually provide that feedback as part of the pipeline, make a SOPR REST call out to the data store, which has that log processing or that deep diagnostics and get that measurement of time as part of a complex ask of, does this pass and can it be promoted? That has enormous power. We can collect those performance defects earlier and often. And as we see from everything from ISO and ANSI and Gartner, the earlier we can collect a performance defect, the cheaper it is to fix because there are fewer architecture
Starting point is 00:53:46 issues that we have to unwind to get to the core of that issue. And then a shout out to people that think this is a great idea and want to get started, we give you a head start by using Keptn. Because Keptn provides that framework to actually pull in data from different data sources, comparing it against your thresholds, and then giving you a very simple to interpret number between zero and 100, like a score. So just the listeners that are interested in figuring out how you can integrate this into the pipeline. But you're right. I mean, the challenge, though, and I think the initial challenge is people need to sit down and say okay first of all we do have a performance requirement five seconds
Starting point is 00:54:30 and what does this mean what's the performance budget that every individual component has along the chain and i often hear that this might be also a blocker because people just don't often have an end-to-end, they don't feel the urge of an end-to-end responsibility, or they don't feel like, or there might not be a performance engineering team or an SRE team or whatever team that is, that is actually taking this action item of saying, we need to sit down, we need to define our SLOs, and then we need to break the SLOs on the front and down to the individual microservices and then enforce quality gates with every build
Starting point is 00:55:11 that is pushed through by every service so we get an early warning. I'm not sure how you see this, but I see a lot of organizations struggle. I see this quite often where all of these components are being developed by decoupled teams, and they all say when response time is bad and something has crashed, well, our stuff works. Nobody has taken responsibility for a uniform view of what is that maximum amount of tolerance state. Now, something that really is helpful, particularly on the deep diagnostic side, is the ability to profile across all of these tiers and see what is the most expensive request or what are the non-compliant requests, either passively
Starting point is 00:55:57 or actively, collect that data. So at least you know from a SWAT effort where to allocate your resources for what is the most offensive component that has been profiled, highest cost and know where to send your engineering team to do the diagnostics and work. And it could be something as simple, and this comes from an actual live case, where the log level was set too high and it was writing too much to the disk. And that, in fact, slowed an entire architectural tier down. Yeah. We've seen this too often, unfortunately, these things, yeah. Or the classical, you know, whether it's logging or whether it's the famous, infamous OR mapper that is killing the database with too many queries and things like that. Right.
Starting point is 00:57:07 Yeah. Or, or, or a missing index or a non partitioned index where someone has to start at a, where most of the data is clustered in S and R. And so they have to trace through the entire alphabet in order to get there. Yeah. Little, little things like that can make substantial differences in performance, particularly when you're shaving 100 milliseconds off of every query.
Starting point is 00:57:33 Maybe you have an embedded query that has caused something a thousand times, it's like, boom, that's a lot, it adds up. Very cool. James, typically, at of getting close to the hour here we we wrap it up but i think we wrapped up the individual sections i think that this is something that i did differently this time brian right there's no summer raider but i just did a mini summary i don't think it was yeah the, the Summaryator bots. They were jumping in.
Starting point is 00:58:07 Yeah. But I mean, for me, again, I took a couple of notes that I will also put into the summary of the podcast. Who? James. I know, Andy, you're going to interrupt. I got to interrupt. Instead of doing a waterfall summary, you did an agile summary. Look at that. Iterated.
Starting point is 00:58:24 In sprints. There you go. All right. Look at that. Iterative. Yeah, yeah. There you go. There you go. All right. Yes. Okay. We'll get you to scaled agile in the next podcast, Andy. And Andy, we're going to have to stand up for the next podcast.
Starting point is 00:58:36 Yeah, of course. Well, I was actually standing. I'm on my standing desk here, which actually means I'm at the bar in my kitchen. And I'm typically standing. And it's been five months now. And, uh, from time to time, I wish I had a proper chair again. Um, but, uh, James, I'm pretty sure, well, obviously our paths cross anyway, because of the, um, the collaboration we've been doing over the years, uh, whether it's, um, you
Starting point is 00:59:03 know, doing, doing podcasts with each other on each other show or the uh you know perform obviously next year won't be in vegas but it will be virtual but i'm pretty sure we'll figure out something there how we can all contribute uh also to the podcasting scene um is there any any uh not final words words, not famous last words, but any resources that we can... Before you die, Peter. What do people need to know? If people just get started with performance engineering, what are the best places to get started?
Starting point is 00:59:38 Any resources on the web? Any people to follow besides you? This is a tough question because there are very few resources on the web on performance engineering. I will give a couple of rules of thumb here. And this goes back to our PerfBytes, you know, give five takeaways.
Starting point is 00:59:58 One, find a way to ask a performance question every time you ask a functional question. It doesn't have to be the same performance question. It could be a measurement of time. It could be a measurement of a resource. It could be a measurement of throughput. But find a way to ask that question as early as possible. Two, wherever you can, collect performance data passively.
Starting point is 01:00:37 People are naturally resistant to change. And as part of collecting data as early as possible, you're probably going to have to collect it from logs. You may have to change your logging model. You may have to collect it using deep diagnostic tools. But unless someone sees a personal benefit and they've experienced that benefit, their ability to resist change is incredibly high. And for a developer who may be trying to hit a deadline, adding an extra quality gate that they personally have to ask and control, well, you can see how this is not likely to go into effect. When it comes to functional testing, no matter what type of application that you're doing, whether it's a thick client app, a thick JavaScript application, kind of a pure HTTP application, web services, whatever it is, if your application does not scale for one user, it's not going to scale for more than one user. So if you've got a 30-second response time on your web services call, fix it for one user rather than sending it over to performance testing and waiting for them to
Starting point is 01:01:54 build the test, design, well, design the test, build the test. Hopefully they design first before they build. Sometimes it goes in reverse there. We don't like that. But, you know, design, build, implement, and then finally get a measurement of test result. That's going to take a long time. And if you can answer that question with a single user, it's going to be a lot cheaper to fix. And the last one is don't feel siloed. Notice I'm saying ask these questions earlier. Ask these questions outside of a traditional multi-user performance test. If you can ask this question and you can discover an issue, you're adding value earlier.
Starting point is 01:02:39 This is not, oh, my God, I'm outside of my box and I'm going to get in trouble. Well, hopefully that's not the case. But generally speaking, if you find a defect earlier that would have to wait until far later where it's more difficult to fix, you become more valuable to the organization. So just kind of keep those five things in mind. And anybody can act in a performance engineering capacity. You just simply have to ask, how fast is it? And what is the resource that's taken up?
Starting point is 01:03:14 That's all. Thank you. That's all I can say about this. Thank you so much. Brian, anything else from your end? Let me see. There's a million things, but related to this podcast, uh, um, not particularly the one thing I would add to, to James, I think James, those are amazing five points. Um, uh, 5.1, maybe we'll go for some surround sound, um, would be, yeah, would be to talk to people in development,
Starting point is 01:03:50 talk to the product owners, talk to marketing, talk to operations, establish relationships with these people because those are going to be critical. You know, take them out to lunch or whatever, maybe not, but just establish some sort of relationship because when you do start to need answers, even if you're going to, you know, somewhat not directly get them, you're going to get the answers from them. And the more they have a connection to you, the more they'll be able to, they'll be willing to help you with some of these things. Like if you're talking to marketing and then they're going to say, oh yeah, we're running a big campaign next month. Oh really? What sort
Starting point is 01:04:24 of traffic are you expecting to drive to that? Now you know you have something that might not have been on the radar for performance that you need to know about. So there's a lot of things that come in from those soft skills. That's a whole different set of things, which I just wanted to introduce there.
Starting point is 01:04:40 But yeah, otherwise, James, it's a pleasure to have you on i can't believe it took us to i believe this might be episode 114 uh i can't believe it took us this long to have you on wow i don't know i i think the next episode we do together we should all come together at debonu colony club on the coast of south carolina andy where you got engaged um just a few hours from me and uh we should have a nice um alcoholic fruity drink and comedy club but we got we got uh we we officially got engaged in seattle and i asked yeah underneath the space needle but then we did a trip to South Carolina
Starting point is 01:05:25 and actually did kind of celebrate the whole thing, that's true. See? Yeah, I remember your picture from I want you to reach inside James' pocket. He has something for you. It's the ring.
Starting point is 01:05:41 You made James part of the engagement ceremony. Yeah. And yeah, James, that's an awesome idea. I think the only thing that stands between now and that event is we need to get rid of this strange COVID virus so we can travel again. But once that is done, once that is over, we'll definitely be there. Yeah, it's really awkward. You know, you go into the
Starting point is 01:06:07 airport and you have those people in the Middle Ages plague masks out front, particularly if they're carrying a scythe or something. It makes me really uncomfortable. I imagine when this is all over, the terrible end scenes from the Star Wars celebration scenes where everyone's partying in the streets and bad music is playing. That's kind of what I picture the whole end of this. And George Lucas is going to be there filming it all for some horrible...
Starting point is 01:06:37 He's just going to make an entire celebration movie of all the worst bits of those pieces of Star Wars. Anyway, I don't know how the heck i got into that um and yeah anything else otherwise um i think that's it it's good to be back yeah thank you for bearing with the repeats um james will have we'd definitely love to have you on again now that we've proven that you're a responsible guest and don't fill the airways with curses you know that's something, that's something that your colleague Mark Tomlinson, I occasionally have to bleep or cut out some foul language from him. So kudos to you, James, for not being like Mark.
Starting point is 01:07:16 We don't want to be like Mark, right, James? It's never a good thing. You know, Mark is a spectacular individual. Like me, we should never clone him. Bringing it back to the clones. All right. Thanks again, everybody, for listening. If you have any questions or topics you would like us to discuss
Starting point is 01:07:38 or maybe even want to come on and discuss, you can reach out to us at pure underscore DT on Twitter or you can email us at pureperformance at dynatrace.com Unlike PerfBytes, we do not have a phone number. Sorry for that. And we'll talk to you all soon.
Starting point is 01:07:56 Thank you, everyone. I'm questioning thanking everybody. I just didn't realize it was the end of the show, so yes. But let's keep this going for another half hour. Bye-bye. I just didn't realize it was the end of the show, so yes. But let's keep this going for another half hour. Exactly. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.