PurePerformance - 009 Proactive Performance Engineering in ASP.NET with Scott Stocker

Episode Date: August 1, 2016

Scott Stocker (@sestocker), Solution Architect at Perficient, tells us the background of a recent load testing engagement on an ASP.NET App running on SiteCore. Turns out that even these apps on the p...opular Microsoft platform suffer from the same architectural and implementation patterns as we see everywhere else. Bypassing the caching layer through FastQuery resulted in excessive SQL, which caused the system to not scale, but crumble. Scott tells us how they identified this issue and what his approach as an architect is to proactively identify most common performance and scalability problems.

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome. Welcome, Pure Performance worldwide audience. I hope you're still with us. Yeah, the audio is still very interesting. It's still there. All right.
Starting point is 00:00:38 Hey, this is Andy Grabner. Today, I'm actually recording live from Linz, Austria. I actually made it back to my hometown for a short visit. And with me today is always my co-host, Brian Wilson. Brian, are you there? Yes, I am, Andy. How are you doing today? Not too bad. I was a little confused in the beginning about the background noise, and I wasn't really sure if I was actually talking loud enough to actually put my voice on top of our awesome intro music, obviously. So I'm not sure if my performance was good enough, actually, to be honest with you.
Starting point is 00:01:10 But hopefully it will be. Yeah, so the background noise is what people drilling into the walls on a Monday morning. Well, I guess it's not morning. It's still morning for you, right? Technically, here is afternoon. I mean, eight hours ahead of you. Oh, you're still overseas. where where are you now i i thought i said that don't you pay attention to what i have to say i am i usually tune you i usually tune you out no i was just making an audio tweak to try to
Starting point is 00:01:37 see if i can get rid of that weird noise i was hearing and i think i did so i took the time while you were talking so you're back in lens are you are you wearing your lederhosen uh maybe i do but maybe i don't maybe not yet maybe later maybe on uh one of these occasions i'm actually traveling to other parts of europe today actually people have asked me last week i was traveling to amsterdam and they were asking about the lederhosen because they saw me previous years at perform. And they were saying, you look different today. I actually was wearing a suit-like-ish thing. What?
Starting point is 00:02:10 And they were totally confused. I know. I apologized. And so anyway. I have a question before we move on, though. So you might have heard there's an old myth that newscasters don't wear pants underneath the desk because they can't. They don't have to because they're sitting right. I don't think it's true.
Starting point is 00:02:28 I've used to work in the news, and I've never seen that. But I guess with the whole idea of lederhosen being handed down from generation to generation, and maybe they're tight, maybe they're not, do you wear – or even if we go to the Scots – no pun intended, Scott. But if we go to the Scots where the whole idea of do you wear underwear under your kilt, do you wear underwear under lederhosen? Well, the thing is if I would tell you now if I do, it will no longer be a myth or a legend, right? Well, I don't know if I was making that up or not. So, okay, we can create that myth or legend whether or not you do. Yeah, and you're right. Lederhosen is supposed to be
Starting point is 00:03:05 basically passed on to the next generation. You know where I learned that? I assume at Perform. No, it was at Perform over at the STP Con when you were out in Denver a year or so ago. Yeah, see? There you go. Hey, I think we have an awesome guest today.
Starting point is 00:03:22 Yes, we do. Who is it? Scott Stocker from Perficient. I had the pleasure of working with him on an engagement several months ago now. So, hello, Scott. How are you doing? Doing well. Thanks for the intro, guys.
Starting point is 00:03:38 Well, it was barely an intro. It was just a Scott from Perficient. Anyway, Scott's a very, very, very bright mind working over Perficient, a.NET expert working. You're concentrating mostly on what, Sitecore over there, Perficient, but you're a.NET guru, you can say? Yeah, yeah, yeah. So I'm a solution architect for Perficient. My specialty right now is Sitecore, where I mostly serve as kind of an architect. But, yeah, I consider myself a.NET architect first, and Sitecore, of course, is on the.NET platform. And I've been doing Sitecore stuff for about eight years, so I've seen a lot of different implementations of Sitecore.
Starting point is 00:04:19 And we're going to dive into at least a little nugget of some best practices with Sitecore. But a lot of this is going to apply more broadly to a.NET audience or really to anybody. So I guess for those listeners who aren't familiar with Sitecore, don't tune out yet. I think a lot of the lessons that we're going to learn through this conversation are going to apply pretty broadly. Right. And just for people who might not be familiar with Sitecore, what kind of technology stack is that? Because they might not be using Sitecore, but they might be using something similar in that realm. And again, as Scott mentioned, even if it's not where he's going, a lot of the things we're going to be talking today are still best practices, whether you're talking about Java.net or anything. But yeah, so Sitecore, explain what that technology does there.
Starting point is 00:05:14 Sure. So I guess first what Sitecore is, right? So Sitecore is kind of an enterprise-grade web content management system. And it's built on the.NET stack using SQL Server and kind of a lot of the well-known, broadly used kind of.NET technology. So it's not some strange application. It has its own kind of caching layer and some things like that that we're going to go into. But very broadly, I mean, it is just a.NET application. Right, so mostly content delivery, kind of, if I wanted to boil it down, very basic.
Starting point is 00:05:50 Yeah, and just out of my curiosity, so is it software that I deploy on-premise that you offer or deploy typically SaaS? So I get a SaaS offering, or how does that work? Typically, it's on-prem or kind of in the cloud, but it's not really software as a service, right? So the Sitecore's licensing model is basically per server. You know, how many servers do you need for the amount of traffic, you know, that you're going to serve and things like that. So it's more of a traditional kind of approach to software licensing where we see, you know, we're licensing for a set of servers, essentially. And I would assume competition-wise, it might be something where people use Confluence or other CMS systems. And I am wrong with the main competition here.
Starting point is 00:06:42 You know, Sitecore's main competition is kind of Adobe's product, which I'm not, I'm not quite that familiar with it, to be honest with you. I think it's called like Q5 or something like that. Somebody can, someone in your audience can probably correct me. I'm not, I'm not on the Java stack and I don't really know much about that content management system, but, but that's, that seems to be the main competition. And of course there's a lot of other.net based content management systems out there as well. Cool. And, and just before we go any further, I know you write a lot about, um, awesome nuggets that you find, uh, in Sitecore or other kinds of problems in general. What,
Starting point is 00:07:20 would you like to share your Twitter handle? Yeah, my Twitter handle is S E stalker. That's spelled S E S T O C K E R. All right. So yes, he's very active, posts a lot of great articles and including I noticed the one that struck my struck my eye was the whole CLI interface to site core. Just knowing that that was there was kind of cool. Because that's sort of the heading heading back towards um the old-fashioned command line as a power tool which i always think is fun yeah absolutely absolutely well i think and this is also i mean i know it may not be related to the topic but as you brought it up brian i think it's a trend that we we keep seeing a lot now especially because it makes it easy to automate certain steps, right?
Starting point is 00:08:07 Everything you can do in the command line, you can easily automate with your automation tools, with the scripting tools. And by the way, because you mentioned the blog, if I do a little Googling or binging or whatever you want to use, and I go into Google search for proficient diagnostic side core, I actually find a blog post you did a couple of months ago on diagnosing side core performance problems. And I would assume that you probably also hear some of these stories today. Is that right? Yeah, that's correct.
Starting point is 00:08:40 That's actually what I kind of wanted to dive into today and kind of tell the story. And, you know, I had the opportunity to work with Brian, actually, on that particular engagement that he had mentioned earlier. That's what that blog post is actually based on. So I kind of wanted to go today into kind of some of the lessons we learned, some of the best practices, and, you know, some of the ways that going forward, you know, we could have done things better. You know, how could we have prevented the performance problems that we saw happen? How could we have stopped that from happening? Do you want to dive into the story or Brian?
Starting point is 00:09:21 I was going to say just to kind of give a brief um synopsis not a synopsis per se but we're going to kind of go in depth on what what happened there but to really make a small summary about it it's it's kind of the age-old problem of look at the settings and the technologies that you're using and understand what you're putting into your system. Uh, because I think a lot of the things that we found in this engagement was those were kind of ignored or not elected to be used. And it was, those were the issues then, right? So it's, you know, going back to back to you know the hibernate examples or anything else like you have to understand what you're putting into your architecture and you know read deeper and understand you know look at the manuals and look at the the guides
Starting point is 00:10:15 because they're there for a reason and um you know what we discovered on that was those were not looked at pretty much or they were ignored. So let me take a wild guess. And I have to be honest with the readership, with the listenership, I have not dug into the details of what you're running on or the frameworks are basically doing a great job, but not looking deeper and then not optimizing it for your particular use case, for your particular usage patterns and load patterns. Is this where we're getting to? Pretty much. Pretty much. Yep. Awesome. Well, I say awesome because the thing is, if people keep repeating the same mistakes, then we just need to educate more and more people about these mistakes. And then more and more people will make less of these mistakes.
Starting point is 00:11:14 And that's why Scott is great that we have him here because he can talk firsthand on what happened. So, Scott, do you want to get started? Yeah, yeah, absolutely. So we were engaged with a client starting, I think, quarter four of last year. So it's been a little while since we actually dug in here and actually solved their problem. But they had a requirement to serve roughly 1.3 million page views over the course of two hours. They're using a tool called JMeter to kind of simulate their traffic. So this particular website, there's a chance that it could get a very, very large spike
Starting point is 00:12:01 way outside of the norms of their kind of typical traffic patterns. Like an event driven spike, right? I'm sorry, Brian, say that again. I was going to say like an event driven site, but I think, you know, without giving away, you know, who this was or anything, we'll say an event driven spike, but it's not like a black Friday kind of a thing. This is more of a something occurs in the real world and suddenly everyone
Starting point is 00:12:22 floods to the site. Correct. Exactly. And I mean, you can think of it as like the classic, and maybe I'm dating myself because maybe Slashdot isn't really a big thing anymore, but it's kind of the Slashdot effect, right? So Slashdot, for those that aren't familiar, kind of a tech news site, one of the first ones actually, where, you know, maybe you wrote a great blog post and suddenly it appeared on Slashdot. And as soon as that happened, you know, thousands and thousands of people went to your blog posts and took your site down. So, you know, kind of these real unusual traffic spikes.
Starting point is 00:12:59 And this client is aware that those spikes are a possibility. And that's where they were planning for it with this kind of using JMeter to simulate this large amount of traffic. So under normal load, the website performed fine, actually performed very, very well. It was only under this huge load that things really started to deteriorate pretty badly. So as I mentioned, we were using JMeter to simulate the traffic, simulating over the course of two hours, actually. But around the 10-minute mark, maybe between 10 and 15 minutes, is when the solution really started to fall over. Now, it didn't fall over completely to like a complete, you know, a complete air state. But what was happening as the traffic ramped up and as things went along, the page response time would just exponentially get worse and worse.
Starting point is 00:14:03 And at some point it was taking, you know, 60 seconds, 90 seconds for a single page to load. Obviously, something was very wrong. So that's when we kind of, where Brian came in to help, and we used Dynatrace as a tool to kind of try to figure out exactly what was going on. Can I ask a quick question? So you were saying you were using JMeter as a load testing tool, and the response time that you were now referring to, is this the response time that JMeter gave you?
Starting point is 00:14:37 Yeah, that's correct. And, I mean, actually, I mean, obviously, if you hit the site just in a browser, right, during when the load test was happening, you could see what the user experience would be, where your browser just spins and spins and spins. And I will note to those that aren't familiar with JMeter and how it works, the time that is measured by JMeter is actually just the time it takes for the HTML to download. So that doesn't include any things like, you know, JavaScript loading or media resources, you know, images and what have you. That load time isn't even included as what was being measured by JMeter. So that 60 seconds that the pages were taking the load, that was just to serve the HTML to the user. Obviously, that's completely unacceptable to a public-facing website where performance is always key. It doesn't matter what industry you're in, what your website is doing.
Starting point is 00:15:36 It has to perform. Now, I have another question. If you run Gmeter, and I've done webinars on Gmeter. I have a long history with load testing, so I know JMeter pretty well, what they are doing. So can I ask you another question? Did you just look at the response time of JMeter, or did the load testing folks that actually ran the JMeter scripts also look at things like server resource utilization on IIS, I would assume, and on the app pool, or did they simply just ran the load and then basically handed it over to you? Or were these load testing folks also looking into more than just what JMeter
Starting point is 00:16:13 gave them as response? No, actually, they were just looking at the JMeter kind of response. And that's where Perficient was engaged. And that's where we engaged Brian Wilson and kind of Dynatrace to help us. But the client basically, they ran the low test and they saw very obviously that it was broke. And they did some instrumentation on the server, you know, looking at CPU and things like that. And they saw that, you know, everything was kind of spiked across the board. And, you know, in that way, they really came to proficient and we came to Dynatrace for that expertise. Like, what exactly is going on here?
Starting point is 00:16:55 You know, it performed fine. It performs fine under normal load. Why did it completely fall over under this heavy load? What was the tipping point? What was basically the thing that broke the party? I mean, the reason why I was asking this particular question, because I still see a lot of load testing teams that are really purely just focusing on generating load without having the expertise of actually diving into the real problems of what could actually cause a problem in case you detect a problem. I think there's a shift happening now in performance engineering teams
Starting point is 00:17:30 thanks to a lot of material that is out there in the world. I think also a shout-out to our friends from PerfBytes who do a phenomenal job in educating the performance engineering community with not only how to run a lot of load, but how to also combine it with tools like Dynatrace to make sure that they cannot only say, oh, there is a problem, and now we have to bring in the experts, which I'm sure, don't worry, you still have a lot of work to do later on, use proficient. But I think we will see more and more, we will see a level up from traditional load testers in also diving into the app monitoring space to really be able to say, wow, when we hit that particular threshold, we are going to see an increase in memory usage, which tips over the CPU.
Starting point is 00:18:17 And this is what causes everything to break. And so I think we'll see some of these things. I think what's interesting, though, is, Scott, correct me if I'm wrong, in the case of this instance, there was the tipping point hadn't been discovered, right? Because it was low load and high load and nothing in between that was run. Or do I remember that incorrectly? No, that's absolutely correct. And I mean, that's one of the first things when we looked at the load results when we put Dynatrace on, we saw a thousand problems. There was a million things going wrong, but it was, well, wait, we're running at 100% CPU. So, of course, a million things are breaking now. Exactly.
Starting point is 00:19:19 And last question that I have before, I'll let you kind of explain what you did to find out. Now, the environment that the application was deployed in, was it a scalable and autoscalable environment? Meaning, did it try to spawn more ESP.NET worker processes under more load? Meaning, was IaaS configured so that it was actually using web gardening, some load balancing across multiple servers, or was it really just a quote-unquote static environment, and we were just basically hitting it with too much load, and it didn't even try to scale or wasn't even possible to scale automatically? Right, right. Actually, I mean, yeah, it's the latter. It was more of a static-type environment.
Starting point is 00:20:03 Once again, that actually goes back to kind of how Sitecore is licensed, where it's licensed per server. So what we did in our load testing environment is we replicated the production environment, right, which in this case was two content delivery servers. And we threw the traffic at it that they knew they had to serve in a given amount of time. Once again, that's 1.3 million page views over the course of two hours. So we simulated that production environment. We threw the traffic at it. And as Brian said, we were way, way over the tipping point. By the time we were looking at the Dynatrace results, say after 10 minutes, after 10 or 15 minutes, as Brian said, it looked like everything was wrong.
Starting point is 00:20:51 It's kind of trying to diagnose the Titanic at the bottom of the ocean. specs from Sitecore figuring out if Sitecore itself actually says this is how much travel traffic we can support on that particular hardware environment? Yeah. So, I mean, Sitecore itself doesn't actually tell you like a hard number of this is how much traffic you can serve, you know, based on this particular hardware. Because Sitecore kind of, Sitecore out of the box gives you the admin interface. In other words, kind of the content management piece where authors will, you know, use the software application to create their content, right? But on the delivery side, so the actual HTML that's delivered to an end user
Starting point is 00:21:43 when they browse to your website, that's completely written by an implementation partner or written by a project team. So Sitecore can't say, hey, we can support this much load because a lot of that is dependent on your particular solution and what you've done as part of what you've done as part of that of that solution yeah how many widgets you've put on the page and all that kind of fun stuff and how heavy they are your image size is there's a million variables uh in that side that i think it would be although i guess it would be interesting if psych or had like as a as a bare minimum um but yeah i guess don't know how useful that would be but yeah and i I mean, we actually did engage Sitecore to kind of give us their perspective. And from their perspective, taking a brief look at the code and kind of what the site did, Sitecore came back and said, this shouldn't really be a problem. You shouldn't be seeing the solution completely fall over under that load. You know, it was then that we really needed to figure out, okay,
Starting point is 00:22:48 how do we find the tipping point then? You know, this isn't a Sitecore problem. You know, according to Sitecore, this isn't a Sitecore problem. So where do we go next? And I think that, Brian, it was your suggestion, actually, that we actually need to scale back that traffic in order to find the tipping point right that was where i think we um went ahead and started you know that i think it was like maybe like 25 percent load and then increased uh basically a you know a step test although we ran
Starting point is 00:23:19 them separately but we um basic idea of just what did we call that one andy with this the step testing do we have a special name for that back in episode two? It's an increasing load test, basically. I think that's what we call it. You start with a certain load and you increase it. And you increase and hold and increase and hold. And we're really just looking for what's starting to break apart. You know, what are the first signs of degradation?
Starting point is 00:23:43 Because that's where you really need to concentrate because usually um not i'd say not 100 every case but usually the thing that you start seeing increase first is going to be your breaking point and you know i would like to do a shameless plug for our tool dynatrace because in this case, the layer breakdown dashlet was just so blatantly obvious. You know, we pulled that up and it just, you know, without even having to dive or think or anything, it just revealed, you know, as long as it creates... I love that dashlet, yeah. Yeah, and basically for anyone who's not using the tool, it's basically going to show you which API layer you're spending time in over time.
Starting point is 00:24:30 So as you increase your load, you just see a nice graph, a color-coded graph, each API is in a different color, and it will just start increasing. And I think it was the ADO.net, correct, Scott, that we saw just kind of running away? Yeah, that's correct. And yeah, like you said, the Dynatrace kind of dashboard there that showed you the breakdown, uh, it was very obvious, very, very obvious that that is where we're spending a majority of our, of our time.
Starting point is 00:24:59 Um, and from a Sitecore perspective, that was truly shocking. Um, and I guess I'm going to go a little bit into the Sitecore weeds for just a couple minutes. So bear with me, those that don't know Sitecore. But so all the content within the content management system is stored within SQL Server. So seeing ADO.NET isn't a shock, but seeing that much ADO.NET was. And the reason for that is because Sitecore caches everything that it retrieves from the database into its own kind of caching layer. So all that stuff ends up being in memory, especially things. So this load test that we performed was across only a handful of pages. I think maybe it was maybe eight, maybe
Starting point is 00:25:46 10, maybe a dozen, something like that. So the expected behavior for Sitecore would be once those pages have been kind of served, that all of that data is going to be cached and we shouldn't ever have to hit the SQL server. So seeing that much ADO.NET really, I mean, it was really obvious that something was going on here that was a major problem. That we were using Sitecore's APIs in ways that they weren't designed for. We weren't using the cache properly.
Starting point is 00:26:17 And that's actually what we discovered. What we discovered when we dove into that kind of ADO.NET breakdown and the actual methods that were using ADO.NET, we discovered that there was a particular method that was being used often by all the navigation components on the page. So each page had three or four different navigation components. They were all calling into this method that was using something that's called a Sitecore Fast Query. And what we didn't know until we did some research was that Sitecore Fast Query isn't cached at all. So we were basically bypassing Sitecore's built-in caching layer and hitting the database every single time.
Starting point is 00:27:07 And as you can imagine, too, with a content, kind of a dynamic content management system, these queries that we were hitting SQL Server with were not pretty. You know, these aren't, you know, one-line queries, you know, select star from whatever. These are queries that have four or five inner joins. They're very complex queries, and they're very taxing queries to the SQL Server. So that's where our bottleneck ended up being was actually the SQL Server. So when we were seeing those page load times taking 60 seconds, it was because the page was waiting on SQL Server to actually respond with the results that it needed.
Starting point is 00:27:49 Ah, and so let me take a wild guess here. That means your incoming requests were basically waiting on SQL Server. That means ASP.NET was basically blocking these threads because they were all waiting. So the more traffic you were basically putting on the front-end side from Gmeter, at some point in time, IIS and ESP.NET is running out of worker threads, and therefore it's basically blocking them until they go into a timeout, and I assume the timeout is probably 60 seconds. So that means once you hit that tipping point where you're all basically blocking all of our ESP.NET threads, they're all waiting for the database, then it has the effect that IIS keeps the new requests that are coming in waiting until it hits a timeout, a native timeout in IIS. And that basically then shows up as probably like 60 seconds bad response time and then probably bad HTTP requests or responses come back from IIS to JMeter.
Starting point is 00:28:48 Is that kind of a little bit going in that direction? Yeah, that's definitely what was happening. There just weren't enough.NET threads IIS threads to serve all these requests and the requests just weren't being rendered in a proper amount of time. So we actually had a goal for page load time as well, between five and seven seconds under this, you know, ginormous load. And very obviously, we weren't meeting the five to seven second load time. But yeah, once you start getting into the realm of 15 seconds, 30 seconds, up to 60 seconds, that's where the whole solution just kind of falls over under that heavy traffic.
Starting point is 00:29:38 Yeah, and I assume it's probably on the IIS layer, on the native layer already, where IIS says, well, I cannot pass these requests over to ASP.NET anymore because it's busy. All the threads are blocking, and then IIS or if you are behind a load balancer, and then it would scale up more instances of that app pool. But on the other side, because of the real problem here is the fast query that you said. So actually scaling up the site vertically, meaning adding more app pools, would actually have killed the SQL server even more. And actually, it's not a scalable solution what you had. Now that's quite something.
Starting point is 00:30:32 Yeah, that's correct. So let me ask you one last thing. So you said these were three navigation components. And were these, maybe you said this, but I maybe don't remember, were these custom navigation components? So it's basically the engineers that customized Sitecore were putting the fast queries in or was this coming from Sitecore itself? Oh, correct. So in the instance of content delivery, Sitecore is essentially just an API. So every, all the HTML that's generated for a page is done by components that are custom wrote. So these navigation components were custom wrote. And like I said,
Starting point is 00:31:12 you know, we were using the Sitecore fast query for those components instead of something that would be cached in Sitecore's data layer. So more of a normal type of Sitecore best practices approach would be to retrieve data not using FastQuery, avoiding FastQuery wherever possible. And in this particular instance, what made sense was actually using what in Sitecore we refer to as a data source for these particular rendering components.
Starting point is 00:31:45 So instead of doing a query that basically says, go find this content that I need kind of in the controller layer, instead of doing it that way, we re-architected it to allow a content author to select what item they needed for those navigation components. And by doing that and re-architecting it using data source, all of those calls are then cached in Sitecore's cache layer, which means as soon as those navigation components are hit one time, we no longer need to go to the SQL server to retrieve that data anymore. It's all in memory. Very obviously, you know, as soon as we did that, we started seeing very positive results under the load test. And I think a lot of these were things like, you know, header, footer, right, on the page, you know, the idea of these, you know, caching
Starting point is 00:32:42 or maybe the content, I don't want to say becoming stale necessarily, but this is not dynamic content that was involved in these components. Obviously, the main content of the page, especially during the events that would drive people to the site needed to be dynamic, needed to be able to get updated. You don't want to be serving any stale content, but there's, there's that idea of when you have the ability to do things that Scott is talking about, take a look at what on your site doesn't need to be updated very often, your header and your footer and some of those other components, they're not going to change. Right. So they were perfect candidates for, and I think Sitecore even kind of recommended, you know, not making those dynamic and updated all the time. Correct? Yeah, that's correct. And I mean, even Sitecore has all these problems solved anyways, to be honest with you. So, you know, for that particular content item, let's say that the header navigation needs, for example, if you make a change to that in the authoring environment and then you go ahead and publish it out to your content delivery,
Starting point is 00:33:51 Sitecore is smart enough to know to kind of purge that cache immediately from the publish event. So as soon as a content author would publish that content, any cache related to that piece of content would be deleted by Sitecore, meaning that it's going to be refreshed the next time somebody hits the site. You know, that next person would see the correct data and that correct data would then be cached in Sitecore's cache layer. So, you know, from an architectural standpoint, Sitecore already has all this kind of figured out, and we were just kind of – those particular navigation components were just avoiding actually using the built-in caching for Sitecore.
Starting point is 00:34:34 And that was one of the lessons that we learned is that we learned that actually the Sitecore fast query wasn't being cached. You know, that was something that wasn't necessarily well documented on Sitecore's side. But obviously, once we saw what was going on in Dynatrace, it was very obvious that, you know, those fast queries, A, they weren't very fast. And then B, they weren't being cached properly as we thought that they would be. Now, let me ask you a question. Did you tell, I mean, obviously, then work with the development teams, right? So you gave that feedback to the engineers that actually built these components. So now, from now on, hopefully, they will build new components for Sitecore in the correct way or in the way Sitecore itself wants you to build it using the data source and no longer the fast queries.
Starting point is 00:35:26 That's correct. And, I mean, Sitecore's recommendation is to use that data source wherever possible to allow things like personalization. Now, for navigation components, maybe that's perhaps a little silly. Maybe not, though. Maybe not. So maybe depending on, you know, a user logs in, maybe they see different things in the navigation. If you use the Sitecore data source properly, you can kind of use the built-in Sitecore has this concept called output caching. And what output caching is is actually caching the HTML result of a component that makes up our page. So in Sitecore, you build things very modularly, as you can probably guess by now, right?
Starting point is 00:36:20 So these different navigation components are different, what we call renderings on the page. And by default, output caching is actually on as long as you enable it. You just have to check a box. Check a box on the rendering definition. And essentially, when that rendering, when it actually outputs the HTML, it will throw that output into a separate cache. So that the next time that component has to serve something, rather than going to CyCore's data cache or potentially having to go to SQL Server, it's actually just fetching the HTML result from this other output cache and renders it directly.
Starting point is 00:37:08 And that's one of the things that we actually discovered really at the onset of this project was that output caching had actually been turned off by the client. Because there were some particular renderings that they were seeing cached that were a big problem for them. So I guess rather than going down the route of figuring out those particular renderings or trying to disable caching for those particular problem areas, instead, the output cache for the entire site was turned off. And in hindsight, that sounds kind of silly, right? Especially after we saw what the actual problem was. But I mean, in reality, they turned it off and under normal load, everything was fine. It performed no different than it did before they turned the output caching off. So, you know, it seemed reasonable. But once again, going back to kind of site core best practices,
Starting point is 00:38:13 you've got to have that output caching on if you can because it's going to make your site so much faster. Not necessarily, you know, we're not necessarily just talking about preventing your site from falling down under load. We're also talking about making your site very fast. And that's also very important. This sounds kind of like, you know, a lot of sites and a lot of technologies have a very similar kind of idea of, you know, the caching or memcache or other things going on in there. I'm curious, in that situation, was that something that they were able to tell you that they did, that they knew about this? Or was it kind of one of those situations where, and this might not be the case in this particular case, but I think quite often what we see is that somebody makes a change like that.
Starting point is 00:38:57 They have a problem. They're like, well, let me turn this caching off and see if it fixes it, right, so that they know where the issue is coming from they turn it off it fixes it and then they never revisit and they even maybe either forgot they turned it off didn't document it or something and it's not until someone starts digging and digging that we you know we discover oh yeah we that's right we turned that off like months ago and we forgot about it and it was an undocumented change that then gets pushed to production maybe or something in in the case that you're talking about here was this something that they were they knew and they were like oh well we have this turned off or did it have to get discovered um the hard way uh you know it had honestly it was surprisingly a little bit of both so uh it took some discovery for me to kind of figure out that, hey, this output caching is off at the site level. Why is that?
Starting point is 00:39:49 And they were actually able to tell me why. And it was documented. It's just they weren't kind of up front with that. They didn't identify turning that setting off as a major problem and as potentially a major reason as to why they weren't meeting their performance testing goals, right? Like, so when I dug in, I said, hey, why is this turned off? They were able to tell me why. So I guess that goes back to kind of knowledge on the technologies that you use. You know, and Sitecore is definitely somewhat of a niche technology.
Starting point is 00:40:25 So it's difficult to expect kind of just a.NET architect or a.NET senior developer to really know and understand what that setting does and what it means. Right. It's there in the documentation, though. And I mean, that's where knowing your tools is really, really important, whether it's Sitecore, whether it's something like SharePoint, or whether it's a third-party library you're using, you know, if you're using N-Hibernate, or if you're using Entity Framework, you can't just use these things and not expect to have to know anything about them, right? Like, there are times that you've got to dig in a little deeper. The classic performance problem, right, with something like an N-hibernate or an entity framework
Starting point is 00:41:09 is kind of the N plus one problem, right? But you've got to, in order to prevent those types of problems as a developer, right, you have to know how the tool works in order to prevent, you know, firing off a thousand different queries to the database.
Starting point is 00:41:24 And once again, in scenarios like that, that may be okay if you're developing locally, right? And you've got a local SQL server, it's going to be super fast. You're not going to notice anything. It's only when that kind of bad implementation gets put under load that you really start to notice those problems. And that's where I... Go ahead. No, I was just saying, I mean, I'm really start to notice those problems. And that's where I – go ahead.
Starting point is 00:41:50 No, I was just saying, I mean, I'm just nodding all the time. If you would see me, it's like, oh, my God. He's just saying the same things that we've been seeing. And I'm always so – I mean, I was a developer for a long, long time, and I know how it is. You download something, you start with the Hello World example to make yourself familiar with the technology, with the framework. And then you never take the time nor the interest in actually learning what the framework is doing underneath the hood. But what we've been trying to promote with Dynatrace now, with our free trial and the personal license especially, so developers can use Dynatrace on their local machine forever to actually identify and figure out what are these frameworks internally doing.
Starting point is 00:42:28 And what you just said, you said on their local machine, it will always work because they're the only one on the system and the local SQL server will always respond fast. But they can already identify the M plus one query problem pattern because if they hook it up with Dynatrace and they run the local test and if they see the same SQL statement being executed more than once, then I already know as a smart developer and architect, this will never be able to scale because I already see on my local environment that I have this one query more than once, which is the classical n plus one query problem. So I can just reiterate what you just said. Please, developers out there, don't just blindly take a framework and take a Hello World example and then think it works on your machine and everything is fine. Take the time and learn what the framework is internally doing.
Starting point is 00:43:19 Take the time to check the documentation, the blogs from these companies that built the frameworks, because what this will actually allow you to do, it allows you to become a better developer and actually being able to spend more time in developing new cool stuff on top of these frameworks instead of being called in later on by Scott or by some other people and saying, you messed up here and now it's time to fix it. And you're taking time away from them on the next project that they're working on. So I can just totally reiterate what you said. You know, research. Do your research. Do your homework.
Starting point is 00:43:56 And these problems seem to be very similar. They might have some different names or some different characteristics between Java and.NET, but they're pretty universal across the two, right? It's, there's a, you know, Java and.NET are kind of different skins, but underneath people are still... It's architecture, right? Yeah, we're talking about architecture.
Starting point is 00:44:17 The M plus one query problem is a classical architecture problem, whether it's between the application and the database, whether it's between the front-end and the back-end server, whether it's between the browser and the web, whether it's between the front-end and the back-end server, whether it's between the browser and the web server. If you are making too many round trips to get information that you can do in fewer round trips, then this is an architectural issue. Or the microservices, which I think we should discuss on another show sometime. Yeah, exactly.
Starting point is 00:44:39 Same thing. The M plus one query problem is not bound to the database. So I think, Scott, this is a great – thanks for actually reassuring us that what we've been talking about and trying to educate people on, especially on the environments we've seen so far, that this is also true for Sitecore and other applications that you've particularly seen in the.NET space. So that's good to hear. And is that a drill in the background? That's the sound, the guy who is drilling the holes. That just reminds me, I think it might have been Halloween 2 or one of those horror movies way back. The guy was like drilling into somebody's head.
Starting point is 00:45:19 So just watch out there. I think it's on my head. So, Scott, any other points that you wanted to get across here or any kind of summary or wrap up or anything else? Yeah, you know, we and expect to be successful. It's just, I mean, unless, unless you, unless you build a perfect solution, you can't wait until the last minute to do load testing. Like this load testing, isn't the last thing you do before you go to production. It has to be more part of your QA process. And, you know, local developers need
Starting point is 00:46:12 to do this as well. Performance needs to be a goal and needs to be something at kind of at the top of everyone's mind throughout the whole process. If performance of your application is important to you, which it should be, then it needs to be part of your kind of normal. If performance of your application is important to you, which it should be, then it needs to be part of your kind of normal process, part of your code review even. You know, let's start reviewing code, not just for correctness or not just for how pretty it is or how understandable it is,
Starting point is 00:46:38 but how performant it is as well. Yeah, performance testing is not just load testing. Exactly. You know? i think i like i mean you said you i think you used the term goal i think it's more culture it's a performance has to be in your culture when you develop software because nowadays the software that you develop is potentially used by millions of users out there. And if you make mistakes in the architecture, it means you will not be able to perform, you will not be able to scale. And the other option, either you fix it
Starting point is 00:47:15 or the other bad thing that could happen, you have an automatically scalable environment that maybe spawns up 100 Docker instances or or vms but then somebody has to pay the price for it at the end of the month um so kind of um i mean it's obviously not not not the best thing to do and i agree with you performance engineering i think brian we had our second or third episode on performance engineering in cicd so all in automated, right? And I think we touched based on the fact that we can actually find a lot of these problems in CICD, in continuous integration,
Starting point is 00:47:51 continuous delivery, without always having to run a large-scale load test. You can find a lot of these problems. Like in your case, Scott, this problem, you could have found it just running a Selenium script or your chain meter script. Instead of simulating 1,000 users, simulating two users and run it for two minutes, you would have found the same problem pattern. Exactly.
Starting point is 00:48:16 But also provided that you're monitoring the right metrics. So in cases like that, monitoring your database queries, the times on those, but also knowing to look at those caches, right? If you're using something like Sitecore or something else that employs caching, look at your cache metrics. Are you using it? You know, and the point you made before, you know, you very specifically ran a test with eight pages, right? That's something, you know, if you're doing a general large scale load test, you probably
Starting point is 00:48:42 wouldn't do that. But for the case that you were running to, you know, there was, there was a specific use case for that test. Right. But in that case specifically to like, Hey, if we're going to, if we're going to run a test with eight or 10 pages, cause we want to see that this performs well after the initial hit, well then absolutely. Like that should be the trigger in everyone's brain to be like, well, we should probably look at the cash metrics too. You know, it's hard sometimes because there are so many things to consider, but, you know, just taking that step back to say, what are we looking at or what are we trying to achieve here? And are we capturing the metrics to tell us that story to see how we're doing it? And also
Starting point is 00:49:25 going back to the vendor's documentation, what do they say? What are their best practices? Do they have best practices? Or if they go onto your blog, which is what S E stalker, uh, S S E S T O C K E R, uh, on Twitter, uh, you have a lot of best practices for, for a site core. There's a, there's a lot of people writing about these things. And I just think, you know, people, people that serve,
Starting point is 00:49:49 it serves them and their projects and everything else better to, to be aware of all these things so that they can catch them early. They can see what they're looking at. Cool. All right. Well, sorry, was there.
Starting point is 00:50:02 Yeah. Oh, I was just saying, wow, I love to, I love to eat this inside. Yes. And I just wanted to kind of start wrapping up here, you know, just putting out the general invite as well. If anybody has any topics or fun stories or any ideas they would like to discuss, we love guests.
Starting point is 00:50:18 Scott, we loved having you on. And again, I'll plug you as a stalker, S-E-S-T-O-C-K-E-R on Twitter. I am Emperor Wilson on Twitter, and then we have Grabner Andy. Grabner Andy. G-R-A-B-N-E-R A-N-D-I. And if you have any comments or thoughts or any ideas, you can either tweet us at
Starting point is 00:50:39 hashtag PeerPerformance at Dynatrace, or you can send us an email at PeerPerformance at Dynatrace, or you can send us an email, uh, at pure performance at Dynatrace.com. Um, we'd love to hear any ideas, thoughts, feedback, good, bad, ugly, um, the good, the bad, the ugly. That's a great film by the way. Um, anyway, Scott, it was a real great pleasure having you on today. Um, I'm not exactly sure when the air date, let me just take a quick hop here.
Starting point is 00:51:06 Andy, would you like to give any kind of summary while I look at the... Yeah, definitely, definitely. I mean, as I said before, I think it's just reassuring us what we've been trying to do is educating people about the common mistakes that developers put into their code
Starting point is 00:51:21 or into their architectural decisions, into their configuration. So the M plus one query problem is definitely a bypassing caches, bad mistake. Make sure you read the documentation. Also, a shout out again, if you want to give Dynatrace a try, because Dynatrace is the tool that can help you identify these problems, then go to our website, sign up for the Dynatrace free trial. It is 30 days to start with where you can use it on the distributed systems.
Starting point is 00:51:46 And then after the 30 days, it converts to a personal license, which means you can keep using the full-featured Dynatrace software that allows you to diagnose exactly these problem patterns that spits out all of these metrics for you automatically. If you are a developer and a tester, and if you're doing some local development and testing, just use Dynatrace. Use the free trial and then the personal license. And I can also do a shout-out on the YouTube channel that we have. If you go to YouTube, there's a lot of tutorials out there that I put out there where I basically show how to diagnose performance problems on the database in memory. So have a look at that as well if you want to get started. And I think that's it. I think I'm very happy that Scott joins us in educating the global community of developers,
Starting point is 00:52:37 testers, anybody that is really interested in building better apps, scaling and performing apps. Great. And you can also read more about this blog if you do a search for diagnosing Sitecore performance problems on blogs.perficient.com. And Scott, this episode is going to air around August 2nd. Scott, do you ever do any kind of speaking engagements or go to any – I think there's a – isn't there a Sitecore symposium coming up or something? Well, I will be at Sitecore symposium. And when is that again?
Starting point is 00:53:17 I'm not – you know what? I'm going to have to look that up. I think that might be September-ish. That sounds right. Right, but anybody listening, if you're a Sitecore user, definitely go find Scott. He's got a wealth of knowledge about it. Do you have any other kind of engagements that you take part in or meetups or anything else? Well, yeah.
Starting point is 00:53:40 I'm actually going to be speaking at the Cleveland Sitecore user group, but that will have been in the past by the time this podcast airs. So I probably will be having some other speaking engagements at Sitecore User Groups across the country later this year, but nothing definitive that I can announce right now. Okay. And Andy, do you have anything on your dance card? My dance card? Yeah. By the time this episode airs, I just finished my Salsa Congress in Slovenia, which I'm sure I will be enjoying. And after that, I will most likely be speaking, hopefully, or at least be there at DevOps Days in Boston and the DevOps Days in Chicago. That's my hope that they accept me. Excellent.
Starting point is 00:54:27 Let me speak. All right. Well, thank you for everyone listening today. I hope you enjoyed our switch up in the beginning where Andy said hello first. We did not have a trivia question today, so I don't think we prepared well for that one. Didn't even think about it. It's hard to think of a trivia question that you can't Google. So, you know, we only have one episode out there at the time of recording with trivia, and we've gotten zero responses to it back anyway, so I'm not too concerned yet.
Starting point is 00:54:56 Anyway, thanks, everyone, for listening. This is Brian saying goodbye. Andy and Scott, please say goodbye, and we'll see you next time. Goodbye, and build better see you next time. Goodbye and build better software. See ya. Bye. Thanks for having me. Thank you. Music
Starting point is 00:55:18 Music Music

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.