PurePerformance - 026 Love your Data and Tear Down Walls between Ops and Test

Episode Date: January 17, 2017

How often have you deployed an application that was supposed to be load tested well but then crashed in production? One of the reasons might be that you never took the time to really analyze real life... load patterns and distributions. Brian Chandler (@Channer531) (https://www.linkedin.com/in/brian-chandler-8366663b ) – Performance Engineer at Raymond James – has worked with their Operations Team to not only start loving application specific performance data captured in production. They starting breaking down the DevOps Walls from Right to Left by sharing this data with Testers to create more realistic load tests but also started education developers to learn from real life production issues.We hope you enjoy this one as we learn a lot of cool techniques, metrics and dashboards that Brian uses at Raymond James. If you want to see it live check out our webinar where he presented their approach as well: https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.htmlYou can view the screenshots we refer to at:https://assets.dynatrace.com/en/images/general/Chandler_01.jpghttps://assets.dynatrace.com/en/images/general/Chandler_02.jpg

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance, our second episode of 2017. Hello, Andy. Hey, Brian. Hey, so before... What's up? Nothing, but you're calling me Brian right now and our guest's name is Brian.
Starting point is 00:00:41 So I'm going to attempt to be respectful and allow our guests and allow our Hey guest, Brian, no speaking until we address you. I'll defer to him as Brian and I will give you permission instead to call me to reference me as emperor during the show because that is You're so modest. Yes, exactly. So you can refer to me as emperor and our guest brian can be brian so speaking of our guest brian andy would you like to give a brief introduction before he introduces himself uh sure uh well well so here's andy again i'm a little jet-lagged i just got off the plane and i think i've been up for 36 hours now and thanks to coffee keeps me going and i'm really happy today for this webinar.
Starting point is 00:01:27 Well, actually, not a webinar. It's a podcast. See, my mind is a little strange, but it's going to be fun. So here's the thing. In the Brian, the real Brian that we call Brian today, we have known him for quite a while. And the last time I bumped into him was in Tampa, Florida, one of our Dynatrace user groups meeting. And at the end of the meeting, I believe, Brian came over to me and said, hey, you know, there's some cool stuff I want to show you. And he showed me some dashboards that Brian has been using in production to actually make some sense of the data to better understand the behavior of the app and actually some traffic patterns. And then he actually, I believe, came up with a term that we use for today's episode, which is why you have to love your data and tear down walls between ops and test.
Starting point is 00:02:12 And actually, I think it's great that we actually call it From Ops to Test because I think this is some stuff that we can learn today for operations on how they can monitor applications better, how to monitor them, and then feed data back to test. And before I really let the Brian introduce himself, is I also want to say, this is going to air, I believe, Mr. Emperor Wilson,
Starting point is 00:02:35 mid of January, just a couple of days after I've been doing a webinar called DevOps for Ops, kind of the first steps. And thanks to the Brian that I'm now introducing, I can actually also use some of his screenshots. And so people will actually see some of the stuff that we're talking about today. Without further ado, Brian, welcome to the show.
Starting point is 00:02:57 I'll do further ado. I just want to tie in. It almost sounds like what you talked about from shifting from ops to test ties in a little bit to the last episode with shifting load right in that kind of special way. So anyhow, go on now, now back to Brian. Brian, special guest, Brian Chandler. How are you doing? Good, good. Well, you know, first, first I want to say I'm honored to be carrying the title of Brian today. Thank you for that, dear emperor. It's good. Yeah, so yeah, really excited to be here. I'm just, I guess, a little bit about me. I'm a systems engineer over
Starting point is 00:03:31 at Raymond James right now on the application performance management team. We're a relatively new team there. It's the third organization I've worked with in the APM area. I've been working with a pretty wide range of app types, internal phasing, external phasing, customer marketing apps. You know, APM is just one of those things that kind of made sense and didn't have a problem getting indoctrinated into the whole philosophy. I mean, something amazing happens when somebody clicks or checks a box when you hit an app. It sprays across 40 servers, however many data centers, all those functions. So it's just a really exciting world to be in, and, yeah, happy to be here. And it sounds like you have a – it seems he sounds like he has a radio voice.
Starting point is 00:04:13 It sounds like he's been doing this before. Well, you know, I went out and got one of those fancy mics, of course, off Amazon. Yeah. That's cool. See, he went up to you. He's not using his headset. Yeah, yeah, I know. I'm usingupped you. He's not using his headset. Yeah. Yeah, I know.
Starting point is 00:04:26 I'm using my headset now. So, hey, Brian. Now let's dig into what you showed me in Tampa the other day. And because I believe it's a fundamental lesson that hopefully – well, I think – I mean people should know about this, but I believe what you showed in your dashboards and the way you visualize it, it's a great way for operations to start thinking about the app not as one big thing and every endpoint is equal, but every endpoint has its own weight and its own priority. Please explain. I'm looking at the dashboard right now, and it's labeled client-centered daily traffic pattern. Can you explain a little bit more about that and what you found out? Right. So that's – what you're looking at there is just the traffic pattern of our largest FA.
Starting point is 00:05:17 And when I say FA, financial advisor app, it supports about 6,500 financial advisors or so right now. So a pretty wide range of users that manage tens of thousands of investors. And the, what I'm trying to paint with that picture there is to really fundamentally how important it is to understand how humans kind of organically use the app, you know, and it's, it's way different than, you know,, you know, what a QA team might intuitively think for a way to test something, right? You know, if you think of an app that has, you know, like this one, for example, has a 58 or so API calls or high level functions that the browser will go out and hit, you know, when you check a box or, you know, hit that tab. And, you know, a QA tester
Starting point is 00:06:02 somewhere might think, okay, well, I have this list and it makes sense if I can just write a script that that exercises all of these evenly. Right. That might be the kind of the intuitive answer, but that's totally not not the case for really any of the apps. I mean, it turns out for us, three out of those 58 API calls makes up about 60 percent of our traffic. So quickly you realize that you cannot treat all of them as equal citizens, right? If you're going to hang some type of service off of one of these API calls, it's going to have a fundamental impact on your app potentially. So what we do over at Raymond James, we have this really, really cool feedback loop with the QA team. We show them basically that graph. Every day they can pull that up and look at today's behavior from our users. And we say, okay, your tests, when you write them in Lode Runner or whatever tool you use, it's got to look like that picture.
Starting point is 00:06:57 And you know you've exercised the app correctly when it looks like that picture. So let me ask you the question now. That means you kind of tore down the walls between ops and test because these two teams are now really actively sharing data on a day-to-day basis, huh? Right, yeah. You got to share the data. It's got to be – it's like free love. It's got to be free data for everyone, right? Go ahead.
Starting point is 00:07:23 Yeah, sorry. And was this initiated by ops or was this initiated by test or what initiated the whole thing? You know, it was actually our team. Like I said, we were kind of a relatively new team in this last year. And we own, you know,
Starting point is 00:07:38 the DT tools in all environments. And, you know, we're working with these QA testers and, you know, we're kind of exploring the things that we can do for them. And it's, you know, we just say, hey, well, you know, you know, your traffic patterns are looking like this, you know, and you could say, you could look at the load runner scripts, or they're there, they were originally kind of putting together, and the distribution wasn't exactly what it is in prod. And that just has huge implications there, if you let something kind of go through with that kind of distribution.
Starting point is 00:08:09 And, yeah, it was just sort of a communications thing, right? We sat down and kind of showed them, okay, well, this is what prod looks like. And this is what we're looking like we're going to test here in Lode Runner. And we've got to adjust this a little bit. And, you know, I'm looking at this and I'm thinking yes this is basic in a way but this is such a leap forward if i go back to my old days of load testing um trying to get this kind of information out of ops was was near impossible they'd maybe have a weblog you know which yeah try making sense of that this is not only visual but very very easy to read and you know i mean heck in the old days, we just kind of used to make an educated guess on what the traffic might be.
Starting point is 00:08:50 This is kind of game changing. And it's not even like what we're looking at right here is something new that you can do in the APM tools. You know, this has been around for a while, but I think just the concept of sharing that data, breaking down that wall, as you're saying, and sending the data back to the testers is quite incredible. Yeah, go ahead. The visualization, I mean, I almost want to say you created kind of like these flame charts, even though you chose obviously total different colors, not really flames. But I think you have different layers. You have different colors for the individual APIs. And if you have 58 different ones and the top three, as you said,
Starting point is 00:09:34 consume or sum up to 60%, then you chose to visualize. Sorry, Emperor Wilson, can we also post the images later on on uh yeah let's just make sure we uh we have them scrubbed nice uh if we need to and yeah i think we can uh at least well we'll have to see uh maybe at the very least we can provide links to them for somewhere but yeah we'll we'll do something we'll get it out there somewhere i think i think it's i think it's a good point it's it's important to be able to see i know we're talking about a graph and a chart and if you can imagine it uh i want you to imagine yeah a graph um but i think i think the visualization really helps with the with the concept again it's nothing it's not groundbreaking when you look at
Starting point is 00:10:16 it but it's it's the groundbreaking side is well yeah let's do this you know and making that leap to share that data is where the groundbreaking comes into it. It's really just how you augment the data, right? I mean, you think of these tools and, you know, you think of the pure path and the transaction tracing. But it's also, to me, they're just as much, you know, data analysis tools and behavior analysis tools for your users as much as performance, right? And it's basically a stacked bar chart, right? It's a stacked bar chart, and every measure is basically the volume of load that comes in on a certain API and different colors,
Starting point is 00:10:59 and you immediately see, hey, here are your top three, consume 60% because of the overall sum. It's just visually very appealing, interesting, really cool stuff. Yeah, it's a really good dashlet too. I mean, it's something when you pull up the table view, you have the half dashlet, half table view. You can actually go through and just kind of hit space bar, turn certain ones on and off.
Starting point is 00:11:24 You can see, okay, what if I turn off this api call how does that change the shape of the behavior of the app and you can see you know if i turned off one of those top three there you know it would change the whole shape of that graph so you can kind of it's kind of just fun to mess around with that in that regard kind of too and i think what's also really nice uh because you know you talked about this you're using in this case datrace, and you're using it in production but also in testing. That means you can easily create the same dashboard in both environments, which then makes it so easy to compare them. And this is what you also show in your second chart that you sent me, where you see the difference between prod and what you simulate in load. And that was, I guess, when you realized that you're load testing something totally different that does not at all apply or does not at all reflect what's actually happening in production.
Starting point is 00:12:16 Right, yeah. It's interesting because it kind of tells you, you can almost kind of point out, you can tell the story of how the scripts were written. You can see instead of the nice, organic, sort of steady hump you see in prod of all this different functions kind of coming in organically from various visits, loadrunner tests, you can see obviously, okay, they started 500 users right now, and they exercised all the API calls, and then the load runner test stopped, and then it geared back up again, and you have this sort of spiky sort of weird that you can see the whole story of the load test right there and kind of point it out. It's just kind of all these – it's a series of peaks instead of kind of like this kind of steady stream of usage of all these API calls that you see in prod. So yeah, just being able, like I said earlier, like the QA team, you know, they, it's great. They can pull both of these graphs up in real time and they can look at them both and say, do our tests, the things that we are exercising right now today, is it matching what we're seeing in prod? So yeah, it's been
Starting point is 00:13:19 super useful. And basically it's, it's actually, you can see two things here you see in uh unequal load distribution and unequal distribution between the api calls so the basically what i mean if you run these tests you're testing something that is i mean in the testing something totally different and therefore even if your load test succeeds uh it's almost like a failed test because you're obviously never going to have this pattern yeah yeah because you're not hitting the right distribution right i mean so i mean again we can yeah we can follow up i'll even we can do a companion blog or send out the the images or something after this but um you can see that yeah their their tests actually hit quadruple the total volume of the app,
Starting point is 00:14:05 but it only hit about a third of what it was supposed to hit volume-wise on the key API calls. So yeah, it was a failed test from the point of you didn't even exercise the right muscles of the app. Yeah, and this also kind of opens up a Pandora's box of modeling that production test, because besides, and this, Andy, i'm harking back to a little bit to the previous episode with the concept of shifting right um for for the load where we talked about um capturing the right metrics in production so that you can make sure you're modeling the same way in your in-production environments. So on the surface, we're talking about the different API calls, but you can even take this further then and say, okay,
Starting point is 00:14:52 if we got the same model of API calls, is our test environment also then giving us the same number of threads, the same number of queries that are being executed? And you can keep extending that comparison along different metric lines of metrics to make sure that you're, you know, getting a much closer production model or discover that there might be something wrong in your test environment setup where maybe you're getting the right API model, but you're not exercising the same amount of database queries or threads, which might mean that you're possibly, you know, maybe you're using the same search terms over and over or something else like that. Right. The danger is, though, you can go, you know, where do you where do you kind of stop in that in that model?
Starting point is 00:15:33 Because you can go deep and deep and deep and deep and deep. You know, it sure is a rabbit hole for sure. And yeah, once you once you start. Yeah. And yeah, like you said, the API layer, just the usage of that, that is the tip of the iceberg. I mean, you get into, you know, an app that, you know, it sprays over 40 different services, you know, hundreds of different URLs and functions that it calls underneath that. And it's like, okay, yeah, well, now you got to start getting the right traffic trend on, you know, everything, you know, going underneath the iceberg there, so to say. But it's also very, very important to do that in a way because if you do have that model set up well, and even if it's the first layer below,
Starting point is 00:16:12 even if you're just looking at the numbers of threads maybe and queries and connections and things like that, if you do get that model correct and then a new code push comes into the pre-production and something gets thrown off, then that's when you'll be able to say, hey, this is going to make a change in this way in production. Yep. Yeah. And it turns out, yeah, I mean, you can, it's great. And you can almost kind of start
Starting point is 00:16:32 turning the knob on, okay, if we introduce this service to this API call or introduce the X number of calls to that, it's going to hang underneath this function. You kind of just got to do, it's a little mathy, but yeah, you can definitely start turning the dial on those things and figuring out exactly how it's going to impact your app as a whole. That's because you have all these architectural metrics, as we call them, right? You have to view path end-to-end, and then you know this transaction is making these many calls, and if you're now shifting things around, you can immediately see the increase in calls to the backend, calls to the database,
Starting point is 00:17:12 when you start changing the way the services call each other or the way if you change the implementation of the service. That's pretty cool. Yeah, so one of the things we're starting to quickly realize, and it's like, okay, how deep you want to go? Do you want to go here? And then you get it going to have on the system over here sort of thing. It's kind of like, you know, that when we're talking about this stuff, my team at work, it's kind of like the Internet meme, you know, ancient aliens where the guy's like, it's aliens. It's all aliens. It's always aliens.
Starting point is 00:17:57 It's kind of like it's all anomalies. That's all it is. You see, you know, you can start, you know, so one of the things that we're starting to do is querying all these backend services, like you see in the PurePath and all that, and starting to, you know, dump it into, you know, hardcore data analysis tools. I mean, there's a lot of ways you can approach this. I mean, the way we do it is you just,'s a free splunk version how we we um how we just sort of experiment with it uh but you can obviously do it with log stash i know dynatrace you know you guys have the the bt feed that can go into log stash elastic search cabana and that all works
Starting point is 00:18:36 just as well too um but we're starting to say okay you know if you really want a good anomaly detection system it's not good enough to do sort of a rolling minute by minute or hour by hour baseline. You've got to be able to go back, say, four or five weeks for any given downstream service call or database call on any given minute, like 10.03 a.m. on a Tuesday, compared to the last 10.03 a.m. on a Tuesday, last five or so, right? I mean, to really zone in on, okay, what is not normal and what kind of changes are you making with the system when you introduce new things? So that means what you just mentioned may be interesting for the listeners. You are streaming out the data from your APM solution,
Starting point is 00:19:26 which in this case is Dynatrace, to the free version of Elasticsearch because I guess the free version allows you a certain limit, a certain number of data per day. Yeah, so it's actually Splunk. I think Elasticsearch is free forever. I mean, it's an open source solution. But, yeah, we're using – I'm personally just kind of experimenting around with the free Splunk that lets you do 500 meg a day or so.
Starting point is 00:19:47 But right now I'm dumping in because what I think DCROM really excels at is really measuring all of those dependent downstream services, not like an end-to-end kind of transaction from a pure path perspective, but actually measuring all the little endpoints that it calls below that. Like, DCROM is very good and excels at measuring things from a bottom-up perspective in an enterprise. That's where it's really good. And there's a SOAP action plugin that you can dump data into Splunk with. And actually, there's a really good DCROM extension out there in the community that I think it was one of the guardians. Shoot, I hope I don't get this name wrong.
Starting point is 00:20:27 I believe it was Brett Barrett. He actually developed it. And you can go in and create a SOAP REST call URL. It's a really nice GUI that he put together. They say, okay, I want this data to come out of Splunk for these operations and these dimensions. And all you got to do is just kind of dump it into your Splunk rest grabber and you can get all that data, you know, every minute. And it's especially good with the DCROM, you know, one minute time intervals that they just came out with in the EAP program. And that's just been awesome. So, yeah, for the past five, six weeks, we've been collecting every operation in the enterprise every minute and just kind of dumping it in there.
Starting point is 00:21:08 And it's only taken up, you know, 200, 300 meg a day. And I'm running it off of, you know, just a little two core VPC. And, you know, we're experimenting around with saying, OK, you know, this endpoint, how is he supposed to behave at 10.03 a.m. on a Tuesday? You know, from from kind of a whole enterprise perspective. I got two quick things to say here. Sorry, Emperor Wilson, I didn't want to interrupt you. Oh, no, no, no. Go on, Andy. Two quick things for the listeners that are not familiar with some of the terminology.
Starting point is 00:21:39 DC RAM is our network-centric APM product. And you said, you know, it's kind of sniffing from the bottom up and great for the enterprise to monitor network traffic. But Brian, the other thing, it seems what you are,
Starting point is 00:21:54 you're trying to solve a very interesting problem that a lot of APM vendors, including Dynatrace, also try to solve out of the box, which is, you know, applying machine learning, applying artificial intelligence actually on top of the data that we collect and then alert you in case something
Starting point is 00:22:13 is out of normal behavior by looking back at historical data, by looking back at particular endpoints and how they behaved a week ago if you have weekly cycles, a month ago if you have monthly cycles or whatever cycles you have. Wouldn't it be amazing if there's a tool that could do this, Andy? Say that again. Wouldn't our listeners love to know if there's a tool that can do this? It would be like a Christmas present when it's four days before Christmas. That's almost impossible.
Starting point is 00:22:41 Or do you think something like that exists? I don't know. Tell me more about this. Yeah, you guys. But I want to look like a total hero, right? I want to beat my head up against a rock. And yeah, no, it's the same thing. I'm totally I think I totally agree.
Starting point is 00:22:56 That's kind of where everyone's going. It turns into this kind of machine learning something. You need something that can be able to detect these things. And and yeah, so I do realize it's like, oh, this might just be few, you know, foolhardy and, you know, just gotta, I should just wait for the machines to take over. The beauty is you don't have to wait and I hate to be a commercial now, but you know, that's what we do. That's true.
Starting point is 00:23:23 That's true. Yeah. now but you know that's what we do that's true that's true yeah so basically for the for the listeners maybe go on dynachase.com and check out davis and check out artificial intelligence so basically we we we saw the trend what you guys have been building that's great right you can you can obviously do this with uh with uh our epmon data our, our DC RAM data but we also try we saw the trend and that's why we came up with our out of the box artificial intelligence
Starting point is 00:23:51 engine that we put into our product but that's I think enough with the commercials if people want to learn more, Meet Davis I think is a great way to start if you Google for Bing or search for Dynatrace, Meet Davis then you will find more. I wanted to ask Brian, before we go on,
Starting point is 00:24:08 I wanted to ask Brian just briefly in case some other users out there might want to do something like this. You mentioned the free Splunk version. And as far as the 500, was it 500 meg daily data cap? Is that, are you coming in below that with both Dynatrace Appmon uh data and dc rum data
Starting point is 00:24:27 or is that just the dc rum data on its own that's just the dc rum data so we're not quite feeding in the the business transaction feed into it quite yet because we do suspect that we are in qa and we're just kind of putting together models of how we're going to slice and dice the data and deal with it um but but yeah that's just DCROM today. Right. And the general concept there is, right, is Dynatrace collects all this data and we have a lot of ways of presenting it. But if you want to get extremely creative in how you want to slice and dice and do complex multi-dimensional analysis of the data, you can put it into something like Splunk or Elasticsearch and run any kind of
Starting point is 00:25:07 queries and correlations against that data that you can imagine. So it's interesting that Splunk has that free cap so that you don't have to necessarily set up. I think I looked into doing a Kibana Elasticsearch setup and it required, just for the base setup, a significant amount of horsepower just to get that running. So I'm just looking to play with the idea of seeing what you can do once you export the data. It's nice to know that Splunk has that free option there. Anyhow, moving on. Hey, Brian, still staying on the topic.
Starting point is 00:25:45 Besides response time and failure rate, any other metrics, measures that you're pushing into Splunk and then doing your anomaly detection on? So right now, it really is just response time, load, and failure rate. Well, actually, it's really just response time and load and software services and operations times. Once we kind of put this model together, we'll obviously put in – we'll dump in more metrics like that. But that's totally important too, right? of load and, you know, a certain X response time for some endpoint, but you also suspect, you know, X amount of 401 response codes or 200 response codes, right? You know, because you're going to have a normal amount of 401 response codes when a service, you know, hits some other service and challenges it for authentication, you know, things like
Starting point is 00:26:38 that. So that's definitely, you know, that's definitely another good piece of it. And yeah, and one other thing, too. One of the things that kind of stuck with me when I started here where I'm at is one of the management said, what's important here is we try to cut down on the amount of smart guy correlation we need to solve performance issues. That goes right into the whole Davis thing too, right? It's, you know, once you set up this system that can kind of start telling you how to find these things, you can have your engineers not have to sit there and compare graphs and cross-check things
Starting point is 00:27:22 and, you know, kind of look like I use a GIF image on one of my slides that kind of, you know, Zach Galifianakis in The Hangover where he's got all the graphs and stats going over his face. You know, you kind of have your engineers doing that and wasting time doing that when, you know, obviously, you know, they could be doing a lot of other productive things like, you know, just the message of APM, right? You spend more time more time you know one of the messages spend more time innovating you know no less time and in fighting fires yeah hey and um i mean i like that maybe some additional metrics maybe you have them already on the list but i can think about something like the number of bytes sent and received per endpoint because that immediately allows you to see if maybe an API change or maybe a new deployment all of a sudden is causing some issues on the amount of data you send
Starting point is 00:28:11 back and forth. Maybe somebody forgot to turn compression on on a certain layer and then you send so much more data over the wire because Dynatrace and DCRAM obviously capture package sizes and network and then request and response sizes. Also, I believe the, and what we talk all the time about number of database queries being executed, number of web service calls, arrest calls, microservice calls being executed, number of threads being involved. With the PurePath, you see how many threads are involved.
Starting point is 00:28:42 And I recently just, I think I blogged about this. It happened to ourselves within Dynatrace. I mean, our DCRAM team was actually using AppMon to analyze DCRAM and they made some quote-unquote optimizations, but then what actually happened their optimization was actually
Starting point is 00:29:00 spawning a lot of background threads to do work parallel because they thought it's faster which for a short term it was faster but they were soon running into the boundaries of the number of threads they have in their worker pools so basically they were just filling up all the threads doing a lot of stuff in parallel slowing down the overall system and that was also very interesting and so the number of threads per request, and they did also do this by endpoint.
Starting point is 00:29:28 These are great metrics that you can then, from an operations side, obviously give back and say, hey, since the last deployment, we saw a change in behavior because now we are consuming twice the amount of threads. We are logging five times as many log messages. So log messages per entry point and all these things.
Starting point is 00:29:51 Great, great values. Yeah, go ahead. I was going to ask Brian, I remember when we were preparing for this, you kind of had a similar story with hanging some other services off of an API call. I don't know if you wanted to kind of tie that into what Andy was talking about with monitoring those other components and maybe how that all ties in. Yeah, so this is kind of, you know, you always have those epiphany moments. And this was one of those that were like, okay, this is something we need to pay attention to, and this is important, right? We had this API call that it was one of those big hitters, about 18% of the traffic. And it kind of gets into how you architect different services.
Starting point is 00:30:36 And you have to be smart about, well, okay, we might want to invoke this service when a user does some, you know, action on the GUI, but you got to be able to understand the implications of that. So in testing, we realized that, hey, you know, if we tie this service call to this API call, that means that API call is also tied to this checkbox on the GUI. And you get some fidgety guy that just likes to check boxes on and off, you know, in a GUI. That's, you know, I would say that's part of the reason why it's high volume, but it's high volume because it's just, it's used a lot. It's a function on the GUI that's used a lot that it's enough to, you know, make 18%. So those are the types of things that, okay, this is important, right? You know, we have humans using these very popular things on the GUI of the app, which, you know, makes these underlying API calls heavily
Starting point is 00:31:30 used. And then you have this exponential, you know, increase in service calls due to that. So, I mean, this particular one was sort of a document sharing system between an FA and a client. And we noticed in testing, okay, you know, this call is getting made a lot, and we're going to basically see this exponential increase in the back end of this service call, you know, syncing documents between the FA and the client. So that was something that we could say, whoa, okay, we better not run off that cliff, you know, before we run it into production. So that worked out well. And in that same vein, one of the things that we're doing today is we kind of slowly onboard functionality, sort of like, you know, A-B testing or, you know, some companies, they'll convert certain users
Starting point is 00:32:22 over to some piece of functionality in an app. And right now, we have a new CRM system we're kind of bringing out. And today, we have about 1% of our users converted over. And we're able to look at, okay, how is this being exercised? How is the performance? Is it deterministic, meaning is it 300 milliseconds solid, just like our old legacy version of this service, or is it all over the place? Right. And if it's all over the place, we catch it now while it's 1% of our users using this, then in feedback to the dev and QA teams before, you know, it's 50% or 100% of our users using this. And besides just the response time and all that, if you're looking at all those other metrics
Starting point is 00:33:07 that Andy's talking about, the service calls, the database queries, the threads, all these other components, when it's at 1%, you have a chance to figure out what's that model going to look like when we push everybody else over and is this model going to survive?
Starting point is 00:33:20 So going back to Andy's points, that's why monitoring all these things in those production environments is so, so key and important. Yeah, it's human behavior. It's really, it's impossible to replicate. It's very hard to replicate. So, I mean, this production type data is just gold to, you know, or teams to the left, right? Yeah, and I think it's just hopefully with all this data,
Starting point is 00:33:46 and if you really feed it back from operations into test, but also into dev, especially you should dev the resource utilization of their features, and also if we can take some kind of a cost factor, because in the end, and I think this is something I brought up in one of the latest in the previous episodes, trying to educate all the engineering teams that we not only need to build software that is fast from a response time perspective and super nice and user experience friendly, but also that is efficient.
Starting point is 00:34:16 Because in the end, I can write super fast code, but spawning 500 parallel threads and do something very strange and write millions of log files that nobody needs, but I need to feed it into Splunk, but then nobody cares about it. So I think these are the things we need to feedback as well to say, you know, great feature that you built, but it's too costly because of this, this, and this metric. Yeah. Yep. Hey, and for your AP testing, I'm interested in what you just said. So in your case, are you selectively onboarding individual users by changing their user profile and then you know they're getting redirected to that particular server? Or is it every server has the same code base and you just turn it on depending on the user?
Starting point is 00:35:00 Yeah, so it's a function of – it's somewhere in the – I don't know the technical specifics of it, but essentially they can grab a list of users, what they're doing today, and they say, okay, these users are going to be converted over to the new service. So if you hit F12 in their browsers, they're calling a different API call than 99% of everyone else. So just the behavior in the F12 tab is different compared to them. But GUI-wise, they're seeing it's the same thing to them sort of thing. But is the code of, let's say, the B version of it, is the code running on different separate hosts? So they're totally separated or they run on the same host as the other code runs too? And just the flag then defines, hey, this code now executes this year, and the code on the same host is executing now in another path.
Starting point is 00:35:49 Yeah, that's essentially it. So like on the API layer, sort of the front door of the data center, if they hit, there's a flag that says, okay, these users are going to take this route, and these users are going to hit this other route. And then, yeah, the users that are converted over are spraying across the whole other system. And then all the other users are there. This API call basically points and sprays over some other system over there sort of thing.
Starting point is 00:36:10 So that's kind of how we're doing with it right now. Cool. Are you leveraging user experience monitoring too? Yes, actually. We had an interesting, so one of the new you know you always atmon is is is funny you always think that you know all the little nooks and crannies and you always think of find some other cool cool thing to use it uh use it by and uh so there's this one app that we're just rolling out and it's another kind of slow rollout where you were we're
Starting point is 00:36:43 bringing out i think it was 20 or 30 people that kind of volunteered to be an alpha tester for it. And, and, you know, we're, we're immediately finding ways that the QA testers couldn't or didn't think of ways to exercise the app or exercise the functionality. Right. And so one of the things is they re they found some, they found some user action coming supposedly from a completely linked from a different app in our enterprise. And they're trying to figure out, well, how is this happening? Or we didn't expect them to kind of take this route in this app. And it's those interesting things you can find. So this sort of cool new functionality I learned from a user experience analysis perspective in Atman is that you go to the
Starting point is 00:37:25 visits tab, you type in a username, and then if you don't filter any apps, it'll show you, okay, well, here's this user. And then they might have multiple visits based on them visiting multiple apps in your enterprise. So what we actually ended up doing is we control clicked all three of their visits and drilled down into the user action pure paths from there and then you get a complete sequential list of their entire journey throughout the entire day overlaying all these apps so what was cool is we were able to see oh you know they they were in you know they were in app a and then they hit a link in app a which then
Starting point is 00:38:02 brought them to app b and then the in the dynatrace GUI, you see a switch apps within the user action PurePath stashlet. So I think that's really cool because we could watch this user bounce between different apps from an enterprise perspective, not just kind of like an application perspective. So that was something that was kind of cool and new that we kind of stumbled upon. So that helped a lot too right there. And so we always learn something new about our uses how would they do the strange ways the strange paths they're taking yeah that's pretty good humans are so unpredictable you know if we could all be machines exactly yeah yeah that'd be good anybody is anybody watching uh westworld by any chance
Starting point is 00:38:42 on hbo oh yeah i watched the first episode, but – It's very creepy. It's very – have you – Brian, have you watched the whole season? Oh, absolutely. Yeah. And I'm one of those freaks that has to immediately go on all the internet forums and read up all the theories. And, yeah, it was actually – the internet did so well at figuring out all of the major spoilers before the finale that it was like oh okay i know all this so maybe season two i won't you know i'll just kind of enjoy and
Starting point is 00:39:11 watch the the show but it's a very good show yeah it is and mr wilson emperor wilson you should check it out you know i'm dying to see the original west world movie the one with yul brenner um it just hasn't been on streaming yet i haven't been able to see it it's uh probably uh late 60s early 70s you know style sci-fi starring yul brenner king and i the guy with that shaved head you know and i think there was a a quote from him if i if i didn't get it wrong uh about him saying that him playing that machine was one of the most comfortable natural roles for him, which is pretty awesome. But I really like to see the campy version.
Starting point is 00:39:50 You know, I watched the first episode. I don't have too much time for TV, and it just didn't hook me in. But maybe at the recommendation of the two of you, especially another fellow, Brian, I'll have to go back and at least give it a few more episode shots, you know? That's right. you're on a major binge watch now it's all streamable yeah hey uh brian i got one more question coming kind of back to how we started you know we started with a operations gave especially testing the insight on which apis are hit how often so you can actually model better loads and distribution of loads for your load tests. Now, how do you deal with test data?
Starting point is 00:40:29 Because I assume the only way this really makes sense is also having good test data. Do you also replicate and does operations provide test data to pre-prod, kind of tearing down this wall as well? Yeah, they do to an extent um i'm not i don't know the specifics on it but i do know that they do do that to an extent that that is that is important um to be able to do you know just to be able to exercise and replicate uh replicate the functionality you know as it should be right yeah because there's you're right there could be so many different uh they well they at least emulate the diversity of relationships that you
Starting point is 00:41:08 might have. Um, they, you would say, okay, well on average, you know, we have a mix of, you know, a 20 clients to one FA kind of relationship. So they might replicate, okay, okay. This is the kind of mix of relationships that we need to be able to exercise. Because one service call could be very heavy to another service call, even though they're the same endpoint, because you might have an FAA that has 40 clients compared to 100, and they're going to pull how much data back. So, yeah, they definitely mix the different types and combinations that of data that you can have there cool and um yeah that's interesting i just wanted to make sure that you know i was interested in what you guys are doing and are you um have you ever played around
Starting point is 00:42:00 with load testing in production have you ever thought about something like crazy like this no not not quite although uh so at a different organization though we have thought about it uh because it was before i was i was with raymond james uh we were having we didn't quite have our prod environment scale or qa environment scale to our prod and And it wasn't, and the way the app behaved, it was kind of these image creation, this image creation server, right? And if you had one of them, basically, that the speed at which it can create these images was a function of how fast the disk could read and write. And it was not a linear scale. So if you had one of these image servers and then four of
Starting point is 00:42:47 them it wasn't like okay you can push you just need to push one fourth of the prod data to it it was like it would actually behave a little bit differently if you had four servers um like it wouldn't just be okay you can push four more through like the the disk would actually lock up at different rates so we thought about that you know waking up at like three in the morning or something like that we didn't end up doing that but um but yeah it's not not not something we're doing we're doing today i would say cool well gentlemen uh i think i mean i i thought this was an amazing discussion especially around the kind of like coming back to the title.
Starting point is 00:43:28 We have to love our data because there's so much we can learn. And I'm really looking for when this show airs and we can then show the charts we all here looked at, kind of showing the audience on what you can do with uh these flame charts that are not flame-ish but i have been other colors but really we look great but you do actually you do have flame charts right the third dashboard that we have in our email here includes frame chart flame charts are you talking about the uh i'll have to pull it up i don't know if you're talking about the red wave of death where you get the yellow, green, red. Oh, yeah, yeah, yeah. So that's good.
Starting point is 00:44:07 Yeah, you can use flame charts, red wave of death. Either works. We actually have a fourth color in there. You don't see. But the fourth color is the purple wave of death. So that's just performance right there. You know, green, you know, well, from an API layer, I'm sorry. So in UEM, red will count as a frustrated visit as if somebody throws a 500 or 400 or something like that.
Starting point is 00:44:28 But from an API layer perspective, we have a fourth color. Okay, was it slow, fast, or okay? Or did it throw a 500 or 400 error? And that's purple. So if we have an account lockout or something, you'll see a giant purple wave there because the API calls at that point are very fast, but they're failing very fast, right? They're at 20 milliseconds. So if we didn't have that fourth purple color in there, it would just be a nice – it would say, hey, everything's great when it's not. So, yeah, that's what we have for all our apps there.
Starting point is 00:45:03 That's why one must always connect response time and failure rate because that's the only – Exactly. Yes. That's only thing. Yes. That's very important. Yeah. That was awesome. Yes. Thank you so much for being on. Andy, did you want to do any other sort of summary?
Starting point is 00:45:16 You haven't done one in a while. I know. I know. I mean, my summary is really I believe modern operation teams and coming back to what I will be talking about at the webinar, which happened the week before this one airs, but I believe modern operation teams need to break down their walls from the right to the left and providing better meaningful data about the applications, about the patterns that people, that users use the applications, also about dependencies when we talk about all these metrics like the number of web service calls, the bytes sent and received and all that.
Starting point is 00:45:54 Because in operations, you really have a view of your real users and how they use the app. And you can level up your testers by allowing them to model better load tests, like in your case. And you also can give direct feedback to your development teams and saying, hey, whatever you just did, first of all, people love your feature, but the feature is doing something very weird and you're consuming too many resources, right? I mean, this is perfect, I believe. So ops, I believe, can step up. They don't have to wait for DevOps to happen from the dev side, but I believe ops can start themselves and make a step towards the testers and the development teams. So that's what I believe. Perfect.
Starting point is 00:46:50 Very good. I just want to remind everybody that Perform is coming up and Andy and I will both be running some hot sessions on Perform and I actually just noticed my session is not on the page which probably explains why we don't have many people signed up for it.
Starting point is 00:47:07 I'll be doing an e-commerce monitoring hot session the second half of the day on February 6th. Andy, which ones are you going to be doing Perform in Las Vegas this year? I'm doing the shift left continuous integration session. I'm doing one in the morning and one in the afternoon. So the idea is we're building a Jenkins pipeline using the latest
Starting point is 00:47:27 Jenkins pipeline feature, having a Spring Boot app with two microservices. So setting up the pipeline, pushing the app through the pipeline and then simulating some bad code changes
Starting point is 00:47:40 and then seeing what happens if you don't have Dynatrace in there and then another one where we can use Dynatrace to stop the bad code change before the change goes into production and kills everything. Excellent. And Mr. Chandler, are you
Starting point is 00:47:52 going to perform this year? I am. In fact, I'll be in a breakout session I think the first day. I think it was creating performance tuning heroes or something along those lines. I believe we're going to be with Mr. Thorsten Roth, my manager and I, going through much of the same, not exactly everything we talked about today,
Starting point is 00:48:13 but I'm sure we're going to touch on it a fair bit. But, yeah, come on over to our breakout session. Excellent. Are you wearing a cape because you're a superhero? You know, I've been trying to pick out which Avenger, you know, our team is. And we've got to, yeah, we'll have to, we're still out on that one. The jury's still out. We've got to figure that one out.
Starting point is 00:48:32 That also means tights. Yes. So please. Will that make more people come to our room? Yes. We kind of post that. Yeah. All right.
Starting point is 00:48:40 We'll be sure. No matter what, it will. Yes. Wear tights. Absolutely. Please. Andy will wear his leader hose in if you wear his tights. Andy, I just committed you.
Starting point is 00:48:47 Bring it home with you. Anyhow, thanks a lot, everybody. Any final words from you, Brian? I know we're kind of rambling here, so any quick last words there for you? No, that was great. And just like Andy said, it's a great opportunity, these types of tools, being able to analyze this data in production to, yeah, get more involved, you know, on the lifecycle as a whole, right? I mean, you become more than just, you know, watchers of green check marks and red Xs, right?
Starting point is 00:49:14 You can actually feed back this rich data and be a really, really good part of the process. Absolutely. And then shameless plug for my Twitter. You can follow me at channer531, C-H-A-N-N-E-R-5-3-1. And yeah, that's about it. All right. And you can follow us at Pure underscore DT. I am at Emperor Wilson.
Starting point is 00:49:35 We also have at Crabner Andy for all the Twitters. And don't forget, you can also, if you're a YouTuber, we're now publishing these to YouTube. So if for some reason you'd like to have a video playing with a static image while you listen to the audio, you can do that as well. That's all for me. I'd like to thank everybody for listening. And thank you, Brian, for being such a gracious guest today. And Andy, thank you as always. Thank you, guys.
Starting point is 00:50:00 Thank you. Bye. Andy, go get some sleep. Go get some sleep, Andy. I'll get some sleep bye bye bye

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.