PurePerformance - 026 Love your Data and Tear Down Walls between Ops and Test
Episode Date: January 17, 2017How often have you deployed an application that was supposed to be load tested well but then crashed in production? One of the reasons might be that you never took the time to really analyze real life... load patterns and distributions. Brian Chandler (@Channer531) (https://www.linkedin.com/in/brian-chandler-8366663b ) – Performance Engineer at Raymond James – has worked with their Operations Team to not only start loving application specific performance data captured in production. They starting breaking down the DevOps Walls from Right to Left by sharing this data with Testers to create more realistic load tests but also started education developers to learn from real life production issues.We hope you enjoy this one as we learn a lot of cool techniques, metrics and dashboards that Brian uses at Raymond James. If you want to see it live check out our webinar where he presented their approach as well: https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.htmlYou can view the screenshots we refer to at:https://assets.dynatrace.com/en/images/general/Chandler_01.jpghttps://assets.dynatrace.com/en/images/general/Chandler_02.jpg
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance, our second episode of 2017.
Hello, Andy.
Hey, Brian.
Hey, so before...
What's up?
Nothing, but you're calling me Brian right now and our guest's name is Brian.
So I'm going to attempt to be respectful and allow our guests and allow our
Hey guest, Brian, no speaking until we address you. I'll defer to him as Brian and I will give
you permission instead to call me to reference me as emperor during the show because that is
You're so modest. Yes, exactly. So you can refer to me as emperor and our guest brian can
be brian so speaking of our guest brian andy would you like to give a brief introduction
before he introduces himself uh sure uh well well so here's andy again i'm a little jet-lagged
i just got off the plane and i think i've been up for 36 hours now and thanks to coffee
keeps me going and i'm really happy today for this webinar.
Well, actually, not a webinar.
It's a podcast.
See, my mind is a little strange, but it's going to be fun.
So here's the thing.
In the Brian, the real Brian that we call Brian today, we have known him for quite a while.
And the last time I bumped into him was in Tampa, Florida, one of our Dynatrace user groups meeting.
And at the end of the meeting, I believe, Brian came over to me and said, hey, you know, there's some cool stuff I want to show you. And he showed me some dashboards that Brian has been using in production to actually make some sense of the data to better understand the behavior of the app and actually some traffic patterns. And then he actually, I believe, came up with a term that we use for today's episode,
which is why you have to love your data and tear down walls between ops and test.
And actually, I think it's great that we actually call it From Ops to Test
because I think this is some stuff that we can learn today for operations
on how they can monitor applications better, how to monitor them,
and then feed data back to test.
And before I really let the Brian introduce himself,
is I also want to say,
this is going to air, I believe,
Mr. Emperor Wilson,
mid of January,
just a couple of days
after I've been doing a webinar called
DevOps for Ops,
kind of the first steps.
And thanks to the Brian that I'm now introducing, I can actually also use some of his screenshots.
And so people will actually see some of the stuff that we're talking about today.
Without further ado, Brian, welcome to the show.
I'll do further ado.
I just want to tie in.
It almost sounds like what you talked about from shifting from ops to test ties in a little bit to the last episode with
shifting load right in that kind of special way. So anyhow, go on now, now back to Brian.
Brian, special guest, Brian Chandler. How are you doing?
Good, good. Well, you know, first, first I want to say I'm honored to be carrying the title of
Brian today. Thank you for that, dear emperor. It's good. Yeah, so yeah,
really excited to be here. I'm just, I guess, a little bit about me. I'm a systems engineer over
at Raymond James right now on the application performance management team. We're a relatively
new team there. It's the third organization I've worked with in the APM area. I've been working
with a pretty wide range of app types, internal phasing, external phasing, customer marketing apps.
You know, APM is just one of those things that kind of made sense and didn't have a problem getting indoctrinated into the whole philosophy.
I mean, something amazing happens when somebody clicks or checks a box when you hit an app.
It sprays across 40 servers, however many data centers, all those functions.
So it's just a really exciting world to be in, and, yeah, happy to be here.
And it sounds like you have a – it seems he sounds like he has a radio voice.
It sounds like he's been doing this before.
Well, you know, I went out and got one of those fancy mics, of course, off Amazon.
Yeah.
That's cool.
See, he went up to you.
He's not using his headset.
Yeah, yeah, I know. I'm usingupped you. He's not using his headset. Yeah.
Yeah, I know.
I'm using my headset now.
So, hey, Brian.
Now let's dig into what you showed me in Tampa the other day.
And because I believe it's a fundamental lesson that hopefully – well, I think – I mean people should know about this, but I believe what you showed in your dashboards and the way you visualize it, it's a great way for operations to start thinking about the app not as one big thing and every endpoint is equal, but every endpoint has its own weight and its own priority.
Please explain.
I'm looking at the dashboard right now, and it's labeled client-centered daily traffic pattern.
Can you explain a little bit more about that and what you found out?
Right. So that's – what you're looking at there is just the traffic pattern of our largest FA.
And when I say FA, financial advisor app, it supports about 6,500 financial advisors or so right now.
So a pretty wide range of users that manage
tens of thousands of investors. And the, what I'm trying to paint with that picture there is to
really fundamentally how important it is to understand how humans kind of organically use
the app, you know, and it's, it's way different than, you know,, you know, what a QA team might intuitively think
for a way to test something, right? You know, if you think of an app that has, you know, like this
one, for example, has a 58 or so API calls or high level functions that the browser will go out
and hit, you know, when you check a box or, you know, hit that tab. And, you know, a QA tester
somewhere might think, okay, well, I have this list and it makes sense if I can just write a script that that exercises all of these evenly.
Right. That might be the kind of the intuitive answer, but that's totally not not the case for really any of the apps.
I mean, it turns out for us, three out of those 58 API calls makes up about 60 percent of our traffic.
So quickly you realize that you cannot treat all of them as equal citizens, right?
If you're going to hang some type of service off of one of these API calls, it's going to have a fundamental impact on your app potentially.
So what we do over at Raymond James, we have this really, really cool feedback loop with the QA team. We show them basically that graph.
Every day they can pull that up and look at today's behavior from our users.
And we say, okay, your tests, when you write them in Lode Runner or whatever tool you use, it's got to look like that picture.
And you know you've exercised the app correctly when it looks like that picture.
So let me ask you the question now. That means you kind of tore down the walls between ops and test because these two teams
are now really actively sharing data on a day-to-day basis, huh?
Right, yeah.
You got to share the data.
It's got to be – it's like free love.
It's got to be free data for everyone, right?
Go ahead.
Yeah, sorry.
And was this initiated by ops
or was this initiated by test
or what initiated the whole thing?
You know, it was actually our team.
Like I said, we were kind of
a relatively new team in this last year.
And we own, you know,
the DT tools in all environments.
And, you know, we're working
with these QA testers
and, you know, we're kind of exploring the things that we can do for them. And it's, you know,
we just say, hey, well, you know, you know, your traffic patterns are looking like this, you know,
and you could say, you could look at the load runner scripts, or they're there, they were
originally kind of putting together, and the distribution wasn't exactly what it is in prod.
And that just has huge implications there, if you let something kind of go through with that kind of distribution.
And, yeah, it was just sort of a communications thing, right?
We sat down and kind of showed them, okay, well, this is what prod looks like.
And this is what we're looking like we're going to test here in Lode Runner.
And we've got to adjust this a little bit.
And, you know, I'm looking at this and I'm thinking yes this is basic in a way but this is
such a leap forward if i go back to my old days of load testing um trying to get this kind of
information out of ops was was near impossible they'd maybe have a weblog you know which yeah
try making sense of that this is not only visual but very very easy to read and you know i mean heck in the old days, we just kind of used to make an educated guess on what the traffic might be.
This is kind of game changing.
And it's not even like what we're looking at right here is something new that you can do in the APM tools.
You know, this has been around for a while, but I think just the concept of sharing that data, breaking down that wall, as you're saying, and sending the data back to the testers is quite incredible.
Yeah, go ahead.
The visualization, I mean, I almost want to say you created kind of like these flame charts, even though you chose obviously total different colors, not really flames.
But I think you have different layers.
You have different colors for the individual APIs.
And if you have 58 different ones and the top three, as you said,
consume or sum up to 60%, then you chose to visualize.
Sorry, Emperor Wilson, can we also post the images later on on uh yeah let's
just make sure we uh we have them scrubbed nice uh if we need to and yeah i think we can uh at least
well we'll have to see uh maybe at the very least we can provide links to them for somewhere but
yeah we'll we'll do something we'll get it out there somewhere i think i think it's i think it's
a good point it's it's important to be able to see i know we're talking about a graph and a chart and if
you can imagine it uh i want you to imagine yeah a graph um but i think i think the visualization
really helps with the with the concept again it's nothing it's not groundbreaking when you look at
it but it's it's the groundbreaking side is well yeah let's do this you know and making that leap
to share that data is where the groundbreaking comes into it.
It's really just how you augment the data, right?
I mean, you think of these tools and, you know, you think of the pure path and the transaction tracing.
But it's also, to me, they're just as much, you know, data analysis tools and behavior analysis tools for your users as much as performance, right?
And it's basically a stacked bar chart, right?
It's a stacked bar chart, and every measure is basically the volume of load
that comes in on a certain API and different colors,
and you immediately see, hey, here are your top three,
consume 60% because of the overall sum.
It's just visually very appealing, interesting, really cool stuff.
Yeah, it's a really good dashlet too.
I mean, it's something when you pull up the table view,
you have the half dashlet, half table view.
You can actually go through and just kind of hit space bar,
turn certain ones on and off.
You can see, okay, what if I turn off this api call how does that change the shape of the
behavior of the app and you can see you know if i turned off one of those top three there you know
it would change the whole shape of that graph so you can kind of it's kind of just fun to mess
around with that in that regard kind of too and i think what's also really nice uh because you know
you talked about this you're using in this case datrace, and you're using it in production but also in testing.
That means you can easily create the same dashboard in both environments, which then makes it so easy to compare them.
And this is what you also show in your second chart that you sent me, where you see the difference between prod and what you simulate in load. And that was, I guess, when you realized that you're load testing something totally different
that does not at all apply or does not at all reflect what's actually happening in production.
Right, yeah.
It's interesting because it kind of tells you, you can almost kind of point out,
you can tell the story of how the scripts were written.
You can see instead of the nice, organic, sort of steady hump you see in prod of all this different functions kind of coming in organically from various visits,
loadrunner tests, you can see obviously, okay, they started 500 users right now, and they exercised all the API calls, and then the load runner test stopped, and then it geared back up again, and you have this sort of spiky sort of weird that you can see the whole story of the load test right there and kind of point it out.
It's just kind of all these – it's a series of peaks instead of kind of like this kind of steady stream of usage of all these API calls that you see in prod. So yeah, just being able,
like I said earlier, like the QA team, you know, they, it's great. They can pull both of these graphs up in real time and they can look at them both and say, do our tests, the things that we
are exercising right now today, is it matching what we're seeing in prod? So yeah, it's been
super useful. And basically it's, it's actually, you can see two things here you see in uh unequal load
distribution and unequal distribution between the api calls so the basically what i mean if you run
these tests you're testing something that is i mean in the testing something totally different
and therefore even if your load test succeeds uh it's almost like a failed test because you're obviously never
going to have this pattern yeah yeah because you're not hitting the right distribution right
i mean so i mean again we can yeah we can follow up i'll even we can do a companion blog or send
out the the images or something after this but um you can see that yeah their their tests actually
hit quadruple the total volume of the app,
but it only hit about a third of what it was supposed to hit volume-wise on the key API calls.
So yeah, it was a failed test from the point of you didn't even exercise the right muscles of the app.
Yeah, and this also kind of opens up a Pandora's box of modeling that production test,
because besides, and this, Andy, i'm harking back to a
little bit to the previous episode with the concept of shifting right um for for the load
where we talked about um capturing the right metrics in production so that you can make sure
you're modeling the same way in your in-production environments. So on the surface,
we're talking about the different API calls, but you can even take this further then and say, okay,
if we got the same model of API calls, is our test environment also then giving us the same
number of threads, the same number of queries that are being executed? And you can keep extending
that comparison along different metric lines of
metrics to make sure that you're, you know, getting a much closer production model or discover that
there might be something wrong in your test environment setup where maybe you're getting
the right API model, but you're not exercising the same amount of database queries or threads,
which might mean that you're possibly, you know, maybe you're using the same search terms over and over or something else like that.
Right. The danger is, though, you can go, you know, where do you where do you kind of stop in that in that model?
Because you can go deep and deep and deep and deep and deep.
You know, it sure is a rabbit hole for sure.
And yeah, once you once you start. Yeah.
And yeah, like you said, the API layer, just the usage of that, that is the tip of the iceberg.
I mean, you get into, you know, an app that, you know, it sprays over 40 different services, you know, hundreds of different URLs and functions that it calls underneath that.
And it's like, okay, yeah, well, now you got to start getting the right traffic trend on, you know, everything, you know, going underneath the iceberg there, so to say. But it's also very, very important to do that in a way
because if you do have that model set up well,
and even if it's the first layer below,
even if you're just looking at the numbers of threads maybe
and queries and connections and things like that,
if you do get that model correct
and then a new code push comes into the pre-production
and something gets thrown off,
then that's when you'll be able to say,
hey, this is going to make a change in this way in production.
Yep. Yeah. And it turns out, yeah, I mean, you can, it's great. And you can almost kind of start
turning the knob on, okay, if we introduce this service to this API call or introduce the X number
of calls to that, it's going to hang underneath this function. You kind of just got to do, it's
a little mathy, but yeah, you can definitely start turning the dial on those things
and figuring out exactly how it's going to impact your app as a whole.
That's because you have all these architectural metrics, as we call them, right?
You have to view path end-to-end, and then you know this transaction is making these many calls, and if you're now shifting things around,
you can immediately see the increase in calls to the backend,
calls to the database,
when you start changing the way the services call each other
or the way if you change the implementation of the service.
That's pretty cool.
Yeah, so one of the things we're starting to quickly realize,
and it's like, okay, how deep you want to go? Do you want to go here? And then you get it going to have on the system over here sort of thing.
It's kind of like, you know, that when we're talking about this stuff, my team at work, it's kind of like the Internet meme, you know, ancient aliens where the guy's like, it's aliens.
It's all aliens.
It's always aliens.
It's kind of like it's all anomalies.
That's all it is.
You see, you know, you can start, you know, so one of the things that we're starting to do is querying all these backend services,
like you see in the PurePath and all that, and starting to, you know, dump it into, you know, hardcore data analysis tools.
I mean, there's a lot of ways you can approach this.
I mean, the way we do it is you just,'s a free splunk version how we we um how we just
sort of experiment with it uh but you can obviously do it with log stash i know dynatrace you know
you guys have the the bt feed that can go into log stash elastic search cabana and that all works
just as well too um but we're starting to say okay you know if you really want a good anomaly
detection system it's not good enough to do sort of a rolling minute by minute or hour by hour baseline.
You've got to be able to go back, say, four or five weeks for any given downstream service call
or database call on any given minute, like 10.03 a.m. on a Tuesday,
compared to the last 10.03 a.m. on a Tuesday, last five or so, right?
I mean, to really zone in on, okay, what is not normal and what kind of changes are you making with the system when you introduce new things?
So that means what you just mentioned may be interesting for the listeners.
You are streaming out the data from your APM solution,
which in this case is Dynatrace, to the free version of Elasticsearch
because I guess the free version allows you a certain limit,
a certain number of data per day.
Yeah, so it's actually Splunk.
I think Elasticsearch is free forever.
I mean, it's an open source solution.
But, yeah, we're using – I'm personally just kind of experimenting around
with the free Splunk that lets you do 500 meg a day or so.
But right now I'm dumping in because what I think DCROM really excels at is really measuring all of those dependent downstream services,
not like an end-to-end kind of transaction from a pure path perspective, but actually measuring all the little endpoints that it calls below that. Like, DCROM is very good and excels at measuring things from a bottom-up perspective in an
enterprise.
That's where it's really good.
And there's a SOAP action plugin that you can dump data into Splunk with.
And actually, there's a really good DCROM extension out there in the community that
I think it was one of the guardians.
Shoot, I hope I don't get this name wrong.
I believe it was Brett Barrett.
He actually developed it.
And you can go in and create a SOAP REST call URL.
It's a really nice GUI that he put together.
They say, okay, I want this data to come out of Splunk for these operations and these dimensions. And all you got to do is just kind of dump it into your Splunk rest grabber and you can get all that data, you know, every minute.
And it's especially good with the DCROM, you know, one minute time intervals that they just came out with in the EAP program.
And that's just been awesome.
So, yeah, for the past five, six weeks, we've been collecting every operation in the enterprise every minute and just kind of dumping it in there.
And it's only taken up, you know, 200, 300 meg a day.
And I'm running it off of, you know, just a little two core VPC.
And, you know, we're experimenting around with saying, OK, you know, this endpoint, how is he supposed to behave at 10.03 a.m. on a Tuesday?
You know, from from kind of a whole enterprise perspective.
I got two quick things to say here.
Sorry, Emperor Wilson, I didn't want to interrupt you.
Oh, no, no, no. Go on, Andy.
Two quick things for the listeners that are not familiar with some of the terminology.
DC RAM is our network-centric APM product.
And you said, you know, it's kind of sniffing
from the bottom up
and great for the enterprise
to monitor network traffic.
But Brian,
the other thing,
it seems what you are,
you're trying to solve
a very interesting problem
that a lot of APM vendors,
including Dynatrace,
also try to solve out of the box,
which is, you know,
applying machine learning, applying artificial
intelligence actually on top of the data that we collect and then alert you in case something
is out of normal behavior by looking back at historical data, by looking back at particular
endpoints and how they behaved a week ago if you have weekly cycles, a month ago if
you have monthly cycles or whatever cycles you have.
Wouldn't it be amazing if there's a tool that could do this, Andy?
Say that again.
Wouldn't our listeners love to know if there's a tool that can do this?
It would be like a Christmas present when it's four days before Christmas.
That's almost impossible.
Or do you think something like that exists?
I don't know.
Tell me more about this.
Yeah, you guys.
But I want to look like a total hero, right?
I want to beat my head up against a rock.
And yeah, no, it's the same thing.
I'm totally I think I totally agree.
That's kind of where everyone's going.
It turns into this kind of machine learning something.
You need something that can be able to detect these things.
And and yeah, so I do realize it's like, oh, this might just be few, you know, foolhardy
and, you know, just gotta, I should just wait for the machines to take over.
The beauty is you don't have to wait and I hate to be a commercial now, but you know,
that's what we do.
That's true.
That's true.
Yeah. now but you know that's what we do that's true that's true yeah so basically for the for the
listeners maybe go on dynachase.com and check out davis and check out artificial intelligence so
basically we we we saw the trend what you guys have been building that's great right you can
you can obviously do this with uh with uh our epmon data our, our DC RAM data but we also try
we saw the trend and that's why we
came up with our
out of the box artificial intelligence
engine that we put into our product
but that's I think enough with the commercials
if people want to learn more, Meet Davis
I think is a great way to start
if you Google for Bing or search for
Dynatrace, Meet Davis
then you will find more.
I wanted to ask Brian, before we go on,
I wanted to ask Brian just briefly
in case some other users out there
might want to do something like this.
You mentioned the free Splunk version.
And as far as the 500,
was it 500 meg daily data cap?
Is that, are you coming in below that
with both Dynatrace Appmon uh data and dc rum data
or is that just the dc rum data on its own that's just the dc rum data so we're not quite feeding in
the the business transaction feed into it quite yet because we do suspect that we are in qa and
we're just kind of putting together models of how we're going to slice and dice the data and deal
with it um but but yeah that's just DCROM today.
Right.
And the general concept there is, right, is Dynatrace collects all this data and we have a lot of ways of presenting it.
But if you want to get extremely creative in how you want to slice and dice and do complex
multi-dimensional analysis of the data, you can put it into something like Splunk or Elasticsearch and run any kind of
queries and correlations against that data that you can imagine. So it's interesting that Splunk
has that free cap so that you don't have to necessarily set up. I think I looked into doing
a Kibana Elasticsearch setup and it required, just for the base setup,
a significant amount of horsepower just to get that running. So I'm just looking to play with the idea of seeing what you can do
once you export the data.
It's nice to know that Splunk has that free option there.
Anyhow, moving on.
Hey, Brian, still staying on the topic.
Besides response time and failure rate, any other metrics, measures that you're pushing into Splunk and then doing your anomaly detection on?
So right now, it really is just response time, load, and failure rate.
Well, actually, it's really just response time and load and software services and operations times.
Once we kind of put this model together, we'll obviously put in – we'll dump in more metrics like that.
But that's totally important too, right? of load and, you know, a certain X response time for some endpoint, but you also suspect,
you know, X amount of 401 response codes or 200 response codes, right?
You know, because you're going to have a normal amount of 401 response codes when a service,
you know, hits some other service and challenges it for authentication, you know, things like
that.
So that's definitely, you know, that's definitely another good piece of it.
And yeah, and one other thing, too.
One of the things that kind of stuck with me when I started here where I'm at is one of the management said,
what's important here is we try to cut down on the amount of smart guy correlation we need to solve performance issues.
That goes right into the whole Davis thing too, right?
It's, you know, once you set up this system that can kind of start telling you how to find these things,
you can have your engineers not have to sit there and compare graphs and cross-check things
and, you know, kind of look like I use a GIF image on one of my slides
that kind of, you know, Zach Galifianakis in The Hangover where he's got all the graphs and stats
going over his face. You know, you kind of have your engineers doing that and wasting time doing
that when, you know, obviously, you know, they could be doing a lot of other productive things
like, you know, just the message of APM, right? You spend more time more time you know one of the messages spend more time innovating you know no less time and in fighting fires yeah hey and um i mean i like that maybe some additional
metrics maybe you have them already on the list but i can think about something like the number
of bytes sent and received per endpoint because that immediately allows you to see if maybe an API change or maybe
a new deployment all of a sudden is causing some issues on the amount of data you send
back and forth.
Maybe somebody forgot to turn compression on on a certain layer and then you send so
much more data over the wire because Dynatrace and DCRAM obviously capture package sizes
and network and then request and response sizes.
Also, I believe the, and what we talk all the time about number of database queries being executed,
number of web service calls, arrest calls, microservice calls being executed,
number of threads being involved.
With the PurePath, you see how many threads are involved.
And I recently just, I think I blogged about this.
It happened to ourselves within
Dynatrace. I mean, our
DCRAM team was actually using
AppMon to analyze DCRAM
and they made some quote-unquote
optimizations, but then what actually happened
their optimization was actually
spawning a lot of background
threads to do work parallel
because they thought it's faster which
for a short term it was faster but they were soon running into the boundaries of the number of
threads they have in their worker pools so basically they were just filling up all the threads
doing a lot of stuff in parallel slowing down the overall system and that was also very interesting
and so the number of threads per request,
and they did also do this by endpoint.
These are great metrics that you can then,
from an operations side,
obviously give back and say,
hey, since the last deployment,
we saw a change in behavior
because now we are consuming twice the amount of threads.
We are logging five times as many log messages.
So log messages per entry point and all these things.
Great, great values.
Yeah, go ahead.
I was going to ask Brian, I remember when we were preparing for this,
you kind of had a similar story with hanging some other services off of an API call.
I don't know if you wanted to kind of tie that into what Andy was talking about with monitoring those other components and maybe how that all ties in.
Yeah, so this is kind of, you know, you always have those epiphany moments.
And this was one of those that were like, okay, this is something we need to pay attention to, and this is important, right? We had this API call that it was one of those big hitters, about 18% of the traffic.
And it kind of gets into how you architect different services.
And you have to be smart about, well, okay, we might want to invoke this service when a user does some, you know, action on the GUI, but you got to
be able to understand the implications of that. So in testing, we realized that, hey, you know,
if we tie this service call to this API call, that means that API call is also tied to this checkbox
on the GUI. And you get some fidgety guy that just likes to check boxes
on and off, you know, in a GUI. That's, you know, I would say that's part of the reason why it's
high volume, but it's high volume because it's just, it's used a lot. It's a function on the
GUI that's used a lot that it's enough to, you know, make 18%. So those are the types of things
that, okay, this is important, right? You know, we have humans using these very popular things on the GUI of the app, which, you know, makes these underlying API calls heavily
used. And then you have this exponential, you know, increase in service calls due to that.
So, I mean, this particular one was sort of a document sharing system between an FA and a client.
And we noticed in testing, okay, you know, this call is getting made a lot, and we're going to basically see this exponential increase in the back end of this service call,
you know, syncing documents between the FA and the client.
So that was something that we could say, whoa, okay, we better not run off that
cliff, you know, before we run it into production. So that worked out well. And in that same vein,
one of the things that we're doing today is we kind of slowly onboard functionality, sort of like,
you know, A-B testing or, you know, some companies, they'll convert certain users
over to some piece of functionality in an app.
And right now, we have a new CRM system we're kind of bringing out.
And today, we have about 1% of our users converted over.
And we're able to look at, okay, how is this being exercised?
How is the performance?
Is it deterministic, meaning is it 300 milliseconds solid, just like our old legacy version of this service, or is it all over the place?
Right.
And if it's all over the place, we catch it now while it's 1% of our users using this, then in feedback to the dev and QA teams before, you know, it's 50% or 100% of our users using this. And besides just the response time and all that, if you're looking at all those other metrics
that Andy's talking about,
the service calls, the database queries,
the threads, all these other components,
when it's at 1%,
you have a chance to figure out
what's that model going to look like
when we push everybody else over
and is this model going to survive?
So going back to Andy's points,
that's why monitoring all these things
in those production environments is so, so key and important.
Yeah, it's human behavior.
It's really, it's impossible to replicate.
It's very hard to replicate.
So, I mean, this production type data is just gold to, you know, or teams to the left, right?
Yeah, and I think it's just hopefully with all this data,
and if you really feed it back from operations into test,
but also into dev,
especially you should dev the resource utilization of their features,
and also if we can take some kind of a cost factor,
because in the end,
and I think this is something I brought up in one of the latest
in the previous episodes,
trying to educate all the engineering teams that we not only need to build software that is fast from a response time perspective and super nice and user experience friendly, but also that is efficient.
Because in the end, I can write super fast code, but spawning 500 parallel threads and do something very strange and write millions of log files that
nobody needs, but I need to feed it into Splunk, but then nobody cares about it.
So I think these are the things we need to feedback as well to say, you know, great feature
that you built, but it's too costly because of this, this, and this metric.
Yeah.
Yep.
Hey, and for your AP testing, I'm interested in what you just said. So in your case, are you selectively onboarding individual users by changing their user profile and then you know they're getting redirected to that particular server?
Or is it every server has the same code base and you just turn it on depending on the user?
Yeah, so it's a function of – it's somewhere in the – I don't know the technical specifics of it, but essentially they can grab a list of users, what they're doing today, and they say, okay, these users are going to be converted over to the new service.
So if you hit F12 in their browsers, they're calling a different API call than 99% of everyone else.
So just the behavior in the F12 tab is different compared to them.
But GUI-wise, they're seeing it's the same thing to them sort of thing.
But is the code of, let's say, the B version of it, is the code running on different separate hosts?
So they're totally separated or they run on the same host as the other code runs too?
And just the flag then defines, hey, this code now executes this year,
and the code on the same host is executing now in another path.
Yeah, that's essentially it.
So like on the API layer, sort of the front door of the data center, if they hit,
there's a flag that says, okay, these users are going to take this route,
and these users are going to hit this other route.
And then, yeah, the users that are converted over are spraying across the whole other system.
And then all the other users are there.
This API call basically points and sprays over some other system over there sort of
thing.
So that's kind of how we're doing with it right now.
Cool.
Are you leveraging user experience monitoring too?
Yes, actually.
We had an interesting, so one of the new you know you always
atmon is is is funny you always think that you know all the little nooks and crannies
and you always think of find some other cool cool thing to use it uh use it by and uh so there's
this one app that we're just rolling out and it's another kind of slow rollout where you were we're
bringing out i think it was 20 or 30 people that kind of volunteered to be an alpha tester for it.
And, and, you know, we're, we're immediately finding ways that the QA testers couldn't
or didn't think of ways to exercise the app or exercise the functionality. Right. And so one of
the things is they re they found some, they found some user action coming supposedly from a completely linked from a
different app in our enterprise. And they're trying to figure out, well, how is this happening? Or
we didn't expect them to kind of take this route in this app. And it's those interesting things
you can find. So this sort of cool new functionality I learned from a user experience analysis
perspective in Atman is that you go to the
visits tab, you type in a username, and then if you don't filter any apps, it'll show you,
okay, well, here's this user.
And then they might have multiple visits based on them visiting multiple apps in your enterprise.
So what we actually ended up doing is we control clicked all three of their visits and drilled
down into the user
action pure paths from there and then you get a complete sequential list of their entire journey
throughout the entire day overlaying all these apps so what was cool is we were able to see oh
you know they they were in you know they were in app a and then they hit a link in app a which then
brought them to app b and then the in the dynatrace GUI, you see a switch apps within the user action PurePath stashlet.
So I think that's really cool because we could watch this user bounce between different apps
from an enterprise perspective, not just kind of like an application perspective.
So that was something that was kind of cool and new that we kind of stumbled upon.
So that helped a lot too right there.
And so we always learn something new about our uses how would they do the strange ways the strange paths they're
taking yeah that's pretty good humans are so unpredictable you know if we could all be machines
exactly yeah yeah that'd be good anybody is anybody watching uh westworld by any chance
on hbo oh yeah i watched the first episode, but –
It's very creepy.
It's very – have you – Brian, have you watched the whole season?
Oh, absolutely.
Yeah.
And I'm one of those freaks that has to immediately go on all the internet forums and read up all the theories.
And, yeah, it was actually – the internet did so well at figuring out all of the major spoilers before the finale that it
was like oh okay i know all this so maybe season two i won't you know i'll just kind of enjoy and
watch the the show but it's a very good show yeah it is and mr wilson emperor wilson you should
check it out you know i'm dying to see the original west world movie the one with yul brenner um it
just hasn't been on streaming yet i haven't been able to see it it's uh probably uh late 60s early 70s you know style sci-fi starring yul brenner king
and i the guy with that shaved head you know and i think there was a a quote from him if i if i
didn't get it wrong uh about him saying that him playing that machine was one of the most comfortable
natural roles for him,
which is pretty awesome.
But I really like to see the campy version.
You know, I watched the first episode.
I don't have too much time for TV, and it just didn't hook me in. But maybe at the recommendation of the two of you, especially another fellow, Brian,
I'll have to go back and at least give it a few more episode shots, you know?
That's right. you're on a major
binge watch now it's all streamable yeah hey uh brian i got one more question coming kind of back
to how we started you know we started with a operations gave especially testing the insight
on which apis are hit how often so you can actually model better loads and distribution of loads for your load tests.
Now, how do you deal with test data?
Because I assume the only way this really makes sense is also having good test data.
Do you also replicate and does operations provide test data to pre-prod, kind of tearing
down this wall as well?
Yeah, they do to an extent um i'm not i don't know
the specifics on it but i do know that they do do that to an extent that that is that is important
um to be able to do you know just to be able to exercise and replicate uh replicate the functionality
you know as it should be right yeah because there's you're right there could be so many
different uh they well they at least emulate the diversity of relationships that you
might have. Um, they, you would say, okay, well on average, you know, we have a mix of, you know,
a 20 clients to one FA kind of relationship. So they might replicate, okay, okay. This is the
kind of mix of relationships that we need to be able to exercise.
Because one service call could be very heavy to another service call, even though they're the same endpoint,
because you might have an FAA that has 40 clients compared to 100, and they're going to pull how much data back.
So, yeah, they definitely mix the different types and combinations that of data
that you can have there cool and um yeah that's interesting i just wanted to make sure that
you know i was interested in what you guys are doing and are you um have you ever played around
with load testing in production have you ever thought about something like crazy like this no not not quite although uh so at a different organization though we have thought about it
uh because it was before i was i was with raymond james uh we were having we didn't quite have our
prod environment scale or qa environment scale to our prod and And it wasn't, and the way the app behaved, it was kind of these image creation, this
image creation server, right?
And if you had one of them, basically, that the speed at which it can create these images
was a function of how fast the disk could read and write.
And it was not a linear scale.
So if you had one of these image servers and then four of
them it wasn't like okay you can push you just need to push one fourth of the prod data to it
it was like it would actually behave a little bit differently if you had four servers um like it
wouldn't just be okay you can push four more through like the the disk would actually lock up
at different rates so we thought about
that you know waking up at like three in the morning or something like that we didn't end
up doing that but um but yeah it's not not not something we're doing we're doing today i would
say cool well gentlemen uh i think i mean i i thought this was an amazing discussion especially
around the kind of like coming back to the title.
We have to love our data because there's so much we can learn.
And I'm really looking for when this show airs and we can then show the charts we all here looked at,
kind of showing the audience on what you can do with uh these flame charts that
are not flame-ish but i have been other colors but really we look great but you do actually you
do have flame charts right the third dashboard that we have in our email here includes frame
chart flame charts are you talking about the uh i'll have to pull it up i don't know if you're
talking about the red wave of death where you get the yellow, green, red. Oh, yeah, yeah, yeah.
So that's good.
Yeah, you can use flame charts, red wave of death.
Either works.
We actually have a fourth color in there.
You don't see.
But the fourth color is the purple wave of death.
So that's just performance right there.
You know, green, you know, well, from an API layer, I'm sorry.
So in UEM, red will count as a frustrated visit as if somebody throws a 500 or 400 or something like that.
But from an API layer perspective, we have a fourth color.
Okay, was it slow, fast, or okay?
Or did it throw a 500 or 400 error?
And that's purple.
So if we have an account lockout or something, you'll see a giant purple wave there because the API calls at that point are very fast, but they're failing very fast, right?
They're at 20 milliseconds.
So if we didn't have that fourth purple color in there, it would just be a nice – it would say, hey, everything's great when it's not.
So, yeah, that's what we have for all our apps there.
That's why one must always connect response time and failure rate because that's the only – Exactly.
Yes. That's only thing. Yes.
That's very important.
Yeah.
That was awesome.
Yes.
Thank you so much for being on.
Andy, did you want to do any other sort of summary?
You haven't done one in a while.
I know.
I know.
I mean, my summary is really I believe modern operation teams and coming back to what I will be talking about at the webinar, which happened the week before this one airs,
but I believe modern operation teams need to break down their walls from the right to the left and providing better meaningful data about the applications,
about the patterns that people, that users use the applications, also
about dependencies when we talk about all these metrics like the number of web service
calls, the bytes sent and received and all that.
Because in operations, you really have a view of your real users and how they use the app.
And you can level up your testers by allowing them to model better load tests, like in your case.
And you also can give direct feedback to your development teams and saying, hey, whatever you just did, first of all, people love your feature, but the feature is doing something very weird and you're consuming too many resources, right?
I mean, this is perfect, I believe.
So ops, I believe, can step up.
They don't have to wait for DevOps to happen from the dev side, but I believe ops can start themselves and make a step towards the testers and the
development teams.
So that's what I believe. Perfect.
Very good.
I just want to remind everybody that Perform
is coming up and Andy and I will
both be
running some hot sessions
on Perform and I actually just noticed my session
is not on the page
which probably explains why we don't have many people signed up for it.
I'll be doing an e-commerce monitoring hot session
the second half of the day on February 6th.
Andy, which ones are you going to be doing Perform in Las Vegas this year?
I'm doing the shift left continuous integration session.
I'm doing one in the morning and one in the afternoon.
So the idea is we're building
a Jenkins pipeline
using the latest
Jenkins pipeline feature,
having a Spring Boot app
with two microservices.
So setting up the pipeline,
pushing the app
through the pipeline
and then simulating
some bad code changes
and then seeing what happens
if you don't have
Dynatrace in there
and then another one
where we can use Dynatrace to stop
the bad code change before the change goes
into production and kills everything.
Excellent. And Mr. Chandler, are you
going to perform this year?
I am. In fact, I'll be in a
breakout session I think the first day.
I think it was creating
performance tuning heroes
or something along those lines.
I believe we're going to be with Mr. Thorsten Roth, my manager and I,
going through much of the same, not exactly everything we talked about today,
but I'm sure we're going to touch on it a fair bit.
But, yeah, come on over to our breakout session.
Excellent.
Are you wearing a cape because you're a superhero?
You know, I've been trying to pick out which Avenger, you know, our team is.
And we've got to, yeah, we'll have to, we're still out on that one.
The jury's still out.
We've got to figure that one out.
That also means tights.
Yes.
So please.
Will that make more people come to our room?
Yes.
We kind of post that.
Yeah.
All right.
We'll be sure.
No matter what, it will.
Yes.
Wear tights.
Absolutely.
Please.
Andy will wear his leader hose in if you wear his tights.
Andy, I just committed you.
Bring it home with you.
Anyhow, thanks a lot, everybody.
Any final words from you, Brian?
I know we're kind of rambling here, so any quick last words there for you?
No, that was great.
And just like Andy said, it's a great opportunity, these types of tools,
being able to analyze this data in production to, yeah, get more involved, you know, on the lifecycle as a whole, right?
I mean, you become more than just, you know, watchers of green check marks and red Xs, right?
You can actually feed back this rich data and be a really, really good part of the process.
Absolutely.
And then shameless plug for my Twitter.
You can follow me at channer531, C-H-A-N-N-E-R-5-3-1.
And yeah, that's about it.
All right.
And you can follow us at Pure underscore DT.
I am at Emperor Wilson.
We also have at Crabner Andy for all the Twitters.
And don't forget, you can also, if you're a YouTuber, we're now publishing these to YouTube.
So if for some reason you'd like to have a video playing with a static image while you listen to the audio, you can do that as well.
That's all for me.
I'd like to thank everybody for listening.
And thank you, Brian, for being such a gracious guest today.
And Andy, thank you as always.
Thank you, guys.
Thank you.
Bye.
Andy, go get some sleep.
Go get some sleep, Andy.
I'll get some sleep
bye bye
bye