Storage Developer Conference - #113: Latency is more than just a number
Episode Date: November 5, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 113.
Okay, so the basics. This is about the data.
It's not about the workloads.
It's not about SSD, HDD.
It's not about system-level QoS.
I don't want to talk about any of that stuff.
Success criteria, not important.
Just the data.
How do we handle the data?
What do we do with the data?
And proper techniques in looking at the data.
So I don't know who left this here.
I guess it was me.
Come on.
Dry crowd.
Here we go.
Last session.
Come on.
Stay with me.
So QOS.
So QOS is something that I like to put in terms of a pizza delivery.
Everybody can relate to that.
30 minutes or less, it'll be there.
You get that one time out of 100,
it doesn't make it in that 30 minutes or less.
The customer does not care if it arrived at 32 minutes or two days later.
The customer is not happy.
It doesn't fall within their QoS envelope.
So their QoS envelope is below 30 minutes.
And again, like I said, this is very basic.
So that's where I start.
Anyone have questions about the basics?
Oh, please don't take a picture of the pizza delivery guy.
Please, please, no.
Let's get a better graph up there.
Okay, QoS traditional view.
This is a better way to look at it.
So again, in the title
that I didn't complete, it's more than just a number. So classically, we get numbers. We get
table of numbers. We get a particular number. And then sometimes when we have a customer that's a
little more savvy, they want to look at a histogram view.
And they say, okay, we bin it up and we look at these histograms and then we say, okay,
where do these QS numbers lie in this?
And you look at that and you kind of go, yeah, but what does it all mean?
I mean, are we talking about a number?
Are we talking about a five nines number at 153 way out here?
Sure.
What does that mean?
It doesn't really have a lot of meaning in the end of the day.
So what I'm trying to do, or what I tried to do very early on,
was try to wrap meaning around what we're doing.
So not just saying, let's look at a number.
And a lot of times my clients inside the company would come to me and say,
we're not hitting the number, why?
Well, I'd look at this histogram and I'd say, something's going on out there.
I don't know what that is, but there's something going on that's pushing our events way out.
Well, can you help us?
Can you help the firmware guys fix it?
Can you help someone analyze this?
So I decided to take this to the next level. So instead of saying just like,
we have a certain number of IOs that didn't make it in the count, a certain number of IOs that
made it in the count, we say, okay, so how do we look at these IOs? What do these IOs mean when
we're actually getting a QoS number out from our particular customers. I'm not going to say names,
but there's one notorious one that wants nine nines QoS
with a very low sample count.
And I said, can't be done.
They said, no, they're generating numbers.
So part of this discussion is to clarify
what they actually did and what they're doing wrong.
And it's hard to go into it,
especially a big customer like that, and say, you're wrong. It's hard to go into it, especially a big customer like that,
and say, you're wrong.
But I got to do that.
It was fun.
Anyhow, so any questions on the traditional view?
You guys are all familiar with this, I'm assuming?
Yeah, okay, good.
So the attempt was to make the measure better, something that we can talk about,
something that we can say, okay, rather than improving our 5-9s number, what do we want to look for?
So we went to trying to say, okay, well, why is QOS not consistent?
Well, first of all, the IOPS measurement.
They said, well, what does the IOPS measurement have to do with our QoS? It's like, well, if our IOPS aren't stable,
we haven't clarified that, then our QoS isn't stable. So if you run this in this first sample
window, which this sample window, I believe, is a 10-minute window per sample. So this is a long
running IOPS stream. You'll get a different number.
And if you amortize that into, let's say, two sections, you'll get yet a different number again.
So the QS numbers are variable.
So, again, it's not a good measurement.
We need to have the system at steady state for the workload.
And we have to look at the IOPS.
We have to care about the IOPS in this case.
So percentiles, you've got to have good data.
So this is part of the IOPS discussion, right?
So we bring in that IOPS.
You have that first section of fast IOPS.
They're not going to give you a good steady state number.
They're going to give you your QoS for that interval.
But that's it.
It doesn't really help you understand what's going on when you reach steady state.
Most of our customers who are running QoS are interested in steady state performance.
They're not interested in what our burst performance is at the beginning.
Not so much.
So we have to have reliable data
up to the percentile of interest.
What does that mean?
So when you think about it,
let's say we have six nines of numbers, right?
Six nines means we have a million samples
to get to a six nines level.
So each one of the nines represents a zero, essentially, right? So that would give me one data
point that I could base my six nines number
on. That's not really a great way to do it
unless your data is really consistent. And as you guys know in the storage
industry, our latencies are never really consistent. I mean,
they can be over here, they can be over there,
but it's not the way it's done
in the real
world. And understanding
that you've got a workload. So when
you're looking at a blended workload,
you're looking at a
read QD1 workload,
the profile's going to be completely
different. What you get out of it's going to be
completely different. As you change the workload it is going to be completely different.
As you change the workload,
so one of the things that some of our customers do is they change the workload midstream.
They'll say, okay, let's do this.
Okay, now we're going to do this, and now we're going to do that.
Each section in there needs to be quantified.
In other words, you don't just look at one particular piece of the puzzle.
So by understanding the workload,
and again I come back to the one data point.
One data point can't drive your numbers.
It's crazy.
And actually there is some empirical data that we worked out
as well as theoretical data that we've worked out
to be able to say how much data you actually need
to make that measure.
Any questions so far?
Super simple. You guys are quiet.
I'm going to bring some alcohol in the room or something.
Get you guys going.
Okay, so...
My first step into this was to say,
we need mathematical significance.
We can't just do this.
We can't just say, hey, there's my nine nines number
because I took a million samples.
Here you go.
No, it doesn't work like that.
This is not that kind of data.
The central limit theorem, one of my favorite theorems.
There's college courses on it
if you guys want to go that route
and find out more about it.
But there's a little thing called root n,
which comes into play,
that says that if you're looking at a sample
and it's not normal data,
which this is not normal data,
the closest approximation I have
is a log normal set of data,
but it's not even that
if you want to start talking about normalization
people want to do averaging
people that do averaging in QOS
wrong, you're doing it wrong
you need to be using
50th percentile median
that is consistent
anyone want to know why?
yes Dave, we want to know why? Yes, Dave, we want to know why.
Okay, so if you think about it this way,
if you have a pool of data,
and think about that histogram I had up there in the very beginning.
We have one outlier that's way, way, way out there.
That two-day after the pizza was supposed to be delivered, it shows up person. That's going to
skew your average significantly. If you looked at
50th percentile, that wouldn't move. It'd be right
there, in the same place, no matter where that last latency came
in. Again, you don't want your data to depend on a
single data.
So averages are bad.
Standard deviation is bad.
And I've got customers that are asking for both.
So I have no choice but to give them what they're asking for.
But then I also give them the right thing, too.
So, again, this is part of saying this is more than just a number.
You have to actually measure it correctly.
And you have to measure the data correctly, not the workload.
I don't care about what you're doing there.
Make sense?
Okay, cool.
For the latencies that I looked at,
we need 100 times more samples beyond the bare minimum.
So what that means, in that million samples that we require for a 6-9 number,
we actually have to measure 100 million samples to be able to make that number reliable.
It doesn't quite follow the root n.
Root n would say that we would need 1,000 more samples,
because million, root n, root of that is going to get you to 1,000.
You need 1,000 more samples than you need.
And that will just increase your accuracy.
But generally speaking, the good news is
the storage is generally consistent,
so you don't have to follow the root N completely
to still get good data out.
And like I said, when I got customers
that are throwing me all kinds of numbers,
all kinds of workloads, all kinds of thoughts on it.
You kind of go, what is it that we really need to do to standardize this?
And again, my
adamant opposition to using numbers
that have less than that 100x has got me into a lot of trouble. But I am very
adamant that you cannot have a good repeatable number in that window.
And I've seen numbers vary as much as 50%. It's
pretty bad. So again, the sample size
matters. Questions? Central limit
there? Anything?
Okay, good.
Go to the next one.
So this
is one of my early studies. I've done a few of these studies
where we took a
single skew drive.
We took three, there's three
versions of it.
All same assembly line,
same day of manufacture, sequential
serial numbers.
Pretty much the same drive.
And I ran one of them three times.
And then I have the other
two up there.
And what I looked at is
and again, standard deviation, bad.
I was bad, bad, bad. I shouldn't have used standard deviation.
But it is a common denominator.
People can understand, at least at a certain level,
that basically if you look at those five runs
that I did on this drive,
and I did 3.2 million samples,
so they ran for a period of time
to give me 3.2 million samples,
and at three nines, it's pretty tight.
It's well less than 1%. Four nines, we're starting to cusp on that 100x over sample.
You're starting to deviate, but not a huge amount.
We're still under that 1% margin,
which I like to see from an error standpoint.
Yes?
So when you say I run them so many times,
you're talking about drives, right?
Are you just showing IOs to the drives? Yes. Yes.
Yeah, we're using FIO to... We're monitoring the completion latencies.
Yes.
So the time from the submission comes in
to the time the data's ready.
Pretty straightforward.
FIO, standard tool.
Linux-based,
good stuff.
It's a good product.
Again, nothing against...
Yes?
I understand when you say
the various QoS levels
at the bottom,
and I'm assuming you're still
talking about the 3.2 million samples.
Correct.
And you've somehow broken those samples
into independent number of points, or Correct.
It's only changing across that particular QoS number.
So that's how much the QoS number is changing. So in other words, each one of the five runs
is going to generate a QoS number for 3 9s, 4 9s, 5 9s, 6 9s.
And so that's the deviation of those 9s levels.
That's what I'm showing here.
No problem.
Any other questions?
Okay.
So again, as we get past that level of 100x oversample,
things start to go apart.
They start to deviate.
So this now becomes a repeatability issue.
This is, we can't get a repeatable number out of the same drive,
or the drive that's the same skew on the same test system in the same test environment.
It's starting to deviate.
And then again, when we have three samples, which are outliers we have,
so basically that 6-9's number there is being driven by three samples.
So there's three samples that are outside of that 6-9's number.
Does that make sense?
And that was a little confusing.
So if you think about it, there's only three samples that are driving that number.
And that's one of the reasons why there's that huge variability.
Because these drives, you can have a long event, you can have a short event,
but if those three events are somehow coupled, which sometimes they are,
that can move that number around.
And then let's say one of those events didn't happen.
Then you move
where that
6 9ths number is,
where those three samples are determining that 6 9ths number.
Does that make sense?
I'm getting yeses.
Okay, cool.
You guys are sticking with me.
Okay.
So,
in this look, I mean, I've done this a number of times.
This is a real simple view that illustrates how that deviation starts to occur.
So the more samples we can rely on, the better data we're going to get. So in this case,
this is a single drive.
I don't remember what the workload was.
7.8 million samples.
You can, by tuning how you do the percentile calculation,
you can change things.
This is where that customer who wanted nine nines from a small sample size
actually came up with a nines nines number.
This is how it works.
So basically, there's a formula in Python called NumPy.
I don't know if you guys are familiar with NumPy or not.
It does the default percentile calculation, which is a linear fit.
So it linearly fits the percentiles to the data set.
What that means is that it doesn't necessarily have a latency when it's creating that linear fit.
It just says, my nines are somewhere over there, somewhere right here.
Here's where it's supposed to be.
Whether there's latency there or not, it doesn't care.
It's going to generate a number. And that's what happened in this bottom run right here. Here's where it's supposed to be. Whether there's latency there or not, it doesn't care. It's going to generate a number.
And that's what happened in this bottom run right here.
So the default linear fit on this set of data, it's the same set of data, all three, all the same set of data.
It says, okay, my percentile is right here.
Here's my five nines.
Here's my 5 nines, here's my 6 nines.
And oh, and by the way, I'm going to generate a 7 nines, 8 nines, and 9 nines number,
even though we've only got 7.8 million samples.
So it's going to lay somewhere over there.
So what it does is it says, okay, so if we look here on the latency map,
we've got a max latency at 3346.
The next lowest one, you can't really see the bars there,
it's around, oh wait, I can tell you what it would be. It would be about here. No,
it's going to be higher than that. So approximately 2700 is the next lower latency.
So what happens is it's saying, okay, between that last latency in here,
the percentiles say it should land about here.
And the next one should land about here.
So it fills in the blanks.
So that's the way default handles it.
So now you have to say, don't do that.
Associate my nines level with actual latencies that are occurring.
So when you change that to the next higher,
and this is, again, a switch that you can set in NumPy.
It's a very simple one.
Say, give me the next higher one.
The behavior is quite different.
So starting even at five nines,
you can see that there was no latency at 453, but there
is one at 455. So now our 5 nines has moved a little bit. Again, not hugely significant. And
again, if you look at my rule of 100x over, 100x is going to lie right around in here somewhere.
So yeah, you're starting to get right at that cusp where things are going to deviate.
Six nines?
A little more.
I mean, three.
Okay, what's three between friends?
We're measuring microseconds.
It's okay.
But now here's where it falls apart.
So the next higher, at seven nines, it's going to take this number and say,
what's the next latency?
And that's going to be our max latency. So now it's going to return max
latencies in the buckets that it doesn't have enough data for.
And I have set my code to ignore. If you have something that's equal to max latency,
don't print it, because you know it's wrong.
There's only one condition if you actually have exactly the right number of IOs, which
rarely occurs, but it does occur sometimes.
Any questions on this?
Wow, okay.
You're going to make this time go fast.
You guys must have airplanes to catch or something.
Maybe?
Okay.
So again, going back to the customer and say you're calculating it wrong is not super popular,
but yet we do have the right way to do it,
and I don't know what their code is when they do their calculation,
so I can't tell them how to change their code.
I can just give an example and say this is NumPy, and here's how NumPy works.
There's a fair possibility they're using NumPy
to do their calculations
when they're calculating their percentiles.
Okay.
So now we go on to a better view of QOFs.
And again, better is always subjective.
Again, it's my opinion.
It's something that we look at
in a different way. So what we do
is we create what's called
a 1-CDF plot.
This is a cumulative
distribution function.
You basically take the entire pool of data,
you bucketize it like we're doing the
histograms, then you take those
buckets and you do
a 1-CDF plot of that data.
And what that gives you
is a really neat
line.
Again, depending on the workload
it can be more interesting or not.
But essentially it allows
you to see what the driving factors are
in the levels of 9.
In this case, when you're looking at this plot,
and it took me a while to teach people how to read it,
basically you take the scientific notation over here,
and that is your level of 9s.
So at this point on the graph, we are at, again, this is log scale,
we are at approximately 300, 289,
where this crosses the five nines barrier.
And as you come down the graph, you see that we start to get jagged down here, and then it ends
here. This is our max latency for this particular run. But there's not a lot of data in here,
so this isn't super clear but you do have these transitions
you have these layers of horizontal movement
which that tells you that there's not a lot going on
at the nines level from an IO perspective
so from a histogram perspective you would see that
there would be very sparse data in that portion
and so there's nothing really driving change
or there's not much
data driving change in this point.
But some trouble can
happen with this. And this is where we
start to see some value.
Also, on the other end of the chart, near the
top, the other
interesting thing is you can see distinct
stair-step levels.
Anybody want to venture
a guess what those stair-steps are about? Anybody?
Bonus points, Friday afternoon. No? No, it's TLC. You're actually looking at it read from each of
the different layers of TLC. It's a TLC drive. So basically we have three major stair-steps,
and then a stair-step here, which is a retry level.
So basically you're seeing the behavior of the NAND in the device.
Cool stuff.
So vertical slope means not really an updated NAND resolution?
No, what vertical means, things are going actually quite well.
So verticals are in your favor.
So if you look at your 2-9s and your 4-9s in this graph,
they're real tightly grouped.
That's a good thing.
That's a good thing, not a bad thing.
So here's how, no, that's not the same graph.
It's a similar graph.
It looks when mapped on top of the histogram.
So basically we take the histogram.
In the histogram, you can see those stair steps now.
They're clearly apparent.
And then you can also see some little sub-stair steps,
which are those first-level retries that are happening very quickly.
So things aren't adding up.
Is that the right number?
It takes a certain number of microseconds to get that.
And again, you can get a lot
of G2 on what the
drive's doing by just looking at this view.
And again, you look
at the levels of retries that are occurring,
which are kind of out here in time.
But then, what
happens in the histogram is your data
breaks down. You don't have a lot you can see.
But in this case, you can see from the CDF that you actually have some level of function.
You have some deterioration.
You have a strong horizontal line right here showing that there's just nothing going on in this section. And you've got this little modality out at the end, which you can see represented
by that wonderful level of nines in there,
saying that, yeah, something really is happening at that point. We don't have enough
data to make it really clear. We need more data. Again, that's always
the desire from a data science perspective, to be
able to have good, solid data for this.
Any questions on this view?
Okay.
I did get some questions
when I showed the slide set before
about what's going on over here.
This is cache, so we're getting
cache hits.
We're getting some
system anomalies.
Again, you have to believe the data that you look at.
Because again, FIO is only so
accurate. CentOS is only so accurate.
So we have to deal with that.
So that's not causing...
It's still horizontal there
just because there's so little...
Yeah, there's a very small amount of data.
And again, very few hits.
So you're getting some level, I mean, 10 microseconds.
So between 10 and 20 microseconds, you've got some modality going on here.
So that's probably going out to the drive, hitting the cache, and then coming back.
If you could see the view where there's one more down,
you'd see the OS having a few cache hits too.
So you're hitting at the different levels of cache.
But again, it's kind of cool because you can see what's going on.
It's kind of neat.
Alright, so this is a little more interesting.
This is a different drive that I ran consecutively 10 times, but I only
reported on the first five runs on the drive. And looking at what happens in the CDF plot and how
that CDF plot changes. So you see in one run, we have a displacement. And again, displacements, when you have good solid data
representing an event, something happens.
So the assumption at my level,
and again, I'm just one of the guys,
is we hit a read scan.
So we have some read scan going on,
and it's taking up some small amount of bandwidth
from our low level of nines,
and it creates a small displacement.
Again, good data, more data, better.
And it tends to be more and more repeatable.
But then, as you go, as you look at these runs,
and you look at this horizontal event over here,
something funky happens,
and this horizontal line crosses our seven nines level.
So one of the runs, we actually, the drive is quicker.
So this is good.
But the line has dropped below the seven nines.
What this does, in the classic scenario where we have a table,
it creates huge variability.
So a customer would say,
I tested it once, I got 757.
Tested it the next time, I got 2308.
What's wrong with your drive?
There's nothing wrong with your drive.
It's just that there's an event that goes on in this section
which drives the typical behavior which didn't occur over here.
So now it looks better artificially.
But the point is that the customer says, how do we improve that?
Well, the way to improve that is to bring this line down just a little bit.
That's all you've got to do.
You've got to figure out what you can do to push off whatever it is, the vent that's driving
that long horizontal, and you'll get your nines back. You'll get your seven nines back.
So, you know, again, this drive has the capability of doing even better than the 757 because
it doesn't cross until this point, but we're over here when we kind of end our
deterioration. So we should be able to get that number down below that point there. By the way,
these are not micron drives, necessarily. There might be a couple thrown in the mix.
So some people said, oh, well, we see your micron drive does this, and say, ha,
it's not a micron drive. I work very closely with a competitive analysis group and they feed me data and then I just analyze
it.
And then
the last part is showing the variability at the end
because again the lack of data
it doesn't allow us to have a really
clear picture of what's going on at the end there
and again it shows us that we really don't have
a repeatable number when we get to those high
level of nines for the data set that we've
allocated.
Thank you. Well a repeatable number when we get to those high level of nines for the data set that we've allocated.
Well, my boss likes me to say micro.
No, no, no, no.
And again, this is not about the drives.
This is about the data.
So it's just about the data. You should say that multiple variation of different drives.
Of competitor X drive.
Or competitor S,
or competitor I, or competitor H,
or whoever it would be.
Yes?
Yeah?
Is there any expectation that the runs are identical
from a workload perspective,
or they're just...
Finding a run as basically saying
these are all the data points that were collected.
I don't know what was necessarily the workload
that was there for the run.
So the runs are identical.
So they're done on the same machine
with the same test script, same day.
So, you know, a lot of environmental variables.
Again, you do have multiple runs on this drive,
so the first pass might create something,
again, like we saw in that first displacement.
Maybe some history being built up based on the past.
Yeah, yeah.
And they're generally close to fresh out of box.
They're generally new drives.
Is there any kind of a reset happening
for the drives between each pass?
No.
We are just running them.
So the idea...
How do you distinguish between the first pass
and the second pass?
It's just starting the workload again.
It restarts the workload,
and the workload has actually been occurring previous.
So in other words, there's a preconditioning that happens,
and so basically steady state, again, part of myonditioning that happens. And so basically, steady
state, again, part of my requirement to say that
those IOs are at a steady stream.
And then we can,
I'll selectively collect
the consecutive runs.
And when you're defining a run,
the run themselves are not defined
as having
randomization inside of the run itself.
No, no.
There's no re-randomization.
We don't re-seed or anything like that.
So we basically let FIO use its default.
We're doing no map.
I'm sorry.
What do they call it?
I forgot.
I'm not an FIO expert.
What I'm trying to think about here is that
if you assume that you had 100 files,
I'm just trying to define a run, right?
You had 100 files, you had 100 threads,
and each of those threads was responsible for reading some location on the file
for each of their read activities that was going to be executed.
So we could use this view for each one of those runs.
Yeah, yeah.
So this one, there's no file system,
where we're just doing direct block access.
So we're not resetting any file system structure.
We're trying to eliminate everything.
We're trying to go direct. Correct. Yeah, yeah, we're not resetting any file system structure. We're trying to eliminate everything. We're trying to go direct.
So you're doing direct device level.
Correct.
You're doing direct device.
Yeah.
Yeah, we're using direct.
So again, talking about workloads,
but yeah, that's it.
Okay.
I understand.
But you have to have some idea of what...
Sure.
Where all in the system you could have
an introduction of...
Variability.
Sure.
Variability. And this is, like I said, this is one of the powers of this view introduction of variability.
And this is, like I said, this is one of the powers of this view.
You can see how easily you've got five runs lined up on one view.
You could probably put 30, 40, 50 runs on there and see where your variability occurs.
So if you wanted to do something like that,
this view would still work for you.
In terms of the time, if you will, that the event occurred,
if we were to take just the last, I don't know, 100 events,
the highest 100 events, and plot them on when in the run
those events occurred.
Would they be occurring at the same exact time
or would they be scattered
across the
duration of the run?
Have I got a view for you?
Have I got a view for you?
I was there
three years ago going that same question, what's going on in time? Because
again, this doesn't really have a factor of time in it. This is just latency percentages, right?
We're not looking at time. So we have a thing called runtime QoS. So runtime QoS, again,
thank you for being my intro on that
I appreciate that
runtime QoS what we do is we say okay let's take
a million data points
some of these drives can hit a million IOs
we can generate a whole lot of data really quickly
and now we can say okay
so at that point in time, for those million IOs,
let's plot our nines of interest, whatever they may be.
In this case, it's three nines and four nines.
And then whatever our max latency is.
So that one longest, again, you don't want to depend on that number,
but it does show a trend when it's cumulative.
So if you look at these
events, and again, this
scale is not time in this case.
It could be. We could do it in time.
But I did it in
samples. So every million
sample, we get a drop of data.
And you can look at, you can see
how our 3-9s and 4-9s
are climbing here, and we get this long latency
event, which is
probably maybe a block race assignment?
Maybe?
So the x-axis on this
is samples. So
you can basically get each one of these data points is
in this case, each one of the data
points is 100,000 samples.
So its max latency is that
100,000th sample.
I'm assuming time is also there.
Time is very similar.
Yeah, so every 100,000 points,
because you have taken the 99...
I've calculated all the percentiles.
You've calculated the percentiles for each 100,000 points
and applied them for all 100,000 points. The points of interest.
So the one I took off here, my boss was bugging me about it
because I had the 50th percentile, again, the typical latency,
which is the median, not the mean.
I have to drive that home.
Flat, very low, not a lot of bearing on the scale.
So what happens is, I mean, you've got the areas of interest.
So whatever levels of nines you're interested in,
you can add to this view.
So if you take a million samples, you can have 2 nines, 3 nines, 4 nines, 5 nines,
typical, whatever you want to do.
And you can add it to this view to see, does it change throughout the run?
So this is basically, this is running as long as the run
runs. So if we have a continuous run,
let's say like a database. So we have
Cassandra or DBRocks or something,
and it's running, and we want to measure
that QoS,
this view will just go as long as we want to
look at it for. Look at over a 24-hour
window. You can see what's happening.
And again, as you zoom in and out, you can lose
the modalities. And this is where I was told not to talk about this,
but this is where my work in the AI part is going.
So I'm actually looking for stuff.
It's in the noise here.
It's really interesting.
What is being held constant on the x-axis is 100,000 points.
Sample count.
Observation points. Rather than, say, n number of seconds that are lost. Correct.
And it's just as simple, and the view's not that different either,
because again, you're in steady state, so it doesn't, yeah.
But it's one of the reasons I stripped off the x-axis,
because I didn't want to say, it's time, because it's not.
It's samples.
Yes?
Yeah. Yeah.
Okay.
So essentially, say I'm taking 1,000 samples,
and if you're facing 100,000 IOPS,
then your sample rate and your time are very tight.
Yeah, exactly.
As long as the pace work is linear.
Well, and that's part of, again, looking at the IOPS level across the run being linear.
If it's not linear, then you really can't calculate a good, solid QoS across that.
Again, and you can tighten the window up.
Like in that first graph where I showed that first section being at a different performance level,
you could look just at that level, that window in this view,
and it would help you understand what's going on during that level.
I mean, again, my assumption is it's data alignment issues or something like that.
But, again, part of it is part of what this is all about is a guide to understand what the device is doing.
And it applies to pretty much any kind of QoS workload.
As long as you have the QoS data set, you can use this view to do some analysis.
Okay.
How am I doing on time?
I'm going to let you guys out early.
All right.
Now, this is four different drives.
Same workload. same test system,
showing across the test run, test duration is approximately the same.
Again, some are going to be a little faster than others.
The bottom right-hand corner drive is much slower than the rest,
so it took it quite a bit longer to do its run.
But what it shows is that there are artifacts of a
periodic nature that are starting to come out. And we're seeing
what's driving that periodic artifact.
You can see the drive over there. Very clearly something periodic
is going on. Off and on.
A clean block, a dirty block,
whatever that is, whatever's driving that.
Again, we can use the data from the time and the duration of the test or the number of IOs to go backwards and calculate
what actually we think's going on.
This drive here, very well-behaved, very flat performance profile.
You have a couple of long latency events.
But again, when you look at the histograms of these, or not the histogram, but the CDF version of this,
it doesn't show you, it doesn't make this clear.
I mean, it doesn't say, yeah, you've got a thing that happens on a cadence
or something that happens over on a long latency event for something like that.
And again, I did leave the 50th percentile on this
just to show that there is a variation
in the 50th percentile in this drive over here.
It's a client drive that's tested
in an enterprise environment.
And as we know, we have customers
that want to use cheap client drives
in an enterprise environment.
This is why you don't do that.
You don't do that.
If you want your consistency to look like that,
then okay, go ahead, use that cheap drive.
But that's part of what this slide has been used for
in the past is that, no, no, no, no,
you don't want to use that drive.
You need to use one of the other three.
Even the slightly wonky performance of drive number two
is not that big of a deal.
I mean, does this all make sense?
I mean, this view?
Because again, it answers that question.
What's happening in real time?
Or real I.O. in this case.
Okay.
Oops.
Too far.
That's the last slide.
Open your Christmas present early.
So this view
was created from
a device
that was running a 7030
workload. We split
the workload. We split the IOs,
the read IOs and the write IOs
to say what's going on this way,
what's going on that way.
And what it shows
is that
from the new
views perspective,
here's all your numbers
from the traditionalists, you guys who want standard
deviation and that. You can calculate all
that, but the real important one is down on the guys who want standard deviation and that. You can calculate all that.
But the real important one is down on the end, which is your
sample count. How many samples
did you collect in that run?
And then look at,
again, you have a big
modality here in the right, and
that's all going to cache. So most of your
write data is going to the cache on that device.
And there's a nice deterioration.
But you do have some long latency events that you may be interested in looking at.
And again, in this view, you can actually see those three long latency events, where they occurred in time.
So they're not clustered together.
They are spaced out.
They don't look modal.
They could be.
Again, we need more data to know that.
But on the other side of the fence, when we look at reads,
well,
we have a clear cadence going on over here.
There's something
in that device
from the read perspective
that is causing a long
latency event on a cadence.
There's some guesses as to what it is.
I don't want to venture a guess on it,
because again, it's not our drive,
it's somebody else's drive.
So I don't know what their firmware does,
but you guys can probably make some ideas
about what would happen on a cadence
during the read portion
that you wouldn't see in the write portion.
And again, you still see the modalities
for the TLC levels in this drive,
because again, it's a TLC drive.
You see, okay, now you can see the caching events,
the system level.
We have one event on that one,
which is faster than what the bus can transfer.
That's part of the clue.
But again, you still have these little modalities here,
which are again represented by the shift
and again this downward trend right here
to be able to see what's going on there.
So this view of QoS tells you so much more
about what's going on rather than saying,
I have a five nines of 289. And what do you
learn from that? Nothing. It's more than just a number. So this is where I come back to
my title and why it isn't on the title because I start off with the data science aspect.
Okay, questions. This is the last slide, so I can link this one up if you want to talk to it
or anybody wants to know what I'm thinking when I look at this stuff
or why I created this view.
How many samples are you reducing down into a point on the time baseline?
You're talking about this guy here?
So this guy here, if you look at these numbers,
this tells me that I did it on a 100,000 sample.
So basically that max represents your 100,000 sample.
So convert that into a nines number, that tells you how you are.
So basically three nines, four nines, and then that's a five nines,
but it's really the max latency in that particular sample window.
Yes? So the bottom chart, you can calculate real time,
at least as soon as you have 100,000 samples.
You can do the thing, the blue one,
it's the actual IOPs themselves,
which you could block by.
The CDF is determined from the blue dots.
But it assumes that you have all of the blue dots before you calculate it.
Oh, yeah.
You can't calculate it without...
Yeah.
You can't.
You can't.
You can't go to a chart like this at the end of an analysis once you have everything. You can't use it... You could generate this runtime as well,
but you wouldn't have much data.
In other words, it wouldn't tell you much.
Yeah, it would stop like up here.
Yeah.
And you can do it.
And again, you can overlay the CDFs very clearly.
You can't really do that with the histograms.
It just looks like a big mess of dots when you do that.
But if you overlay them, you can kind of see how it changes in time.
And that's certainly a way that I have analyzed it before.
It just doesn't tell you that much.
The one that really tells you is this view. And again, taking this runtime view and doing further analysis on the runtime itself,
what's happening in those windows,
I think it's going to tell you a huge amount more about what's going on in the system.
And again, you've got to understand it is a system at the end of the day,
because we still have CentOS, we still have FIO.
I mean, we try to understand it is a system at the end of the day because we still have CentOS, we still have FIO. We try to eliminate it all, but the fact is there's a suspicion
that the horizontal at the seven nines that I was showing before
might be a systemic anomaly, not a drive anomaly,
which then represents how are you testing?
You're going to have a big shift in your seven lines and above
if you test with, if that's the case,
if it's a systemic anomaly.
Better minds than I are working on that.
In the back.
I didn't think this analysis would work or look on a larger scale.
I tried it on a drive array,
or especially on a network storage device.
Oh, I think it would be kind of interesting,
because again, I was listening to some of the earlier talks,
and I go back to the CDF view.
The CDF view is the great equalizer.
It doesn't matter how many samples you have,
that CDF view is still the CDF view.
It's just where it's going to end at the bottom.
You can continue that down and keep going. So a runtime analysis might get really muddy,
so you see a lot of stuff going on,
and you might want to expand your view window or something like that.
But in the CDF plot, it's a great equalizer.
It's kind of cool.
CDF's a great view.
And again, the same thing happens with the histograms,
is it gets kind of muddy because everything starts to climb upwards.
So an ideal CDF view would be a straight line, either across or down.
Straight down, yeah.
It would just drop.
Yeah, and basically your QoS level would be at a single point,
or pretty close to a single point in time.
Follow the curve.
300 microseconds, and everything comes in at 300 microseconds straight down to the floor.
We've looked at it. It actually tells you some things
because you see the transitions paralleling each other.
So you can see things like that in the data.
Sure.
I would think you could, yeah.
Again, a lot of it is what you can pull out of the data.
So again, by going to this view of QoS,
where I give you all your stats,
I give you all your stats, I give you all
your nines, I give you your sample count and fries and everything you want with it. And then I go,
okay, so let's look and see what this actually is doing. So it gives you, and I think your
customers will benefit from seeing these different views to say, you know, my five nines might not
look that good, but it's being driven by this periodic behavior,
which we have traced back to journaling or something like that. And we journal, and it's slow because we'll be able to have an answer for them.
And then they start to get okay with it
if they understand it's protecting their data or something like that, right?
And the nice thing is, again, from the CDF view,
you can see them all on the same page.
So you could actually see the array view
and then the individual drives, and then you could
see those contributing factors.
Alright.
Is that it? Question way
in the back.
Maybe not.
Really, is there
any way that you can
or
if I were to run this,
if I want to have this data on the opportunity, it's based in a normal running system.
I can't really have millions and millions of data points.
Is there some good theory behind condensing this data to something manageable that is still useful?
Well, if you go back to the histogram view, that might help you to go manageable that is still useful? Well, if you go back to the histogram view,
that might help you to go into that.
So you basically have buckets,
and you just keep adding to the bucket.
So it's not quite as...
There's nothing beyond just predefined buckets
that I can just go up and...
Or you can do it on an interval.
So in other words,
you could be sampling every million IOs, right?
And then you're saving it off.
A million IOs, save it off.
I mean, I don't know if that's practical in your environment.
I mean, it could be like a billion IOs or something.
I don't know.
But again, any time that you close the sampling window,
you also affect the test.
So you've got to be kind of careful about it.
So I generally like to take all my data at once
and then just break it up if I need to.
Okay.
Thank you, everybody.
I think that puts us in time.
If you want to talk to me more, come on up.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to
developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.