Storage Developer Conference - #113: Latency is more than just a number

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 113. Okay, so the basics. This is about the data. It's not about the workloads. It's not about SSD, HDD.

Starting point is 00:00:51 It's not about system-level QoS. I don't want to talk about any of that stuff. Success criteria, not important. Just the data. How do we handle the data? What do we do with the data? And proper techniques in looking at the data. So I don't know who left this here.

Starting point is 00:01:08 I guess it was me. Come on. Dry crowd. Here we go. Last session. Come on. Stay with me. So QOS.

Starting point is 00:01:18 So QOS is something that I like to put in terms of a pizza delivery. Everybody can relate to that. 30 minutes or less, it'll be there. You get that one time out of 100, it doesn't make it in that 30 minutes or less. The customer does not care if it arrived at 32 minutes or two days later. The customer is not happy. It doesn't fall within their QoS envelope.

Starting point is 00:01:45 So their QoS envelope is below 30 minutes. And again, like I said, this is very basic. So that's where I start. Anyone have questions about the basics? Oh, please don't take a picture of the pizza delivery guy. Please, please, no. Let's get a better graph up there. Okay, QoS traditional view.

Starting point is 00:02:02 This is a better way to look at it. So again, in the title that I didn't complete, it's more than just a number. So classically, we get numbers. We get table of numbers. We get a particular number. And then sometimes when we have a customer that's a little more savvy, they want to look at a histogram view. And they say, okay, we bin it up and we look at these histograms and then we say, okay, where do these QS numbers lie in this? And you look at that and you kind of go, yeah, but what does it all mean?

Starting point is 00:02:37 I mean, are we talking about a number? Are we talking about a five nines number at 153 way out here? Sure. What does that mean? It doesn't really have a lot of meaning in the end of the day. So what I'm trying to do, or what I tried to do very early on, was try to wrap meaning around what we're doing. So not just saying, let's look at a number.

Starting point is 00:03:00 And a lot of times my clients inside the company would come to me and say, we're not hitting the number, why? Well, I'd look at this histogram and I'd say, something's going on out there. I don't know what that is, but there's something going on that's pushing our events way out. Well, can you help us? Can you help the firmware guys fix it? Can you help someone analyze this? So I decided to take this to the next level. So instead of saying just like,

Starting point is 00:03:28 we have a certain number of IOs that didn't make it in the count, a certain number of IOs that made it in the count, we say, okay, so how do we look at these IOs? What do these IOs mean when we're actually getting a QoS number out from our particular customers. I'm not going to say names, but there's one notorious one that wants nine nines QoS with a very low sample count. And I said, can't be done. They said, no, they're generating numbers. So part of this discussion is to clarify

Starting point is 00:03:58 what they actually did and what they're doing wrong. And it's hard to go into it, especially a big customer like that, and say, you're wrong. It's hard to go into it, especially a big customer like that, and say, you're wrong. But I got to do that. It was fun. Anyhow, so any questions on the traditional view? You guys are all familiar with this, I'm assuming?

Starting point is 00:04:17 Yeah, okay, good. So the attempt was to make the measure better, something that we can talk about, something that we can say, okay, rather than improving our 5-9s number, what do we want to look for? So we went to trying to say, okay, well, why is QOS not consistent? Well, first of all, the IOPS measurement. They said, well, what does the IOPS measurement have to do with our QoS? It's like, well, if our IOPS aren't stable, we haven't clarified that, then our QoS isn't stable. So if you run this in this first sample window, which this sample window, I believe, is a 10-minute window per sample. So this is a long

Starting point is 00:05:01 running IOPS stream. You'll get a different number. And if you amortize that into, let's say, two sections, you'll get yet a different number again. So the QS numbers are variable. So, again, it's not a good measurement. We need to have the system at steady state for the workload. And we have to look at the IOPS. We have to care about the IOPS in this case. So percentiles, you've got to have good data.

Starting point is 00:05:36 So this is part of the IOPS discussion, right? So we bring in that IOPS. You have that first section of fast IOPS. They're not going to give you a good steady state number. They're going to give you your QoS for that interval. But that's it. It doesn't really help you understand what's going on when you reach steady state. Most of our customers who are running QoS are interested in steady state performance.

Starting point is 00:05:58 They're not interested in what our burst performance is at the beginning. Not so much. So we have to have reliable data up to the percentile of interest. What does that mean? So when you think about it, let's say we have six nines of numbers, right? Six nines means we have a million samples

Starting point is 00:06:20 to get to a six nines level. So each one of the nines represents a zero, essentially, right? So that would give me one data point that I could base my six nines number on. That's not really a great way to do it unless your data is really consistent. And as you guys know in the storage industry, our latencies are never really consistent. I mean, they can be over here, they can be over there, but it's not the way it's done

Starting point is 00:06:48 in the real world. And understanding that you've got a workload. So when you're looking at a blended workload, you're looking at a read QD1 workload, the profile's going to be completely different. What you get out of it's going to be

Starting point is 00:07:04 completely different. As you change the workload it is going to be completely different. As you change the workload, so one of the things that some of our customers do is they change the workload midstream. They'll say, okay, let's do this. Okay, now we're going to do this, and now we're going to do that. Each section in there needs to be quantified. In other words, you don't just look at one particular piece of the puzzle. So by understanding the workload,

Starting point is 00:07:26 and again I come back to the one data point. One data point can't drive your numbers. It's crazy. And actually there is some empirical data that we worked out as well as theoretical data that we've worked out to be able to say how much data you actually need to make that measure. Any questions so far?

Starting point is 00:07:51 Super simple. You guys are quiet. I'm going to bring some alcohol in the room or something. Get you guys going. Okay, so... My first step into this was to say, we need mathematical significance. We can't just do this. We can't just say, hey, there's my nine nines number

Starting point is 00:08:12 because I took a million samples. Here you go. No, it doesn't work like that. This is not that kind of data. The central limit theorem, one of my favorite theorems. There's college courses on it if you guys want to go that route and find out more about it.

Starting point is 00:08:26 But there's a little thing called root n, which comes into play, that says that if you're looking at a sample and it's not normal data, which this is not normal data, the closest approximation I have is a log normal set of data, but it's not even that

Starting point is 00:08:45 if you want to start talking about normalization people want to do averaging people that do averaging in QOS wrong, you're doing it wrong you need to be using 50th percentile median that is consistent anyone want to know why?

Starting point is 00:09:06 yes Dave, we want to know why? Yes, Dave, we want to know why. Okay, so if you think about it this way, if you have a pool of data, and think about that histogram I had up there in the very beginning. We have one outlier that's way, way, way out there. That two-day after the pizza was supposed to be delivered, it shows up person. That's going to skew your average significantly. If you looked at 50th percentile, that wouldn't move. It'd be right

Starting point is 00:09:36 there, in the same place, no matter where that last latency came in. Again, you don't want your data to depend on a single data. So averages are bad. Standard deviation is bad. And I've got customers that are asking for both. So I have no choice but to give them what they're asking for. But then I also give them the right thing, too.

Starting point is 00:10:01 So, again, this is part of saying this is more than just a number. You have to actually measure it correctly. And you have to measure the data correctly, not the workload. I don't care about what you're doing there. Make sense? Okay, cool. For the latencies that I looked at, we need 100 times more samples beyond the bare minimum.

Starting point is 00:10:23 So what that means, in that million samples that we require for a 6-9 number, we actually have to measure 100 million samples to be able to make that number reliable. It doesn't quite follow the root n. Root n would say that we would need 1,000 more samples, because million, root n, root of that is going to get you to 1,000. You need 1,000 more samples than you need. And that will just increase your accuracy. But generally speaking, the good news is

Starting point is 00:10:50 the storage is generally consistent, so you don't have to follow the root N completely to still get good data out. And like I said, when I got customers that are throwing me all kinds of numbers, all kinds of workloads, all kinds of thoughts on it. You kind of go, what is it that we really need to do to standardize this? And again, my

Starting point is 00:11:13 adamant opposition to using numbers that have less than that 100x has got me into a lot of trouble. But I am very adamant that you cannot have a good repeatable number in that window. And I've seen numbers vary as much as 50%. It's pretty bad. So again, the sample size matters. Questions? Central limit there? Anything? Okay, good.

Starting point is 00:11:48 Go to the next one. So this is one of my early studies. I've done a few of these studies where we took a single skew drive. We took three, there's three versions of it. All same assembly line,

Starting point is 00:12:04 same day of manufacture, sequential serial numbers. Pretty much the same drive. And I ran one of them three times. And then I have the other two up there. And what I looked at is and again, standard deviation, bad.

Starting point is 00:12:19 I was bad, bad, bad. I shouldn't have used standard deviation. But it is a common denominator. People can understand, at least at a certain level, that basically if you look at those five runs that I did on this drive, and I did 3.2 million samples, so they ran for a period of time to give me 3.2 million samples,

Starting point is 00:12:39 and at three nines, it's pretty tight. It's well less than 1%. Four nines, we're starting to cusp on that 100x over sample. You're starting to deviate, but not a huge amount. We're still under that 1% margin, which I like to see from an error standpoint. Yes? So when you say I run them so many times, you're talking about drives, right?

Starting point is 00:13:04 Are you just showing IOs to the drives? Yes. Yes. Yeah, we're using FIO to... We're monitoring the completion latencies. Yes. So the time from the submission comes in to the time the data's ready. Pretty straightforward. FIO, standard tool. Linux-based,

Starting point is 00:13:25 good stuff. It's a good product. Again, nothing against... Yes? I understand when you say the various QoS levels at the bottom, and I'm assuming you're still

Starting point is 00:13:38 talking about the 3.2 million samples. Correct. And you've somehow broken those samples into independent number of points, or Correct. It's only changing across that particular QoS number. So that's how much the QoS number is changing. So in other words, each one of the five runs is going to generate a QoS number for 3 9s, 4 9s, 5 9s, 6 9s. And so that's the deviation of those 9s levels.

Starting point is 00:14:15 That's what I'm showing here. No problem. Any other questions? Okay. So again, as we get past that level of 100x oversample, things start to go apart. They start to deviate. So this now becomes a repeatability issue.

Starting point is 00:14:35 This is, we can't get a repeatable number out of the same drive, or the drive that's the same skew on the same test system in the same test environment. It's starting to deviate. And then again, when we have three samples, which are outliers we have, so basically that 6-9's number there is being driven by three samples. So there's three samples that are outside of that 6-9's number. Does that make sense? And that was a little confusing.

Starting point is 00:15:03 So if you think about it, there's only three samples that are driving that number. And that's one of the reasons why there's that huge variability. Because these drives, you can have a long event, you can have a short event, but if those three events are somehow coupled, which sometimes they are, that can move that number around. And then let's say one of those events didn't happen. Then you move where that

Starting point is 00:15:29 6 9ths number is, where those three samples are determining that 6 9ths number. Does that make sense? I'm getting yeses. Okay, cool. You guys are sticking with me. Okay. So,

Starting point is 00:15:44 in this look, I mean, I've done this a number of times. This is a real simple view that illustrates how that deviation starts to occur. So the more samples we can rely on, the better data we're going to get. So in this case, this is a single drive. I don't remember what the workload was. 7.8 million samples. You can, by tuning how you do the percentile calculation, you can change things.

Starting point is 00:16:23 This is where that customer who wanted nine nines from a small sample size actually came up with a nines nines number. This is how it works. So basically, there's a formula in Python called NumPy. I don't know if you guys are familiar with NumPy or not. It does the default percentile calculation, which is a linear fit. So it linearly fits the percentiles to the data set. What that means is that it doesn't necessarily have a latency when it's creating that linear fit.

Starting point is 00:16:56 It just says, my nines are somewhere over there, somewhere right here. Here's where it's supposed to be. Whether there's latency there or not, it doesn't care. It's going to generate a number. And that's what happened in this bottom run right here. Here's where it's supposed to be. Whether there's latency there or not, it doesn't care. It's going to generate a number. And that's what happened in this bottom run right here. So the default linear fit on this set of data, it's the same set of data, all three, all the same set of data. It says, okay, my percentile is right here. Here's my five nines.

Starting point is 00:17:24 Here's my 5 nines, here's my 6 nines. And oh, and by the way, I'm going to generate a 7 nines, 8 nines, and 9 nines number, even though we've only got 7.8 million samples. So it's going to lay somewhere over there. So what it does is it says, okay, so if we look here on the latency map, we've got a max latency at 3346. The next lowest one, you can't really see the bars there, it's around, oh wait, I can tell you what it would be. It would be about here. No,

Starting point is 00:17:55 it's going to be higher than that. So approximately 2700 is the next lower latency. So what happens is it's saying, okay, between that last latency in here, the percentiles say it should land about here. And the next one should land about here. So it fills in the blanks. So that's the way default handles it. So now you have to say, don't do that. Associate my nines level with actual latencies that are occurring.

Starting point is 00:18:29 So when you change that to the next higher, and this is, again, a switch that you can set in NumPy. It's a very simple one. Say, give me the next higher one. The behavior is quite different. So starting even at five nines, you can see that there was no latency at 453, but there is one at 455. So now our 5 nines has moved a little bit. Again, not hugely significant. And

Starting point is 00:18:55 again, if you look at my rule of 100x over, 100x is going to lie right around in here somewhere. So yeah, you're starting to get right at that cusp where things are going to deviate. Six nines? A little more. I mean, three. Okay, what's three between friends? We're measuring microseconds. It's okay.

Starting point is 00:19:16 But now here's where it falls apart. So the next higher, at seven nines, it's going to take this number and say, what's the next latency? And that's going to be our max latency. So now it's going to return max latencies in the buckets that it doesn't have enough data for. And I have set my code to ignore. If you have something that's equal to max latency, don't print it, because you know it's wrong. There's only one condition if you actually have exactly the right number of IOs, which

Starting point is 00:19:44 rarely occurs, but it does occur sometimes. Any questions on this? Wow, okay. You're going to make this time go fast. You guys must have airplanes to catch or something. Maybe? Okay. So again, going back to the customer and say you're calculating it wrong is not super popular,

Starting point is 00:20:10 but yet we do have the right way to do it, and I don't know what their code is when they do their calculation, so I can't tell them how to change their code. I can just give an example and say this is NumPy, and here's how NumPy works. There's a fair possibility they're using NumPy to do their calculations when they're calculating their percentiles. Okay.

Starting point is 00:20:35 So now we go on to a better view of QOFs. And again, better is always subjective. Again, it's my opinion. It's something that we look at in a different way. So what we do is we create what's called a 1-CDF plot. This is a cumulative

Starting point is 00:20:54 distribution function. You basically take the entire pool of data, you bucketize it like we're doing the histograms, then you take those buckets and you do a 1-CDF plot of that data. And what that gives you is a really neat

Starting point is 00:21:10 line. Again, depending on the workload it can be more interesting or not. But essentially it allows you to see what the driving factors are in the levels of 9. In this case, when you're looking at this plot, and it took me a while to teach people how to read it,

Starting point is 00:21:29 basically you take the scientific notation over here, and that is your level of 9s. So at this point on the graph, we are at, again, this is log scale, we are at approximately 300, 289, where this crosses the five nines barrier. And as you come down the graph, you see that we start to get jagged down here, and then it ends here. This is our max latency for this particular run. But there's not a lot of data in here, so this isn't super clear but you do have these transitions

Starting point is 00:22:06 you have these layers of horizontal movement which that tells you that there's not a lot going on at the nines level from an IO perspective so from a histogram perspective you would see that there would be very sparse data in that portion and so there's nothing really driving change or there's not much data driving change in this point.

Starting point is 00:22:28 But some trouble can happen with this. And this is where we start to see some value. Also, on the other end of the chart, near the top, the other interesting thing is you can see distinct stair-step levels. Anybody want to venture

Starting point is 00:22:44 a guess what those stair-steps are about? Anybody? Bonus points, Friday afternoon. No? No, it's TLC. You're actually looking at it read from each of the different layers of TLC. It's a TLC drive. So basically we have three major stair-steps, and then a stair-step here, which is a retry level. So basically you're seeing the behavior of the NAND in the device. Cool stuff. So vertical slope means not really an updated NAND resolution? No, what vertical means, things are going actually quite well.

Starting point is 00:23:24 So verticals are in your favor. So if you look at your 2-9s and your 4-9s in this graph, they're real tightly grouped. That's a good thing. That's a good thing, not a bad thing. So here's how, no, that's not the same graph. It's a similar graph. It looks when mapped on top of the histogram.

Starting point is 00:23:45 So basically we take the histogram. In the histogram, you can see those stair steps now. They're clearly apparent. And then you can also see some little sub-stair steps, which are those first-level retries that are happening very quickly. So things aren't adding up. Is that the right number? It takes a certain number of microseconds to get that.

Starting point is 00:24:06 And again, you can get a lot of G2 on what the drive's doing by just looking at this view. And again, you look at the levels of retries that are occurring, which are kind of out here in time. But then, what happens in the histogram is your data

Starting point is 00:24:23 breaks down. You don't have a lot you can see. But in this case, you can see from the CDF that you actually have some level of function. You have some deterioration. You have a strong horizontal line right here showing that there's just nothing going on in this section. And you've got this little modality out at the end, which you can see represented by that wonderful level of nines in there, saying that, yeah, something really is happening at that point. We don't have enough data to make it really clear. We need more data. Again, that's always the desire from a data science perspective, to be

Starting point is 00:25:04 able to have good, solid data for this. Any questions on this view? Okay. I did get some questions when I showed the slide set before about what's going on over here. This is cache, so we're getting cache hits.

Starting point is 00:25:24 We're getting some system anomalies. Again, you have to believe the data that you look at. Because again, FIO is only so accurate. CentOS is only so accurate. So we have to deal with that. So that's not causing... It's still horizontal there

Starting point is 00:25:39 just because there's so little... Yeah, there's a very small amount of data. And again, very few hits. So you're getting some level, I mean, 10 microseconds. So between 10 and 20 microseconds, you've got some modality going on here. So that's probably going out to the drive, hitting the cache, and then coming back. If you could see the view where there's one more down, you'd see the OS having a few cache hits too.

Starting point is 00:26:04 So you're hitting at the different levels of cache. But again, it's kind of cool because you can see what's going on. It's kind of neat. Alright, so this is a little more interesting. This is a different drive that I ran consecutively 10 times, but I only reported on the first five runs on the drive. And looking at what happens in the CDF plot and how that CDF plot changes. So you see in one run, we have a displacement. And again, displacements, when you have good solid data representing an event, something happens.

Starting point is 00:26:49 So the assumption at my level, and again, I'm just one of the guys, is we hit a read scan. So we have some read scan going on, and it's taking up some small amount of bandwidth from our low level of nines, and it creates a small displacement. Again, good data, more data, better.

Starting point is 00:27:09 And it tends to be more and more repeatable. But then, as you go, as you look at these runs, and you look at this horizontal event over here, something funky happens, and this horizontal line crosses our seven nines level. So one of the runs, we actually, the drive is quicker. So this is good. But the line has dropped below the seven nines.

Starting point is 00:27:40 What this does, in the classic scenario where we have a table, it creates huge variability. So a customer would say, I tested it once, I got 757. Tested it the next time, I got 2308. What's wrong with your drive? There's nothing wrong with your drive. It's just that there's an event that goes on in this section

Starting point is 00:28:07 which drives the typical behavior which didn't occur over here. So now it looks better artificially. But the point is that the customer says, how do we improve that? Well, the way to improve that is to bring this line down just a little bit. That's all you've got to do. You've got to figure out what you can do to push off whatever it is, the vent that's driving that long horizontal, and you'll get your nines back. You'll get your seven nines back. So, you know, again, this drive has the capability of doing even better than the 757 because

Starting point is 00:28:41 it doesn't cross until this point, but we're over here when we kind of end our deterioration. So we should be able to get that number down below that point there. By the way, these are not micron drives, necessarily. There might be a couple thrown in the mix. So some people said, oh, well, we see your micron drive does this, and say, ha, it's not a micron drive. I work very closely with a competitive analysis group and they feed me data and then I just analyze it. And then the last part is showing the variability at the end

Starting point is 00:29:12 because again the lack of data it doesn't allow us to have a really clear picture of what's going on at the end there and again it shows us that we really don't have a repeatable number when we get to those high level of nines for the data set that we've allocated. Thank you. Well a repeatable number when we get to those high level of nines for the data set that we've allocated.

Starting point is 00:29:31 Well, my boss likes me to say micro. No, no, no, no. And again, this is not about the drives. This is about the data. So it's just about the data. You should say that multiple variation of different drives. Of competitor X drive. Or competitor S, or competitor I, or competitor H,

Starting point is 00:29:52 or whoever it would be. Yes? Yeah? Is there any expectation that the runs are identical from a workload perspective, or they're just... Finding a run as basically saying these are all the data points that were collected.

Starting point is 00:30:09 I don't know what was necessarily the workload that was there for the run. So the runs are identical. So they're done on the same machine with the same test script, same day. So, you know, a lot of environmental variables. Again, you do have multiple runs on this drive, so the first pass might create something,

Starting point is 00:30:29 again, like we saw in that first displacement. Maybe some history being built up based on the past. Yeah, yeah. And they're generally close to fresh out of box. They're generally new drives. Is there any kind of a reset happening for the drives between each pass? No.

Starting point is 00:30:48 We are just running them. So the idea... How do you distinguish between the first pass and the second pass? It's just starting the workload again. It restarts the workload, and the workload has actually been occurring previous. So in other words, there's a preconditioning that happens,

Starting point is 00:31:04 and so basically steady state, again, part of myonditioning that happens. And so basically, steady state, again, part of my requirement to say that those IOs are at a steady stream. And then we can, I'll selectively collect the consecutive runs. And when you're defining a run, the run themselves are not defined

Starting point is 00:31:20 as having randomization inside of the run itself. No, no. There's no re-randomization. We don't re-seed or anything like that. So we basically let FIO use its default. We're doing no map. I'm sorry.

Starting point is 00:31:37 What do they call it? I forgot. I'm not an FIO expert. What I'm trying to think about here is that if you assume that you had 100 files, I'm just trying to define a run, right? You had 100 files, you had 100 threads, and each of those threads was responsible for reading some location on the file

Starting point is 00:31:56 for each of their read activities that was going to be executed. So we could use this view for each one of those runs. Yeah, yeah. So this one, there's no file system, where we're just doing direct block access. So we're not resetting any file system structure. We're trying to eliminate everything. We're trying to go direct. Correct. Yeah, yeah, we're not resetting any file system structure. We're trying to eliminate everything. We're trying to go direct.

Starting point is 00:32:25 So you're doing direct device level. Correct. You're doing direct device. Yeah. Yeah, we're using direct. So again, talking about workloads, but yeah, that's it. Okay.

Starting point is 00:32:33 I understand. But you have to have some idea of what... Sure. Where all in the system you could have an introduction of... Variability. Sure. Variability. And this is, like I said, this is one of the powers of this view introduction of variability.

Starting point is 00:32:47 And this is, like I said, this is one of the powers of this view. You can see how easily you've got five runs lined up on one view. You could probably put 30, 40, 50 runs on there and see where your variability occurs. So if you wanted to do something like that, this view would still work for you. In terms of the time, if you will, that the event occurred, if we were to take just the last, I don't know, 100 events, the highest 100 events, and plot them on when in the run

Starting point is 00:33:26 those events occurred. Would they be occurring at the same exact time or would they be scattered across the duration of the run? Have I got a view for you? Have I got a view for you? I was there

Starting point is 00:33:43 three years ago going that same question, what's going on in time? Because again, this doesn't really have a factor of time in it. This is just latency percentages, right? We're not looking at time. So we have a thing called runtime QoS. So runtime QoS, again, thank you for being my intro on that I appreciate that runtime QoS what we do is we say okay let's take a million data points some of these drives can hit a million IOs

Starting point is 00:34:15 we can generate a whole lot of data really quickly and now we can say okay so at that point in time, for those million IOs, let's plot our nines of interest, whatever they may be. In this case, it's three nines and four nines. And then whatever our max latency is. So that one longest, again, you don't want to depend on that number, but it does show a trend when it's cumulative.

Starting point is 00:34:43 So if you look at these events, and again, this scale is not time in this case. It could be. We could do it in time. But I did it in samples. So every million sample, we get a drop of data. And you can look at, you can see

Starting point is 00:35:00 how our 3-9s and 4-9s are climbing here, and we get this long latency event, which is probably maybe a block race assignment? Maybe? So the x-axis on this is samples. So you can basically get each one of these data points is

Starting point is 00:35:15 in this case, each one of the data points is 100,000 samples. So its max latency is that 100,000th sample. I'm assuming time is also there. Time is very similar. Yeah, so every 100,000 points, because you have taken the 99...

Starting point is 00:35:36 I've calculated all the percentiles. You've calculated the percentiles for each 100,000 points and applied them for all 100,000 points. The points of interest. So the one I took off here, my boss was bugging me about it because I had the 50th percentile, again, the typical latency, which is the median, not the mean. I have to drive that home. Flat, very low, not a lot of bearing on the scale.

Starting point is 00:36:03 So what happens is, I mean, you've got the areas of interest. So whatever levels of nines you're interested in, you can add to this view. So if you take a million samples, you can have 2 nines, 3 nines, 4 nines, 5 nines, typical, whatever you want to do. And you can add it to this view to see, does it change throughout the run? So this is basically, this is running as long as the run runs. So if we have a continuous run,

Starting point is 00:36:27 let's say like a database. So we have Cassandra or DBRocks or something, and it's running, and we want to measure that QoS, this view will just go as long as we want to look at it for. Look at over a 24-hour window. You can see what's happening. And again, as you zoom in and out, you can lose

Starting point is 00:36:43 the modalities. And this is where I was told not to talk about this, but this is where my work in the AI part is going. So I'm actually looking for stuff. It's in the noise here. It's really interesting. What is being held constant on the x-axis is 100,000 points. Sample count. Observation points. Rather than, say, n number of seconds that are lost. Correct.

Starting point is 00:37:10 And it's just as simple, and the view's not that different either, because again, you're in steady state, so it doesn't, yeah. But it's one of the reasons I stripped off the x-axis, because I didn't want to say, it's time, because it's not. It's samples. Yes? Yeah. Yeah. Okay.

Starting point is 00:37:27 So essentially, say I'm taking 1,000 samples, and if you're facing 100,000 IOPS, then your sample rate and your time are very tight. Yeah, exactly. As long as the pace work is linear. Well, and that's part of, again, looking at the IOPS level across the run being linear. If it's not linear, then you really can't calculate a good, solid QoS across that. Again, and you can tighten the window up.

Starting point is 00:37:57 Like in that first graph where I showed that first section being at a different performance level, you could look just at that level, that window in this view, and it would help you understand what's going on during that level. I mean, again, my assumption is it's data alignment issues or something like that. But, again, part of it is part of what this is all about is a guide to understand what the device is doing. And it applies to pretty much any kind of QoS workload. As long as you have the QoS data set, you can use this view to do some analysis. Okay.

Starting point is 00:38:34 How am I doing on time? I'm going to let you guys out early. All right. Now, this is four different drives. Same workload. same test system, showing across the test run, test duration is approximately the same. Again, some are going to be a little faster than others. The bottom right-hand corner drive is much slower than the rest,

Starting point is 00:38:58 so it took it quite a bit longer to do its run. But what it shows is that there are artifacts of a periodic nature that are starting to come out. And we're seeing what's driving that periodic artifact. You can see the drive over there. Very clearly something periodic is going on. Off and on. A clean block, a dirty block, whatever that is, whatever's driving that.

Starting point is 00:39:34 Again, we can use the data from the time and the duration of the test or the number of IOs to go backwards and calculate what actually we think's going on. This drive here, very well-behaved, very flat performance profile. You have a couple of long latency events. But again, when you look at the histograms of these, or not the histogram, but the CDF version of this, it doesn't show you, it doesn't make this clear. I mean, it doesn't say, yeah, you've got a thing that happens on a cadence or something that happens over on a long latency event for something like that.

Starting point is 00:40:04 And again, I did leave the 50th percentile on this just to show that there is a variation in the 50th percentile in this drive over here. It's a client drive that's tested in an enterprise environment. And as we know, we have customers that want to use cheap client drives in an enterprise environment.

Starting point is 00:40:25 This is why you don't do that. You don't do that. If you want your consistency to look like that, then okay, go ahead, use that cheap drive. But that's part of what this slide has been used for in the past is that, no, no, no, no, you don't want to use that drive. You need to use one of the other three.

Starting point is 00:40:45 Even the slightly wonky performance of drive number two is not that big of a deal. I mean, does this all make sense? I mean, this view? Because again, it answers that question. What's happening in real time? Or real I.O. in this case. Okay.

Starting point is 00:41:06 Oops. Too far. That's the last slide. Open your Christmas present early. So this view was created from a device that was running a 7030

Starting point is 00:41:20 workload. We split the workload. We split the IOs, the read IOs and the write IOs to say what's going on this way, what's going on that way. And what it shows is that from the new

Starting point is 00:41:37 views perspective, here's all your numbers from the traditionalists, you guys who want standard deviation and that. You can calculate all that, but the real important one is down on the guys who want standard deviation and that. You can calculate all that. But the real important one is down on the end, which is your sample count. How many samples did you collect in that run?

Starting point is 00:41:52 And then look at, again, you have a big modality here in the right, and that's all going to cache. So most of your write data is going to the cache on that device. And there's a nice deterioration. But you do have some long latency events that you may be interested in looking at. And again, in this view, you can actually see those three long latency events, where they occurred in time.

Starting point is 00:42:16 So they're not clustered together. They are spaced out. They don't look modal. They could be. Again, we need more data to know that. But on the other side of the fence, when we look at reads, well, we have a clear cadence going on over here.

Starting point is 00:42:35 There's something in that device from the read perspective that is causing a long latency event on a cadence. There's some guesses as to what it is. I don't want to venture a guess on it, because again, it's not our drive,

Starting point is 00:42:51 it's somebody else's drive. So I don't know what their firmware does, but you guys can probably make some ideas about what would happen on a cadence during the read portion that you wouldn't see in the write portion. And again, you still see the modalities for the TLC levels in this drive,

Starting point is 00:43:10 because again, it's a TLC drive. You see, okay, now you can see the caching events, the system level. We have one event on that one, which is faster than what the bus can transfer. That's part of the clue. But again, you still have these little modalities here, which are again represented by the shift

Starting point is 00:43:31 and again this downward trend right here to be able to see what's going on there. So this view of QoS tells you so much more about what's going on rather than saying, I have a five nines of 289. And what do you learn from that? Nothing. It's more than just a number. So this is where I come back to my title and why it isn't on the title because I start off with the data science aspect. Okay, questions. This is the last slide, so I can link this one up if you want to talk to it

Starting point is 00:44:06 or anybody wants to know what I'm thinking when I look at this stuff or why I created this view. How many samples are you reducing down into a point on the time baseline? You're talking about this guy here? So this guy here, if you look at these numbers, this tells me that I did it on a 100,000 sample. So basically that max represents your 100,000 sample. So convert that into a nines number, that tells you how you are.

Starting point is 00:44:38 So basically three nines, four nines, and then that's a five nines, but it's really the max latency in that particular sample window. Yes? So the bottom chart, you can calculate real time, at least as soon as you have 100,000 samples. You can do the thing, the blue one, it's the actual IOPs themselves, which you could block by. The CDF is determined from the blue dots.

Starting point is 00:45:12 But it assumes that you have all of the blue dots before you calculate it. Oh, yeah. You can't calculate it without... Yeah. You can't. You can't. You can't go to a chart like this at the end of an analysis once you have everything. You can't use it... You could generate this runtime as well, but you wouldn't have much data.

Starting point is 00:45:36 In other words, it wouldn't tell you much. Yeah, it would stop like up here. Yeah. And you can do it. And again, you can overlay the CDFs very clearly. You can't really do that with the histograms. It just looks like a big mess of dots when you do that. But if you overlay them, you can kind of see how it changes in time.

Starting point is 00:45:55 And that's certainly a way that I have analyzed it before. It just doesn't tell you that much. The one that really tells you is this view. And again, taking this runtime view and doing further analysis on the runtime itself, what's happening in those windows, I think it's going to tell you a huge amount more about what's going on in the system. And again, you've got to understand it is a system at the end of the day, because we still have CentOS, we still have FIO. I mean, we try to understand it is a system at the end of the day because we still have CentOS, we still have FIO. We try to eliminate it all, but the fact is there's a suspicion

Starting point is 00:46:29 that the horizontal at the seven nines that I was showing before might be a systemic anomaly, not a drive anomaly, which then represents how are you testing? You're going to have a big shift in your seven lines and above if you test with, if that's the case, if it's a systemic anomaly. Better minds than I are working on that. In the back.

Starting point is 00:46:55 I didn't think this analysis would work or look on a larger scale. I tried it on a drive array, or especially on a network storage device. Oh, I think it would be kind of interesting, because again, I was listening to some of the earlier talks, and I go back to the CDF view. The CDF view is the great equalizer. It doesn't matter how many samples you have,

Starting point is 00:47:18 that CDF view is still the CDF view. It's just where it's going to end at the bottom. You can continue that down and keep going. So a runtime analysis might get really muddy, so you see a lot of stuff going on, and you might want to expand your view window or something like that. But in the CDF plot, it's a great equalizer. It's kind of cool. CDF's a great view.

Starting point is 00:47:42 And again, the same thing happens with the histograms, is it gets kind of muddy because everything starts to climb upwards. So an ideal CDF view would be a straight line, either across or down. Straight down, yeah. It would just drop. Yeah, and basically your QoS level would be at a single point, or pretty close to a single point in time. Follow the curve.

Starting point is 00:48:01 300 microseconds, and everything comes in at 300 microseconds straight down to the floor. We've looked at it. It actually tells you some things because you see the transitions paralleling each other. So you can see things like that in the data. Sure. I would think you could, yeah. Again, a lot of it is what you can pull out of the data. So again, by going to this view of QoS,

Starting point is 00:48:43 where I give you all your stats, I give you all your stats, I give you all your nines, I give you your sample count and fries and everything you want with it. And then I go, okay, so let's look and see what this actually is doing. So it gives you, and I think your customers will benefit from seeing these different views to say, you know, my five nines might not look that good, but it's being driven by this periodic behavior, which we have traced back to journaling or something like that. And we journal, and it's slow because we'll be able to have an answer for them. And then they start to get okay with it

Starting point is 00:49:15 if they understand it's protecting their data or something like that, right? And the nice thing is, again, from the CDF view, you can see them all on the same page. So you could actually see the array view and then the individual drives, and then you could see those contributing factors. Alright. Is that it? Question way

Starting point is 00:49:36 in the back. Maybe not. Really, is there any way that you can or if I were to run this, if I want to have this data on the opportunity, it's based in a normal running system. I can't really have millions and millions of data points.

Starting point is 00:49:54 Is there some good theory behind condensing this data to something manageable that is still useful? Well, if you go back to the histogram view, that might help you to go manageable that is still useful? Well, if you go back to the histogram view, that might help you to go into that. So you basically have buckets, and you just keep adding to the bucket. So it's not quite as... There's nothing beyond just predefined buckets that I can just go up and...

Starting point is 00:50:17 Or you can do it on an interval. So in other words, you could be sampling every million IOs, right? And then you're saving it off. A million IOs, save it off. I mean, I don't know if that's practical in your environment. I mean, it could be like a billion IOs or something. I don't know.

Starting point is 00:50:29 But again, any time that you close the sampling window, you also affect the test. So you've got to be kind of careful about it. So I generally like to take all my data at once and then just break it up if I need to. Okay. Thank you, everybody. I think that puts us in time.

Starting point is 00:50:47 If you want to talk to me more, come on up. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #113: Latency is more than just a number

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.