Storage Developer Conference - #43: SNIA Tutorial: Your Cache is Overdue a Revolution: MRCs for Cache Performance and Isolation

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 43. Today we hear from Irfan Ahmad, CTO of Cloud Physics, as he presents a SNEA tutorial. Your cache is overdue a Revolution, MRCs for Cache Performance in Isolation, from the 2016 Storage Developer Conference.

Starting point is 00:00:55 Hi everybody, my name is Irfan Ahmed. Thank you for taking the time to attend this SNEA tutorial today. I'm going to be talking about some interesting new developments in the world of caching. So caches are obviously main components in any kind of distributed system, even inside the distributed system of a single microprocessor today. The only way you can make distributed systems work is by doing caching and all different forms of it. It turns out that some work done about 40 years ago in the area of cache analysis has

Starting point is 00:01:36 found resurgence in the last five years in both the academic and industry communities, industrial research. And there's some interesting lessons from that that we can draw. And Sneha was gracious enough to accept my proposal to have me talk to you guys today about some of those developments. Okay, so pass through this stuff. So the context for caching for me today, just to exemplify and to make things concrete,

Starting point is 00:02:06 will be something like what I'm showing up here. So some sort of a database system, perhaps, or key value store container, some applications running inside maybe a virtual machine or some other distributed platform. So that's a system context. Obviously running on servers with different layers of hardware, different tiers of hardware. And perhaps in some cases, even an elastic cache out in the network.

Starting point is 00:02:32 Okay? There's different places these caches exist. So that's the first context for me. I'm calling that cloud-scale data caching. But of course, that's not the only place. The same technique would be applied in a single system. An example of that might be an app running on your phone interesting things happening these days with for example recent iPhones shipping with NVMe solid-state storage right so

Starting point is 00:02:55 almost the first adopter in the industry for NVMe in you know in a way was the iPhone so interesting things happening in terms of trying to look at caching. So I'm going to set up kind of an aggressive goal. I'm going to say, look, we're going to use the new research and the new results in the last few years and we're going to try to provide

Starting point is 00:03:17 self-service on sizing, tuning, and troubleshooting to end customers. And then actually improve the hardware efficiency of caching okay so how much hardware you need of the fast medium d-ram flash whatever it is to be able to achieve your performance right we're going to try to improve that by 50 percent right that's the sort of stuff that uh that that's the level of performance improvement that, you know, gets attention. And then the next goal we're going to set up is that, well, using the latest research, we don't want to just improve performance, but we want to use that to create knobs. We want to give people guarantees.

Starting point is 00:03:54 If there's a client somewhere that specifies using an API or in a UI saying, I want five milliseconds of latency, right, we want to be able to use this new research to actually provide that. So some of this will start to become clear. But these are aggressive goals that I'm setting out and I think the most recent work is proving that this might be possible. So let's first start out. Today's cache systems provide some sort of a dashboard like this. We're all used to it, right? Hit ratio. That's all they give they give you of course the current size but usually this is pretty static so hit ratio okay so I'm getting hit ratio of 43% is that good or bad I don't know maybe it's good maybe bad no okay well usually when you somebody gives you a number all of a sudden you're like okay

Starting point is 00:04:42 I'm motivated I got a number now I of a sudden you're like, okay, I'm motivated. I've got a number now. I'm going to improve it. Well, can this be improved? Is this the best possible? Or if I tune in or if I change something, can it be improved? Like this dumb dashboard is not going to tell me. And there's nothing in the system today that is going to tell me. So we're ill-equipped to be able to answer the most basic question you can as soon as you put a metric up right so imagine cpu utilization if somebody says my cpu utilization is a hundred percent and then you come in and you say can it be improved it's like okay you're 100 we can't get any more out of that cpu because you

Starting point is 00:05:14 know what to do with that number can it be optimized now you have a number you can go and tweak you can change your all kinds of parameters in the workload maybe you can improve it but this is a number nobody knows what to do with okay so let me get a little bit more interesting here so let's say that is a number you put up so we're talking about for those guys who just joined we're just talking about like a cache dashboard okay and the numbers that provides in today a cache some sort of a performance monitoring dashboard only gives you a hit ratio so let me ask a second question what happens if I add cash okay so I got 128 i want to go double that because i'm just my performance is not good not my application level performance is not good

Starting point is 00:05:50 well will that improve the hit ratio or would it stay 43 maybe it'll drop i don't know do we know we don't know right what if i remove because i want to give it to somebody else i want to go to 64 gigabytes will that go to half? Or will it go to zero? Or will it stay the same? No clue, right? What if I add workloads? I'm going to keep everything else the same.

Starting point is 00:06:15 I'm going to add more VMs or add more apps or add more containers or add more databases. Don't know what's going to happen. Simple question. Is thrashing actually happening in this cache right now? No clue. Don't know. Right? Well, I mean, forget this last question I what if I change parameters what if I change the

Starting point is 00:06:28 the read ahead length what if I change some other parameter will it improve this performance or decrease in no clue so this is the state of the art in almost every implementation of a cache today so in in terms of modeling it you know because that's when you start you you always think, okay, can I model this, right? And maybe I can get better if I know how to model this. So in terms of modeling, that would be like having all of this real estate in this graph and only being able to plot one point. What is that point? That point is equivalent to that 43% in the other chart.

Starting point is 00:07:02 So that's a state of the art. Okay? We don't know what happens if we change cache size in terms of the performance. Now, the performance I'm plotting is miss ratio here, which is kind of customary in literature. So lower is better.

Starting point is 00:07:16 I'm just going to keep reminding you that whenever you see a plot like this, lower is better. Kind of like latency. Lower is better. Okay. So that's a state of the art today. But what if, what if we could

Starting point is 00:07:27 draw out a curve, okay? And the curve is varying the cache size, and as you vary the cache size, you just magically, the oracle tells you, ah, here's what the curve, here's the response of the system will be, of that workload will be to more or less cache. So in this particular example, this is a funny example, it's actually a real workload from a customer live production deployment. So let's say that you're operating

Starting point is 00:07:56 at, say, 10 gigabytes of cache, right? So now we go back to the other questions we asked at the beginning. What if I double the cache size? Nothing happens. What if I double again? Nothing happens. What if I double again nothing happens what if i double again so you would have gone through three doublings now most people give up at that point of time if you were doing this in a real production because you're doubling right it's like okay three doubles i give up but right

Starting point is 00:08:16 when you give up is when you were going to you're about to see performance improvements remember lower is better right less misses why Because there's some working set, and I don't know this workload enough to know what exactly, but I can tell you the working set is about 80 gigabytes for the first working set, and the second working set goes out to be about 100 and maybe 10, and the third one at about 140, and then everything fits after that.

Starting point is 00:08:40 So this is what we want. If we could have this curve for every workload in the world, we'd be able to answer almost all the questions that we asked before. So this is known as a miss ratio curve. So when I say MRC, please translate that to be a miss ratio curve. And sometimes you'll see people draw this the opposite way, which is a hit ratio. Higher is better, but we're not going to do that.

Starting point is 00:09:00 So the miss ratio curve. So this is performance as a function of the cache size, and you get to see the working set needs. This is what we want because we can use this to inform allocation policies. Now, in previous tutorials that I've given, happy to send you references to those. Even at SNEA, I've talked about how this curve is computed. Today I'm not going to go through the details.

Starting point is 00:09:21 Suffice it to say, now there are multiple systems out there, including some open source implementations of how to do this with various levels of efficiencies. For example, the company that I work for has one that we license out on how to compute this curve. So I'm not going to go through that. There's plenty of material out there. I will just say that the way

Starting point is 00:09:40 that you compute that curve is by creating something called a reuse distance. So when you see a block A, you measure how many unique blocks were seen between the first reference of A to the next reference of A. Okay, so if there's 100 blocks that are unique in that time period between the first reference of A and the second reference of A, A being the block number, that's called the reuse distance. What you do is you measure the reuse distance for every block in the system and you can draw this curve so that's all I'll say about that it's not hard there's nothing too complicated about it to do at scale and efficiently is hard but there are even multiple ways

Starting point is 00:10:16 of doing that today okay so so this goes back to like I said about 40 years before the invention of the first algorithm to try to do this, you needed to do a separate cache simulation per cache size. That's very expensive. And in fact, in some of our experiments to do the original algorithm for this, to run a day's worth of trace could take a day worth of

Starting point is 00:10:38 computation. And that's too much running on a fast processor. It's simply too much. You could take hundreds of gigabytes just for one single workload. Now, I'm going to pass through this very quickly, but there has been research sporadically in this area, a lot of times for CPUs and processors, but not a lot of improvement.

Starting point is 00:10:55 So you went from order M, order NM, in terms of space and time, out to a little bit of improvement here and then you get into the last five years and all of a sudden there's a tremendous amount of work that's gone on so I provide these as references that if you guys want to take a look at or happy to talk about another time but we're now at us at a time where we can do this in constant space that means we can actually compute those miss Russia curves starting from linear space

Starting point is 00:11:27 to constant space, which is absolutely phenomenal. And then this order n is time, but note that the n is a number of blocks that were accessed during that time. So for per block basis, first algorithm that does it in constant time. So n divides out here because

Starting point is 00:11:43 n is a number of blocks. So per IO, it's constant time. So the second time this was done, you know, a couple of years after this original work, there's a different technique now also available called AET. Very interesting algorithm. Only works for LRU. This technique works for LRU and ARC and other algorithms. And also, so now we're

Starting point is 00:11:59 at a state where this, we as an industry, we just got to start doing this because there's two or three ways of computing these menstrual curves that are available to you in different forms right so today i'm going to focus on more on what you do with these curves if you can get them because now i mean i think it's pretty well known in academia that you can get them uh pretty cheaply so again misracial curves lower is better so these are just a bunch of different production work scuzzy workloads that I had access to. I ran those traces, collected them one or two weeks, ran them through this analysis to see. Now the reason why I'm showing

Starting point is 00:12:34 this slide is look at the variety. So for example this T08 trace goes out to be about 700 gigabytes, I think one or two weeks worth of a workload. And as you can see majority of, I mean if you had a 600 gigabyte cache it still wouldn't fit this working set and you add one more gig and all of a sudden everything fits right so now you have that type of a curve you have a complete opposite curve where the all the benefit is seen in the first 10 gigabytes and the rest of you just like getting a very little additional marginal return so why would you give it more so now i'm trying to motivate hot first use case for this if you had this for every workload in your storage array or your software defined storage or

Starting point is 00:13:12 whatever where there was a cache you can take these curves and do some simple math and say well how much cash should i give to each person to drive global optimization okay first trivial use case that comes out of this. And all we did is use traditional cache partitioning to be able to simulate that. Traditional cache partitioning. Give each workload some defined amount but you vary that amount based on this curve.

Starting point is 00:13:36 So you give that guy, let's say you have a terabyte of cache, you give that guy 600 megs and 600 gigs and that guy only 10 gigs. And they'll both achieve their best. If you had a terabyte and you gave this guy 500 and that guy 500, guess what you're doing? This guy ain't going to get any benefit, and he's wasting space. And so this would be, at any point outside of this range, he's thrashing.

Starting point is 00:13:58 Each one of these blocks is, you know, for a certain part of the cache? Single workload. Single cache? Single workload. Single workload. Single workload. So what you're doing is you're saying, let's say this is a VM, right? Let's say that's a database VM and that's a virtual desktop or something like that. You don't know that. That storage array doesn't know that.

Starting point is 00:14:15 But it knows this curve if it were to integrate it into its cache controller. And so without knowing the workload or anything about the workload from a subjective point of view, it would just empirically know that, look, I know how to improve this workload, and I know how to improve that workload. Okay. Now, there's lots of varieties, right? There's some that, for example, there's always a few in these plots that improve. For every gigabyte of cash you give it, it continues to improve.

Starting point is 00:14:40 Every gigabyte of cash it continues to improve, right? So now you can trade those off. And trade-off is very easy. Like, that's easy, right? You just kind of say, I have one more gigabyte of cash to give. Who gigabyte of cache it continues to improve. So now you can trade those off. And trade off is very easy. Like that's easy, right? You just kind of say, I have one more gigabyte of cache to give, who do I give it to? That's a naive algorithm. You can get better than that with simulated annealing,

Starting point is 00:14:52 trivial stuff. An undergrad could write that code to do the comparison. First thing you need is the MRCs. Yes? I don't know if you're going to talk about it, but time variation of the key sets. So for example, it may look like one piece in the morning and something else in the afternoon. We find that if you are operating about a day, they're very stable. Different parts of the day change.

Starting point is 00:15:17 So you're assuming like long term? So I'm not making that assumption. I'm saying for this workload, you could have caught MRC every hour. They're so cheap that you could have captured it every hour and varied your caching every hour. I'll show you some results where we actually did that work and we showed that about hour is kind of like a good sweet spot where you can get a tremendous performance improvement or hardware efficiency improvement by buying less DRAM or flash to get the same performance benefit.

Starting point is 00:15:45 But it's a very important time varying is actually, that is one area I would say is still being researched. I'm working with a couple of universities on really narrowing down on the theory about how time varying behavior can be modeled. That's probably the only area here that's open. And I'll give you some results about that as well. So lots of different MRCs and just motivates us to think,

Starting point is 00:16:02 well, interesting, if I could have this for all of the workloads running on my system, cool stuff I can do. So now let's take a look. What are some cool stuff we can do? Okay. So I should have said at the beginning that to really get the most out of this talk, you've got to put in a little bit of kind of mental math effort here. But the first mental math effort is, remember you were seeing missed ratio curves before

Starting point is 00:16:24 lowers better. And now what I just did is I pulled a trick on you and called this an average latency curve. So I did some arithmetic to do that. So let me tell you what arithmetic is. It's trivial. It looks complicated, but it's trivial. So I start with the miss ratio curve. What does that mean? A miss means that a certain percentage of IOs went to the back end

Starting point is 00:16:46 because they didn't hit in the cache, right? So let's just say that the miss ratio is 50%. That means 50% hits, 50% misses. And let's say that the latency of a hit is 100 microseconds. And latency of a miss is 1 millisecond, right? If 50% of your IOs are hitting and 50% of your eyes are missing you just average the two numbers you have the average latency make sense because half of them are hitting half of them are missing yes so

Starting point is 00:17:14 that's all I did here because for every cache size I know how what percentage will hit or miss and I know what the if when they miss I know what the latency is I do the trivial miss ratio times miss cost plus hit ratio times hit cost, and I get this number. And then I get a number for every one of the spots that I have on this curve, and I draw it out. So for the very first time, at least to my knowledge, in real deployable systems, you can get a prediction of what the latency of the VM would be or the app or the database would be as you vary the cache size.

Starting point is 00:17:52 I spent my entire career looking for this. We just figured this out two or three years ago. But my entire career, almost in 10 plus years of storage, has been in figuring out how to give people the guaranteed latency. I hadn't been able to figure out how to do that. And this is the first time we figured it out.

Starting point is 00:18:09 So, first thing you do is you take your misratio curve and you convert it into an average latency curve, varied over cache size. And you get something that shape-wise looks very similar to the misratio curve. And then what you do is you say, dear client, what is it that your target is? And then you kind of just read that target down and across and say, oh, 16 gigabytes. Now, I'm going to probably give them 18 or 20 just to kind of, if there's any variations that occur shifting during the course of the day or the week, I'm going to give them a little more.

Starting point is 00:18:36 But fairly simple computation, arithmetic really, once you have the original curve. And now we can solve something, again, very, very challenging, never been solved before. So I'll take a pause here, make sure everybody's absorbing this. Yes? So I've got a question, because this looks nice, but what about a situation where you've got like a symphonious IOs, and for example, one IO is hit,

Starting point is 00:18:58 another IO is missed, and we still need to wait for the completion of both of them. Do you somehow include that in your computation? No, because I'm just operating at the average. I'm saying, like, basically, think of it this way. Let's say you had all, let's say, disk subsystems, spinning rust, okay? And then you replaced the same application, nothing changes,

Starting point is 00:19:17 and you replace it with a faster thing. But I'm not going to tell you what the faster thing is. It's just average latency, let's say, half or a quarter of what it used to be. The performance of the application will improve because on average things are taking a quarter as much as long as before.

Starting point is 00:19:33 And as much as the bottleneck is the storage, the performance will improve. Yes, some IOs will be very quick, some of them will not be. Even in a single storage, single hard drive, if you run, let's say something like Iometer, there are some IOs that

Starting point is 00:19:50 even for a single drive, will come back in 100 microseconds. Single drive, like a 7500, whatever, 5400 RPM drive. Because even the hard drive has a little buffer, and occasionally you get hit in that little tiny one-back buffer, and if you plot every single that little tiny one-back buffer.

Starting point is 00:20:11 And if you plot every single latency, like let's say you have a million IOs that you do, and you plot every single one of them, I encourage you to do that. It's very enlightening to see what the variance. Once you see every single IO on a plot, you realize a single drive can also do very fast, and it could also be one second latency. All of that in a minute, and you don't even realize it because you only see the average. Same thing is going to happen here. It's just the average will drop as you increase the cash size yes so the relationship between a cash hit and a miss is a constant right yeah one minus miss is so hit so isn't the shape of the curve going to be exactly the same it is for average by a constant it is for an average but but some people want

Starting point is 00:20:42 don't want to prove it for average let's say you've got a client who comes up and says look average is nothing I want 95th percentile so then what you do is you measure the back ends latency for the misses but you don't measure the average in addition to the average you measure the 95th percentile latency of the back end then you plug the sunburn and what will happen is this curve will expand it's to keep the same like need ratio but it will stretch out and then all of a sudden to get the 95th percentile you're not paying 16 you're paying like 100 gigabytes okay but the fact that you could even do that is amazing when's the last time we could tell somebody if you want 95th percentile i can give it to you just pay me this much right now we can do that right so now again this is amazing use case. Imagine a network elastic cache

Starting point is 00:21:26 in like a cloud service provider. If you can have this curve, you can predictively allocate DRAM, right, in a Redis somewhere. Forget like SCSI disk. Just think Redis. If you know this curve for your database, you

Starting point is 00:21:42 can allocate a Redis or Memcached D of the right size. We don't need SCSI for mrc's but mrc's are just about references they don't care if it's a scuzzy or or a memory memcache underneath underneath okay so it's the first thing we did extremely powerful within reach okay within reach of anybody in the industry anyone who wants to have this feature today you you can do it, as long as you compute the administration curves. And there are three different ways of doing those today.

Starting point is 00:22:11 Okay. One out of a university, and two of them commercial systems that are supported and you can license. Okay. So now, okay, we're going to switch topics a little bit. Now we're going to go into tiering,

Starting point is 00:22:24 because a lot of systems do some type of tiering. Even if you're all flash, there's a little bit of tiering going on because you've got some really fast flash that can handle writes better, let's say, and then you've got the slower flash for capacity, and that's where D-loop is happening, for example, right? Now, this time I'm back to a mis-ratio curve, right? So this is not latency.

Starting point is 00:22:41 It's mis-ratio. It's lower is better. Now what we're saying is well we got a we got tears right we got local and we got kind of remote and I got really far remote maybe the furthest remote is out in the cloud somewhere but if we develop our cache system appropriately now we know exactly how much to size every tier for every workload to be able to achieve the performance requirements of that workload okay the way to do that is every tier has a different miss latency or miss penalty yeah so what you do is you iteratively figure out how much to give to each tier and then the latency curve that you derive will be a

Starting point is 00:23:24 piecewise curve because during the middle of this curve, you're changing the miss penalty. So the miss penalty here is very low, here is a little higher, and here might be very high. So now piecewise, that curve, and when you convert it into latency, will shift, and then now you can figure out how to draw this curve iteratively. So without providing a lot of detail, I think the higher level point is you can do tiering and now predictively you can figure out how much to give to each tier either for overall efficiency

Starting point is 00:23:54 for all workloads, driving overall throughput, or individual SLAs and SLOs. Yes? I mean, basically calculation on that. And yes, that's me. We already have a cache there. Okay.

Starting point is 00:24:08 Correct me if I'm wrong. Maybe I'm saying it wrong. No, you don't need a cache. You just need a... No, no. You're saying that, okay, I got X misses and Y hits. Okay? Yes.

Starting point is 00:24:18 And based on that, you're drawing your chart. No, I'm not. See, this chart can be drawn without actually having a cache because all you do, and this was a part of the previous year's tutorial, so I didn't spend a lot of time on it, but what you do is you say an I.O. is coming in. At some point in time, before any cache,

Starting point is 00:24:38 you just call a little function that says, hey, please add to your miss ratio curve analysis data structures the fact that this IO is accessed okay so one Meg buffer of various types of data structures and that's all you need one Meg okay and then the fact that this IO was handled or requested or this row in a table or a Redis key was accessed is enough for these data structures and this misratio curve algorithm, the history of which goes back 40 years,

Starting point is 00:25:10 but it was not practical. Very recently, it's been made practical in the last five years. So now that you tell that algorithm, hey, somebody requested this block or this ID or this opaque key, it comes back in less than 60 nanoseconds and says, yeah, don't

Starting point is 00:25:26 worry, keep going, I got it. That's enough. You can draw this curve in a system where there is no cache at all, and you still get the full curve. The point is, if you were to, this dot is an example of something where it's a cache stat. Now, you cannot, in a production system, vary the actual cache size to get the full curve. You can't do it. I mean, it's just nobody will let you,

Starting point is 00:25:48 right? Who's going to do that? Who's going to say, have my cache, and just see what happens? You get fired. So, instead, what you do is you just install something in a system, and like I said, it's easily implementable, where without actually even having a cache, or having the cache operate

Starting point is 00:26:03 here, you get the full curve. And then you decide to reduce or increase because you can predictively know what the performance implication of that is going to be. Yes? So you're binning it. You're putting it in a bin and then statistically sampling it across the space. Is that the way you say it? The technique?

Starting point is 00:26:18 Is that the concept? There are three different techniques. They all do different types of sampling. One of them does a velocity sampling. One of them does a velocity sampling one of them does a spatial sampling the other one does a reuse distance sampling they're all different techniques I don't want to go into them right now just because there there's actually is math involved there yeah but I will say that once you read the papers you're like huh why don't I think of that okay that

Starting point is 00:26:41 this like that okay okay but it is a sort of sampling. But the point is very accurate. And if I was to just to emphasize there's actually two curves on each of these multiple plots. You cannot see two curves.

Starting point is 00:26:59 The reason why you cannot see two curves one is the actual MRC computed the 1970 way takes hours and days to run, and the other one is seconds of CPU time for these very, very long traces. The reason why you cannot see two curves is that's how accurate the systems are. They're sitting right on top of each

Starting point is 00:27:16 other. So the sampled approximations are very accurate. So I'll talk about two use cases. Let's move on. The third one I think is actually pretty important, and most people kind of don't think of it at first, is self-service monitoring. Like imagine putting this in your storage array, and for your customers to be able to actually see what's going on,

Starting point is 00:27:33 like, you know, they'll buy more cash once they realize how much cash pressure they actually have. So, you know, just imagine this is a mock-up that I did, and I'm not a UI designer, so it probably doesn't look that good, but the idea is, again, cache size, hit ratio. Okay, so this is the only plot that I'm doing with hit ratio because I've tried saying miss ratio to customers, and it's like, oh, they don't think in terms of misses.

Starting point is 00:27:58 Well, most of them are not engineers, so they think higher is better. So if it's up and to the right, it's better. If it's up and down and to the right, it's bad. So just to show customers, I've drawn this as a hit ratio. So if you ever put this in a real UI in your storage array or in your software-defined storage or in your whatever memory-based storage,

Starting point is 00:28:18 please do it this way. I've learned the hard way. Okay. But imagine giving this to all of your clients. Like, imagine having this and your cloud service provider. When you log into your dashboard for a cloud service provider,

Starting point is 00:28:30 he gives you this curve. That'd be pretty awesome because now you can actually size. So that's the idea here. Let's not forget the simple stuff, right? Simple and easy. Just measure and plot. Don't even do anything with it, right?

Starting point is 00:28:39 Very valuable. Okay. So, this one again, we're going to have to go put our thinking hats on this is a tutorial part where I got to teach you something so something very interesting is it happened recently so remember I showed you a bunch of curves that had

Starting point is 00:28:56 these flat regions and I said ah that's when thrashing is going on remember that so the reason why it's thrashing obviously is he adding more cash and the latency is not improving. But, interestingly, what's happening underneath is that the workload doesn't reuse the blocks that it asks for. It asks for A and then it doesn't ask for A for a really, really, really long time. By the time it asks for A, A is no longer in the cache. It's been evicted from the cache. So that's why you get this flat region okay and then when you don't get the flat region is when those blocks that were asked for back here are starting to be used so that's why it's a reuse distance right okay so now this

Starting point is 00:29:36 the workloads that have flatness are the worst workloads in the world for in the world for the for a cache right it's just awful because no matter what the cache does, it just cannot fix those workloads because it's not the cache's fault. It's not the algorithm's fault. The workload is not cooperating. Okay. The interesting thing is, recently we figured

Starting point is 00:29:56 out, some work out of MIT has figured out, that if, big if, if you can draw this curve a priori for a workload, okay, you can tweak your cache algorithm online, on the fly, to actually draw

Starting point is 00:30:14 the, what's called the convex hull against this misrush curve, right? So now, this is thrash remediation like we've never seen it before, because now you take the worst workloads and you make them your slave. Give me your worst one, I can fix it. Because the worst ones, always remember, have this flat region.

Starting point is 00:30:33 And the convex hull means that you take the lowest non-convex parts and you connect them, and that's a convex hull. So now there's an algorithm, but it requires having the misratio curve. Again, this is something that everybody, all of us can do in our production systems fairly easily if we just add a misratio curve. A misratio curve computation

Starting point is 00:30:56 capability. If you have this, this becomes within reach. Now, I'm trying to debate whether I tell you how this is done. I'll do it. It's a tutorial. We'll do it. Okay. So here's how it's done. Thinking caps on.

Starting point is 00:31:09 Okay. So imagine that I have this miss ratio curve. Okay. And what I do is I say, okay, randomly I'm going to take half the blocks. Randomly. It has to be random. Okay. And I'm going to put them into partition one.

Starting point is 00:31:27 I'm going to call this workload partition number one because it's just random. Take half the LBAs, and I assign them to workload partition one and half of them to workload partition two. Randomly, half and half, okay? Then I come in, and I say I have my cache, and I'm going to allocate half the cache

Starting point is 00:31:44 to workload partition one and half the cache to workload partition one and half the cache to a portal purchasing to easy so far right half half will there be a performance change no because I randomly pick the blocks and both what will partition one and workload partition two will have the same miss ratio curve because I pick them randomly yes okay but now I do this observation I say okay suppose that I'm currently operating at a 10 gig cache so now what I'm going to do is instead of making workload partition 1 and workload partition to equal size I'm going to make workload partition 1 be 10% of the logical block addresses and workload

Starting point is 00:32:24 partition 2 to be 90%. Now, I did something very unfair, because workload partition 1 is getting half of the cache, but it's only 10% of the blocks. And workload partition 2 is getting half of the cache, but it's 90% of the blocks. Everybody with me so far? So what did I just do?

Starting point is 00:32:39 Workload partition 1 moved this way. No, actually, workload partition 1 was the smaller one, right? Okay, so workload partition 1 moved this way, okay? 10% of the blocks, 50% of the cache. So, it's going to move this way. Now, workload partition 2 moves this way. But what just happened is that workload partition 2 moving this way, But what just happened is that workload partition 2 moving this way, it experienced no loss of performance because it was in a flat region to start with. Workload performance, workload partition 1 though got over this hump. Okay. Now, but to truly get over the hump, I also then go and start half and half in the cache size. I make the workload partition's cache occupancy

Starting point is 00:33:26 not be 50%, I make that 90%. Okay, so now, again, workload partition one here goes all the way across and ends up getting a huge performance win, whereas workload partition two never climbs up there. So if I can do the algorithm right,

Starting point is 00:33:44 just using the basic intuition, and it's pretty easy to do once you sit down and try to do it, what happens is that my average performance between those two partitions ends up somewhere along this diagonal. So this is like, I mean, you remember the 90s and 2000s

Starting point is 00:34:01 and the first half of the 2010s, probably more than 100 papers have been published on better cache algorithms. This thing beats all of them with the most naive cache algorithm because you're just cheating. You're just cheating. You're like, huh. Now, of course, it requires cache partitioning

Starting point is 00:34:18 support. Not every storage array has that, so you've got to go and implement it. It requires MRC computation. No one has that in a production system outside of some systems that I'm working on with some vendors. But this stuff is not hard. It requires a little bit of testing,

Starting point is 00:34:36 but it's not that hard. So again, this is fundamentally game-changing use case for misratio curves. Now, basic stuff like I mentioned you know this is a trivial example of two workloads single cache but I'm partitioning the cache and I say well this workload look it benefits immediately from a cache but then it doesn't benefit from extra cash flow that much and that's what it has the shape well this other one always keeps

Starting point is 00:35:00 benefiting so I'm going to give partition zero to client zero but a very small partition, just enough to fit his workload. And then partition 1 is bigger because he's going to keep benefiting. Simple idea, right? So then we actually tested it. And we got an average of a 40%

Starting point is 00:35:16 larger cache efficiency or hardware efficiency. What that means is that for the same performance improvement the performance improvement you got from this algorithm of partitioning was equivalent to buying 40% extra hardware on average.

Starting point is 00:35:34 And the best performance out of these workloads was 146. So these are mixed workloads. So the kind of workloads that are storage we're able to see, they're not trivial single VM workloads. They're anywhere from 8 to 20 different workloads mixed in. Here's the distribution of that hardware efficiency

Starting point is 00:35:50 improvement. You can see most of the workloads in our test of 27, I think it was, different workloads had 0 to 10%, but the majority, the weight, is 20 to 50%, and there are some that benefit more. Again, what did we do? Look at it, what did we do?

Starting point is 00:36:05 Look at it. What did we do? We just computed MRCs for each workload, partitioned the cache, and allocated dynamically. This was done once every hour for those workloads. 40% improvement. How often do you get that?

Starting point is 00:36:17 It's very hard to get that. And I didn't even add the thrash remediation in that test. We came across that later. Cumulatively, once you combine those two, I think it'll be even higher. Definitely get over my goal of 50% that I presented at the beginning of the presentation. One more thing on this. I keep saying, look, order one, order one, order one

Starting point is 00:36:38 for space, and then order one per I-O, so constant time per I-O for CPU for these algorithms to compute the misrush occur, right? Well, that got us thinking, oh, it's so cheap. And on average, every one of those function calls to update the data structure is 16 nanoseconds on a pretty slow processor. I've measured it like 39 on a faster processor. But those are even desktop class, but I've not even measured it on a server class yet. So it might be in a 35-nanosecond

Starting point is 00:37:04 range per I-O overhead. So that got us thinking, well, maybe we should do more than one of these. So I'm like, whoa, interesting. Okay, so hold on. So what you see here is multiple misratio curves for a single workload. Okay? So why are there multiple

Starting point is 00:37:19 misratio curves for a single workload? Well, in this particular case, we assumed a different cache block size and said, suppose this workload was running at 4k or 8k or 16k or 32k or 1 meg. I'm going to compute simultaneously, it's so cheap, you can simultaneously compute multiple misratio curves for the same workload and then you just, again, undergrad could write this and come in and say, hey, for this workload, is a 4K block size better or a 16K cache block size better? Okay. We

Starting point is 00:37:53 have seen 50% performance improvement if you perfectly pick the right cache block size for the given cache that you want to give that workload. In some cases, we've seen 100%, but 50% is like very common to see. And you can see that workload. In some cases, we've seen 100%, but 50% is very common to see. You can see that, right? Different cache block sizes, you can imagine, right? If your cache block size is too big, you're pulling in too much data. It's going to slow you down. If it's too small, if it's not

Starting point is 00:38:15 quite right, then you're not getting other benefits in terms of metadata. This is different workloads. So you see it's per workload, right? So depending upon the workload that you have, the ratio between the various cache block size misratio curves will change. And so you can make this decision now

Starting point is 00:38:34 completely online, right? So that's just one example, cache block size. There's other ones. Like for, you know, most systems have both a read and a write cache. And it's always a question, who should I give more to?

Starting point is 00:38:44 Read or write? Now you can answer that, because what do you do? You compute a read misracial curve and a write misracial curve, compare them. Replacement policies, like if you have any other tuning parameters,

Starting point is 00:38:58 like a sub-block size, or write-through versus write-back, all of those now just mean every time you have a hard decision, and you think, huh, I'm going to put a knob in my storage array because some SE is going to figure out in the field how to tune

Starting point is 00:39:11 it, good luck. It never happens. Every time you make one of those decisions and you put a knob in the system, don't put a knob, just compute the image structure curve and do it automatically. We have the capability as an industry now. We should be able to do this. So we think that this could add, maybe in some cases 100% We have the capability as an industry now. We should be able to do this. So we think that this could add maybe in some cases 100% hardware efficiency gain,

Starting point is 00:39:35 possibly on top of the previous two that we set up. So is it clear that all you need is just a bunch of mystery records and you can just do the comparison? I think it's fairly clear, right? Okay. So I mentioned earlier that LRU is not a requirement you can actually do the same sort of

Starting point is 00:39:49 thing with ARC and ClockPro those are the two other ones that we studied we're studying some other ones now hopefully we'll

Starting point is 00:39:55 publish some results for that in the next year but as a sample right both of these are possible the way we do the non-LRUs

Starting point is 00:40:02 is using something called a scaled on simulation which our paper talks about, so I encourage you to take a look at that in case your implementation in your storage system is not an LRU. We can handle that. It's just a slightly different

Starting point is 00:40:13 way of actually running those computations. So interestingly, by the way, I don't know if you've ever seen misrace occur for ARC. As far as we know, we're the first ones to demonstrate because we can run so many different x-axis points because we can run so many different x-axis points because we can run them really, really efficiently that ARC has these points

Starting point is 00:40:30 where it's not a monotonically decreasing. So there are times where you add cache and the performance gets worse for a little while. We had never seen that graph before. And because it's a very adaptive cache, it makes mistakes sometimes. On average, it does well, but it does make mistakes.

Starting point is 00:40:43 The other thing we're seeing is, remember the thrash remediation convex hull thing I talked it does make mistakes. The other thing we're seeing is, remember the thrash remediation convex hull thing I talked about? In our initial results, what we're seeing is that LRU plus that thrash remediation is often better than ARC. And that ARC plus that thrash remediation is not that much better

Starting point is 00:41:00 than LRU plus thrash remediation, which is again, LRU, 1960 was the time that LRU plus threshold mediation, which is, again, LRU, it's been, like, 1960 was the time that LRU was invented. Still, we haven't still managed to totally beat it. Every time we think a new flashy algorithm is going to beat it, some adjustment to LRU catches up, and I think in this case it'll be LRU plus threshold mediation. Okay, so systems implementation, an example of that is a cloud physics implementation

Starting point is 00:41:23 which we license, only needs two function calls. You don't have to necessarily change your cache algorithm. Before the I.O. goes into the cache, just call process ref for every block or every database row that's being accessed or whatever it is that you're trying to model in terms of cache. So you give it a data structure that's been initiated about a megabyte and then whatever the key is, we don't care what it is. It could be logical block address or some opaque ID.

Starting point is 00:41:48 We don't care. It could be hash of the block ID. We don't care. It just has to be unique for it. And then we don't treat it in any special way. So you do that for every I.O. On average, we take about 15 milliseconds. We've seen 38 to 60 in our experiments.

Starting point is 00:42:04 On average, it'll be about 50 nanoseconds. Every time you call that, okay, and then whenever you need to draw the misratio curve or have a higher level user level program that's going to go and do the repartitioning and all that sort of stuff, you call this function, get histo. That's it. That's all you need.

Starting point is 00:42:20 The rest of it is all in here. And what that is is in the paper, yeah? And you don't have to call that for every transaction. You call it for every transaction. The system underneath does nothing for most transactions because it's sampling. But the way that it's sampling and adaptively sampling is what makes the magic work. Okay? So you should call it for everything and let us adaptively decide which of the transactions or

Starting point is 00:42:46 keys or blocks or rows that you're accessing that we're going to do a lot of work for and which ones we're just going to sample out okay every time yes yes you know i mean 50 nanoseconds is like about one memory one memory request so and and it and it pipelines really well. So it should not be... Rob, you're the first one I've seen complain about 50 nanoseconds as being an issue. Come on, this is a storage stack. The storage stack usually is like 10 micros of software.

Starting point is 00:43:16 I'm saying 50 nanos. Come on. Okay, so for your reference, when you look at the slides later I talked about a whole bunch of different use cases that I have not covered that once you have a miss ratio curve they become enlightened and you can actually do either better performance or better service level assurances so there are

Starting point is 00:43:38 some multi-tenant environments in the world that are very popular like cloud service providers and the features that I'm not going to go through this in detail. I only have seven minutes. But a lot of people in the world are using some sort of global LRU. A global LRU is when you have multiple clients and they hit the same LRU. Now what happens there is like pigs at a trough. The bigger one kicks you bigger bigger so if you have a global LRU and you've got a workload that's really intense what's it gonna do it's gonna push everybody else's workload out but it's a priority and

Starting point is 00:44:15 naive and using a naive system it's very hard to even know who the pressure is and what to do about it how much to mitigate the pressure it's just very hard to know that, right? So this is like getting an x-ray of your trough or the pigs at the trough and trying to figure out, well, how are you going to manage this, right? What are you going to do about it? So anytime you have multi-tenancy, right, this just extenuates a need for being able to have a data-driven way of online managing the ability of your clients to get access

Starting point is 00:44:45 to the performance resources that you've invested in. So I won't read through it, but I think by now, hopefully, that's all in context. So what I want to do is I want to do a step-by-step recap. So to be able to use this technology, the first step is to calculate the misracial curve. So pictorially, I'm drawing that out with some steps that once you see the previous year's tutorial, you'll know what they are. But again, the point is that they're trivial, really,

Starting point is 00:45:11 once you see them. It took us years to build it, but now once you look back, you're like, wow, why didn't I think of that in the last 20 years? So first step, misracial curve. So trivial integration with your current cache code base. A couple of functions. One for IO very fast and whenever you need to put that in a UI or run a higher level process you just suck the memory out compare everybody make a

Starting point is 00:45:35 decision and then come back you know minutes or hours later to to finish it off second thing you can do is display the cache utility curve for DevOps people for people trying to do optimizations in your system, for SEs, for customers to be able to self-manage. The third thing is auto-tune cash parameters. Let's get over those days where we ship 50 different parameters and nobody ever touches them. So as engineers, we feel real good.

Starting point is 00:46:00 It's like, oh, look, we got so many knobs. And in our labs, we can show 50%, 100% performance improvements. Well, enough is enough. Let's just not do those knobs or do them, but have them automatically set. Because now we can actually accurately predict on a per workload or per lawn or per whatever basis what the performance difference will be for every one of our knobs in that cache. Step number four, thrashing mediation. We talked about this in a lot of detail.

Starting point is 00:46:25 Once you have the MRC, this starts to become easy, and you can tame the worst workloads. In fact, the worse the workload, the more awesome this looks because it's designed to address. So a very cash-friendly workload will get no benefit from this. An unfriendly workload could get 100% performance improvement from this. Next is latency guarantees. So now we've got this really nice curve, okay? But that really nice curve hides in it the ability to say, I can control how much cash I give this workload.

Starting point is 00:46:54 If I can control that, I can predict the performance improvement they're going to get, and I can predict the latency. Now, the same thing can be done for IOPS. I care less about IOPS as a human being than I care about latency, but that's just me. But if you wanted to do IOPS, it's fine. I mean, it's just a slightly different curve. You know how I showed you how to compute using the miss and the hit ratio and the hit and miss latency? The same thing can be done for IOPS, and you can get the curve for IOPS, and then it'll just be the opposite curve. It'll be higher is better.

Starting point is 00:47:19 Okay. So now, for the first time in the industry, we can use the cache as a control mechanism, as an SLO guarantee sort of mechanism. I think that's a game-changing opportunity. Next is accurate tiering decisions are close cousin to the last couple things we talked about. But instead of just having two regions, the cached region and the uncached region, and you change the boundary, now we have multiple. Maybe the last one is a cloud with maybe a 20, 30 millisecond latency, 50 millisecond latency, right? Even in those scenarios, the iterative algorithm that can sit on top of this and change multiple barriers can find sub-millisecond latency even when your worst case latency is 50 milliseconds or 100 milliseconds because

Starting point is 00:47:59 we can accurately size all the intermediate caches, right, and do that iteratively, okay? Next step, multi-tender partitioning. So we saw a case where we had average of 40%, in some cases up to 150% performance improvement by just repartitioning the caches once an hour, not even minutes or anything like that, once an hour. All we needed was the misstorage occurs for each client, and we would adapt and redo it.

Starting point is 00:48:21 And in this example, I'm just showing two, but in the testing we did, and we were going to 20 different workloads running at the same time to get that now the cloud tier we already talked about kind of a special this is a special case of step number six and then the final piece that I'm going to leave you with and we have a five minutes for questions so the final piece here this is an open area working with universities on this right now you someone had alluded to this as a question

Starting point is 00:48:46 earlier, which is, hey these misratio curves, how stable are they? are they shifting and kind of undulating and moving throughout the course of a day or a week? well they are, they're kind of reasonably stationary when you're at a day scale now the nice thing is the algorithms as the misratio curves change, they adapt and give you

Starting point is 00:49:02 the new misratio curve but we don't yet have a theory of the behavior of these workloads and the behavior of the MRCs. We don't have a theory about that yet. So we're very actively working to try to figure that out. And I think the interesting thing is, here's the amazing thing, right? Imagine if instead of just knowing how much to size the cache or how much to not size the cache on a day or hourly basis, you could come in and say, I how much to size the cache or how much to not size the cache on a day or hourly basis, you could come in and say, I'm going to turn the cache off.

Starting point is 00:49:29 Because it's better for me to turn it off now than to turn it on later. And so there's some heat map type of stuff that we're working on where we're trying to figure out, are there some kind of theoretical results? Just like the previous work that we did. It started off with ad hoc stuff and ended up with a theory we're trying to develop a theory for being able to do time series analysis of misracial curve and then to put them in a control loop probably take us another couple years to have good results that we have some initial results so idea is to take the basic computation of the

Starting point is 00:50:01 misracial curve and then apply various types of data analytics on it so that we can get even time-varying prediction, not just kind of a stationarity assumption. Okay. So that's all I have. So I hope that I've managed to interest some of you in following up on this work and taking a look a little bit deeper. I think we're at a stage now. You know when the technology is first invented, the first

Starting point is 00:50:27 kind of work in this was about three years ago, where there's a barrier to entry because even the inventors haven't figured out how to explain the technology yet, right? Haven't really built the use case, haven't figured out how to explain it. I think we're at the marginal return, so waiting any longer than today as vendors or as systems or cash designers, I think it's not going to give you much benefit because I think

Starting point is 00:50:47 most of the work has been figured out here. Most of the challenges have been addressed. I'm not saying there's not challenges. And since some of these things

Starting point is 00:50:57 are still, will take you some time to implement. But we're at that marginal return in terms of your return for waiting.

Starting point is 00:51:07 So I don't think there's any point in waiting anymore. I think we should be investing in this as an industry. Questions? Comments? Yes? Most of the talk that you talk here was always about block address. So how does it cater to the whole system memory and more byte address? Is it just shifts over? I mean, like, it's exactly the same for those types of designs? Yeah, it's exactly the same.

Starting point is 00:51:39 So the way to think about it is that any place you might want to insert a cache, once you put data in the cache, you've got to get it out. So you key every element in the cache with some unique key. The missed ratio curve computation only uses that key, nothing else. It doesn't care about your cache implementation, it reasonably about it but it's not tied to it and it doesn't really care about the cache size that you currently have or you might have right it just says look just give me the list of keys that might be logical block addresses it might be bread is you know memory like random keys or something like that

Starting point is 00:52:21 whatever it is it doesn't matter so that that's for the misracial computation. So as soon as you do the misracial computation, everything else is built on top of it. So it should be safe. Yeah. Good question. Yes? One question.

Starting point is 00:52:33 For cache, you have certain underlying assumptions about LRU and ARC as a support tiering. Are you making other assumptions about how the tiering is done? Or is this dependent on the tiering algorithm underlying the data? So when we looked at tiering, the assumption being made was that the work is so. This particular thing is, this example is LRU.

Starting point is 00:53:03 The same example could have been done with ARC. So I think the more important question is, what if part of your system is LRU and part of it is ARC and part of it is some other sort of mechanism that's not even like a cache? It could be, for instance, you know, some other like machine learning system running underneath that's saying, hey, these blocks are accessed every like four days and a half.

Starting point is 00:53:22 So I'm going to pull it in or something like that, right? So that is open. I don't know the answer to that. But as long as it's a traditional cache algorithm, we can deal with it in this way. But I do know, for instance, some EMC arrays, they don't use a cache algorithm for their tiering.

Starting point is 00:53:38 On the other hand, what this tells you is if you were to cache, what would the benefit be? Now maybe you can do better than that. But I actually kind of doubt that most systems, once you pull in this type of technique, can do much better than just pure cash. Because, I mean, again, we have 57 years of history of people trying to beat LRU and not being able to.

Starting point is 00:54:02 Good question. So I would really like it if I got a chance to say hi to all of you. There might be a talk in here. I'm not sure. We can step outside, but it would be nice. I always like to spend an hour with you.

Starting point is 00:54:14 We'd love to kind of know who you are and exchange LinkedIn maybe and keep in touch. Thank you for taking the time. I appreciate it. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.

Starting point is 00:54:35 Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #43: SNIA Tutorial: Your Cache is Overdue a Revolution: MRCs for Cache Performance and Isolation

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.