Storage Developer Conference - #71: Self-Optimizing Caches

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 71.

Starting point is 00:00:39 Thanks everyone for coming early this morning to listen to me talk about caches and data path optimizations. It's always nice to get a bunch of cache lovers together and geek out a little bit, and we're definitely going to try to do that. So my name is Irfan Ahmed. I work at Cache Physics. It's a new startup trying to commercialize some technology in the area of real-time predictive modeling for data access paths. And that's what we're going to be talking about today. The result of this type of modeling is the ability to increase

Starting point is 00:01:19 performance very systematically or to lower cost of data path software or storage systems. And a big part of that is caches. So let's look at this at a high level. So in this very generic looking plot that I have, I'm plotting latency as increasing on the y-axis and cost sensitivity dropping on the x-axis as you go to the right. In other words, close to the center where the darkest colors are, you're very latency sensitive, and you're not really cost sensitive. So that's where I put high frequency trading. If you can squeeze a few extra microseconds, people tend not to care about the cost of the infrastructure. And then as you get further and further and further out along the diagonal, the workloads that sit there are less sensitive about latency

Starting point is 00:02:13 and much more sensitive about cost. So now we can debate about what workloads run where, and of course I'm being very generic here. It's possible that for some people the e-commerce workload is as important in terms of the latency trade-offs as HFT. So it's, I think, less important exactly which application in my choice of drawing it out sits where in this orbit. It's much more important to think about the hierarchy of storage that is employed to achieve those latencies and those cost trade-offs. So anywhere from DRAM to persistent memory to flash to disk drives,

Starting point is 00:02:55 and actually many more layers when you start to think about remote disk and then cloud disk and so on, so I didn't bother putting all those up here. So the result of this is that once you have a sense of what your applications need in terms of latency, you have a lot of trade-offs to do. And so in a lot of these trade-offs, right, there's two basic hardware infrastructures or software infrastructures available.

Starting point is 00:03:21 One is persistent storage. So you can try to make your petabytes of persistent storage really, really, really, really fast. Some people will do that. Most people will choose to implement sophisticated caches so that they don't have to take their petabytes of data and put it into DRAM, but rather they will take the portions of that data

Starting point is 00:03:42 that they think that they need at any particular time and put it in the caches. So there's a few claims that I'm going to be making, and then the rest of the talk is trying to justify those claims, right? So first claim is caches are critical to every application. So whether you're dealing with a database or a key value store or a container or a VM or an app that represents application in this way of thinking, there are resources available to you, both in terms of persistent as well as ephemeral storage. So that might be server-resident CPUs with L1, L2, last-level L3, and other caches. It might be DRAM that the operating system is managing

Starting point is 00:04:25 or the database system is managing, right? It could be 3D cross point flash, et cetera. And then there's a network, and then across the network, there's resources in persistent storage as well. So the use of caching to accelerate data access is pervasive in modern storage systems. The challenge often is that monitoring systems typically provide only a very little bit of information about the cache.

Starting point is 00:04:56 So, for example, in this case, I've drawn a very simple dashboard. Actually, this is the only dashboard you can have today for caches, which is, well, for that cache, the hit ratio is 65%, and it's currently running at 128 gigabytes, right? So that's not a lot of information that you can use to operate your environment. So here's my claims. So even though caches are critical to every application,

Starting point is 00:05:21 cache management is nonexistent. Now, it's deliberately a little bit strong, but I'm going to try to prove to you that it is, in fact, largely correct. So let's start with some questions you might have about your cache. And cache management means you should be able to answer some of these questions. I think some of them are fairly straightforward. Is this performance good? So that's 65%. Is that good? Is that the best that could be done for this workload?

Starting point is 00:05:50 Can we answer that question today? Because if we could, then maybe my claim is hollow. But I think today it's very hard to answer the question of is this performance that's coming from that cache good? Could it be improved? That's a corollary of the previous one. What if I gave you 128, 9 gigabytes of cache? What will you do in terms of hit ratio?

Starting point is 00:06:13 Will it be better? Will it be the same? Could it get worse? Could I give you more cache and things get worse? Can we save any of those things with certainty? Not really today. So how much cache for application A versus B versus C, etc.? Right?

Starting point is 00:06:29 If I have some limited budget, how do I allocate that budget? What happens if I add, remove DRAM? What happens if I replace some of my DRAM in my cache with Flash? What will happen? Very difficult to answer that question. How to achieve 99th percentile latency of 10 microseconds or whatever number of microseconds it is that you desire. Okay, here's a twister. What if I keep everything else the same, but I just add one more workload to this cache's input workload? Will that

Starting point is 00:06:56 change anything? Make it worse? Make it better? How do you know? And then is there cache pollution going on? How do I figure out if there's cash pollution? What can I do about it? What if I change cash parameters? Most of the cash systems I've ever, almost actually every single cash system that I've ever worked with in my career has had had parameters. Now, how do you figure out what parameters to set

Starting point is 00:07:15 for that particular workload or mix of workloads? So if you look at these questions, and us, at least in my opinion, not being able to answer any one of them satisfactorily, that leads to my conclusion that, look, caches are very important. Everybody cares about them. But we simply do not know how to manage them as an industry, as a field, as a discipline today. Okay, so you guys, I'm going to try to see if we can get better than that. But you've got to hold me honest.

Starting point is 00:07:45 If I'm way off course, just say so. Okay, so part of the reason for that, I think, is that what we have today are very static caches. And there's an opportunity, I think, to start to think about caches that are much more dynamic and self-optimizing, self-learning. So just conceptually, what I did was I took three parameters, just parameter one, two, three. They're not labeled right now.

Starting point is 00:08:08 We'll see later on in the talk what some of these labels could be as parameters. So an example of that could be cache block size. Another example could be read ahead, whatever the parameters are of that cache. So there's a point cloud here, and every point in that point cloud represents just one application.

Starting point is 00:08:30 But it represents the optimal parameter setting for those three parameters for that application. Okay? So you'll see hundreds of applications represented one dot per application. And that dot just... That application has different performance profiles throughout that space, but I've only

Starting point is 00:08:46 picked one spot, which is the best one for those three parameters. Make sense? Okay. Now, so you can already see that in this parameter space, every one of those hundreds of applications has a different optimal operating point. So in this case,

Starting point is 00:09:02 it could be cache size, it could be parameter settings. So, how do we ship systems today? We do some benchmarking. We pick some spot that maybe works for one workload or some mix of workloads. Actually, it's usually not even one workload because you're trying to trade off, okay, these parameters are good for this application workload,

Starting point is 00:09:20 for this benchmark, these are for another one. So let me just kind of set it halfway somewhere in the middle. The result of that is that the shipping systems are not optimal for any single workload, most of the time, and often are very far away from optimal points given the actual production systems that are running in the field. So this is this concept of, okay, this type of way of doing caching, what I call static caching, leads to thrashing, leads to scan pollution, results in gross unfairness or interference because the cache system is not actually caring about the type of workload that it's receiving and therefore adapting itself to it.

Starting point is 00:10:01 It's just operating at whatever point that the developer or the performance engineer set as the default. And of course, leading to unpredictability, over-provisioning, lack of control. These are very serious problems, right? I've talked to people who are running production caches that are hundreds of terabytes big and they think they're 5x over-provisioned.

Starting point is 00:10:19 And I ask them, well, why don't you go to like 4x over-provisioned? Because they're like, well, I don't have the confidence. I don't know the answer to the questions that we asked previously. Okay, so if we want to get better than this, we want to go from static caches to maybe self-optimizing caches to caches that can adapt themselves to those workloads and can self-manage or at least provide good management hooks for the operators. What we need to do is we need to shift a little bit our thinking, and I claim that to do that, we have to try to do some modeling of the cache with the workload and to try to do so

Starting point is 00:10:51 in real time, not every night or something like that, try to do it in real time. So on the left-hand side, you'll see the same little dashboard which caches provide today, basically one number, which is the hit ratio. And what I did on the right-hand side is I actually plotted that as that blue dot. So what the blue dot is showing is the 35% miss ratio, in other words, 100 minus 65, 35%. But in this particular case, I've translated that into latency, because if you know what your miss or hit ratio is, you can easily just come up with the latency number as long as you know the back-end's miss penalties. So in this talk, for the remainder of the talk, we'll use latency on the y-axis and miss ratio on the y-axis interchangeably

Starting point is 00:11:36 because both of them are lower is better, right? And there's a simple arithmetic conversion between them. In this case, the cache size here on the x-axis, right, again, the 128 gigabytes are resulting in that blue dot, okay? So the entire exercise associated with what we're going to try to do in modeling is we're going to say, okay, that dot is not good enough because that dot has all the problems and all the questions that are unanswered from a couple of slides ago, which is, is this good? Can I get better?

Starting point is 00:12:05 What if I do this? What if I do that? So how do we do that? Well, we're going to have to do some modeling, right? We're going to have to put a little tiny little brain in there somewhere. And we're going to learn the performance model of the application with that particular type of cache. And what we're really trying to do is, you know, a fancy way of saying

Starting point is 00:12:26 it. Sometimes academics are in my audience, so I have to put a function to satisfy them. Otherwise, I don't get good ratings on my talk. But really, what you're trying to do is you're trying to build a model. So that's just in the f, that is in the function of the cache size and other parameters. So what does that mean? Well, all that means, folks, is we need to plot that curve. That's it. If we can plot this curve for a particular given workload as you vary a parameter, how does performance vary? If we can draw this curve for the workloads that we're interested in in production in real time, we can answer all of the questions that we started off with. Okay, so let's look at this curve a little bit.

Starting point is 00:13:10 So again, here cache size is the x-axis, but we know there are more parameters in cache size. But then it gets hard to visualize in multiple dimensions, so I've tried to just, as much as possible, trying to just keep one variable just to illustrate it, but in production in real time, we would be modeling more than one parameter. So now you see a lot...

Starting point is 00:13:29 First of all, you see that sometimes you add cache and nothing happens. Okay, so for example, if you're at 10 gigabytes and you add cache, nothing happens for a while. So let's look at that a little bit more closely. So we're at 10 gigabytes. We decide we want to improve things. And let's say we double the cache size.

Starting point is 00:13:55 So now let's say we went from $1,000 a day to $20,000 a day. Performance did not improve. Now, how do we know performance didn't improve? Because remember, it's flat there. So latency did not drop. So then we decided to go 4x. Then we decided to go 8x.

Starting point is 00:14:16 Now we went from $10,000 a day to $80,000 a day. I'm just making up those numbers. But whatever that is, 100 to 800, whatever the cost of the system is in terms of its cache footprint, we've gone up 8x with no performance improvements. So maybe if you don't have this curve, maybe you'll give up then.

Starting point is 00:14:36 Maybe you'll say, look, this workload cannot do better. Performance cannot get better. I've gone up 8x. What do you expect me to do? I'm not going to go up 100x. Little do you know that for every single megabyte of additional cache

Starting point is 00:14:50 that you're going to provide to this workload after that, you're going to start to see improvements, and that's where the curve... So the nice curves, you know, improve with cache size or other parameter changes, right? Okay. So this is nice because we actually had a curve, and we could just argue about it but

Starting point is 00:15:06 remember those curves don't exist in systems today and so you are just feeling it out and trying different parameters which is actually dangerous in production because you can actually create negative effects as well another way of looking at the same curve and by the way this is a production of virtual machine virtual disk workload and it's not a diagram that I just created randomly. So here, it's the same plot, but we're going to try to, instead of thinking about it from a performance point of view, we're going to start to think about it

Starting point is 00:15:34 from an efficiency point of view. So you can take the same curve, and you can annotate it a little bit differently, and you can say, well, it looks like most operating points here are not efficient. Why are they not efficient? Well, because you could have gotten away with a smaller cache for the same performance.

Starting point is 00:15:54 So the particular blue dot, the reason why I picked that spot for this presentation, is because it's a super inefficient operating point. Why? Because you could have had 1 eighth the cache size for the same approximate performance. So you do not want to operate there. So what you can do is, you can label the space,

Starting point is 00:16:18 and just operate only on the efficient operating points. And so you're looking for some sort of a knee, and depending upon what your threshold is for defining efficient, there will be more or less points here. But in perhaps a strict definition, you can see that you want to operate at the bottom of a cliff,

Starting point is 00:16:37 because you've gotten a huge improvement, and you're not going to get too much more improvement later on. So can anybody just hazard a guess as to why there are these flat spaces in these workloads? Why is it that we see these flat regions? Working set sizes, exactly.

Starting point is 00:16:55 And why are there more than one? Right? Sorry? Well, actually, this is not, this is the pattern of the workloads reuse distance. So we have not... Caching comes in as the algorithm that can suck locality out of it. But there's a hierarchy of working sets is another way of thinking about it.

Starting point is 00:17:17 So maybe there's this fast working set that's whatever data is being used for that hour or two hours or whatever. And then over the course of a diurnal cycle, there's some reuse of blocks, right? And then over the weekly cycle, there's some reuse of blocks. And depending upon how big your cache is, it will capture a different portion of it. So that's why there's these flat regions, right? Exactly right. It's successive working set sizes. Okay. Okay.

Starting point is 00:17:44 So now we're starting to build an intuition for what this modeling is all about. Now let's see how we can go to that concept of self-optimizing, self-learning, self-adapting. Okay, so I just went from one curve into one, two, three, four, five curves. In this particular case, this is a single workload.

Starting point is 00:18:08 It's the same actual workload. Remember I said that I'm going to try to keep it to the same axis on the plot rather than going five-dimensional on us. But what's happened here is that it's the same exact workload in all of those five plots.

Starting point is 00:18:24 So why are the plots looking different? Why aren't they all the same? Well, the only thing to change, not the cache algorithm, not the workload. What changes is the cache's implementation parameters. In this case, the cache block size. Okay, so now imagine this.

Starting point is 00:18:40 You're running a production. You've got some workload that really likes two kilobyte blocks. It tries to get data like a DB2 or something like that. It tries to get data in 2K chunks. And you're operating the cache at 64 kilobyte block size. So there's going to be some inefficiency there. So what we're doing here is we're just saying,

Starting point is 00:19:01 okay, vary the cache block size, and those are what each of those curves are. And we're going to see what the resulting performance is. The remarkable thing here, which actually completely took me by surprise when I first discovered this, is that depending upon how much cash you're giving, different block sizes are more efficient or less efficient.

Starting point is 00:19:20 Okay? Remarkable. I mean, it's shocking. So that's what those arrows are. So there's points where the difference between two different cache block sizes flips over depending upon how much resources you're providing. So if you have this sort of a thing,

Starting point is 00:19:36 and this is what real world looks like, we have a pretty hard problem, but if you know all of these curves, the problem gets pretty easy, and how? So what you do is you just say okay i'm going to trace the lowest profile okay and that's going to be my operating surface so if for one reason or another i need to allocate more resources for this workload okay beyond this horizon here i'm going to switch from that block size to this one.

Starting point is 00:20:06 And if I need to allocate more beyond this horizon, I'm going to change the block size and so on and so on and so on. So if you know the models for the different settings of your cache implementation, you can online just pick the right one. So that's what we're trying to say here. So if you can predict the performance under different policies, you just pick the right one. Okay, so that's what we're trying to say here. So if you can predict the performance under different policies, you can just pick the right one. That's it. Just pick.

Starting point is 00:20:29 Now, of course, your cache implementation has to be able to switch block sizes on the fly. Some of them may, some of them may not. And if you, for example, are owner of a cache subsystem that cannot today change cache block sizes on the fly, you've never been motivated to do it because you've never had a model that tells you exactly when to switch those on the fly. Other parameters fall in this

Starting point is 00:20:52 sort of situation as well where, until you know it, you may not be motivated to actually implement these knobs, these online, real-time change knobs inside your cache algorithm. So far, so good. I'll take questions as I go. Yeah? How much overhead do you think that that's going to do if you start switching that parameter when you have a cache that's almost working to its limit?

Starting point is 00:21:15 I guess size works for this, right? So, overheads is the question. Overheads. Yes. So, there's multiple different overheads that we need to think about. There is how much memory do we need to do this performance model.

Starting point is 00:21:27 Right? That's okay. So to understand overheads, there's different types of overheads we need to understand. One of them is how much memory do we need to do the model, and will that eat away memory that we could be using to do caching itself? So that's one question. Another question is to do these models, oftentimes when you think modeling, you think, oh my god, I'm going to burn a lot of CPU. I'm going to add latency just to do the model, and so therefore is it worth doing, right?

Starting point is 00:21:54 And I think, and let me try to address these and see if there's other types of overheads you can think of. So in terms of memory overhead, each one of these curves costs about half a megabyte to a megabyte in memory. So let's say you wanted to do parallel simulation of a dozen different combinations of parameters. In that case, it would be 6 megabytes to 12 megabytes depending upon the situation. Does that make sense? And then in terms of CPU overhead, each one of these curves that you're trying to do will cost you about 40 nanoseconds per IO on average, amortized. So you definitely don't want to do thousands of these, but you can easily do a dozen of these

Starting point is 00:22:36 and still fit it under your budget. And the nice thing about the CPU impact, which is, like I said, about 40 nanoseconds per curve, per IO, is that it's not in the critical path. You just need to preserve the identity of the block that was going through the data path. You just put it in a circular buffer and drain it off. So it doesn't have to add to the latency of the real path. Does that address your question?

Starting point is 00:23:03 Okay. Yes. So does this assume an LIU replacement policy? So the remarkable thing is that when we first started out this research, we only figured out how to do this efficiently for LRU. In the last few months, we figured out how to do it for arbitrary cache policies. So we can now do it actually remarkably for OPT. So I'm going to show you some results for OPT. And OPT is the unachievable best-case cache algorithm. So if you had an oracle for your cache replacement policy that could spend as much energy as it could

Starting point is 00:23:38 under the constraints of on-demand replacement, it provides the frontier for what the best caching policy in the world can do, right? Nobody can actually achieve it in practice. It's too expensive. And we can actually simulate that also in real time using the same overhead. So now we've gotten to a stage where you throw your implementation. It doesn't have to be standard LRU.

Starting point is 00:24:00 It could be LRU-K. It could be LRU with read-ahead. Or it could be ARC. It could be LERS. It could be 2Q. It could be, you know-K. It could be LRU with read-ahead. Or it could be ARC. It could be LERS. It could be 2Q. It could be whatever complexity it is. And now we can just wrap around it and actually do the modeling. And so that's a huge, huge breakthrough in the first time at SDC that we're talking about that.

Starting point is 00:24:16 Okay, so let's make some progress. But I'm happy to take questions as we go. There's a lot of complex stuff here. So this is, now we get to see more of these, right? So we looked at a couple of different ones, and we've gotten some familiarity with what these models look like, and they're not that complicated

Starting point is 00:24:32 when you simplify them down into a curve, right? And so hopefully they're less scary than in the abstract, when you read for this conference. So here's a bunch of different workloads. Some of these are from the Cloud Physics trace set, again, virtual disks from production, customer workloads.

Starting point is 00:24:51 Some of these are OLTP, database, web workloads, et cetera. There's also a couple of, actually three in this case, Microsoft Research traces here as well, and that's just for completeness to say, you know, over 120 workloads that we've looked at so far in our publications that we've been allowed to release in our academic papers and white papers, and so that's all available in public because we have permission from those customers to use it.

Starting point is 00:25:23 So the accuracy is there. So here, okay, so let's look at some of these curves. Let's look at T05 first. In this case, what we see is the blue is that opt that I was mentioning, this blue right. No real cache algorithm can ever get there. Now, there's two other cache algorithms, ARC and LIRS, so ARC and LIRS. And so in this particular T05 case, they kind of compete with each other. As you can see, again, lower is better, and as you go across the x-axis here, we're increasing the cache size. In this case, for this workload, only small caches were needed to be able to get to almost less than 5% miss rate. So only from 20 gigs. But you can see that they're competing with each rate. So only from 20 gigs.

Starting point is 00:26:05 But you can see that they're competing with each other. So these two algorithms in that workload are not really that far from each other. Again, how do you tell that in production today? You don't. You have to have these models. But if we look at T19, they kind of fight with each other for up until about 200 gigabytes.

Starting point is 00:26:23 But if you're willing to go more than 200 gigabytes, there's a massive performance difference between those two algorithms. And in this case, remember, LURs better, so LURs wins. So actually, one of the things we found is that, as a side effect of our work, is that LURs actually is better than most people thought it was in real production environments.

Starting point is 00:26:43 And it's very competitive to AR Arc, just as a side note. And so here's another one where LERS hugs OPT very, very closely. It gets really lucky in that workload, and all the way out to about 600 gigabytes of working set, LERS is hugging OPT pretty closely, whereas Arc has this degenerate pattern where it is not able to benefit from that workload all the way until about 600 gigabytes.

Starting point is 00:27:12 So in this case, you're seeing OPT, ARC, LERS, online modeling, real-time modeling of them for the cache size case. So hopefully now we get to get more practice with what these models actually look like in practice. Okay, so now let's try to use these for cool things. So remember I mentioned in the cache policy tuning example how you can just hug the lower profile,

Starting point is 00:27:37 the lower envelope? Well, this is what we did, again, for a couple of these workloads. In our more detailed paper, we've done it for all of our workloads. And so what you see here is you see a top and lower profile. So the minimum performance, in other words, the best performance that we saw using static workloads. So we ran a whole bunch, like dozens of different parameters for LERS.

Starting point is 00:28:03 Dozens of different settings, but static settings for LERS, okay? There are dozens of different settings, but static settings for LERS, and we picked the lowest profile and the highest profile. And then we said, hey, go and automatically find those while you're running in real time, okay? So the red curve here is, again, trying to find the best, trying to find online without knowing these curves ahead of time.

Starting point is 00:28:28 So there's no faking here. It's all real. The algorithm does not know the curves. It has to figure out the curves, and it has to pick the best, and it has to change constantly online as the workload keeps shifting because these are very, very long workloads,

Starting point is 00:28:41 and over the course of time, they change what the best parameters are. So the best case would have been the left-hand side, which is hug the lower profile, right? And that's the best we could do for layers for that workload, and our algorithm was able to achieve it. The worst we did across hundreds of our traces is the one on the right, okay,

Starting point is 00:29:03 because there was some things about the workload, some workload shifts that our real-time adaptation algorithm was not able to take into account. So you'll see here that we're able to hug the lowest profile for a while, and then we just kind of get away from it, okay, and just we couldn't figure out how to make the real-time algorithm better than that, right, with all the constraints. So, you know, there's often very fake signals inside workloads.

Starting point is 00:29:27 Shifts happen. But even then, we are well below the middle for most of that in terms of the static to, best static to worst static distance. Does that make sense? So this is the opportunity, right? Now, again, if we were shipping this in a production system, we would have picked some line

Starting point is 00:29:48 that was either the higher profile or the lower envelope or some other single curve in between. But what the system is able to do once it can do this modeling and do it real time is to just pick the best possible as you go. Yes? So when you are training those plans online,

Starting point is 00:30:05 do you see any eye stall? Any of the eye stalls? What's coming from a host to the cache and out of the cache? So the parameter for, the most sensitive parameter for layers is something called the stack distance.

Starting point is 00:30:24 So there's a metadata stack deep inside the bowels of layers, and that's the most sensitive parameter to layers, and that's the one we were working on. So as you change that, what happens is that if you have a long stack, you're just going to prune that stack. And just by pruning that stack,

Starting point is 00:30:43 you're going to get better performance. That's an example. But workload shifts again and actually prune that stack. And just by pruning that stack, you're going to get better performance. Okay, that's an example. But workload shifts again and actually needs a longer stack. So you add a bunch of entries to make the stack longer and performance improves, right? So as you can see,

Starting point is 00:30:55 it's not like a cache flush where you just dump all the... It's not like that, right? It's just inside the algorithm, something needs to change because this workload is now operating in a way the cache is getting confused sitting in that parameter, right? Does that make sense?

Starting point is 00:31:13 Okay. So this is the first kind of real-world use case we've seen for adapting an algorithm, in this case LERS. And then there's another well-known algorithm called 2Q. We have results for this in our more detailed analysis for ARC, which is popular as well as for LRU. But here, for 2Q, the same sort of thing, right? There is a top profile, which is, you know,

Starting point is 00:31:36 could be a setting that somebody ships with, but it's static. And then the lower profile, which is also static. So the best and worst for these workloads for that algorithm. And then the real-time algorithm kind of tries to find its way without any of the additional information, without knowledge of these curves ahead of time. Real-time is trying to find it. And as you can see, it's always in the middle,

Starting point is 00:31:58 and most of the time it's able to achieve very close to optimality, even with no prior knowledge. Okay, so now we've seen a couple examples of adaptation of policies or parameters of cache algorithms. Now let's look at a totally new class of acceleration. Now, you know, a lot of times in our industry, we say performance acceleration, and usually we're talking about either a new piece of hardware

Starting point is 00:32:21 or, you know, introducing a cache. And we say, okay, well, if you introduce a cache, you're doing acceleration. But what I'm going to talk to you about is not just about introducing a cache. We're saying there's already a cache, and the cache is handicapped. It's sitting where it's doing the best that it can do. So it's already doing acceleration, right? And so I couldn't come up with a name to identify this acceleration of the acceleration. So, like, what's acceleration squared, right? But I just said, okay, acceleration of the acceleration. So, like, what's the acceleration squared, right?

Starting point is 00:32:47 So, but I just said, okay, new class of acceleration. So, but this is really, really interesting and exciting. So, again, we see here a curve, and you can see those same pattern of flat regions in this curve, right? That's those working sets, and that's those, you know, marginal return is really horrible in those areas, right? And so remember, this is a particular cache algorithm, and it's trying to cache as best as it can. But when you look at that, and those flat regions, you know, aren't you motivated to say,

Starting point is 00:33:19 hey, what if we could just somehow, like, just these because these are horrible they don't they don't work well um these flat regions and they're very inefficient operating points well could we just somehow come up with a way to just cut across all this nastiness and just go straight to the next cliff and then straight to the next cliff right so this has been really really interesting um and really what this is saying is, can I tweak the curve? Like, can I change the rules somehow? What can I do? Is there something simple that I can do

Starting point is 00:33:49 that post my acceleration, I can get another boost of acceleration? Okay. So it turns out that we can steer the curve. It turns out that you can interpolate that convex hull, and you can take your cache algorithm and tweak it a little bit, and all of a sudden, you end up operating on the convex hull, and you can take your cache algorithm and tweak it a little bit, and all of a sudden, you end up operating on the convex hull, which in this case is basically the

Starting point is 00:34:09 diagonal. But however, to be able to do it, there's one prerequisite, which is you have to know the model. So you have to do online, real-time modeling to find that red curve. But if you know the red curve, you can always find the dotted line. Because there's an algorithm for that. It's published in HBCA 2015. So let's look at an example.

Starting point is 00:34:34 So we have our blue dot again. And this blue dot is sitting somewhere in that flat region. Turns out that if you tweak your cache implementation to just have shadow partitions, and shadow partitions means that they're not exposed to the user, but they are partitions. They're inside the algorithm, right? So we're just going to call them alpha and beta. So let's say you had a one gig cache, just for example. In this case, let's say a 50 gig cache, right? So part of that cache would be partition alpha. Part of it would be beta. Nobody has to know about it. We're just going to hide it underneath the surface. And we're going to tweak the sizes of the alpha and beta and steer hits and misses to them.

Starting point is 00:35:10 And that's how we're going to be able to achieve sitting at that convex hull. So the way that works is that the partition of the 50-gig cache, in this case, that is called alpha, this hidden partition, we're going to operate that partition at an efficient operating point.

Starting point is 00:35:36 And the rest of the cache, which is the beta partition, we're going to operate that at the next efficient operating point. And we're going to operate that at the next efficient operating point. And we're going to pick efficient operating points such that we skip these middle ones, and that's where the convex optimization comes in, the convex hull identification. So we're going to study this curve, find the cliff that is along the convex hull, and we're going to take the portion of our cache, which we decided to call beta, the hidden partition called beta, and we're going to steer hits and misses to it so that it operates here. So it turns out that if

Starting point is 00:36:18 a portion of your workload is operating here and a portion of it is operating here, on average you're operating in the middle. And then that's true for every single cache size. Now, for every single cache size, the bottom of the cliff, the talus of the cliff that you're going to pick is going to be different. But as soon as you're able to pick it, you just operate there.

Starting point is 00:36:37 Okay, so the more details, you can either talk to me about it or there's some papers you can read, including the HBCA 15.1. So the result of this is this acceleration squared, because the first acceleration you got was by having caching in the first place, by having a cache. And the second acceleration you got was by tweaking the cache algorithm to have these shadow partitions and then steer hits and misses into those shadow partitions to be able to sit on the convex hull, okay? So not possible without modeling. So

Starting point is 00:37:11 you have to model first. You've got to pay that cost of the model, but then you can get these super effects coming out of it. Okay, so this is theoretical. Let's look at some real examples. Here we've got three different workloads and two different cache algorithms in those workloads. So the same setup as always. Missed ratio is on the y-axis. Lower is better. And the x-axis is increasing cache size. So for these three workloads,

Starting point is 00:37:38 you'll see a large diversity of different cache sizes because the working sets are different sizes. One other thing I want to mention real quick here so you see arc here remember i said like what if i increase my cache size and i try to do hey if i increase my cache size in production of course i'm going to get either equal performance or hopefully better no no no no not always the case look at arc here in this region we added it added more cache and the performance got worse. Remember? It lowers better. Performance got worse. So for these complicated cache algorithms,

Starting point is 00:38:10 the complexity is so phenomenal that you can't even rely on performance improving if you add cache. I've shown you that that's not always the case. And there's lots and lots of examples of this. And that's yet another reason why modeling is important. You don't want to just mess around

Starting point is 00:38:25 with production caches because you could actually make it worse when you think you're making it better. Okay, so here, the setup is similar to before. So black is the original miss ratio curve. So the first acceleration when the cache is added,

Starting point is 00:38:41 you can predict the performance of the workload with that cache in terms of hits and misses by just looking at the black curve. The dotted line is the convex hull of that entire curve. Now, of course, you don't know that when you start off, but the algorithm has to learn both the black curve and the dotted line. If it can learn, it can actually operate in the red zone. Now, we think that over time when we get better at figuring out how to do online modeling, we can get these red curves to be closer and closer to the convex hull. But again, any improvement against the black today is a net win just by a slightly better algorithm and by online modeling.

Starting point is 00:39:21 So it's a huge, huge win. So you'll see all the way on the left, that was an example, I think one of our best examples of hugging the optimal, sorry, the convex hull. And then the other workloads have varying levels of error against the hull.

Starting point is 00:39:38 No knowledge, real-time modeling. And you see, there are a couple cases where we haven't figured out quite how to bridge the gap. There are some cases where we are just a tiny little bit worse. We have some ideas on how to improve that in the future. Okay. So in this particular case, on the leftmost is a 69% of the potential gain

Starting point is 00:39:57 that the algorithm was able to achieve, and a 40% and a 30%. Make sense? Okay. So, next, we said, okay, once you have a model, what else can you do with it? And it turns out, like, we haven't run out. I think we still haven't run out of cool things

Starting point is 00:40:15 that you can do once you have these models. So one of the things I've been looking for most of my career is the ability to have a storage system provide latency guarantees. Like, why can't this XYZ vendor storage system just give me a knob that says, hey, I just want 7 milliseconds, man. If you give me 6, great. Don't give me 8.

Starting point is 00:40:34 Why can't I have that, right? And being on the vendor side, I was always looking for it. I think we finally cracked how to do it. And here's how. So on the y-axis is 95th percentile latency. Remember, it's just an arithmetic conversion between miss ratios and latency. So

Starting point is 00:40:51 lower is better over there as well. But in this case, it's not just any latency. It's 95th percentile latency. And then we can cast size on the x-axis. So let's say that a customer comes to us and says, look, I want a 7 millisecond latency target. Okay. Well, today we can't really give it to them, but if you had a system like this

Starting point is 00:41:06 that was doing online modeling, you could. And how would you do it? You'd just go over to the 7-millisecond line, and you'd walk across until you found the intersection, and then you'd say, oh, I need 16.5 gigabytes. I just want to be safe, because workloads do shift and change

Starting point is 00:41:24 over the course of a day or a week, and give maybe half gigabytes. I just want to be safe because workloads do shift and change over the course of a day or a week. I'm going to give maybe 18 gigabytes. And as long as I do that, and this workload goes through the patterns that it has gone through in the past, and there's some stationarity in those patterns, I'm guaranteed to get seven milliseconds. It's just like walking across a curve

Starting point is 00:41:42 and you've got latency guarantees. Now, of course, the model has to be accurate, right? So you have to have a good way of determining the model. You have to be doing it on time scale where the workload varies enough that your patch is predictive of the future. Thoughts, questions on this?

Starting point is 00:41:57 Yes? Yeah, you know, the conversion from ratio to latency is not so simple. Okay, so... Especially in a hierarchy-hit cache, you can degrade the hit rate of the lower cache, which would improve the hit rate of the cache. So let me repeat the question, or the comment.

Starting point is 00:42:15 The comment is that, Irfan, you constantly have been saying it's a simple arithmetic conversion between miss ratios and the average or the 95th percent latency, and it's actually more complicated than you keep saying it is. And I think you're right. So here, at the first level, it is arithmetic. So let's just do the arithmetic, right? So I'm going to put miss ratio and hit ratio,

Starting point is 00:42:34 and they're just one minus the other, right? So for the miss ratio, I'm going to multiply that by the back end or the remainder of my cache hierarchy's miss penalty. So let's say it's one millisecond median, three milliseconds 75th percentile, and nine milliseconds 99th percentile. So let's get the full distribution. So for each of those distribution points, I'm going to take my miss ratio, and I'm going to multiply that by the latency. And I'm going to add the hit ratio times the hit penalty, or hit cost, in, say, 100 microseconds. But that's so small, I'm going to just make it zero just to make it easy for now. So therefore, to the first approximation, what I put on the y-axis here as

Starting point is 00:43:23 latency is nothing but miss ratio that my model predicts times whichever percentile of latency that I've measured for the remainder of my backend. So that's very arithmetic. So now, right, so the next comment was, okay, you're right, let's say that your arithmetic is correct, and you're able to actually determine it statically for this current configuration, but how do you predict it after having changed the top-level cache, let's say, for the SQL argument, right? But the nice thing is,

Starting point is 00:43:52 all of our caches should have these models. So when we go and tweak the top-level cache, let's say make it smaller, right? We know the operating point of the next cache because now we've made that a little different. We moved it to the left and the one down from it to the left and so on. So we should theoretically be able to

Starting point is 00:44:13 simply predict all the way through that network of caches as long as each one of them is doing these models and then know for each of them what the resulting is going to be. But your model is derived from looking at the workload presented at the operating point. Yes. No, no, no. So again, the beauty

Starting point is 00:44:31 of these models is that it's irrelevant of the operating point, right? The model, remember the blue dot that was here? We predict less than and greater than in the same pass. Yeah, yeah, for that level of test. For that level. Sure, sure. Observing the Yeah, yeah, for that level of cache. For that level, sure, sure.

Starting point is 00:44:48 And you're doing that observing the workload that's coming into that level. Yes. The workload coming into the next level of cache is now different. But we know exactly how it's going to be different is the point. Well, but you haven't measured the curve for that. So there's two cases.

Starting point is 00:45:03 The one case is inclusive caches, and the one case where it's exclusive caches. For exclusive caches, it's trivial. The reason why it's trivial is that that entire curve actually applies to every level of the cache. And inasmuch as you have an exclusive cache in your system, it is trivial. For inclusive caches, you have to do a little bit more modeling,

Starting point is 00:45:23 and I confess. And we can talk offline about why it's different for inclusive versus exclusive. So if I was to design a system around the capability of doing this sort of thing, I would make an exclusive cache because now we can actually make an inclusive cache for the first time. We've never been able to argue about the utility of an inclusive versus exclusive cache in the past. Once you have a model, you can argue about it. And I think once people start to incorporate this sort of thinking in their early stage design

Starting point is 00:45:48 for the next generation caches, now I think all of a sudden we have a reason why we should be thinking harder about exclusive than we have in the past. Okay, so one quick thing I wanted to mention here is multi-tier sizing. This is very close to where you were going, which is in the case of exclusive systems,

Starting point is 00:46:04 this is actually, it's this trivial, where you just run a very simple optimization linear program on top of this, which determines, given my SLA and given the costs of each tier, I can draw this curve and actually decide how much DRAM I'm going to occupy, how much tier 1, 3D cross point, persistent memory I'm going to occupy, and local flash, remote flash, network flash network misses cloud storage whatever it is and you know number of tiers no longer matters because at least for the cases exclusive you can just simply read it off now the dollar signs are not in front of it but you can weight each of those regions by dollar sign and then just run

Starting point is 00:46:39 a linear program that gives you your sla at the lowest cost possible. Okay, so this self-optimizing data path, right? So I said self-optimizing caches as the title of the talk, but really it's about more than just the cache, right? It's about thinking about the application all the way down to the persistent storage, modeling it, and then chopping it up into the right set of tiers. And usually it's just going to be one tier in the middle,

Starting point is 00:47:03 but the point is that the system should be more than just one cache at a time. We now have the opportunity to go and think about an end-to-end data path and provide end-to-end data path guarantees. So the first thing we can do with it is monitoring, right? Just be able to have a dashboard that says, hey, what happens if I go higher or lower or change my cache parameters, right? And then we should be able to predict what the changes of those cache parameters should be such that we can even make a dynamic and self-adapting.

Starting point is 00:47:29 And we saw some examples of that with significant performance wins by hugging the lower profile. Well, we should then be able to insert into this model after picking the lower profile, so you'll see that my curve in this case is nice and smooth. And the reason it's nice and smooth is it's gone through some of those lower profile and self-adaptations already. And once our curves is nice and smooth. And the reason it's nice and smooth is it's gone through some of those lower-profile and self-adaptations already.

Starting point is 00:47:47 And once our curves are nice and smooth, we should be able to set our latency targets. And by the way, throughput and latency use the same sort of mechanism here by simply operating at the right point on these curves. We can also do tiering, which is the last thing I talked about as a full slide, on how that works in these systems. We can also talk about latency reduction, which is this stress remediation algorithm which uses shadow partitions to steer misses and hits

Starting point is 00:48:16 to these shadow partitions, and is able to actually steer the actual curve for the first time, which is really quite an acceleration squared sort of phenomenon. There's another thing I didn't talk about, which is multi-tenant isolation. If you have multiple tenants hitting your storage array and you have a limited amount of cache, how do you make sure that they get what they deserve? If you have a last level cache inside a processor and you have some ability to control tenants, well, how much should each tenant get such that your utility function,

Starting point is 00:48:44 which might be overall throughput, or it might be isolation for a tenant, can be satisfied? As soon as you have multiple tenants, right, if you don't have models, it's just crapshoot after that. You just don't know. And that's another thing, very, very critical, important. I just didn't have time to talk about today. So the results of all of this work is now we can safely and efficiently quantify the impact of changes. And we're often seeing a 50 to 150 percent efficiency improvement in terms of cost of running infrastructures, the cash infrastructure. You can do some of these tradeoffs, maybe pay a little bit higher cost, but instead of aiming for just lower cost, you can set the system to adapt its parameters

Starting point is 00:49:27 to always hit a certain latency SLA. And then, of course, higher consolidation ratios and looking forward and being able to accurately predict and do capacity planning. So that's the quick summary, which is, if you can model, you can do all these things plus more, and there's some really serious performance benefits to be had. So I have maybe two minutes, I think, three minutes for any remaining questions,

Starting point is 00:49:51 and I'll be around the rest of the conference as well. Yes? I'm a little bit perplexed why we're talking about milliseconds as opposed to microseconds when we have network storage now operating at 10 microseconds. Just out of interest, what's the sort of scenario of the workloads? Is it heavyweight stacks or languages to Java? So there's no reason why... It's just all of the latency numbers

Starting point is 00:50:22 are using arithmetic associated with some assumptions of the back-end model, right? So I happen to pick models where the working set, if it doesn't fit in DRAM, is going to disk, spinning disk. And that's why you see those high latency numbers, you know, and the left end of the curve. But then you'll see you're oftentimes below one millisecond on the right end. Yeah, but again, the modeling is not in latency, right? The modeling is done in terms of misses and hits. And you just impose a model on, you impose a back-end

Starting point is 00:51:07 latency model on top of it to get the results. Yeah. So, I wouldn't worry too much about my... In this case, it was disk all the way. And so, if much of the workload was going into disk, you'd get those penalties. But there's a lot of disk in the world. That's not going away.

Starting point is 00:51:24 So, there's nothing about this modeling that, like I said, it applies to last-level caches inside a processor. And we, I mean, Cache Physics is my company. You know, we're looking to do deals all the way up and down the stack. Okay, so it looks like we're going to get kicked out. But I'm going to see if I can push my luck

Starting point is 00:51:43 and take one question back there. Yes? So once you predict the model... Yes. ...and you provision all the sources, how accurate is it? It comes with different workloads and it comes with different periods in time.

Starting point is 00:51:53 So we've seen accuracy... We've seen error almost all the time less than 1%. In the model itself, yeah. And then again, the controller on the model, on top of the model, should not hug down to megabytes, right? Try to operate at tens or hundreds of megabytes in terms of accuracy so you have some leeway.

Starting point is 00:52:14 Yeah? Okay. So thank you very much. I'll be here for the rest of the conference. I appreciate your time and would love to talk about how we may be able to partner and bring this sort of technology into your storage system. Thank you so much. Thanks for listening.

Starting point is 00:52:29 If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #71: Self-Optimizing Caches

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.