Storage Developer Conference - #71: Self-Optimizing Caches
Episode Date: June 13, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual
Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to STC Podcast
Episode 71.
Thanks everyone for coming early this morning
to listen to me talk about
caches and data path optimizations.
It's always nice to get a bunch of cache lovers together and geek out a little bit, and we're definitely going to try to do that.
So my name is Irfan Ahmed.
I work at Cache Physics. It's a new startup trying to commercialize some technology in the area
of real-time predictive modeling for data access paths. And that's what we're going
to be talking about today. The result of this type of modeling is the ability to increase
performance very systematically or to lower cost of data path software or storage systems.
And a big part of that is caches. So let's look at this at a high level.
So in this very generic looking plot that I have, I'm plotting latency as increasing
on the y-axis and cost sensitivity dropping on the x-axis as you go to the right.
In other words, close to the center where the darkest colors are, you're very latency sensitive,
and you're not really cost sensitive. So that's where I put high frequency trading. If you can
squeeze a few extra microseconds, people tend not to care about the cost of the infrastructure. And then as you get further and further and further out along the diagonal,
the workloads that sit there are less sensitive about latency
and much more sensitive about cost.
So now we can debate about what workloads run where,
and of course I'm being very generic here.
It's possible that for some people the e-commerce workload is as important in terms of the latency trade-offs as HFT. So it's, I think,
less important exactly which application in my choice of drawing it out sits where in this orbit.
It's much more important to think about the hierarchy of storage
that is employed to achieve those latencies and those cost trade-offs.
So anywhere from DRAM to persistent memory to flash to disk drives,
and actually many more layers when you start to think about remote disk
and then cloud disk and so on, so I didn't bother putting all those up here.
So the result of this is that once you have a sense
of what your applications need in terms of latency,
you have a lot of trade-offs to do.
And so in a lot of these trade-offs, right,
there's two basic hardware infrastructures
or software infrastructures available.
One is persistent storage.
So you can try to make your petabytes of persistent storage
really, really, really, really fast.
Some people will do that.
Most people will choose to implement sophisticated caches
so that they don't have to take their petabytes of data
and put it into DRAM,
but rather they will take the portions of that data
that they think that they need at any particular time
and put it in the caches.
So there's a few claims that I'm going to be making, and then the rest of the talk is trying to justify those claims, right?
So first claim is caches are critical to every application.
So whether you're dealing with a database or a key value store or a container or a VM or an app that represents application in this way of thinking,
there are resources available to you, both in terms of persistent as well as ephemeral storage.
So that might be server-resident CPUs with L1, L2, last-level L3, and other caches.
It might be DRAM that the operating system is managing
or the database system is managing, right?
It could be 3D cross point flash, et cetera.
And then there's a network, and then across the network,
there's resources in persistent storage as well.
So the use of caching to accelerate data access
is pervasive in modern storage systems.
The challenge often is that monitoring systems
typically provide only a very little bit of information about the cache.
So, for example, in this case, I've drawn a very simple dashboard.
Actually, this is the only dashboard you can have today for caches,
which is, well, for that cache, the hit ratio is 65%,
and it's currently running at 128 gigabytes, right?
So that's not a lot of information
that you can use to operate your environment.
So here's my claims.
So even though caches are critical to every application,
cache management is nonexistent.
Now, it's deliberately a little bit strong,
but I'm going to try to prove to you that it is, in fact, largely correct.
So let's start with some questions you might have about your cache. And cache management means you
should be able to answer some of these questions. I think some of them are fairly straightforward.
Is this performance good?
So that's 65%. Is that good?
Is that the best that could be done for this workload?
Can we answer that question today?
Because if we could, then maybe my claim is hollow.
But I think today it's very hard to answer the question
of is this performance that's coming from that cache good?
Could it be improved?
That's a corollary of the previous one.
What if I gave you 128, 9 gigabytes of cache?
What will you do in terms of hit ratio?
Will it be better?
Will it be the same?
Could it get worse?
Could I give you more cache and things get worse?
Can we save any of those things with certainty?
Not really today.
So how much cache for application A versus B versus C, etc.?
Right?
If I have some limited budget, how do I allocate that budget?
What happens if I add, remove DRAM?
What happens if I replace some of my DRAM in my cache with Flash?
What will happen?
Very difficult to answer that question.
How to achieve 99th percentile latency of 10 microseconds or
whatever number of microseconds it is that you desire. Okay, here's a twister. What if I keep
everything else the same, but I just add one more workload to this cache's input workload? Will that
change anything? Make it worse? Make it better? How do you know? And then is there cache pollution
going on? How do I figure out if there's cash pollution?
What can I do about it?
What if I change cash parameters?
Most of the cash systems I've ever,
almost actually every single cash system that I've ever worked with in my career
has had had parameters.
Now, how do you figure out what parameters to set
for that particular workload or mix of workloads?
So if you look at these questions,
and us, at least in my opinion, not being able to answer any one of them satisfactorily,
that leads to my conclusion that, look, caches are very important.
Everybody cares about them.
But we simply do not know how to manage them as an industry, as a field, as a discipline today.
Okay, so you guys, I'm going to try to see if we can get better than that.
But you've got to hold me honest.
If I'm way off course, just say so.
Okay, so part of the reason for that, I think,
is that what we have today are very static caches.
And there's an opportunity, I think, to start to think about caches
that are much more dynamic and self-optimizing, self-learning.
So just conceptually, what I did was I took three parameters,
just parameter one, two, three.
They're not labeled right now.
We'll see later on in the talk
what some of these labels could be as parameters.
So an example of that could be cache block size.
Another example could be read ahead,
whatever the parameters are of that cache.
So there's a point cloud here,
and every point in that point cloud
represents just one application.
But it represents the optimal parameter setting
for those three parameters for that application.
Okay?
So you'll see hundreds of applications
represented one dot per application.
And that dot just...
That application has different performance profiles
throughout that space, but I've only
picked one spot, which is the best one
for those three parameters. Make sense?
Okay.
Now, so you can already see
that in this parameter space, every one of those
hundreds of applications has a different optimal
operating point.
So in this case,
it could be cache size, it could be parameter settings.
So, how do we ship systems today?
We do some benchmarking.
We pick some spot that maybe works for one workload
or some mix of workloads.
Actually, it's usually not even one workload
because you're trying to trade off, okay,
these parameters are good for this application workload,
for this benchmark, these are for another one.
So let me just kind of set it halfway somewhere in the middle.
The result of that is that the shipping systems
are not optimal for any single workload, most of the time,
and often are very far away from optimal points
given the actual production systems that are running in the field.
So this is this concept of, okay,
this type of way of doing caching, what I call static caching, leads to thrashing, leads to scan pollution, results in gross unfairness or interference because the cache system is not actually caring about the type of workload that it's receiving and therefore adapting itself to it.
It's just operating at whatever point that the developer or the performance engineer
set as the default.
And of course, leading to unpredictability,
over-provisioning, lack of control.
These are very serious problems, right?
I've talked to people who are running production caches
that are hundreds of terabytes big
and they think they're 5x over-provisioned.
And I ask them, well, why don't you go to like 4x over-provisioned?
Because they're like, well, I don't have the confidence.
I don't know the answer to the questions
that we asked previously. Okay, so if we want to get better than this, we want to
go from static caches to maybe self-optimizing caches to caches that can adapt themselves to
those workloads and can self-manage or at least provide good management hooks for the operators.
What we need to do is we need to shift a little bit our thinking, and I claim that to do that,
we have to try to do some modeling of the cache with the workload and to try to do so
in real time, not every night or something like that, try to do it in real time.
So on the left-hand side, you'll see the same little dashboard which caches provide today,
basically one number, which is the hit ratio. And what I did on the right-hand side is I actually plotted that as that blue dot. So what the blue dot is showing
is the 35% miss ratio, in other words, 100 minus 65, 35%. But in this particular case, I've translated
that into latency, because if you know what your miss or hit ratio is, you can easily just come up with the latency number as long as you know the back-end's miss penalties.
So in this talk, for the remainder of the talk,
we'll use latency on the y-axis
and miss ratio on the y-axis interchangeably
because both of them are lower is better, right?
And there's a simple arithmetic conversion between them.
In this case, the cache size here on the x-axis, right,
again, the 128 gigabytes
are resulting in that blue dot, okay? So the entire exercise associated with what we're going
to try to do in modeling is we're going to say, okay, that dot is not good enough because that
dot has all the problems and all the questions that are unanswered from a couple of slides ago,
which is, is this good? Can I get better?
What if I do this? What if I do that?
So how do we do that?
Well, we're going to have to do some modeling, right?
We're going to have to put a little tiny little brain in there somewhere.
And we're going to learn the performance model
of the application with that particular type of cache.
And what we're really trying to do is,
you know, a fancy way of saying
it. Sometimes academics are in my audience, so I have to put a function to satisfy them. Otherwise,
I don't get good ratings on my talk. But really, what you're trying to do is you're trying to build
a model. So that's just in the f, that is in the function of the cache size and other parameters.
So what does that mean? Well, all that means, folks, is we need to plot that curve.
That's it. If we can plot this curve for a particular given workload as you vary a parameter,
how does performance vary? If we can draw this curve for the workloads that we're interested in
in production in real time, we can answer all of the questions that we started off with.
Okay, so let's look at this curve a little bit.
So again, here cache size is the x-axis,
but we know there are more parameters in cache size.
But then it gets hard to visualize in multiple dimensions,
so I've tried to just, as much as possible,
trying to just keep one variable just to illustrate it,
but in production in real time,
we would be modeling more than one parameter.
So now you see a lot...
First of all, you see that sometimes you add cache
and nothing happens.
Okay, so for example, if you're at 10 gigabytes
and you add cache, nothing happens for a while.
So let's look at that a little bit more closely.
So we're at 10 gigabytes.
We decide we want to improve things.
And let's say we double the cache size.
So now let's say we went from $1,000 a day
to $20,000 a day.
Performance did not improve.
Now, how do we know performance didn't improve?
Because remember, it's flat there.
So latency did not drop.
So then we decided to go 4x.
Then we decided to go 8x.
Now we went from $10,000 a day to $80,000 a day.
I'm just making up those numbers.
But whatever that is, 100 to 800,
whatever the cost of the system is
in terms of its cache footprint,
we've gone up 8x with no performance improvements.
So maybe if you don't have this curve,
maybe you'll give up then.
Maybe you'll say,
look, this workload cannot do better.
Performance cannot get better.
I've gone up 8x.
What do you expect me to do?
I'm not going to go up 100x.
Little do you know that for every
single megabyte of additional cache
that you're going to provide to this workload after that,
you're going to start to see improvements, and that's
where the curve... So the nice curves, you know,
improve with cache size
or other parameter changes, right?
Okay.
So this is nice because we actually
had a curve, and we could just argue about it but
remember those curves don't exist in systems today and so you are just feeling it out and
trying different parameters which is actually dangerous in production because you can actually
create negative effects as well another way of looking at the same curve and by the way this is
a production of virtual machine virtual disk workload and it's not a diagram that I just created randomly.
So here, it's the same plot,
but we're going to try to,
instead of thinking about it from a performance point of view,
we're going to start to think about it
from an efficiency point of view.
So you can take the same curve,
and you can annotate it a little bit differently,
and you can say, well,
it looks like most operating points here
are not efficient.
Why are they not efficient?
Well, because you could have gotten away with a smaller cache for the same performance.
So the particular blue dot, the reason why I picked that spot for this presentation,
is because it's a super inefficient operating point.
Why?
Because you could have had 1 eighth the cache size
for the same approximate performance.
So you do not want to operate there.
So what you can do is,
you can label the space,
and just operate only on the efficient operating points.
And so you're looking for some sort of a knee,
and depending upon what your threshold is
for defining efficient,
there will be more or less points here.
But in perhaps a strict definition,
you can see that you want to operate
at the bottom of a cliff,
because you've gotten a huge improvement,
and you're not going to get
too much more improvement later on.
So can anybody just hazard a guess
as to why there are these flat spaces
in these workloads?
Why is it that we see these flat regions?
Working set sizes, exactly.
And why are there more than one?
Right?
Sorry?
Well, actually, this is not,
this is the pattern of the workloads reuse distance.
So we have not...
Caching comes in as the algorithm that can suck locality out of it.
But there's a hierarchy of working sets is another way of thinking about it.
So maybe there's this fast working set that's whatever data is being used for that hour or two hours or whatever. And then over the course of a diurnal cycle, there's some reuse of blocks, right?
And then over the weekly cycle, there's some reuse of blocks.
And depending upon how big your cache is, it will capture a different portion of it.
So that's why there's these flat regions, right?
Exactly right.
It's successive working set sizes.
Okay.
Okay.
So now we're starting to build an intuition
for what this modeling is all about.
Now let's see how we can go to that concept
of self-optimizing, self-learning, self-adapting.
Okay, so I just went from one curve
into one, two, three, four, five curves.
In this particular case,
this is a single workload.
It's the same actual workload.
Remember I said that I'm going to try
to keep it to the same axis
on the plot rather than going five-dimensional
on us.
But what's happened here is that
it's the same exact workload
in all of those five plots.
So why are the plots looking different?
Why aren't they all the same?
Well, the only thing to change,
not the cache algorithm,
not the workload.
What changes is the cache's implementation parameters.
In this case, the cache block size.
Okay, so now imagine this.
You're running a production.
You've got some workload
that really likes two kilobyte blocks.
It tries to get data like a DB2 or something like that.
It tries to get data in 2K chunks.
And you're operating the cache at 64 kilobyte block size.
So there's going to be some inefficiency there.
So what we're doing here is we're just saying,
okay, vary the cache block size,
and those are what each of those curves are.
And we're going to see what the resulting performance is.
The remarkable thing here,
which actually completely took me by surprise
when I first discovered this,
is that depending upon how much cash you're giving,
different block sizes are more efficient or less efficient.
Okay?
Remarkable.
I mean, it's shocking.
So that's what those arrows are.
So there's points where the difference
between two different cache block sizes flips over
depending upon how much resources you're providing.
So if you have this sort of a thing,
and this is what real world looks like,
we have a pretty hard problem,
but if you know all of these curves,
the problem gets pretty easy, and how?
So what you do
is you just say okay i'm going to trace the lowest profile okay and that's going to be my operating
surface so if for one reason or another i need to allocate more resources for this workload
okay beyond this horizon here i'm going to switch from that block size to this one.
And if I need to allocate more beyond this horizon, I'm going to change the block size
and so on and so on and so on. So if you know the models for the different settings of your
cache implementation, you can online just pick the right one. So that's what we're trying
to say here. So if you can predict the performance under different policies, you just pick the right one. Okay, so that's what we're trying to say here. So if you can predict the performance
under different policies,
you can just pick the right one.
That's it.
Just pick.
Now, of course, your cache implementation
has to be able to switch block sizes on the fly.
Some of them may, some of them may not.
And if you, for example, are owner of a cache subsystem
that cannot today change cache block sizes on the fly,
you've never been motivated to do it
because you've never had a model that tells you exactly
when to switch those on the fly. Other parameters fall in this
sort of situation as well where, until you know it, you may not be
motivated to actually implement these knobs, these online, real-time
change knobs inside your cache algorithm. So far, so good.
I'll take questions as I go.
Yeah?
How much overhead do you think that that's going to do
if you start switching that parameter
when you have a cache that's almost working to its limit?
I guess size works for this, right?
So, overheads is the question.
Overheads.
Yes.
So, there's multiple different overheads
that we need to think about.
There is how much memory do we need
to do this performance model.
Right?
That's okay.
So to understand overheads, there's different types of overheads we need to understand.
One of them is how much memory do we need to do the model,
and will that eat away memory that we could be using to do caching itself?
So that's one question.
Another question is to do these models, oftentimes when you think modeling, you think, oh my god, I'm going to burn a lot of CPU.
I'm going to add latency just to do the model, and so therefore is it worth doing, right?
And I think, and let me try to address these and see if there's other types of overheads you can think of.
So in terms of memory overhead, each one of these curves costs about half a megabyte to a megabyte in memory.
So let's say you wanted to do parallel simulation of a dozen different combinations of parameters.
In that case, it would be 6 megabytes to 12 megabytes depending upon the situation.
Does that make sense? And then in terms of CPU overhead, each one of these curves that you're trying to do
will cost you about 40 nanoseconds per IO on average, amortized.
So you definitely don't want to do thousands of these,
but you can easily do a dozen of these
and still fit it under your budget.
And the nice thing about the CPU impact,
which is, like I said, about 40 nanoseconds per curve, per IO,
is that it's not in the critical path.
You just need to preserve the identity of the block that was going through the data path.
You just put it in a circular buffer and drain it off.
So it doesn't have to add to the latency of the real path.
Does that address your question?
Okay. Yes. So does this assume an LIU replacement policy?
So the remarkable thing is that when we first started out this research,
we only figured out how to do this efficiently for LRU.
In the last few months, we figured out how to do it for arbitrary cache policies.
So we can now do it actually remarkably for OPT.
So I'm going to show you some results for OPT. And OPT is the unachievable best-case cache algorithm.
So if you had an oracle for your cache replacement policy
that could spend as much energy as it could
under the constraints of on-demand replacement,
it provides the frontier for what the best caching policy in the world
can do, right?
Nobody can actually achieve it in practice.
It's too expensive.
And we can actually simulate that also in real time using the same overhead.
So now we've gotten to a stage where you throw your implementation.
It doesn't have to be standard LRU.
It could be LRU-K.
It could be LRU with read-ahead.
Or it could be ARC.
It could be LERS.
It could be 2Q. It could be, you know-K. It could be LRU with read-ahead. Or it could be ARC. It could be LERS. It could be 2Q.
It could be whatever complexity it is.
And now we can just wrap around it and actually do the modeling.
And so that's a huge, huge breakthrough in the first time at SDC that we're talking about that.
Okay, so let's make some progress.
But I'm happy to take questions as we go.
There's a lot of complex stuff here.
So this is, now we get to see more of these, right?
So we looked at a couple of different ones,
and we've gotten some familiarity
with what these models look like,
and they're not that complicated
when you simplify them down into a curve, right?
And so hopefully they're less scary
than in the abstract,
when you read for this conference.
So here's a bunch of different workloads.
Some of these are from the Cloud Physics trace set,
again, virtual disks from production,
customer workloads.
Some of these are OLTP, database,
web workloads, et cetera.
There's also a couple of, actually three in this case,
Microsoft Research traces here as well,
and that's just for completeness to say, you know,
over 120 workloads that we've looked at so far in our publications
that we've been allowed to release in our academic papers and white papers,
and so that's all available in public because we have permission from those customers to use it.
So the accuracy is there.
So here, okay, so let's look at some of these curves. Let's look at T05 first. In this case,
what we see is the blue is that opt that I was mentioning, this blue right. No real cache
algorithm can ever get there. Now, there's two other cache algorithms, ARC and LIRS, so ARC and LIRS. And so in this particular T05 case,
they kind of compete with each other. As you can see, again, lower is better, and as you
go across the x-axis here, we're increasing the cache size. In this case, for this workload,
only small caches were needed to be able to get to almost less than 5% miss rate. So only
from 20 gigs. But you can see that they're competing with each rate. So only from 20 gigs.
But you can see that they're competing with each other.
So these two algorithms in that workload
are not really that far from each other.
Again, how do you tell that in production today?
You don't. You have to have these models.
But if we look at T19,
they kind of fight with each other
for up until about 200 gigabytes.
But if you're willing to go more than 200 gigabytes,
there's a massive performance difference
between those two algorithms.
And in this case, remember, LURs better, so LURs wins.
So actually, one of the things we found is that,
as a side effect of our work,
is that LURs actually is better than most people thought it was
in real production environments.
And it's very competitive to AR Arc, just as a side note.
And so here's another one where LERS hugs OPT very, very closely.
It gets really lucky in that workload,
and all the way out to about 600 gigabytes of working set,
LERS is hugging OPT pretty closely,
whereas Arc has this degenerate pattern
where it is not able to benefit from that workload
all the way until about 600 gigabytes.
So in this case, you're seeing OPT, ARC, LERS,
online modeling, real-time modeling of them
for the cache size case.
So hopefully now we get to get more practice
with what these models actually look like in practice.
Okay, so now let's try to use these for cool things.
So remember I mentioned in the cache policy tuning example
how you can just hug the lower profile,
the lower envelope?
Well, this is what we did, again,
for a couple of these workloads.
In our more detailed paper,
we've done it for all of our workloads.
And so what you see here is you see a top and lower profile.
So the minimum performance, in other words, the best performance that we saw using static workloads.
So we ran a whole bunch, like dozens of different parameters for LERS.
Dozens of different settings, but static settings for LERS, okay? There are dozens of different settings,
but static settings for LERS,
and we picked the lowest profile and the highest profile.
And then we said, hey, go and automatically find those
while you're running in real time, okay?
So the red curve here is, again,
trying to find the best, trying to find online
without knowing these curves ahead of time.
So there's no faking here.
It's all real.
The algorithm does not know the curves.
It has to figure out the curves,
and it has to pick the best,
and it has to change constantly online
as the workload keeps shifting
because these are very, very long workloads,
and over the course of time,
they change what the best parameters are.
So the best case would have been the left-hand side,
which is hug the lower profile, right?
And that's the best we could do for layers for that workload,
and our algorithm was able to achieve it.
The worst we did across hundreds of our traces
is the one on the right, okay,
because there was some things about the workload,
some workload shifts that our real-time adaptation algorithm
was not able to take into account.
So you'll see here that we're able to hug the lowest profile for a while,
and then we just kind of get away from it, okay,
and just we couldn't figure out how to make the real-time algorithm
better than that, right, with all the constraints.
So, you know, there's often very fake signals inside workloads.
Shifts happen.
But even then, we are well below the middle
for most of that in terms of the static to,
best static to worst static distance.
Does that make sense?
So this is the opportunity, right?
Now, again, if we were shipping this in a production system,
we would have picked some line
that was either the higher profile
or the lower envelope
or some other single curve in between.
But what the system is able to do
once it can do this modeling and do it real time
is to just pick the best possible as you go.
Yes?
So when you are training those plans online,
do you see any eye stall?
Any of the eye stalls?
What's coming from a host
to the cache and out of the cache?
So the parameter for,
the most sensitive parameter
for layers is something called
the stack distance.
So there's a metadata stack
deep inside the bowels of layers,
and that's the most sensitive parameter to layers,
and that's the one we were working on.
So as you change that, what happens is
that if you have a long stack,
you're just going to prune that stack.
And just by pruning that stack,
you're going to get better performance.
That's an example. But workload shifts again and actually prune that stack. And just by pruning that stack, you're going to get better performance. Okay, that's an example.
But workload shifts again
and actually needs a longer stack.
So you add a bunch of entries
to make the stack longer
and performance improves, right?
So as you can see,
it's not like a cache flush
where you just dump all the...
It's not like that, right?
It's just inside the algorithm,
something needs to change
because this workload is now operating in a way
the cache is getting confused sitting in that parameter, right?
Does that make sense?
Okay.
So this is the first kind of real-world use case
we've seen for adapting an algorithm, in this case LERS.
And then there's another well-known algorithm called 2Q.
We have results for this in our more detailed analysis for ARC,
which is popular as well as for LRU.
But here, for 2Q, the same sort of thing, right?
There is a top profile, which is, you know,
could be a setting that somebody ships with, but it's static.
And then the lower profile, which is also static.
So the best and worst for these workloads for that algorithm.
And then the real-time algorithm kind of tries to find its way
without any of the additional information,
without knowledge of these curves ahead of time.
Real-time is trying to find it.
And as you can see, it's always in the middle,
and most of the time it's able to achieve very close to optimality,
even with no prior knowledge.
Okay, so now we've seen a couple examples
of adaptation of policies or parameters of cache algorithms.
Now let's look at a totally new class of acceleration.
Now, you know, a lot of times in our industry,
we say performance acceleration,
and usually we're talking about either a new piece of hardware
or, you know, introducing a cache.
And we say, okay, well, if you introduce a cache, you're doing acceleration.
But what I'm going to talk to you about is not just about introducing a cache.
We're saying there's already a cache, and the cache is handicapped.
It's sitting where it's doing the best that it can do.
So it's already doing acceleration, right?
And so I couldn't come up with a name to identify this acceleration of the acceleration.
So, like, what's acceleration squared, right? But I just said, okay, acceleration of the acceleration. So, like, what's the acceleration squared, right?
So, but I just said, okay, new class of acceleration.
So, but this is really, really interesting and exciting.
So, again, we see here a curve,
and you can see those same pattern of flat regions in this curve, right?
That's those working sets, and that's those, you know,
marginal return is really horrible in those areas,
right? And so remember, this is a particular cache algorithm, and it's trying to cache as best as it
can. But when you look at that, and those flat regions, you know, aren't you motivated to say,
hey, what if we could just somehow, like, just these because these are horrible they don't they don't
work well um these flat regions and they're very inefficient operating points well could we just
somehow come up with a way to just cut across all this nastiness and just go straight to the next
cliff and then straight to the next cliff right so this has been really really interesting um
and really what this is saying is, can I tweak the curve?
Like, can I change the rules somehow?
What can I do?
Is there something simple that I can do
that post my acceleration,
I can get another boost of acceleration?
Okay.
So it turns out that we can steer the curve.
It turns out that you can interpolate that convex hull,
and you can take your cache algorithm
and tweak it a little bit,
and all of a sudden, you end up operating on the convex hull, and you can take your cache algorithm and tweak it a little bit, and all of a sudden, you end up operating on the convex hull, which in this case is basically the
diagonal.
But however, to be able to do it, there's one prerequisite, which is you have to know
the model.
So you have to do online, real-time modeling to find that red curve.
But if you know the red curve, you can always find the dotted line.
Because there's an algorithm for that.
It's published in HBCA 2015.
So let's look at an example.
So we have our blue dot again.
And this blue dot is sitting somewhere in that flat region.
Turns out that if you tweak your cache implementation to just have shadow partitions, and shadow partitions means that they're not exposed to the user, but they are partitions. They're
inside the algorithm, right? So we're just going to call them alpha and beta. So let's
say you had a one gig cache, just for example. In this case, let's say a 50 gig cache, right?
So part of that cache would be partition alpha. Part of it would be beta. Nobody has to know
about it. We're just going to hide it underneath the surface. And we're going to tweak the sizes of the alpha and beta
and steer hits and misses to them.
And that's how we're going to be able to achieve
sitting at that convex hull.
So the way that works is
that the partition of the 50-gig cache,
in this case, that is called alpha,
this hidden partition,
we're going to operate that partition
at an efficient operating point.
And the rest of the cache, which is the beta partition,
we're going to operate that
at the next efficient operating point. And we're going to operate that at the next efficient operating point.
And we're going to pick efficient operating points such that we skip these middle ones,
and that's where the convex optimization comes in, the convex hull identification.
So we're going to study this curve, find the cliff that is along the convex hull, and we're going to take the portion
of our cache, which we decided to call beta, the hidden partition called beta, and we're
going to steer hits and misses to it so that it operates here. So it turns out that if
a portion of your workload is operating here and a portion of it is operating here, on
average you're operating in the middle. And then that's true for every single cache size.
Now, for every single cache size,
the bottom of the cliff,
the talus of the cliff that you're going to pick
is going to be different.
But as soon as you're able to pick it,
you just operate there.
Okay, so the more details,
you can either talk to me about it
or there's some papers you can read,
including the HBCA 15.1.
So the result of this is this acceleration squared, because the first acceleration you
got was by having caching in the first place, by having a cache. And the second acceleration
you got was by tweaking the cache algorithm to have these shadow partitions and then steer
hits and misses into those shadow partitions to be able to sit on the convex hull, okay? So not possible without modeling. So
you have to model first. You've got to pay that cost of the model, but then you can get these
super effects coming out of it. Okay, so this is theoretical. Let's look at some real examples.
Here we've got three different workloads and two different cache algorithms in those workloads.
So the same setup as always.
Missed ratio is on the y-axis.
Lower is better.
And the x-axis is increasing cache size.
So for these three workloads,
you'll see a large diversity of different cache sizes
because the working sets are different sizes.
One other thing I want to mention real quick here so you see arc here remember i said like what if i increase my cache size and
i try to do hey if i increase my cache size in production of course i'm going to get either
equal performance or hopefully better no no no no not always the case look at arc here
in this region we added it added more cache and the performance got worse. Remember? It lowers better.
Performance got worse.
So for these complicated cache algorithms,
the complexity is so
phenomenal that you can't even rely
on performance improving if you add cache.
I've shown you
that that's not always the case. And there's
lots and lots of examples of this.
And that's yet another reason why modeling
is important. You don't want to just mess around
with production caches
because you could actually make it worse
when you think you're making it better.
Okay, so here,
the setup is similar to before.
So black is the original miss ratio curve.
So the first acceleration
when the cache is added,
you can predict the performance
of the workload with that cache in terms of hits and misses by just looking at the black curve.
The dotted line is the convex hull of that entire curve.
Now, of course, you don't know that when you start off, but the algorithm has to learn both the black curve and the dotted line.
If it can learn, it can actually operate in the red zone.
Now, we think that over time when we get better at figuring out how to do online modeling,
we can get these red curves to be closer and closer to the convex hull.
But again, any improvement against the black today is a net win just by a slightly better algorithm and by online modeling.
So it's a huge, huge win.
So you'll see all the way on the left, that was an example,
I think one of our best examples
of hugging the optimal,
sorry, the convex hull.
And then the other workloads
have varying levels of error
against the hull.
No knowledge, real-time modeling.
And you see, there are a couple cases
where we haven't figured out quite how to bridge the gap.
There are some cases where we are just a tiny little bit worse.
We have some ideas on how to improve that in the future.
Okay.
So in this particular case,
on the leftmost is a 69% of the potential gain
that the algorithm was able to achieve,
and a 40% and a 30%.
Make sense?
Okay.
So, next, we said, okay, once you have a model,
what else can you do with it?
And it turns out, like, we haven't run out.
I think we still haven't run out of cool things
that you can do once you have these models.
So one of the things I've been looking for
most of my career is the ability to have
a storage system provide latency guarantees.
Like, why can't this XYZ vendor storage system just give me a knob that says,
hey, I just want 7 milliseconds, man.
If you give me 6, great.
Don't give me 8.
Why can't I have that, right?
And being on the vendor side, I was always looking for it.
I think we finally cracked how to do it.
And here's how.
So on the y-axis is 95th
percentile latency. Remember, it's just an
arithmetic conversion between
miss ratios and latency. So
lower is better over there as well. But in this case,
it's not just any latency. It's 95th percentile
latency. And then we can cast size on the x-axis.
So let's say that a customer comes to us
and says, look, I want a 7 millisecond latency target.
Okay.
Well, today we can't really give it to them,
but if you had a system like this
that was doing online modeling, you could.
And how would you do it?
You'd just go over to the 7-millisecond line,
and you'd walk across
until you found the intersection,
and then you'd say, oh, I need 16.5 gigabytes.
I just want to be safe,
because workloads do shift and change
over the course of a day or a week, and give maybe half gigabytes. I just want to be safe because workloads do shift and change over the course of a day or a week.
I'm going to give maybe 18 gigabytes.
And as long as I do that,
and this workload goes through the patterns
that it has gone through in the past,
and there's some stationarity in those patterns,
I'm guaranteed to get seven milliseconds.
It's just like walking across a curve
and you've got latency guarantees.
Now, of course, the model has to
be accurate, right? So you have to have a good way
of determining the model. You have to be doing it on
time scale where the workload varies enough
that your patch is predictive of the
future.
Thoughts, questions on this?
Yes?
Yeah, you know, the conversion from
ratio to latency is not so simple.
Okay, so...
Especially in a hierarchy-hit cache,
you can degrade the hit rate of the lower cache,
which would improve the hit rate of the cache.
So let me repeat the question, or the comment.
The comment is that, Irfan, you constantly have been saying
it's a simple arithmetic conversion between miss ratios
and the average or the 95th percent latency,
and it's actually more complicated than you keep saying it is.
And I think you're right.
So here, at the first level, it is arithmetic.
So let's just do the arithmetic, right?
So I'm going to put miss ratio and hit ratio,
and they're just one minus the other, right?
So for the miss ratio, I'm going to multiply that
by the back end or the remainder of my cache hierarchy's miss penalty.
So let's say it's one millisecond median, three milliseconds 75th percentile, and nine milliseconds
99th percentile. So let's get the full distribution. So for each of those distribution points,
I'm going to take my miss ratio, and I'm going to multiply that by the latency. And I'm going to add the hit ratio times the hit penalty,
or hit cost, in, say, 100 microseconds. But that's so small, I'm going to just make it zero just to
make it easy for now. So therefore, to the first approximation, what I put on the y-axis here as
latency is nothing but miss ratio that my
model predicts times whichever percentile of latency that I've measured for the remainder of
my backend. So that's very arithmetic. So now, right, so the next comment was, okay, you're right,
let's say that your arithmetic is correct, and you're able to actually determine it statically for this current configuration,
but how do you predict it
after having changed the top-level cache,
let's say, for the SQL argument, right?
But the nice thing is,
all of our caches should have these models.
So when we go and tweak the top-level cache,
let's say make it smaller, right?
We know the operating point of the next cache
because now we've made that a little different.
We moved it to the left
and the one down from it to the left and so on.
So we should theoretically be able to
simply predict all the way through that network of caches
as long as each one of them is doing these models
and then know for each of them
what the resulting is going to be.
But your model is derived from looking at the workload
presented at the operating point.
Yes.
No, no, no. So again, the beauty
of these models is that it's
irrelevant of the operating point, right?
The model, remember the blue dot that was
here? We predict less than
and greater than in the same pass.
Yeah, yeah, for that level of test.
For that level.
Sure, sure. Observing the Yeah, yeah, for that level of cache. For that level, sure, sure.
And you're doing that observing the workload that's coming into that level.
Yes.
The workload coming into the next level of cache
is now different.
But we know exactly how it's going to be different is the point.
Well, but you haven't measured
the curve for that.
So there's two cases.
The one case is inclusive caches,
and the one case where it's exclusive caches.
For exclusive caches, it's trivial.
The reason why it's trivial is that that entire curve
actually applies to every level of the cache.
And inasmuch as you have an exclusive cache in your system,
it is trivial.
For inclusive caches, you have to do a little bit more modeling,
and I confess.
And we can talk offline about why it's different for inclusive versus exclusive.
So if I was to design a system around the capability of doing this sort of thing,
I would make an exclusive cache because now we can actually make an inclusive cache for the first time.
We've never been able to argue about the utility of an inclusive versus exclusive cache in the past.
Once you have a model, you can argue about it.
And I think once people start to incorporate
this sort of thinking in their early stage design
for the next generation caches,
now I think all of a sudden we have a reason
why we should be thinking harder about exclusive
than we have in the past.
Okay, so one quick thing I wanted to mention here
is multi-tier sizing.
This is very close to where you were going,
which is in the case of exclusive systems,
this is actually, it's this trivial,
where you just run a very simple optimization linear program on top of this,
which determines, given my SLA and given the costs of each tier,
I can draw this curve and actually decide how much DRAM I'm going to occupy,
how much tier 1, 3D cross point, persistent memory I'm going to occupy,
and local flash, remote flash, network flash network misses cloud storage whatever it is and you know number of tiers no longer matters
because at least for the cases exclusive you can just simply read it off now the dollar signs are
not in front of it but you can weight each of those regions by dollar sign and then just run
a linear program that gives you your sla at the lowest cost possible.
Okay, so this self-optimizing data path, right?
So I said self-optimizing caches as the title of the talk,
but really it's about more than just the cache, right?
It's about thinking about the application all the way down to the persistent storage,
modeling it, and then chopping it up
into the right set of tiers.
And usually it's just going to be one tier in the middle,
but the point is that the system should be more than just one cache at a time.
We now have the opportunity to go and think about an end-to-end data path
and provide end-to-end data path guarantees.
So the first thing we can do with it is monitoring, right?
Just be able to have a dashboard that says,
hey, what happens if I go higher or lower or change my cache parameters, right?
And then we should be able to predict what the changes of those cache parameters should be
such that we can even make a dynamic and self-adapting.
And we saw some examples of that
with significant performance wins
by hugging the lower profile.
Well, we should then be able to insert into this model
after picking the lower profile,
so you'll see that my curve in this case is nice and smooth.
And the reason it's nice and smooth
is it's gone through some of those lower profile and self-adaptations already. And once our curves is nice and smooth. And the reason it's nice and smooth is it's gone through some of those lower-profile and self-adaptations already.
And once our curves are nice and smooth, we should be able to set our latency targets.
And by the way, throughput and latency use the same sort of mechanism here
by simply operating at the right point on these curves.
We can also do tiering, which is the last thing I talked about as a full slide,
on how that works in these systems.
We can also talk about latency reduction,
which is this stress remediation algorithm
which uses shadow partitions to steer misses and hits
to these shadow partitions,
and is able to actually steer the actual curve
for the first time,
which is really quite an acceleration squared sort of phenomenon.
There's another thing I didn't talk about, which is multi-tenant isolation. If you have multiple
tenants hitting your storage array and you have a limited amount of cache, how do you make sure that
they get what they deserve? If you have a last level cache inside a processor and you have some
ability to control tenants, well, how much should each tenant get such that your utility function,
which might be overall throughput, or it might be isolation for a tenant, can be satisfied?
As soon as you have multiple tenants, right, if you don't have models, it's just crapshoot after
that. You just don't know. And that's another thing, very, very critical, important. I just
didn't have time to talk about today. So the results of all of this work is now we can safely
and efficiently quantify the impact of changes. And we're
often seeing a 50 to 150 percent efficiency improvement in terms of cost of running infrastructures,
the cash infrastructure. You can do some of these tradeoffs, maybe pay a little bit higher
cost, but instead of aiming for just lower cost, you can set the system to adapt its parameters
to always hit a certain latency SLA.
And then, of course, higher consolidation ratios
and looking forward and being able to accurately predict
and do capacity planning.
So that's the quick summary, which is,
if you can model, you can do all these things plus more,
and there's some really serious performance benefits to be had.
So I have maybe two minutes, I think, three minutes for any remaining questions,
and I'll be around the rest of the conference as well.
Yes?
I'm a little bit perplexed why we're talking about milliseconds as opposed to microseconds
when we have network storage now operating at 10 microseconds.
Just out of interest,
what's the sort of scenario of the workloads? Is it heavyweight stacks or languages to Java?
So there's no reason why...
It's just all of the latency numbers
are using arithmetic associated with some assumptions of the back-end model, right?
So I happen to pick models where the working set, if it doesn't fit in DRAM, is going to disk, spinning disk. And that's why you see those high latency numbers, you know,
and the left end of the curve.
But then you'll see you're oftentimes below one millisecond on the right end.
Yeah, but again, the modeling is not in latency, right?
The modeling is done in terms of misses and hits.
And you just impose a model on, you impose a
back-end
latency model on top of it to get
the results. Yeah.
So, I wouldn't worry too much about
my...
In this case, it was disk all the way.
And so, if much of the workload
was going into disk, you'd get those penalties.
But there's a lot of disk in the world. That's not going away.
So, there's nothing about this modeling that,
like I said, it applies to last-level caches
inside a processor.
And we, I mean, Cache Physics is my company.
You know, we're looking to do deals
all the way up and down the stack.
Okay, so it looks like we're going to get kicked out.
But I'm going to see if I can push my luck
and take one question back there.
Yes?
So once you predict the model...
Yes.
...and you provision all the sources,
how accurate is it?
It comes with different workloads
and it comes with different periods in time.
So we've seen accuracy...
We've seen error almost all the time less than 1%.
In the model itself, yeah.
And then again, the controller on the model,
on top of the model,
should not hug down to megabytes, right?
Try to operate at tens or hundreds of megabytes
in terms of accuracy so you have some leeway.
Yeah? Okay.
So thank you very much.
I'll be here for the rest of the conference.
I appreciate your time
and would love to talk about how we may be able to partner
and bring this sort of technology into your storage system.
Thank you so much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.