Storage Developer Conference - #43: SNIA Tutorial: Your Cache is Overdue a Revolution: MRCs for Cache Performance and Isolation
Episode Date: May 2, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 43.
Today we hear from Irfan Ahmad, CTO of Cloud Physics, as he presents a SNEA tutorial.
Your cache is overdue a Revolution, MRCs for Cache Performance
in Isolation, from the 2016 Storage Developer Conference.
Hi everybody, my name is Irfan Ahmed.
Thank you for taking the time to attend this SNEA tutorial today.
I'm going to be talking about some interesting new developments in
the world of caching. So caches are obviously main components in any kind of distributed
system, even inside the distributed system of a single microprocessor today. The only
way you can make distributed systems work is by doing caching and all different
forms of it.
It turns out that some work done about 40 years ago in the area of cache analysis has
found resurgence in the last five years in both the academic and industry communities,
industrial research.
And there's some interesting lessons from that that we can draw.
And Sneha was gracious enough to accept my proposal
to have me talk to you guys today about some of those developments.
Okay, so pass through this stuff.
So the context for caching for me today,
just to exemplify and to make things concrete,
will be something like what I'm showing up here.
So some sort of a database system, perhaps, or key value store container,
some applications running inside maybe a virtual machine or some other distributed platform.
So that's a system context.
Obviously running on servers with different layers of hardware,
different tiers of hardware.
And perhaps in some cases,
even an elastic cache out in the network.
Okay?
There's different places these caches exist.
So that's the first context for me.
I'm calling that cloud-scale data caching.
But of course, that's not the only place.
The same technique would be applied in a single system.
An example of that might be an app running on your phone interesting things happening these days
with for example recent iPhones shipping with NVMe solid-state storage right so
almost the first adopter in the industry for NVMe in you know in a way was the
iPhone so interesting things happening in terms of trying to look at caching.
So I'm going to set up
kind of an aggressive goal.
I'm going to say, look,
we're going to use the new research
and the new results in the last few years
and we're going to try to provide
self-service on sizing, tuning,
and troubleshooting to end customers.
And then actually improve the hardware efficiency of caching okay
so how much hardware you need of the fast medium d-ram flash whatever it is to be able to achieve
your performance right we're going to try to improve that by 50 percent right that's the sort
of stuff that uh that that's the level of performance improvement that, you know, gets attention.
And then the next goal we're going to set up is that, well, using the latest research, we don't want to just improve performance, but we want to use that to create knobs.
We want to give people guarantees.
If there's a client somewhere that specifies using an API or in a UI saying, I want five milliseconds of latency, right, we want to be able to use this new research to actually provide that. So some of this will start to become clear. But these are
aggressive goals that I'm setting out and I think the most recent work is
proving that this might be possible. So let's first start out.
Today's cache systems provide some sort of a dashboard like this. We're all used to it, right?
Hit ratio. That's all they give they give you of course the current size but
usually this is pretty static so hit ratio okay so I'm getting hit ratio of
43% is that good or bad I don't know maybe it's good maybe bad no okay well
usually when you somebody gives you a number all of a sudden you're like okay
I'm motivated I got a number now I of a sudden you're like, okay, I'm motivated. I've got a number now. I'm going to improve it.
Well, can this be improved?
Is this the best possible?
Or if I tune in or if I change something, can it be improved?
Like this dumb dashboard is not going to tell me.
And there's nothing in the system today that is going to tell me.
So we're ill-equipped to be able to answer the most basic question you can as soon as you put a metric up right so imagine cpu utilization if somebody says my cpu utilization is a hundred percent and then you come in and you say
can it be improved it's like okay you're 100 we can't get any more out of that cpu because you
know what to do with that number can it be optimized now you have a number you can go
and tweak you can change your all kinds of parameters in the workload maybe you can
improve it but this is a number nobody knows what to do with okay so let me get a little bit more interesting here so let's say that is a
number you put up so we're talking about for those guys who just joined we're just talking about like
a cache dashboard okay and the numbers that provides in today a cache some sort of a
performance monitoring dashboard only gives you a hit ratio so let me ask a second question what
happens if I add cash okay so I got 128 i want to go double that because
i'm just my performance is not good not my application level performance is not good
well will that improve the hit ratio or would it stay 43 maybe it'll drop i don't know do we know
we don't know right what if i remove because i want to give it to somebody else i want to go to
64 gigabytes will that go to half?
Or will it go to zero?
Or will it stay the same?
No clue, right?
What if I add workloads?
I'm going to keep everything else the same.
I'm going to add more VMs or add more apps or add more containers or add more databases.
Don't know what's going to happen.
Simple question.
Is thrashing actually happening in this cache right now?
No clue.
Don't know.
Right?
Well, I mean, forget this last question I what if I change parameters what if I change the
the read ahead length what if I change some other parameter will it improve
this performance or decrease in no clue so this is the state of the art in
almost every implementation of a cache today so in in terms of modeling it you
know because that's when you start you you always think, okay, can I model this, right?
And maybe I can get better if I know how to model this.
So in terms of modeling, that would be like having all of this real estate in this graph and only being able to plot one point.
What is that point?
That point is equivalent to that 43% in the other chart.
So that's a state of the art.
Okay?
We don't know what happens if we change cache size
in terms of the performance.
Now, the performance I'm plotting
is miss ratio here,
which is kind of customary in literature.
So lower is better.
I'm just going to keep reminding you
that whenever you see a plot like this,
lower is better.
Kind of like latency.
Lower is better.
Okay.
So that's a state of the art today.
But what if, what if we could
draw out a curve, okay? And the curve is varying the cache size, and as you vary the cache size,
you just magically, the oracle tells you, ah, here's what the curve, here's the response
of the system will be,
of that workload will be to more or less cache.
So in this particular example, this is a funny example,
it's actually a real workload
from a customer live production deployment.
So let's say that you're operating
at, say, 10 gigabytes of cache, right?
So now we go back to the other questions
we asked at the beginning.
What if I double the cache size?
Nothing happens.
What if I double again? Nothing happens. What if I double again nothing happens what if i double again so you would have gone
through three doublings now most people give up at that point of time if you were doing this in
a real production because you're doubling right it's like okay three doubles i give up but right
when you give up is when you were going to you're about to see performance improvements remember
lower is better right less misses why Because there's some working set,
and I don't know this workload enough to know what exactly,
but I can tell you the working set is about 80 gigabytes
for the first working set,
and the second working set goes out to be about 100 and maybe 10,
and the third one at about 140,
and then everything fits after that.
So this is what we want.
If we could have this curve for every workload in the world,
we'd be able to answer almost all the questions that we asked before.
So this is known as a miss ratio curve.
So when I say MRC, please translate that to be a miss ratio curve.
And sometimes you'll see people draw this the opposite way,
which is a hit ratio.
Higher is better, but we're not going to do that.
So the miss ratio curve.
So this is performance as a function of the cache size,
and you get to see the working set needs.
This is what we want because we can use this to inform allocation policies.
Now, in previous tutorials that I've given,
happy to send you references to those.
Even at SNEA, I've talked about how this curve is computed.
Today I'm not going to go through the details.
Suffice it to say, now there are multiple systems out there, including some
open source implementations of how to do this
with various levels of efficiencies.
For example, the company that I work for has
one that we license out on how to
compute this curve. So I'm not going to go through that. There's plenty
of material
out there. I will just say that the way
that you compute that curve is by creating
something called a reuse distance.
So when you see a block A, you measure how many unique blocks were seen between the first reference
of A to the next reference of A. Okay, so if there's 100 blocks that are unique in that
time period between the first reference of A and the second reference of A, A being the
block number, that's called the reuse distance. What you do is you measure the reuse distance for every block in the system and you can draw this curve so
that's all I'll say about that it's not hard there's nothing too complicated
about it to do at scale and efficiently is hard but there are even multiple ways
of doing that today okay so so this goes back to like I said about 40 years
before the invention of the first algorithm to try to do this, you needed to do a separate
cache simulation per
cache size. That's very expensive.
And in fact, in some
of our experiments to do the original
algorithm for this, to run a
day's worth of trace could take a day worth of
computation.
And that's too much running on a fast processor.
It's simply too much. You could take hundreds
of gigabytes just for one single workload.
Now, I'm going to pass through this very quickly,
but there has been research sporadically in this area,
a lot of times for CPUs and processors,
but not a lot of improvement.
So you went from order M, order NM, in terms of space and time,
out to a little bit of improvement here and then you get
into the last five years and all of a sudden there's a tremendous amount of
work that's gone on so I provide these as references that if you guys want to
take a look at or happy to talk about another time but we're now at us at a
time where we can do this in constant space that means we can actually compute
those miss Russia curves starting from
linear space
to constant space, which is absolutely phenomenal.
And then this order
n is time, but note that the n
is a number of blocks
that were accessed during that time. So for
per block basis, first
algorithm that does it in constant time.
So n divides out here because
n is a number of blocks. So per IO, it's
constant time. So the second time
this was done, you know, a couple of years after
this original work, there's a different technique now
also available called AET. Very interesting
algorithm. Only works for LRU. This technique
works for LRU and ARC and other
algorithms. And also, so now we're
at a state where this, we as an industry,
we just got to start doing this because there's two
or three ways of computing these menstrual curves that are available to you in different forms right
so today i'm going to focus on more on what you do with these curves if you can get them because now
i mean i think it's pretty well known in academia that you can get them uh pretty cheaply so again
misracial curves lower is better so these are just a bunch of different production work
scuzzy workloads that I had access to. I ran those traces, collected them one or
two weeks, ran them through this analysis to see. Now the reason why I'm showing
this slide is look at the variety. So for example this T08 trace goes out to be
about 700 gigabytes, I think one or two weeks worth of a workload. And as you can
see majority of, I mean if you had a 600 gigabyte cache it still wouldn't fit
this working set and you add one more gig and all of a sudden everything fits
right so now you have that type of a curve you have a complete opposite curve
where the all the benefit is seen in the first 10 gigabytes and the rest of you
just like getting a very little additional marginal return so why would you give it more so now i'm trying to motivate hot first use case for
this if you had this for every workload in your storage array or your software defined storage or
whatever where there was a cache you can take these curves and do some simple math and say
well how much cash should i give to each person to drive global optimization okay first trivial
use case that comes out of this.
And all we did is use traditional cache partitioning
to be able to simulate that.
Traditional cache partitioning. Give each
workload some defined amount
but you vary that amount based on this curve.
So you give that guy, let's say you have a terabyte
of cache, you give that guy 600 megs
and 600 gigs and that guy only 10 gigs.
And they'll both achieve their best.
If you had a terabyte and you gave this guy 500 and that guy 500,
guess what you're doing?
This guy ain't going to get any benefit, and he's wasting space.
And so this would be, at any point outside of this range, he's thrashing.
Each one of these blocks is, you know, for a certain part of the cache?
Single workload. Single cache? Single workload.
Single workload.
Single workload.
So what you're doing is you're saying, let's say this is a VM, right?
Let's say that's a database VM and that's a virtual desktop or something like that.
You don't know that.
That storage array doesn't know that.
But it knows this curve if it were to integrate it into its cache controller.
And so without knowing the workload or anything about the workload from a subjective point of view,
it would just empirically know that, look, I know how to improve this workload,
and I know how to improve that workload.
Okay.
Now, there's lots of varieties, right?
There's some that, for example, there's always a few in these plots that improve.
For every gigabyte of cash you give it, it continues to improve.
Every gigabyte of cash it continues to improve, right?
So now you can trade those off.
And trade-off is very easy.
Like, that's easy, right? You just kind of say, I have one more gigabyte of cash to give. Who gigabyte of cache it continues to improve. So now you can trade those off. And trade off is very easy. Like that's easy, right?
You just kind of say, I have one more gigabyte of cache to
give, who do I give it to?
That's a naive algorithm.
You can get better than that with simulated annealing,
trivial stuff.
An undergrad could write that code to do the comparison.
First thing you need is the MRCs.
Yes?
I don't know if you're going to talk about it, but time
variation of the key sets. So for example, it may look like one piece in the morning and something else in the afternoon.
We find that if you are operating about a day, they're very stable.
Different parts of the day change.
So you're assuming like long term?
So I'm not making that assumption.
I'm saying for this workload, you could have caught MRC every hour. They're so cheap that you could have captured it every hour
and varied your caching every hour. I'll show you some results where we actually
did that work and we showed that about hour is kind of like a good sweet spot
where you can get a tremendous performance improvement or hardware
efficiency improvement by buying less DRAM or flash to get the same
performance benefit.
But it's a very important time varying is actually,
that is one area I would say is still being researched.
I'm working with a couple of universities
on really narrowing down on the theory
about how time varying behavior can be modeled.
That's probably the only area here that's open.
And I'll give you some results about that as well.
So lots of different MRCs and just motivates us to think,
well, interesting, if I could have this
for all of the workloads running on my system, cool stuff I can do.
So now let's take a look.
What are some cool stuff we can do?
Okay.
So I should have said at the beginning that to really get the most out of this talk, you've
got to put in a little bit of kind of mental math effort here.
But the first mental math effort is, remember you were seeing missed ratio curves before
lowers better.
And now what I just did is I pulled a trick on you and called this an average latency curve.
So I did some arithmetic to do that.
So let me tell you what arithmetic is. It's trivial.
It looks complicated, but it's trivial.
So I start with the miss ratio curve.
What does that mean?
A miss means that a certain percentage of IOs went to the back end
because they didn't hit in the cache, right?
So let's just say that the miss ratio is 50%.
That means 50% hits, 50% misses.
And let's say that the latency of a hit is 100 microseconds.
And latency of a miss is 1 millisecond, right?
If 50% of your IOs are hitting and 50% of your
eyes are missing you just average the two numbers you have the average latency
make sense because half of them are hitting half of them are missing yes so
that's all I did here because for every cache size I know how what percentage
will hit or miss and I know what the if when they miss I know what the latency
is I do the trivial miss ratio times miss cost plus hit ratio times hit cost, and I get this number.
And then I get a number for every one of the spots that I have on this curve, and I draw it out.
So for the very first time, at least to my knowledge, in real deployable systems,
you can get a prediction of what the latency of the VM would be
or the app or the database would be
as you vary the cache size.
I spent my entire career looking for this.
We just figured this out two or three years ago.
But my entire career,
almost in 10 plus years of storage,
has been in figuring out
how to give people the guaranteed latency.
I hadn't been able to figure out how to do that.
And this is the first time we figured it out.
So, first thing you do is you take your misratio curve and you convert it into an average latency curve,
varied over cache size.
And you get something that shape-wise looks very similar to the misratio curve.
And then what you do is you say, dear client, what is it that your target is?
And then you kind of just read that target down and across and say, oh, 16 gigabytes.
Now, I'm going to probably give them 18 or 20 just to kind of,
if there's any variations that occur shifting during the course of the day or the week,
I'm going to give them a little more.
But fairly simple computation, arithmetic really, once you have the original curve.
And now we can solve something, again, very, very challenging, never been solved before.
So I'll take a pause here, make sure everybody's
absorbing this.
Yes?
So I've got a question, because this looks
nice, but what about a situation where you've got
like a symphonious IOs, and for example, one IO is hit,
another IO is missed, and we still need to wait for the
completion of both of them.
Do you somehow include that in your computation?
No, because I'm just operating at the average.
I'm saying, like, basically, think of it this way.
Let's say you had all, let's say, disk subsystems,
spinning rust, okay?
And then you replaced the same application, nothing changes,
and you replace it with a faster thing.
But I'm not going to tell you what the faster thing is.
It's just average latency, let's say, half or a quarter
of what it used to be.
The performance of the application will improve
because on average
things are taking a quarter as much
as long as before.
And as much as the bottleneck is the storage,
the performance will improve. Yes,
some IOs will be
very quick, some of them will not be.
Even in a single storage, single
hard drive, if you run, let's say
something like Iometer,
there are some IOs that
even for a single drive, will
come back in 100 microseconds.
Single drive, like a
7500, whatever,
5400 RPM drive.
Because even the hard drive has a little buffer, and
occasionally you get hit in that little tiny one-back
buffer, and if you plot every single that little tiny one-back buffer.
And if you plot every single latency, like let's say you have a million IOs that you do, and you plot every single one of them,
I encourage you to do that.
It's very enlightening to see what the variance.
Once you see every single IO on a plot, you realize a single drive can also do very fast, and it could also be one second latency.
All of that in a minute, and you don't even realize it because you only see the average.
Same thing is going to happen here. It's just the average will drop as you increase the cash size yes so the relationship between a cash hit and a miss
is a constant right yeah one minus miss is so hit so isn't the shape of the curve going to be
exactly the same it is for average by a constant it is for an average but but some people want
don't want to prove it for average let's say you've got a client who comes up and says look average is nothing I want 95th percentile so then what you do is you
measure the back ends latency for the misses but you don't measure the average in addition to the
average you measure the 95th percentile latency of the back end then you plug the sunburn and
what will happen is this curve will expand it's to keep the same like need ratio but it will stretch out and then
all of a sudden to get the 95th percentile you're not paying 16 you're paying like 100 gigabytes
okay but the fact that you could even do that is amazing when's the last time we could tell
somebody if you want 95th percentile i can give it to you just pay me this much right now we can do
that right so now again this is amazing use case. Imagine a network elastic cache
in like a cloud service provider.
If you can have this curve,
you can predictively allocate
DRAM, right, in a
Redis somewhere.
Forget like SCSI disk.
Just think Redis. If you know
this curve for your database, you
can allocate a Redis or Memcached
D of the right size.
We don't need SCSI for mrc's but mrc's are just about references they don't care if it's
a scuzzy or or a memory memcache underneath underneath okay so it's the first thing we did
extremely powerful within reach okay within reach of anybody in the industry anyone who wants to
have this feature today you you can do it,
as long as you compute the administration curves.
And there are three different ways of doing those today.
Okay.
One out of a university,
and two of them commercial systems
that are supported and you can license.
Okay.
So now, okay,
we're going to switch topics a little bit.
Now we're going to go into tiering,
because a lot of systems do some type of tiering.
Even if you're all flash, there's a little bit of tiering going on
because you've got some really fast flash
that can handle writes better, let's say,
and then you've got the slower flash for capacity,
and that's where D-loop is happening, for example, right?
Now, this time I'm back to a mis-ratio curve, right?
So this is not latency.
It's mis-ratio. It's lower is better.
Now what we're saying is well we got a
we got tears right we got local and we got kind of remote and I got really far remote maybe the
furthest remote is out in the cloud somewhere but if we develop our cache system appropriately now
we know exactly how much to size every tier for every workload to be able to achieve the
performance requirements of that workload okay the way to do that is every tier has a different miss latency
or miss penalty yeah so what you do is you iteratively figure out how much to
give to each tier and then the latency curve that you derive will be a
piecewise curve because during the middle of this curve, you're changing the miss penalty.
So the miss penalty here is very low, here is a little higher, and here might be very high.
So now piecewise, that curve, and when you convert it into latency, will shift, and then now you can figure out how to draw this curve iteratively.
So without providing a lot of detail,
I think the higher level point is
you can do tiering and now predictively
you can figure out how much to give to each tier
either for overall efficiency
for all workloads,
driving overall throughput,
or individual SLAs and SLOs.
Yes?
I mean, basically calculation on that.
And yes, that's me.
We already have a cache there.
Okay.
Correct me if I'm wrong.
Maybe I'm saying it wrong.
No, you don't need a cache.
You just need a...
No, no.
You're saying that, okay, I got X misses and Y hits.
Okay?
Yes.
And based on that, you're drawing your chart.
No, I'm not.
See, this chart can be drawn without actually having a cache
because all you do,
and this was a part of the previous year's tutorial,
so I didn't spend a lot of time on it,
but what you do is you say an I.O. is coming in.
At some point in time, before any cache,
you just call a little function that says,
hey, please add to your miss ratio curve analysis data
structures the fact that this IO is accessed okay so one Meg buffer of
various types of data structures and that's all you need one Meg okay and
then the fact that this IO was handled or requested or this row in a table or a
Redis key was accessed is enough for these data structures
and this misratio curve algorithm,
the history of which goes back 40 years,
but it was not practical.
Very recently, it's been made practical in the last five years.
So now that you
tell that algorithm, hey,
somebody requested this block or this
ID or this opaque key,
it comes back in less than
60 nanoseconds and says, yeah, don't
worry, keep going, I got it.
That's enough.
You can draw this curve in a system where there is no cache at all, and you still get
the full curve.
The point is, if you were to, this dot is an example of something where it's a cache
stat.
Now, you cannot, in a production system, vary the actual cache size to get the full
curve. You can't do it. I mean, it's just nobody will let you,
right? Who's going to do that?
Who's going to say,
have my cache, and just see what happens?
You get fired. So, instead, what you
do is you just install something
in a system, and like I said, it's
easily implementable, where without
actually even having a cache, or having the cache operate
here, you get the full curve.
And then you decide to reduce or increase because you can predictively know what the
performance implication of that is going to be.
Yes?
So you're binning it.
You're putting it in a bin and then statistically sampling it across the space.
Is that the way you say it?
The technique?
Is that the concept?
There are three different techniques.
They all do different types of sampling.
One of them does a velocity sampling.
One of them does a velocity sampling one of them does a spatial sampling the other one does a reuse distance sampling
they're all different techniques I don't want to go into them right now just
because there there's actually is math involved there yeah but I will say that
once you read the papers you're like huh why don't I think of that okay that
this like that okay okay but it is a sort of sampling.
But the point is very accurate.
And if I was to
just to emphasize
there's actually two curves
on each of these
multiple plots.
You cannot see two curves.
The reason why you cannot see two curves
one is the actual MRC computed the 1970 way
takes hours and days
to run, and the other one is
seconds of CPU time for these very, very long
traces. The reason why you cannot see two curves
is that's how accurate the systems are.
They're sitting right on top of each
other. So the sampled
approximations are very accurate.
So I'll talk about two use cases. Let's move on.
The third one I think is actually pretty important, and most
people kind of don't think of it at first,
is self-service monitoring.
Like imagine putting this in your storage array,
and for your customers to be able to actually see what's going on,
like, you know, they'll buy more cash
once they realize how much cash pressure they actually have.
So, you know, just imagine this is a mock-up that I did,
and I'm not a UI designer, so it probably doesn't look that good,
but the idea is, again, cache size, hit ratio.
Okay, so this is the only plot that I'm doing with hit ratio
because I've tried saying miss ratio to customers,
and it's like, oh, they don't think in terms of misses.
Well, most of them are not engineers,
so they think higher is better.
So if it's up and to the right, it's better.
If it's up and down and to the right, it's bad.
So just to show customers, I've drawn this as a hit ratio.
So if you ever put this in a real UI in your storage array
or in your software-defined storage
or in your whatever memory-based storage,
please do it this way.
I've learned the hard way.
Okay.
But imagine giving this to all of your clients.
Like, imagine having this
and your cloud service provider.
When you log into your dashboard
for a cloud service provider,
he gives you this curve.
That'd be pretty awesome
because now you can actually size.
So that's the idea here.
Let's not forget the simple stuff, right?
Simple and easy.
Just measure and plot.
Don't even do anything with it, right?
Very valuable.
Okay.
So,
this one again,
we're going to have to go
put our thinking hats on this is
a tutorial part where I got to teach you something so something very interesting
is it happened recently so remember I showed you a bunch of curves that had
these flat regions and I said ah that's when thrashing is going on remember that
so the reason why it's thrashing obviously is he adding more cash and the
latency is not improving.
But, interestingly, what's happening underneath is that the workload doesn't reuse the blocks that it asks for.
It asks for A and then it doesn't ask for A for a really, really, really long time.
By the time it asks for A, A is no longer in the cache. It's been evicted from the cache.
So that's why you get this flat region okay and then when you don't get the flat region is when those blocks that were asked
for back here are starting to be used so that's why it's a reuse distance right okay so now this
the workloads that have flatness are the worst workloads in the world for in the world for the
for a cache right it's just awful because no matter what the cache does, it
just cannot fix those workloads because
it's not the cache's fault.
It's not the algorithm's fault. The workload
is not cooperating.
Okay.
The interesting thing is, recently we figured
out,
some work out of MIT has figured out,
that if, big if,
if you can draw this curve a
priori for a workload,
okay, you can tweak
your cache algorithm online,
on the fly, to actually draw
the, what's called the convex
hull
against this misrush curve, right?
So now, this is thrash
remediation like we've never
seen it before, because now you take the worst workloads and you make them your slave.
Give me your worst one, I can fix it.
Because the worst ones, always remember, have this flat region.
And the convex hull means that you take the lowest non-convex parts
and you connect them, and that's a convex hull.
So now there's an algorithm, but it requires having the misratio curve. Again, this is something that
everybody, all of us can do
in our production systems
fairly easily if we just add
a misratio curve.
A misratio curve computation
capability. If you have this,
this becomes within reach. Now,
I'm trying to debate whether I tell you how this is
done. I'll do it. It's a tutorial.
We'll do it.
Okay.
So here's how it's done.
Thinking caps on.
Okay.
So imagine that I have this miss ratio curve.
Okay.
And what I do is I say, okay, randomly I'm going to take half the blocks.
Randomly.
It has to be random.
Okay.
And I'm going to put them into partition one.
I'm going to call this workload partition number one
because it's just random.
Take half the LBAs,
and I assign them to workload partition one
and half of them to workload partition two.
Randomly, half and half, okay?
Then I come in, and I say I have my cache,
and I'm going to allocate half the cache
to workload partition one and half the cache to workload partition
one and half the cache to a portal purchasing to easy so far right half half will there be a
performance change no because I randomly pick the blocks and both what will partition one and
workload partition two will have the same miss ratio curve because I pick them randomly yes
okay but now I do this observation I say okay suppose that I'm currently
operating at a 10 gig cache so now what I'm going to do is instead of making
workload partition 1 and workload partition to equal size I'm going to
make workload partition 1 be 10% of the logical block addresses and workload
partition 2 to be 90%.
Now, I did something very unfair,
because workload partition 1 is getting half of the cache,
but it's only 10% of the blocks.
And workload partition 2 is getting half of the cache,
but it's 90% of the blocks.
Everybody with me so far?
So what did I just do?
Workload partition 1 moved this way.
No, actually, workload partition 1 was the smaller one, right? Okay, so workload partition 1 moved this way, okay? 10% of the blocks, 50% of the cache. So,
it's going to move this way. Now, workload partition 2 moves this way. But what just
happened is that workload partition 2 moving this way, But what just happened is that workload partition 2 moving
this way, it experienced no loss of performance because it was in a flat region to start with.
Workload performance, workload partition 1 though got over this hump. Okay. Now, but
to truly get over the hump, I also then go and start half and half in the cache size.
I make the workload partition's cache occupancy
not be 50%, I make that 90%.
Okay, so now,
again, workload partition
one here goes all the way across
and ends up getting a huge
performance win, whereas workload
partition two never climbs
up there. So if I can do the algorithm right,
just using the basic intuition, and it's
pretty easy to do once you sit down and try
to do it, what happens is that
my average performance
between those two partitions
ends up somewhere along this diagonal.
So this is like, I mean, you remember
the 90s and 2000s
and the first half of the 2010s,
probably more than 100 papers have been
published on better cache algorithms.
This thing beats all of them
with the most naive
cache algorithm because you're just cheating.
You're just cheating. You're like, huh.
Now, of course, it requires cache partitioning
support. Not every storage array has that, so you've got to
go and implement it. It requires MRC
computation. No one has that
in a production system
outside of some systems that I'm working on
with some vendors.
But this stuff is not hard.
It requires a little bit of testing,
but it's not that hard.
So again, this is fundamentally game-changing
use case for misratio curves.
Now, basic stuff like I mentioned
you know this is a trivial example of two workloads single cache but I'm
partitioning the cache and I say well this workload look it benefits
immediately from a cache but then it doesn't benefit from extra cash flow
that much and that's what it has the shape well this other one always keeps
benefiting so I'm going to give partition zero to client zero but a very
small partition,
just enough to fit his workload.
And then partition 1 is bigger
because he's going to keep benefiting.
Simple idea, right?
So then we actually tested it.
And we got an average of a 40%
larger cache efficiency
or hardware efficiency.
What that means is
that for the same performance improvement
the performance
improvement you got from this algorithm of
partitioning was equivalent to buying 40%
extra hardware on average.
And the best performance
out of these workloads was 146. So these are
mixed workloads. So the kind of workloads that are storage
we're able to see, they're not trivial single VM workloads.
They're anywhere from
8 to 20 different workloads mixed
in.
Here's the distribution of that hardware efficiency
improvement. You can see most of the workloads
in our test of 27,
I think it was, different workloads
had 0 to 10%, but the
majority, the weight, is
20 to 50%, and there are some
that benefit more. Again, what did
we do? Look at it, what did we do?
Look at it.
What did we do?
We just computed MRCs for each workload,
partitioned the cache,
and allocated dynamically.
This was done once every hour for those workloads.
40% improvement.
How often do you get that?
It's very hard to get that.
And I didn't even add the thrash remediation in that test.
We came across that later.
Cumulatively, once you combine those two,
I think it'll be even higher. Definitely get over my goal of 50% that I presented at the beginning of the presentation.
One more thing on this.
I keep saying,
look, order one, order one, order one
for space, and then
order one per I-O, so constant
time per I-O for CPU
for these algorithms to compute the
misrush occur, right? Well, that got us thinking, oh, it's so cheap. And on average, every one
of those function calls to update the data structure is 16 nanoseconds on a pretty slow
processor. I've measured it like 39 on a faster processor. But those are even desktop class,
but I've not even measured it on a server class yet. So it might be in a 35-nanosecond
range per I-O
overhead. So that got us thinking, well,
maybe we should do more than one of these.
So I'm like, whoa, interesting. Okay, so
hold on. So what
you see here is multiple
misratio curves for a single workload.
Okay? So why are there multiple
misratio curves for a single workload?
Well, in this particular case,
we assumed a
different cache block size and said, suppose this workload was running at 4k
or 8k or 16k or 32k or 1 meg. I'm going to compute simultaneously, it's so cheap,
you can simultaneously compute multiple misratio curves for the same workload
and then you just, again, undergrad could write this and come in and say, hey,
for this workload, is a 4K block size better or a 16K cache block size better? Okay. We
have seen 50% performance improvement if you perfectly pick the right cache block size
for the given cache that you want to give that workload. In some cases, we've seen 100%,
but 50% is like very common to see. And you can see that workload. In some cases, we've seen 100%, but 50% is very
common to see. You can see that, right?
Different cache block sizes, you can
imagine, right? If your cache block size is too big, you're
pulling in too much data. It's going to slow you down.
If it's too small, if it's not
quite right, then
you're not getting other benefits
in terms of metadata.
This is different workloads.
So you see it's per workload, right?
So depending upon the workload that you have,
the ratio between the various cache block size misratio curves will change.
And so you can make this decision now
completely online, right?
So that's just one example,
cache block size.
There's other ones.
Like for, you know,
most systems have both a read and a write cache.
And it's always a question,
who should I give more to?
Read or write?
Now you can answer that,
because what do you do?
You compute a read misracial curve
and a write misracial curve,
compare them.
Replacement policies,
like if you have any other tuning parameters,
like a sub-block size,
or write-through versus write-back,
all of those now just mean
every time you have a hard decision,
and you think, huh, I'm
going to put a knob in my storage array because some
SE is going to figure out in the field
how to tune
it, good luck. It never happens.
Every time you make one of those decisions
and you put a knob in the system, don't put a knob, just
compute the image structure curve and do it automatically.
We have the capability as an industry
now. We should be able to do this.
So we think that this could add, maybe in some cases 100% We have the capability as an industry now. We should be able to do this.
So we think that this could add maybe in some cases 100% hardware efficiency gain,
possibly on top of the previous two that we set up.
So is it clear that all you need is just a bunch of mystery records and you can just do the comparison?
I think it's fairly clear, right?
Okay.
So I mentioned earlier that LRU
is not a requirement
you can actually do
the same sort of
thing with ARC
and ClockPro
those are the two
other ones that we
studied we're
studying some other
ones now
hopefully we'll
publish some results
for that in the
next year
but as a sample
right both of
these are possible
the way we do
the non-LRUs
is using something
called a scaled
on simulation
which our paper talks about,
so I encourage you to take a look at that in case
your implementation in your storage
system is not an LRU. We can
handle that. It's just a slightly different
way of actually running
those computations. So interestingly, by the way,
I don't know if you've ever seen misrace occur for ARC.
As far as we know, we're the first ones to demonstrate
because we can run so many
different x-axis points because we can run so many different x-axis points
because we can run them really, really efficiently
that ARC has these points
where it's not a monotonically decreasing.
So there are times where you add cache
and the performance gets worse for a little while.
We had never seen that graph before.
And because it's a very adaptive cache,
it makes mistakes sometimes.
On average, it does well,
but it does make mistakes.
The other thing we're seeing is,
remember the thrash remediation convex hull thing I talked it does make mistakes. The other thing we're seeing is, remember the thrash remediation
convex hull thing I talked about? In our
initial results, what we're seeing is that LRU
plus that thrash remediation
is often better than ARC.
And that ARC plus
that thrash remediation is not that much better
than LRU plus thrash remediation, which is
again, LRU,
1960 was the time that LRU plus threshold mediation, which is, again, LRU, it's been, like, 1960
was the time that LRU was invented.
Still, we haven't still managed to totally beat it.
Every time we think a new flashy algorithm is going to beat it, some adjustment to LRU
catches up, and I think in this case it'll be LRU plus threshold mediation.
Okay, so systems implementation, an example of that is a cloud physics implementation
which we license, only needs two function calls.
You don't have to necessarily change your cache algorithm.
Before the I.O. goes into the cache, just call process ref for every block or every
database row that's being accessed or whatever it is that you're trying to model in terms
of cache.
So you give it a data structure that's been initiated about a megabyte and then whatever
the key is, we don't care what it is.
It could be logical block address or some opaque ID.
We don't care.
It could be hash of the block ID.
We don't care.
It just has to be unique for it.
And then we don't treat it in any special way.
So you do that for every I.O.
On average, we take about 15 milliseconds.
We've seen 38 to 60 in our experiments.
On average, it'll be about 50 nanoseconds.
Every time you call that, okay,
and then whenever you need to draw the misratio curve
or have a higher level user level program
that's going to go and do the repartitioning
and all that sort of stuff,
you call this function, get histo.
That's it. That's all you need.
The rest of it is all in here.
And what that is is in the paper, yeah?
And you don't have to call that for every transaction.
You call it for every transaction.
The system underneath does nothing for most transactions because it's sampling.
But the way that it's sampling and adaptively sampling is what makes the magic work.
Okay?
So you should call it for everything and let us adaptively decide which of the transactions or
keys or blocks or rows that you're accessing that we're going to do a lot of work for and which ones
we're just going to sample out okay every time yes yes you know i mean 50 nanoseconds is like
about one memory one memory request so and and it and it pipelines really well.
So it should not be...
Rob, you're the first one I've seen
complain about 50 nanoseconds as being an issue.
Come on, this is a storage stack. The storage stack
usually is like 10 micros of software.
I'm saying 50 nanos.
Come on.
Okay, so
for your reference,
when you look at the slides later I talked
about a whole bunch of different use cases that I have not covered that once
you have a miss ratio curve they become enlightened and you can actually do
either better performance or better service level assurances so there are
some multi-tenant environments in the world that are very popular like cloud
service providers and the features
that I'm not going to go through this in detail. I only have seven minutes. But a lot of people
in the world are using some sort of global LRU. A global LRU is when you have multiple
clients and they hit the same LRU. Now what happens there is like pigs at a trough. The
bigger one kicks you bigger bigger so if you have a
global LRU and you've got a workload that's really intense what's it gonna do
it's gonna push everybody else's workload out but it's a priority and
naive and using a naive system it's very hard to even know who the pressure is
and what to do about it how much to mitigate the pressure it's just very
hard to know that, right?
So this is like getting an x-ray of your trough or the pigs at the trough and trying to figure out, well, how are you going to manage this, right?
What are you going to do about it?
So anytime you have multi-tenancy, right, this just extenuates a need
for being able to have a data-driven way of online managing the ability of your clients
to get access
to the performance resources that you've invested in.
So I won't read through it, but I think by now, hopefully, that's all in context.
So what I want to do is I want to do a step-by-step recap.
So to be able to use this technology, the first step is to calculate the misracial curve.
So pictorially, I'm drawing that out with some steps
that once you see the previous year's tutorial,
you'll know what they are.
But again, the point is that they're trivial, really,
once you see them.
It took us years to build it,
but now once you look back,
you're like, wow, why didn't I think of that in the last 20 years?
So first step, misracial curve.
So trivial integration with your current cache code base.
A couple of functions. One for IO very fast and whenever you need to put that in a UI or
run a higher level process you just suck the memory out compare everybody make a
decision and then come back you know minutes or hours later to to finish it
off second thing you can do is display the cache utility curve for DevOps
people for people
trying to do optimizations in your system, for SEs, for customers to be able to self-manage.
The third thing is auto-tune cash parameters.
Let's get over those days where we ship 50 different parameters and nobody ever touches
them.
So as engineers, we feel real good.
It's like, oh, look, we got so many knobs.
And in our labs, we can show 50%, 100% performance improvements.
Well, enough is enough.
Let's just not do those knobs or do them, but have them automatically set.
Because now we can actually accurately predict on a per workload or per lawn or per whatever basis
what the performance difference will be for every one of our knobs in that cache.
Step number four, thrashing mediation.
We talked about this in a lot of detail.
Once you have the MRC, this starts to become easy,
and you can tame the worst workloads.
In fact, the worse the workload, the more awesome this looks
because it's designed to address.
So a very cash-friendly workload will get no benefit from this.
An unfriendly workload could get 100% performance improvement from this.
Next is latency guarantees. So now we've got this really nice curve, okay? But that really
nice curve hides in it the ability to say, I can control how much cash I give this workload.
If I can control that, I can predict the performance improvement they're going to get, and I can
predict the latency. Now, the same thing can be done for IOPS. I care less about IOPS as
a human being than I care about latency, but that's just me. But if you wanted to do IOPS, it's fine.
I mean, it's just a slightly different curve.
You know how I showed you how to compute using the miss and the hit ratio and the hit and miss latency?
The same thing can be done for IOPS, and you can get the curve for IOPS,
and then it'll just be the opposite curve.
It'll be higher is better.
Okay.
So now, for the first time in the industry, we can use the cache as a control mechanism, as an SLO guarantee sort of mechanism.
I think that's a game-changing opportunity.
Next is accurate tiering decisions are close cousin to the last couple things we talked about.
But instead of just having two regions, the cached region and the uncached region, and you change the boundary, now we have multiple.
Maybe the last one is a cloud with maybe a 20, 30 millisecond latency, 50 millisecond latency, right? Even in those scenarios, the iterative
algorithm that can sit on top of this and change multiple barriers can find sub-millisecond
latency even when your worst case latency is 50 milliseconds or 100 milliseconds because
we can accurately size all the intermediate caches, right, and do that iteratively, okay?
Next step, multi-tender partitioning.
So we saw a case where we had average of 40%,
in some cases up to 150% performance improvement
by just repartitioning the caches once an hour,
not even minutes or anything like that, once an hour.
All we needed was the misstorage occurs for each client,
and we would adapt and redo it.
And in this example, I'm just showing two,
but in the testing we did,
and we were going to 20 different workloads running
at the same time to get that now the cloud tier we already talked about kind
of a special this is a special case of step number six and then the final
piece that I'm going to leave you with and we have a five minutes for questions
so the final piece here this is an open area working with universities on this
right now you someone had alluded to this as a question
earlier, which is, hey these
misratio curves, how stable are they?
are they shifting and kind of undulating
and moving throughout the course of a day or a week?
well they are, they're kind of
reasonably stationary when you're at a day scale
now the nice thing is the algorithms as the
misratio curves change, they adapt and give you
the new misratio curve
but we don't yet have a theory of the behavior of these workloads and the behavior of the MRCs.
We don't have a theory about that yet.
So we're very actively working to try to figure that out.
And I think the interesting thing is, here's the amazing thing, right?
Imagine if instead of just knowing how much to size the cache or how much to not size
the cache on a day or hourly basis, you could come in and say, I how much to size the cache or how much to not size the cache on a day or hourly basis,
you could come in and say, I'm going to turn the cache off.
Because it's better for me to turn it off now than to turn it on later.
And so there's some heat map type of stuff that we're working on where we're trying to figure out,
are there some kind of theoretical results?
Just like the previous work that we did.
It started off with ad hoc stuff and ended up with a theory we're trying to develop a theory for being
able to do time series analysis of misracial curve and then to put them in
a control loop probably take us another couple years to have good results that
we have some initial results so idea is to take the basic computation of the
misracial curve and then apply various types of data analytics on it so that we can get even time-varying prediction,
not just kind of a stationarity assumption.
Okay.
So that's all I have.
So I hope that I've managed to interest some of you
in following up on this work
and taking a look a little bit deeper.
I think we're at a stage now. You know when the technology is first invented, the first
kind of work in this was about three years ago, where there's a barrier to entry because
even the inventors haven't figured out how to explain the technology yet, right?
Haven't really built the use case, haven't figured out how to explain it.
I think we're at the marginal return, so waiting any longer than today as vendors or as systems
or cash designers,
I think it's not going
to give you much benefit
because I think
most of the work
has been figured out here.
Most of the challenges
have been addressed.
I'm not saying
there's not challenges.
And since some
of these things
are still,
will take you some time
to implement.
But we're at that
marginal return
in terms of
your return
for waiting.
So I don't think there's any point in waiting anymore.
I think we should be investing in this as an industry.
Questions? Comments?
Yes?
Most of the talk that you talk here was always about block address. So how does it cater to the whole system memory and more byte address?
Is it just shifts over?
I mean, like, it's exactly the same for those types of designs?
Yeah, it's exactly the same.
So the way to think about it is that any place you might want to insert a cache, once you
put data in the cache, you've got to get it out.
So you key every element in the cache with some unique key.
The missed ratio curve computation only uses that key, nothing else.
It doesn't care about your cache implementation, it reasonably about it but it's not tied to it and it doesn't really care about
the cache size that you currently have or you might have right it just says
look just give me the list of keys that might be logical block addresses it
might be bread is you know memory like random keys or something like that
whatever it is it doesn't matter so that that's for the misracial computation.
So as soon as you do the misracial computation,
everything else is built on top of it.
So it should be safe.
Yeah.
Good question.
Yes?
One question.
For cache, you have certain underlying assumptions about
LRU and ARC as a support tiering.
Are you making other assumptions about how the tiering is done?
Or is this dependent on the tiering algorithm
underlying the data?
So when we looked at tiering, the assumption being made
was that the work is so.
This particular thing is, this example is LRU.
The same example could have been done with ARC.
So I think the more important question is,
what if part of your system is LRU and part of it is ARC and part of it is some other sort of mechanism
that's not even like a cache?
It could be, for instance, you know,
some other like machine learning system running underneath
that's saying, hey, these blocks are accessed
every like four days and a half.
So I'm going to pull it in or something like that, right?
So that is open.
I don't know the answer to that.
But as long as it's a traditional
cache algorithm, we can deal with it in this
way. But I
do know, for instance, some EMC arrays, they don't
use a cache algorithm for their tiering.
On the other hand, what this tells you is
if you were to cache, what would
the benefit be? Now maybe you can do better than that.
But I actually kind of doubt that most systems,
once you pull in this type of technique,
can do much better than just pure cash.
Because, I mean, again, we have 57 years of history
of people trying to beat LRU and not being able to.
Good question.
So I would really like it
if I got a chance to say hi to all of you.
There might be a talk in here.
I'm not sure.
We can step outside,
but it would be nice.
I always like to spend an hour with you.
We'd love to kind of know who you are
and exchange LinkedIn maybe
and keep in touch.
Thank you for taking the time.
I appreciate it.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.