Storage Developer Conference - #134: Best Practices for OpenZFS L2ARC in the Era of NVMe
Episode Date: October 7, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual
Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to SDC Podcast
Episode 134.
My name is Ryan McKenzie. I am a performance engineer
at IX Systems. And if you don't know what IX Systems does, it's am a performance engineer at IX Systems.
And if you don't know what IX Systems does, it's actually a pretty cool place to work
because they allow us to come to conferences like this and they support us to come and network with everyone.
And then they also allow us to do interesting investigations like this
to divert ourselves and some of our effort and do neat investigations and share the results.
We are the maintainers of the FreeNAS project,
and we also have an enterprise storage line called TrueNAS,
which is based on that software with some extra features and stuff.
But just thanks to IX Systems for sending me.
So that's my current life.
In a previous career, I was a lecturer in the Department of Computer Science at University of Kentucky.
So if you need me to slow down or repeat something, that's fine.
I know not everyone speaks banjo.
Banjo is slow.
We're okay with it.
Slow banjo.
All right, cool. Slow banjo, alright cool But also you know
I've been up in front of rooms of 700 college students
But this is a little bit of a different environment
Where there's probably people in the room
That know more than me about ZFS
Anyone from Oracle?
So I'll give you guys $20 if you don't grill me too hard on ZFS It is OpenZFS we're talking about here today Not Oracle ZFS? Anyone from Oracle? So I'll give you guys $20 if you don't grill me too hard on ZFS. It is
OpenZFS we're talking about here today, not Oracle ZFS. But anyway, I'll just go ahead and get
started. That's a little bit about me and where I work. So what are we doing today? Just a brief
overview of how OpenZFS works. So really something that's not on the slides, and I like to talk a
little bit off the slides from time to time.
It's a good thing that when I was a teacher, I kept my students a little more engaged.
The two big pictures for this agenda, really, if you will, are there's a technical big picture, L2 ARC, the Level 2 Adaptive Replacement Cache, has for a long time now sort of been one of those,
the recommendation has been don't use it with certain workloads, but maybe it'll work with future technologies.
And now that we have NVMe SSDs becoming very available, we are going to reevaluate that.
The other big picture of this talk is sort of a meta talk, if you will,
is knowing how to tune your system means you must know your solution,
you must know your architecture and your implementation,
and you also must know your workloads and the applications
you're going to deploy them into. So I don't know what's making that noise. Is it the projector?
Maybe my laptop is doing that. I don't know. Yeah, it sure was. Oh, yeah, audio goes over HDMI.
How cool.
Told you I wasn't a developer.
Okay, so let's start with a brief overview of ARC and L2ARC.
So ARC stands for Adaptive Replacement Cache.
It resides in main memory, so it's global.
On the server, the filer, it's shared by all the storage pools. Basically all the incoming
or written data goes through the ARC in one way or another. And the really cool thing
about ARC is it balances between most frequently used, MFU, and most recently used, MRU. And that does some protections
against things like cache scanning,
where if you have a backup workload or something,
you still have your most frequently used lists
in your ARC that aren't getting really perturbed
by the backup workload and things like that.
So ARC is really neat.
This is the focus of our talk, L2 Arc,
level two adaptive replacement cache.
It's actually configured and added on a per pool basis,
so some of your pools can have Arc
and some of them cannot have Arc,
and Arc can be even configured differently
on different pools based on what those pools are used for,
which is good.
There's a concept of warm data,
things that are about to be evicted from ARC.
And we'll learn more about that in just a minute,
but the L2 ARC tries to feed itself with blocks
that are about to be evicted from the main memory ARC.
So we'll talk about what that means
and how it determines what is about to be evicted from the main memory arc. So we'll talk about what that means and how it determines
what is about to be evicted. The other interesting thing to note is that all of the blocks that
are in your L2 arc have a header component that's actually stored in the main memory.
So there's a couple caveats to remember there. I'm going to move really quickly through these introductory slides, but feel free to ask
questions.
When ZFS writes, you've got a write request that comes in.
If it's an async write, as soon as that block gets copied to main memory, we're going to
acknowledge to the requester
that the write is complete if it's async write. Async write takes a little bit longer. On
this slide, I want to give a plug for Nick Principe's talk last year at SDC 2018. He
gave a very similar talk to this last year, but his talk was how you could improve this portion of your OpenZFS system using MVDEM.
So I'm giving a talk about maybe using NVMe SSDs here.
He gave a talk about using MVDEMs here.
And the SLOG is basically, and it's called the ZFS Intent Log,
and a SLOG is a separate log device.
So the ZFS Intent Log, by default, is sort of striped across all of the data VDEVs,
which can be slow if you have a very heavy synchronous write environment.
A lot of NFS environments are pretty synchronous heavy.
So one way to speed up the acknowledgement process is to put a fast device as a separate log, and that can be an SSD, but it can also be an NVDIMM in certain cases,
or an NVMe SSD, either one.
So at some point later in time,
the block that has been written gets copied to stable storage.
It's no longer quote-un quote unquote dirty and all is well with
the world and now we have this neat block in ARC. So we've got a block that sits in ARC and if that
block is going to be re-hit, that block is going to be hot, it will stay there. It will be promoted
to the heads of the most recently slash most frequently used lists in the ARC, just like any other
block.
And if it's just something that gets written once and never gets written again, it will
fall to the cold part, the tail of the ARC, and it will get evicted.
Reads.
Reads are a little different.
So if a read request comes in and the block we're looking for is in ARC we have an Arc hit, which is awesome. We're hitting main memory. That's what we
want. That's why certain ZFS-based servers have many, many terabytes of main memory,
because Arc will grow to consume all that memory and use it and it'll give it back if your system needs
it, but we want arc hits with reads. Arc hits are really, really low latency, very, very
fast. If it's in L2 arc, we're going to see from our header over here in the arc, we're
going to see, oh, that's a block that I've got stored in my L2 device, so we're going to see from our header over here in the arc, we're going to see, oh, that's a block that I've got stored in my L2 device.
We're going to go get that block from L2, which is another tier that's sort of between your main memory and your slower data VDEVs here.
And then we're going to return that.
In the case of a miss, we have to actually copy data up from our slower data VDEV.
So we're going to copy this block that's been missed on up.
And then we're going to acknowledge the read.
Yes?
All right, real quick.
What is the process by which it moves from R to L to R?
Is it just a finding issue?
How long has it been in?
Yeah, that'll be one of the next slides.
But yeah, it's a pretty important concept
of how things, once they get warm or cold in ARC,
how they get into L2ARC.
Yeah, that's one of the next slides.
Thank you for the good segue.
So we're going to, this is called a demand read.
So we're demanding data from our VDEVs in our pool up into ARC.
We're going to acknowledge if ARC prefetcher,
so there's a part of an ARC read system call
that tries to determine if something is a streaming workload.
And if it thinks it's a streaming workload,
it may not actually be a streaming workload.
It just, if it thinks it is,
it's going to actually copy some more blocks too
to maybe try and avoid another miss.
So we have concepts in Arc.
If you look at the Arc statistics,
we have concepts of Arc miss,
Arc data miss,
Arc prefetch reads,
prefetch misses,
metadata reads,
metadata misses.
There's lots of statistics around it,
which is awesome for a performance person
because if you capture all that stuff,
you can really do some damage.
And then once that stuff is demand read into ARC,
it again stays there.
So this is really, think of this as there's two ways
something gets in ARC.
It either gets in ARC because it was demanded
or it gets in ARC because it was written. Those are the two ways things get in ARC. It either gets in ARC because it was demanded, or it gets in ARC
because it was written. Those are the two ways things get in ARC. Now, how do things get in L2
ARC? Oh, wait a minute. Yes, this is how things get in L2 ARC. So as I said before, ARC will
actually grow to consume almost all your system's memory. It's kind of a feature. If you read the forums, a lot
of people don't think it's a feature, but if it's well tuned, it's a feature. So what's
happened is you have to have a headroom of available main memory for incoming writes.
Because think about what would happen if your memory was full and every incoming write has
to hit that memory. If there was no free memory
and you had to immediately transaction group
your sink your dirty data
down to your slow spinning REST drives,
your write latency would just go through the roof, right?
So we have to sort of do some housekeeping
and maintain some room in ARC for incoming writes.
So that's what happens. That's also why the L2 arc feed is
completely asynchronous of the arc reclaim or the arc things falling off the end of arc because
imagine if you've got a block that's cold, and it's in your main memory,
and there's a write coming in.
And if a write comes in, it has to wait for a block
to then be copied to, I mean, this L2 arc is faster
than your data VDEVs, but it's also slower than memory.
So this is actually going to also increase your latency.
So this whole process being asynchronous,
loading L2 arc, loading L2ARC,
feeding L2ARC asynchronously of ARC evictions and cleaning up ARC, it really is optimized to
keep our write latencies very low. So periodically, and these things in yellow are tunable,
so how often it is scanned, how much it scans, how many blocks it copies into L2ARC is all tunable. So how often it is scanned, how much it scans, how many blocks it copies into
L2 arcs, all tunable. So that's really all three of those things combined are called
the feed rate. The feed rate of L2 arc is tunable. So we're going to go through and
we're going to look at close to the tails. And just imagine if there's a dotted line
here. You're going to look at either more of the tail or less of the tail.
That's called headroom.
That's tunable.
So we're going to look in here and go, okay, these are getting ready to fall out of arc.
We don't really know anything about them, but we're just going to copy them into L2 arc in the background for fun.
So that's really how L2 arc feeds.
It's completely asynchronous.
It doesn't really know anything about those blocks.
It's just saying, oh, these are getting ready to fall out of arc,
so we're going to take those into the next tier down
instead of letting them fall all the way out to slower stable storage.
Okay.
So when L2 arc, quote, unquote, gets full,
which means we're sort of at the end,
so things get written in a round-robin fashion to all the available L2 arc devices.
So if you have two L2 arc devices or four L2 arc devices,
they start writing in a rotor, round-robin
fashion, and then when you get to the end, it just says, okay, let's just go back to
the beginning.
And then, you know, it starts back at the beginning with the new blocks that are being
promoted, and it invalidates the old indexes to these in the headers.
And, you know, so L2ARC is,
if you have L2ARC in your system,
it is constantly feeding something at some rate.
And that's a lot of what we're going to talk about today is what do we feed it, how fast do we feed it,
that kind of thing.
Under what workload scenarios.
Yes.
Yes. So when there's no L2 arc, these just get reclaimed.
And they disappear from main memory.
They don't have another tier to go to,
so they're going to be on your data VDEV.
They're going to be on your spending rest
or whatever your slowest,
whatever quote-unquote stable storage is in your system.
So L2ARC is sort of making an effort to, instead of just dumping these all the way down to
stable storage, L2ARC is making an effort to put some things in a middle tier that makes
them maybe useful at some later time. Ideally, L2R should be based on what you're using it for, obviously, but that's kind of
the old classic pyramid, right?
We teach in computer sciences, you know, your CPU and your on-chip cache is at the very top,
but it's very small, and then your main memory is a little slower and a little bigger,
and then your next tier is a little slower and a little bigger.
So, yeah, that's conceptually the idea.
Yes?
Yes. If you keep writing the block,
so the question was if you keep writing a block
lots of times,
does it end up being
just overwritten in L2ARC? There's two mechanisms to keep that from happening. There's the most
frequently used list in ARC. If something is close to the top of these lists and something
that you're hitting over and over again is going to be close to the top of this list,
it's not going to get down here to the point where it's in L2 arc. And the other mechanism is if it's already in L2 arc,
it's going to look up here and go, oh, I've already got that.
So I'm not going to copy it again.
So any block can be in three states.
It can be either in arc only or L2 arc only.
It can be in both.
But that is only sort of a temporary state.
So it can be in both right now.
I've copied it from the tail of this list.
It's here until it gets evicted.
Now, if this thing gets re-hit and gets promoted back,
you actually have it in both places.
That's what we call a wasted feed.
We've fed something to L2 arc
too early. Basically, we've fed too fast. It never was going to fall off the end of
arc. Any other questions? This is key. We understand how the L2 arc feeds. We can understand how to tune it in different scenarios.
Again, if an L2 arc block then becomes dirty at some point,
let's say we've got a block here that we've brought into L2 arc,
and it gets written from the outside,
arc is just going to handle that.
Arc is, as we saw on the third slide or whatever,
ARC is going to say, okay, this is a dirty block. I'm going to put it as part of a transaction group, and I'm going to get it to stable storage at some point. And then this block in ARC
is just not going to be used until it circles around and gets written with something different.
So we don't ever have anything dirty in arc because it would be an extra
step to have to flush arc, or sorry, we don't ever have anything dirty in L2 arc because
it would be extra steps to flush L2 arc and slower.
So here's just some notes. These are very text-heavy slides.
So like a true teacher that I am,
I'm going to give you guys homework.
I'm not going to read them all.
You can download the slides if you're very interested.
But this is a very interesting point.
The blocks are variable size.
So if you've got a block workload
or a file workload or an object workload,
whatever your block size that you've set on that particular data set or that particular pool or that particular share, that's what the size of the block is going to be in ARC.
You can have, you know, you can have 4K, you can have 1 meg.
So some of these blocks are 1 meg and some of them are 4K on a mixed use system.
We're just going to talk about blocks. And whatever those are, they are. But that, you
know, takes into, you have to take that into account when you're sizing, especially if
you have everything set to a really small block size, you're going to have lots and
lots of metadata. So you may have to tune your metadata percent of arc. You've got a
lot of metadata overhead if you've got, you overhead. If you've got a petabyte system with 4K block sizes on everything,
that's probably a misconfiguration.
Your SE is going to be in trouble.
Blocks get into ARC either by a demand or prefetch read miss or by a ZFS write.
That's the only way they get into here.
Now, this is worth the price of admission to this talk.
Here's the configuration pitfall.
It can only get into L2ARC if it's been in ARC first, right?
We learned that just a few minutes ago.
So what a lot of people do, like, oh, okay, so let's set our primary cache, let's set our ARC to metadata, because it's fast, and then let's set our L2 ARC to catch everything else, all.
Well, guess what?
You're not going to get anything in L2 ARC, right?
Because it can't get into L2 ARC unless it was already in ARC.
So that's something you have to know about the way ZFS caching works.
Because a lot of people, and we see this in the forums about every month on Freenas forums.
They're talking about how their L2ARC never has anything in it.
It's because there's these primary cache, secondary cache settings,
and they've got them set in such a way that what you're telling ARC it's allowed to copy is something that's also never in, you're telling L2ARC to grab things that are
never in ARC.
So it's looking at the tails of the ARC list going, there's nothing there I'm allowed to
copy.
There's nothing there I'm allowed to feed.
All right, so that's a configuration pitfall.
L2ARC is not persistent, parentheses yet.
The blocks themselves are persistent in the L2ARC, because you're talking about an MVME
SSD or you're talking about an SSD or maybe a 15k SAS drive in
some deployments. The actual blocks are there, but what's not there? Something we saw earlier.
The header information, right?
The headers reside in main memory,
so if you lose power, the header information that points to when a user reads, we're going to look in our header,
and if we lose main memory or if we reboot the system,
the header is gone.
So basically when the system comes back up, it goes,
oh, L2 arc's empty.
I'm going to start at block zero.
That's a configuration pitfall and a
something to note there.
Write heavy workloads can churn the most recently used.
It's not going to churn the most frequently used. We talked about that a minute ago.
Prefetch heavy workloads can scan.
So like I was talking about a backup workload can scan the ARC
because if you are the recipient of the backup, you're going to get writes.
That's the first bullet point.
But if you are the source of the backup, you're going to have a lot of sequential reads,
which are just going to go through and your most recently used list is going to completely be scanned
all the time until the backup is complete. That's a neat feature of Arc, how it has the
most frequently and the most recently used lists, because it sort of protects your VM
guest OS data that you want to stay in the most frequently used list, it sort
of stays there. What's happening is you're going to, yeah, let's see, I think I already
talked about all that. Blocks that L2Arc doesn't feed may not be hit ever, or blocks that L2Arc
does feed may never be hit again in the case of scanning and churning, and that's a wasted feed.
Alright, so let's talk about some key performance factors that we got from, because the first
step is really understanding the architecture.
So what's a random read heavy workload going to look like on this guy?
Well, L2ARC is actually designed for this. Brennan Gregg put L2ARC in ZFS in 1996 and all of his comments in the arc.c
and a lot of his blog posts and some of his papers that he wrote about it, it is designed
for being a random read heavy cache. So we expect it to be very good.
Sequential read-heavy, not originally designed for this, but again, the original design document said future SSDs, future storage technologies might make it such that it's going to work
out.
So we're going to revisit that today.
That's the big data component of today's talk.
Write-heavy workload, it doesn't matter if it's random or sequential,
it's going to cause what we call memory pressure. Every time we write to a ZFS file system we
need that block in memory to be free. And if we're doing a ton of writes it's going
to cause what's called memory pressure so we're going to very, very quickly be reclaiming
blocks from the tails of the ARC lists and very, very quickly be reclaiming blocks from the tails of the
arc lists, and very, very quickly.
And a lot of those are probably going to go by without being fed into L2 arc.
Some of them are going to be fed into L2 arc and probably never touched again, and that's
a wasted feed.
So the design intention was to do no harm, but we think there may be an impact,
and we're going to show some of that later. The background performance of actually feeding
L2 arcs. So if you're in this situation, your L2 arc's never really going to be warm, because
it's always going to be grabbing new stuff, because the tails of those lifts are going
to be perturbing all the time. So it's going to be grabbing new stuff all the time and it may never be hit again.
So we have to, and actually those mem copies,
they use CPU time, they use resources.
Those NVMe SSD drives, writing to them actually costs us.
So in some cases it may be better not to have the L2 arc
or at least mitigate how much feeding you're doing
of the L2ARC or at least mitigate how much feeding you're doing of the L2ARC to prevent that. The active data set size, if it's small, it's going to
probably fit in L2ARC or it may never be fully warm, but we'll talk about that. So really,
active data set size is a very important thing for you to know about your workload, about
your solution that's deployed,
because otherwise you're not going to really know if you're going to be all in cache or
in cache plus L2 arc or largely on disk, and that really impacts how you design your solution.
All right, so let's look at a few tribal knowledge things.
Now, I've only been around ZFS land for about a year, a little more than a year, so let's look at a few tribal knowledge things. Now, I've only been around ZFS land for about a year,
a little more than a year.
So there's probably more secret incantations out there.
But these were ones that I heard amongst our ZFS developers around,
is that L2ARC is not helpful for sequential workloads.
But if you do have a sequential workload,
you need to segregate it to a different pool. So you have a pool over here that's for sequential workloads. But if you do have a sequential workload, you need to segregate it to a different pool.
So you have a pool over here that's for sequential
streaming, and you have a pool over here
that's for your random VM stuff.
So you would fix that at config time,
or at deployment time.
There's a no-prefetch setting that basically
says L2ARC, you can feed yourself anything out of arc except for things
that have got there by the prefetcher. Remember that slide we talked about where we missed
something and demand read it up into the arc, and the prefetcher said, well, this looks
like it might be streaming, so I'm going to get these three or four blocks here. There's
actually accounting on this, and the accounting says these blocks were demand, these blocks were prefetched, because that will actually tune
it, the ARC will actually tune itself later if it finds out that almost everything it's
prefetching is a miss. It'll say, okay, maybe let's not do as much of that.
So the ARC is actually a really cool self-tuning thing. Sometimes it's smarter than it needs to be.
And then a lot of people say set the secondary cache so that only metadata gets into the L2 arc for those pools and data sets that are doing streaming workloads. So even though
you may not be hitting, you've got these large streaming files, you may not be hitting the
data blocks very much.
Let's just put the metadata in there, and that's going to help.
So we're going to try and validate some of these.
But they all seem more or less plausible, given what we've learned about the architecture.
Here's the fun part.
So here's our solution in our test.
I sort of highlighted in yellow the parts that are sort of important.
So we have a 512 gig main memory on this system.
That's more or less what your arc is going to be, the arc size.
Because on a FreeBSD OpenZFS-based storage server,
your arc is going to grow to, if your workload is sufficient,
and in a performance lab it's always sufficient,
your arc is going to grow to consume almost all your main memory.
We have some enterprise-grade dual port,
because just saying an NVMe is not enough.
Just saying, oh, I've got some NVMes in there.
Well, what are you talking about? Are you talking about, you know, Evo 960s or M.2s or U.2s? What are you talking
about? So we've got a lot of hard drives and
we're working with a big dataset, well fairly big dataset.
It's a 1.2 terabyte dataset. It is
specifically designed, that active dataset size was specifically chosen
because it will fit,
it will not fit in ARC.
So we're gonna always be,
we're gonna be missing, ARC missing a lot.
It's also designed so that the active dataset
will always fit into L2ARC.
Even if I'm using one of those 1.6 terabyte drives,
it'll fit in there.
If I'm using two, it'll fit in there. If I'm using four of them, it'll fit in there. So, and we're doing some preconditioning
here to make sure that this is fully warm before we take our measurements. And my, one of my
acceptance criteria to move on with measurement period is that theC size plus what has been fed to L2ARC is greater than or equal
to 90% of the active data set size.
This is a pretty safe assumption in my environment because, you know, this is my system and I'm
the only one running workload against that thing.
You know, if you're on a mixed use system, some of that data in L2ARC is not going to
be your data. So this is kind
of a, this is a clean room metric. Possibly some blocks won't be there yet, especially
with random, and we'll see that. We actually rebooted the system when we added, so I tested
one L2ARC device, and then I added a second, and then rebooted the system to basically clean those headers we were talking about earlier.
So this is actually another pitfall.
If you add L2 Arc devices, they can skew results.
Now this is only a pitfall in a testing environment
because in a deployed production environment,
you want to add a second L2 Arc device, you put it in there and add it.
And what's going to happen is over a very, very long period of time, that first L2 arc
device may be very, very full.
And if it's a big NVMe SSD, three or so terabytes or something like that, it's going to take
a while.
And then the other one is basically empty.
It's going to take it a while to load level along those. So I didn't want that, I wanted everything to be starting
from scratch in my test environment so every time I added a new L2 Arc device I rebooted
the system, cleaned out the caches and I validated that there was balanced read write activity
because when I first started doing it I thought I'm just going to add the second L2ARC device and see what happens. And then I pulled my
disk statistics out. And I've got one that's doing like 80% of the L2ARC hits, and I've
got one that's doing 20%. Well, what's going on here? Well, it's because the first one
was loaded, and the second one wasn't loaded yet.
The only really important thing here is active benchmarking.
You guys can go back and look at the rest.
Active benchmarking basically is a term that means we're not just caring about the top-level performance metrics of IOPS and latency and bandwidth.
We're actually caring about what are each individual disks doing?
What are each individual VDEVs doing?
Let's go out and collect the profiles,
the TPM counters and the profiles. Let's go do all this other stuff while the benchmark
is running. So if it's doing something weird or unexpected, we can find out why. I'm not
going to present that information today or very much of it, but as a performance engineer,
I care about not just what the system is doing, but why it's doing it.
So that's an in-house automation that we've worked on the past couple years.
So let's do some 4K.
VD Bench is the load generator, so this is a very synthetic random 4K workload.
Okay. The only really unexpected thing is that one and two NVMe devices sort
of perform about the same. They don't scale. But then four really scales. That's a deep dive that I have to do. But you can see here that as, that the L2 arc is very effective for a random read workload.
So more than six times the ops
compared to the no L2 arc case.
And even though something strange is going on here
between 1 and 2 L2 arc,
it's still very beneficial.
I suspect there might be something going on
with the round robinning of the drives.
So I've engaged my ZFS developers about that, and we're looking into it.
It's really cool when you go to a meeting and you're presenting performance numbers,
and your main ZFS developer goes, oh, that's exciting.
You never want to hear a base level, like an operating system level developer look at you and say,
that's exciting, because that's another word for scary.
Yes?
Are you getting super linear scaling
between OneDrive and TwoDrive?
Super linear scaling between OneDrive.
OneDrive is, is that 18,000 IOPs,
and TwoDrive is in excess of 40. Yeah, I'm sorry. I should
have called that out. This is no L2 arc. These are one and two L2 arc devices kind of laying on top of each other, which is the weird exciting thing.
And this is four L2 arc devices.
And actually the benefit can be seen even down here at the low thread count, which usually that's not the case.
Usually, you know, your single threaded aggregated workload is very, you know, they're kind of very tight down here.
Yeah, good question.
Let's look at latency.
We have a major reduction in latency,
and actually at the higher thread counts,
the reduction in latency is more significant.
But what I think is really cool is at the low thread counts counts because a lot of media and entertainment customers
are going to have maybe one application running on one workstation looking at a share or data
set. So this is a very important metric if you're playing in that space. Yes? about the hardware interface connection for the NAND, but that 2-4 gap, are they plugged into separate sort of adapters?
Have things or places on the front
that might explain why they're the same
and increasing the support?
Right, so the block diagram of our server, essentially,
is what you're talking about,
is whether or not they're plugged into
some sort of PCIe switch behind the scenes. They're actually plugged into a backplane
that does not have any PCIe switches in them, but it is all fed by a single by 16. So we
have a single by 16 on the motherboard, and then that backplane plugs into that,
and it, I guess, aggregates it out to four by fours.
Actually, it's four by eights,
because these are dual port by four.
Okay, we're only using one at a time,
so we're not oversubscribed.
You've got bifurcated DCIU,
and you're bif oversubscribed. Yeah. And that's about as deep as I can go into the hardware design of the system.
Yeah. I've taken it apart a lot of times.
But, yeah. Any other questions? These are very good questions. I appreciate them.
This is just basically proving that we're collecting the individual. This is one L2Arc device. We're collecting the individual statistics for each drive. So the client aggregate here
at this thread count, which when I say quote unquote best, what I mean is it is not the maximum ops that we achieved, but it is the highest
ops we achieved before the latency started curving upward. It's kind of the hockey stick
thread count because this is thread scaling. So the client aggregate is getting about 200
megs per second and the single NVMe is servicing half of that, which is pretty impressive. If we look at the arc stats, we're
seeing that even with this massive, by the time this preconditioning is running, it took
12 to 16 hours for the preconditioning to satisfy my metrics. The arc has tuned itself
to the point where it's actually still getting a 20% hit rate at this point,
which is really impressive considering it's 256 gig memory and a 1.2 terabyte active data set size.
We're still getting 20% ARC hit.
It's really well tuned.
We still have some errant feeding.
So what this is, we're actually writing into the NVMe SSD.
NVD 0, you can tell, is the only one active at this point.
So we're getting some errant feeding at the beginning of this.
It is a purely synthetic workload.
So there are probably going to be, even though my NVMe SSD is big enough to hold like every single bite of this workload of this active
data set size, by the time the measurement starts, that's why I had that 90% threshold,
by the time the measurement period begins, there's still going to be some errant feeding
going on here.
We're still going to be churning the tails of the arc a little bit.
Let's see.
So this is with two.
The sawtooth thing you're seeing there is actually just an artifact of how we collect
our stats.
Some of the time series from the thread counts may be shifted a little bit.
Client aggregate again is about 200 maybe bytes per second and the two NVMe are servicing about 100. So this result is
similar to the single NVMe result and we can see that each NVMe SSD is doing about 50 mbbytes
per second this time as opposed to 100. So we know that the devices can do more. We just don't know why they aren't in this particular
case. We're going to really see better scaling here with four NVMe devices active, client
aggregate bandwidth at 450 movie bytes per second, and we've got 250 of that coming from the L2 Arc devices themselves, and each one is doing,
I would call that 75, which is pretty good.
What I would really look for here, what I saw early on is that early on when my methodology
wasn't right
and I was just adding the devices,
some of these drives were doing a lot more than others.
But we're out of balance here.
So here's just a, I guess this is the takeaway slide
from the random reads.
This is the L2 arc effectiveness.
So this is the ops versus response time hockey
stick chart and you can see even at the beginning we are far, far lower with this active data
set size. Now these aren't hero numbers. You know our marketing team doesn't care a lot
about these numbers because the ops numbers are actually a lot lower than what we can
achieve with this system. This is a 1.2 terabyte active data set size. I mean, that's something to
get pretty excited about because you're way, way, way bigger than your cache. I mean, we're
not, this is not a cache hit number. So we're at sub milliseconds on a active data set size
that is far, far larger than our cache, which is a very exciting result. And we scale out to this magic.
This is a number that the marketing team cares about.
We're actually up over 100,000 ops with that large active data set size.
So this is something they can use in sizing, which is pretty cool.
So 670% increase in operations per second and a 670% decrease in latency at the 16 thread data point.
This is sort of a cautionary tale of quote unquote moving the bottleneck.
So with no L2 arc, this system is very happy.
We're using very little of the CPU. But when you add all of the NVMe SSDs, your CPU is no longer waiting
on slow drive, so your CPU is actually copying buffers. Your CPU is just real busy. So just
as a cautionary tale is, as you incorporate faster and better storage tiers,
you're probably going to need more CPU at some point
because you're sort of moving the bottleneck.
Yes?
Yes.
I do not know that, but I can find that out for you.
Yeah, this is... Okay.
Okay.
That's more of a question for our hardware qualification team
and how they configured these drives.
They were handed to me and said,
these have been qualified for use in our system,
so check this out.
Sequential read heavy.
This is really where the crux of the talk is.
I'm going to spend the rest of my time here.
This is 128K sequential 100% read.
Again, it's very synthetic.
Something cool happens here.
So first of all, the cool thing that's happening here,
first of all, the wisdom said L2 arc is no good for sequential workloads.
But we've shown that it's very effective.
It can be very effective with the right devices.
So we're getting 3x the bandwidth versus no L2 arc.
I don't know why the colors changed,
but I apologize for that.
So we're getting three times the bandwidth
compared to no L2 arc, but actually what's happening is
as we increase thread counts,
wow, the performance
goes up.
We sort of did a deep dive on that and we
realized that once we have so many clients
and so many threads happening,
the
traffic as it arrives to
the storage array starts to look random.
It doesn't look like sequential anymore.
So when you've got 12 clients
like I do and you're running 128 or 256 threads per client
of a very synthetic VDBench workload,
it begins to look like random at the server.
So sequential reads begin to look like random reads.
Response time is much the same story.
Pre-fetching.
You can actually see, this is how we diagnosed that weird hump we were seeing. You can see that the prefetch reads start to go way down. So the ARC itself
is saying, well, there's nothing here to prefetch. At this point, about this point, ARC says,
well, that's not streaming. I'm not going to mark that as streaming. So
that's actually how we diagnosed the previous, the phenomenon we saw in the previous results.
So, neat stuff. Much of the same, this is something that, you know, we have the balanced performance between all of the MVMEs.
Here's the money shot.
And actually, this is a neat, you know, if you see a scaling curve like this, it looks weird.
You're like, why is it doing that?
Well, the reason it's doing that is at this point is where we stop prefetching. At this point is where, from the server perspective,
we're no longer doing a streaming read workload.
We're now doing a random read workload
because the thread count is sufficient
to where things are interleaved
and it starts to look random.
So we get a little bit more scaling.
So really, this changes workload profiles
right in the middle in the synthetic lab environment.
We've got about a 60% or 60 to 70% performance increase out of L2ARC versus no L2ARC for sequential read.
That's a result because a lot of people in the ZFS space are saying don't use it for sequential read.
This is why.
All my data fits in L2ARC.
Well, so this is a testing I did a long time ago
before our active benchmarking automation was complete,
so I don't have a lot of collateral data on this,
but this is a much smaller server,
kind of the same workload,
128K VD bench sequential read workload, but this is a much smaller server, kind of the same workload, 128K VDBench sequential read workload.
But this is a 400 gig SSD as L2ARC instead of an MVME SSD.
So this is a SAS SSD being used as an L2ARC device.
And this is actually a much smaller active data set size. So we get all of our data in L2,
or in the L2 arc case,
we get all of our data in L2 arc,
and we think, yeah, that's cool.
We're in a faster tier.
Well, that one SSD drive is not nearly as fast
as our 142 hard drives, SAS 10K hard drives.
So actually what happens is we get stuck behind
our latencies and all of our requests sort of get
stuck behind that, and we're, it's, it's a very
bad thing. So this is why. This is why
that, that previous assumption existed.
So here's some, some, some people were saying, you know, not to do prefetching into the L2ARC or only do metadata into the L2ARC.
With older and slower and fewer L2ARC devices, this made sense, but in our testing, none of these mitigation factors really made any sense.
If your cache devices are fast enough, it makes no sense to keep things out of L2-Wark. It only makes sense to keep things out of L2-Wark if your L2-Wark
is not well suited to serve the workload.
So I've been given my five minute warning, so I'm going to skip a lot here. Write heavy
workloads are bad, okay.
So basically what's happening is either it does no harm,
just like the designers of L2Arc say,
adding L2Arc does no harm,
or we have some up to 20% regression in IOPS workload because what we're doing is we're spending a lot of CPU cycles
writing to those NVMe devices constantly.
So in those previous results I was showing you,
at some point we're writing, we're writing, we're writing,
we're feeding L2 arc.
At some point the feeding slows down
and we get those CPU cycles back to use for client workload.
This never happens here.
We're constantly turning the arc.
We're constantly under memory pressure.
We're constantly using those CPU cycles to feed L2ARC, and so our actual client visible performance decreases.
So this is probably the money shot for the rest of the presentation. There are two key
metrics that you can calculate to size a system or configure a system for L2ARC. The first one basically says, hey, is my active data set size going to fit in my L2ARC?
Okay?
The second one basically says, is my L2ARC fast enough to make that a good thing?
Right?
Because if you get, you can get all your data in L2ARC just like I showed you before,
but if you get all your data in L2ARC and like I showed you before, but if you get all your data in L2Arc
and L2Arc is slow, that's a bad thing.
Key metrics there.
What type of drives?
Your bandwidth and ops and latency capabilities
are going to rest on your type of drives.
Segregated pools versus mixed pools
with segregated data sets is a couple of things
you can do. So, for example, if you have smaller drives, like you were saying earlier, if you
can only afford very small NVMe SSDs, then you kind of want to segregate those off to the pool or the data set where it makes most sense.
Capacity.
Large devices can hold more of your data.
It will take longer to warm.
In a deployed environment, that probably doesn't make any sense.
Or that doesn't make any, you want, it doesn't matter how long it takes to warm.
You're going to be running for years, hopefully. So you don't care how long it takes to warm. You're going to be running for years, hopefully.
So you don't care how long it takes to warm.
I only care about that in my lab.
There is a thing that can happen where if you have like 64 gigs of RAM on your system and you have like, I don't know, 100 terabytes of L2 arc,
you're going to eat up all of your arc with the indexes to your L2 arc in the headers. So your arc hits are actually going to go
down. So beware of that. Probably not a good idea. Device count, our device count shows,
our testing shows that device count actually improves latency because it's basically spreading
the load if you have more devices.
What are you going to feed?
If you have a slow L2 arc,
you want it to be demand-fed, not prefetched.
If it's a fast enough L2 arc,
you want to feed everything.
And those fast enough is that ratio
I showed a couple slides ago. If you're
writing, you basically want to do those mitigation practices where you segregate the pool.
Finally, know your workload. And this is kind of preachy, teachery stuff, but if you're
a customer looking at evaluating a storage solution, know your workload. Know at least the applications.
And if you're a vendor, know your architecture. Don't just go out there and say, yeah, L2Wark's
awesome. NVMe is a rocket. I'm going to strap myself to that rocket. So not necessarily.
You can actually have problems there. So the big picture here is
with enterprise-grade NVMe SSD devices
as L2ARC,
we have reached the quote-unquote future
that Brennan Gregg mentioned,
and we can now have it as an effective tool
for a streaming workload.
So, thank you.
I'm going to give you some more homework.
Please download the slides for more content,
and please rate my talk in the SSD portal. So, thank you. I'm going to give you some more homework. Please download the slides for more content.
Please rate my talk in the SSD portal.
Thank you.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.