Storage Developer Conference - #134: Best Practices for OpenZFS L2ARC in the Era of NVMe

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 134.

Starting point is 00:00:40 My name is Ryan McKenzie. I am a performance engineer at IX Systems. And if you don't know what IX Systems does, it's am a performance engineer at IX Systems. And if you don't know what IX Systems does, it's actually a pretty cool place to work because they allow us to come to conferences like this and they support us to come and network with everyone. And then they also allow us to do interesting investigations like this to divert ourselves and some of our effort and do neat investigations and share the results. We are the maintainers of the FreeNAS project, and we also have an enterprise storage line called TrueNAS,

Starting point is 00:01:15 which is based on that software with some extra features and stuff. But just thanks to IX Systems for sending me. So that's my current life. In a previous career, I was a lecturer in the Department of Computer Science at University of Kentucky. So if you need me to slow down or repeat something, that's fine. I know not everyone speaks banjo. Banjo is slow. We're okay with it.

Starting point is 00:01:43 Slow banjo. All right, cool. Slow banjo, alright cool But also you know I've been up in front of rooms of 700 college students But this is a little bit of a different environment Where there's probably people in the room That know more than me about ZFS Anyone from Oracle? So I'll give you guys $20 if you don't grill me too hard on ZFS It is OpenZFS we're talking about here today Not Oracle ZFS? Anyone from Oracle? So I'll give you guys $20 if you don't grill me too hard on ZFS. It is

Starting point is 00:02:07 OpenZFS we're talking about here today, not Oracle ZFS. But anyway, I'll just go ahead and get started. That's a little bit about me and where I work. So what are we doing today? Just a brief overview of how OpenZFS works. So really something that's not on the slides, and I like to talk a little bit off the slides from time to time. It's a good thing that when I was a teacher, I kept my students a little more engaged. The two big pictures for this agenda, really, if you will, are there's a technical big picture, L2 ARC, the Level 2 Adaptive Replacement Cache, has for a long time now sort of been one of those, the recommendation has been don't use it with certain workloads, but maybe it'll work with future technologies. And now that we have NVMe SSDs becoming very available, we are going to reevaluate that.

Starting point is 00:03:06 The other big picture of this talk is sort of a meta talk, if you will, is knowing how to tune your system means you must know your solution, you must know your architecture and your implementation, and you also must know your workloads and the applications you're going to deploy them into. So I don't know what's making that noise. Is it the projector? Maybe my laptop is doing that. I don't know. Yeah, it sure was. Oh, yeah, audio goes over HDMI. How cool. Told you I wasn't a developer.

Starting point is 00:03:52 Okay, so let's start with a brief overview of ARC and L2ARC. So ARC stands for Adaptive Replacement Cache. It resides in main memory, so it's global. On the server, the filer, it's shared by all the storage pools. Basically all the incoming or written data goes through the ARC in one way or another. And the really cool thing about ARC is it balances between most frequently used, MFU, and most recently used, MRU. And that does some protections against things like cache scanning, where if you have a backup workload or something,

Starting point is 00:04:30 you still have your most frequently used lists in your ARC that aren't getting really perturbed by the backup workload and things like that. So ARC is really neat. This is the focus of our talk, L2 Arc, level two adaptive replacement cache. It's actually configured and added on a per pool basis, so some of your pools can have Arc

Starting point is 00:04:55 and some of them cannot have Arc, and Arc can be even configured differently on different pools based on what those pools are used for, which is good. There's a concept of warm data, things that are about to be evicted from ARC. And we'll learn more about that in just a minute, but the L2 ARC tries to feed itself with blocks

Starting point is 00:05:19 that are about to be evicted from the main memory ARC. So we'll talk about what that means and how it determines what is about to be evicted from the main memory arc. So we'll talk about what that means and how it determines what is about to be evicted. The other interesting thing to note is that all of the blocks that are in your L2 arc have a header component that's actually stored in the main memory. So there's a couple caveats to remember there. I'm going to move really quickly through these introductory slides, but feel free to ask questions. When ZFS writes, you've got a write request that comes in.

Starting point is 00:05:57 If it's an async write, as soon as that block gets copied to main memory, we're going to acknowledge to the requester that the write is complete if it's async write. Async write takes a little bit longer. On this slide, I want to give a plug for Nick Principe's talk last year at SDC 2018. He gave a very similar talk to this last year, but his talk was how you could improve this portion of your OpenZFS system using MVDEM. So I'm giving a talk about maybe using NVMe SSDs here. He gave a talk about using MVDEMs here. And the SLOG is basically, and it's called the ZFS Intent Log,

Starting point is 00:06:39 and a SLOG is a separate log device. So the ZFS Intent Log, by default, is sort of striped across all of the data VDEVs, which can be slow if you have a very heavy synchronous write environment. A lot of NFS environments are pretty synchronous heavy. So one way to speed up the acknowledgement process is to put a fast device as a separate log, and that can be an SSD, but it can also be an NVDIMM in certain cases, or an NVMe SSD, either one. So at some point later in time, the block that has been written gets copied to stable storage.

Starting point is 00:07:22 It's no longer quote-un quote unquote dirty and all is well with the world and now we have this neat block in ARC. So we've got a block that sits in ARC and if that block is going to be re-hit, that block is going to be hot, it will stay there. It will be promoted to the heads of the most recently slash most frequently used lists in the ARC, just like any other block. And if it's just something that gets written once and never gets written again, it will fall to the cold part, the tail of the ARC, and it will get evicted. Reads.

Starting point is 00:07:59 Reads are a little different. So if a read request comes in and the block we're looking for is in ARC we have an Arc hit, which is awesome. We're hitting main memory. That's what we want. That's why certain ZFS-based servers have many, many terabytes of main memory, because Arc will grow to consume all that memory and use it and it'll give it back if your system needs it, but we want arc hits with reads. Arc hits are really, really low latency, very, very fast. If it's in L2 arc, we're going to see from our header over here in the arc, we're going to see, oh, that's a block that I've got stored in my L2 device, so we're going to see from our header over here in the arc, we're going to see, oh, that's a block that I've got stored in my L2 device. We're going to go get that block from L2, which is another tier that's sort of between your main memory and your slower data VDEVs here.

Starting point is 00:08:55 And then we're going to return that. In the case of a miss, we have to actually copy data up from our slower data VDEV. So we're going to copy this block that's been missed on up. And then we're going to acknowledge the read. Yes? All right, real quick. What is the process by which it moves from R to L to R? Is it just a finding issue?

Starting point is 00:09:20 How long has it been in? Yeah, that'll be one of the next slides. But yeah, it's a pretty important concept of how things, once they get warm or cold in ARC, how they get into L2ARC. Yeah, that's one of the next slides. Thank you for the good segue. So we're going to, this is called a demand read.

Starting point is 00:09:41 So we're demanding data from our VDEVs in our pool up into ARC. We're going to acknowledge if ARC prefetcher, so there's a part of an ARC read system call that tries to determine if something is a streaming workload. And if it thinks it's a streaming workload, it may not actually be a streaming workload. It just, if it thinks it is, it's going to actually copy some more blocks too

Starting point is 00:10:06 to maybe try and avoid another miss. So we have concepts in Arc. If you look at the Arc statistics, we have concepts of Arc miss, Arc data miss, Arc prefetch reads, prefetch misses, metadata reads,

Starting point is 00:10:21 metadata misses. There's lots of statistics around it, which is awesome for a performance person because if you capture all that stuff, you can really do some damage. And then once that stuff is demand read into ARC, it again stays there. So this is really, think of this as there's two ways

Starting point is 00:10:40 something gets in ARC. It either gets in ARC because it was demanded or it gets in ARC because it was written. Those are the two ways things get in ARC. It either gets in ARC because it was demanded, or it gets in ARC because it was written. Those are the two ways things get in ARC. Now, how do things get in L2 ARC? Oh, wait a minute. Yes, this is how things get in L2 ARC. So as I said before, ARC will actually grow to consume almost all your system's memory. It's kind of a feature. If you read the forums, a lot of people don't think it's a feature, but if it's well tuned, it's a feature. So what's happened is you have to have a headroom of available main memory for incoming writes.

Starting point is 00:11:18 Because think about what would happen if your memory was full and every incoming write has to hit that memory. If there was no free memory and you had to immediately transaction group your sink your dirty data down to your slow spinning REST drives, your write latency would just go through the roof, right? So we have to sort of do some housekeeping and maintain some room in ARC for incoming writes.

Starting point is 00:11:44 So that's what happens. That's also why the L2 arc feed is completely asynchronous of the arc reclaim or the arc things falling off the end of arc because imagine if you've got a block that's cold, and it's in your main memory, and there's a write coming in. And if a write comes in, it has to wait for a block to then be copied to, I mean, this L2 arc is faster than your data VDEVs, but it's also slower than memory. So this is actually going to also increase your latency.

Starting point is 00:12:21 So this whole process being asynchronous, loading L2 arc, loading L2ARC, feeding L2ARC asynchronously of ARC evictions and cleaning up ARC, it really is optimized to keep our write latencies very low. So periodically, and these things in yellow are tunable, so how often it is scanned, how much it scans, how many blocks it copies into L2ARC is all tunable. So how often it is scanned, how much it scans, how many blocks it copies into L2 arcs, all tunable. So that's really all three of those things combined are called the feed rate. The feed rate of L2 arc is tunable. So we're going to go through and we're going to look at close to the tails. And just imagine if there's a dotted line

Starting point is 00:13:01 here. You're going to look at either more of the tail or less of the tail. That's called headroom. That's tunable. So we're going to look in here and go, okay, these are getting ready to fall out of arc. We don't really know anything about them, but we're just going to copy them into L2 arc in the background for fun. So that's really how L2 arc feeds. It's completely asynchronous. It doesn't really know anything about those blocks.

Starting point is 00:13:28 It's just saying, oh, these are getting ready to fall out of arc, so we're going to take those into the next tier down instead of letting them fall all the way out to slower stable storage. Okay. So when L2 arc, quote, unquote, gets full, which means we're sort of at the end, so things get written in a round-robin fashion to all the available L2 arc devices. So if you have two L2 arc devices or four L2 arc devices,

Starting point is 00:14:03 they start writing in a rotor, round-robin fashion, and then when you get to the end, it just says, okay, let's just go back to the beginning. And then, you know, it starts back at the beginning with the new blocks that are being promoted, and it invalidates the old indexes to these in the headers. And, you know, so L2ARC is, if you have L2ARC in your system, it is constantly feeding something at some rate.

Starting point is 00:14:32 And that's a lot of what we're going to talk about today is what do we feed it, how fast do we feed it, that kind of thing. Under what workload scenarios. Yes. Yes. So when there's no L2 arc, these just get reclaimed. And they disappear from main memory. They don't have another tier to go to, so they're going to be on your data VDEV.

Starting point is 00:15:00 They're going to be on your spending rest or whatever your slowest, whatever quote-unquote stable storage is in your system. So L2ARC is sort of making an effort to, instead of just dumping these all the way down to stable storage, L2ARC is making an effort to put some things in a middle tier that makes them maybe useful at some later time. Ideally, L2R should be based on what you're using it for, obviously, but that's kind of the old classic pyramid, right? We teach in computer sciences, you know, your CPU and your on-chip cache is at the very top,

Starting point is 00:15:47 but it's very small, and then your main memory is a little slower and a little bigger, and then your next tier is a little slower and a little bigger. So, yeah, that's conceptually the idea. Yes? Yes. If you keep writing the block, so the question was if you keep writing a block lots of times, does it end up being

Starting point is 00:16:25 just overwritten in L2ARC? There's two mechanisms to keep that from happening. There's the most frequently used list in ARC. If something is close to the top of these lists and something that you're hitting over and over again is going to be close to the top of this list, it's not going to get down here to the point where it's in L2 arc. And the other mechanism is if it's already in L2 arc, it's going to look up here and go, oh, I've already got that. So I'm not going to copy it again. So any block can be in three states. It can be either in arc only or L2 arc only.

Starting point is 00:17:01 It can be in both. But that is only sort of a temporary state. So it can be in both right now. I've copied it from the tail of this list. It's here until it gets evicted. Now, if this thing gets re-hit and gets promoted back, you actually have it in both places. That's what we call a wasted feed.

Starting point is 00:17:24 We've fed something to L2 arc too early. Basically, we've fed too fast. It never was going to fall off the end of arc. Any other questions? This is key. We understand how the L2 arc feeds. We can understand how to tune it in different scenarios. Again, if an L2 arc block then becomes dirty at some point, let's say we've got a block here that we've brought into L2 arc, and it gets written from the outside, arc is just going to handle that. Arc is, as we saw on the third slide or whatever,

Starting point is 00:18:09 ARC is going to say, okay, this is a dirty block. I'm going to put it as part of a transaction group, and I'm going to get it to stable storage at some point. And then this block in ARC is just not going to be used until it circles around and gets written with something different. So we don't ever have anything dirty in arc because it would be an extra step to have to flush arc, or sorry, we don't ever have anything dirty in L2 arc because it would be extra steps to flush L2 arc and slower. So here's just some notes. These are very text-heavy slides. So like a true teacher that I am, I'm going to give you guys homework.

Starting point is 00:18:50 I'm not going to read them all. You can download the slides if you're very interested. But this is a very interesting point. The blocks are variable size. So if you've got a block workload or a file workload or an object workload, whatever your block size that you've set on that particular data set or that particular pool or that particular share, that's what the size of the block is going to be in ARC. You can have, you know, you can have 4K, you can have 1 meg.

Starting point is 00:19:20 So some of these blocks are 1 meg and some of them are 4K on a mixed use system. We're just going to talk about blocks. And whatever those are, they are. But that, you know, takes into, you have to take that into account when you're sizing, especially if you have everything set to a really small block size, you're going to have lots and lots of metadata. So you may have to tune your metadata percent of arc. You've got a lot of metadata overhead if you've got, you overhead. If you've got a petabyte system with 4K block sizes on everything, that's probably a misconfiguration. Your SE is going to be in trouble.

Starting point is 00:19:57 Blocks get into ARC either by a demand or prefetch read miss or by a ZFS write. That's the only way they get into here. Now, this is worth the price of admission to this talk. Here's the configuration pitfall. It can only get into L2ARC if it's been in ARC first, right? We learned that just a few minutes ago. So what a lot of people do, like, oh, okay, so let's set our primary cache, let's set our ARC to metadata, because it's fast, and then let's set our L2 ARC to catch everything else, all. Well, guess what?

Starting point is 00:20:35 You're not going to get anything in L2 ARC, right? Because it can't get into L2 ARC unless it was already in ARC. So that's something you have to know about the way ZFS caching works. Because a lot of people, and we see this in the forums about every month on Freenas forums. They're talking about how their L2ARC never has anything in it. It's because there's these primary cache, secondary cache settings, and they've got them set in such a way that what you're telling ARC it's allowed to copy is something that's also never in, you're telling L2ARC to grab things that are never in ARC.

Starting point is 00:21:11 So it's looking at the tails of the ARC list going, there's nothing there I'm allowed to copy. There's nothing there I'm allowed to feed. All right, so that's a configuration pitfall. L2ARC is not persistent, parentheses yet. The blocks themselves are persistent in the L2ARC, because you're talking about an MVME SSD or you're talking about an SSD or maybe a 15k SAS drive in some deployments. The actual blocks are there, but what's not there? Something we saw earlier.

Starting point is 00:21:48 The header information, right? The headers reside in main memory, so if you lose power, the header information that points to when a user reads, we're going to look in our header, and if we lose main memory or if we reboot the system, the header is gone. So basically when the system comes back up, it goes, oh, L2 arc's empty. I'm going to start at block zero.

Starting point is 00:22:08 That's a configuration pitfall and a something to note there. Write heavy workloads can churn the most recently used. It's not going to churn the most frequently used. We talked about that a minute ago. Prefetch heavy workloads can scan. So like I was talking about a backup workload can scan the ARC because if you are the recipient of the backup, you're going to get writes. That's the first bullet point.

Starting point is 00:22:37 But if you are the source of the backup, you're going to have a lot of sequential reads, which are just going to go through and your most recently used list is going to completely be scanned all the time until the backup is complete. That's a neat feature of Arc, how it has the most frequently and the most recently used lists, because it sort of protects your VM guest OS data that you want to stay in the most frequently used list, it sort of stays there. What's happening is you're going to, yeah, let's see, I think I already talked about all that. Blocks that L2Arc doesn't feed may not be hit ever, or blocks that L2Arc does feed may never be hit again in the case of scanning and churning, and that's a wasted feed.

Starting point is 00:23:31 Alright, so let's talk about some key performance factors that we got from, because the first step is really understanding the architecture. So what's a random read heavy workload going to look like on this guy? Well, L2ARC is actually designed for this. Brennan Gregg put L2ARC in ZFS in 1996 and all of his comments in the arc.c and a lot of his blog posts and some of his papers that he wrote about it, it is designed for being a random read heavy cache. So we expect it to be very good. Sequential read-heavy, not originally designed for this, but again, the original design document said future SSDs, future storage technologies might make it such that it's going to work out.

Starting point is 00:24:17 So we're going to revisit that today. That's the big data component of today's talk. Write-heavy workload, it doesn't matter if it's random or sequential, it's going to cause what we call memory pressure. Every time we write to a ZFS file system we need that block in memory to be free. And if we're doing a ton of writes it's going to cause what's called memory pressure so we're going to very, very quickly be reclaiming blocks from the tails of the ARC lists and very, very quickly be reclaiming blocks from the tails of the arc lists, and very, very quickly.

Starting point is 00:24:48 And a lot of those are probably going to go by without being fed into L2 arc. Some of them are going to be fed into L2 arc and probably never touched again, and that's a wasted feed. So the design intention was to do no harm, but we think there may be an impact, and we're going to show some of that later. The background performance of actually feeding L2 arcs. So if you're in this situation, your L2 arc's never really going to be warm, because it's always going to be grabbing new stuff, because the tails of those lifts are going to be perturbing all the time. So it's going to be grabbing new stuff all the time and it may never be hit again.

Starting point is 00:25:26 So we have to, and actually those mem copies, they use CPU time, they use resources. Those NVMe SSD drives, writing to them actually costs us. So in some cases it may be better not to have the L2 arc or at least mitigate how much feeding you're doing of the L2ARC or at least mitigate how much feeding you're doing of the L2ARC to prevent that. The active data set size, if it's small, it's going to probably fit in L2ARC or it may never be fully warm, but we'll talk about that. So really, active data set size is a very important thing for you to know about your workload, about

Starting point is 00:26:03 your solution that's deployed, because otherwise you're not going to really know if you're going to be all in cache or in cache plus L2 arc or largely on disk, and that really impacts how you design your solution. All right, so let's look at a few tribal knowledge things. Now, I've only been around ZFS land for about a year, a little more than a year, so let's look at a few tribal knowledge things. Now, I've only been around ZFS land for about a year, a little more than a year. So there's probably more secret incantations out there. But these were ones that I heard amongst our ZFS developers around,

Starting point is 00:26:37 is that L2ARC is not helpful for sequential workloads. But if you do have a sequential workload, you need to segregate it to a different pool. So you have a pool over here that's for sequential workloads. But if you do have a sequential workload, you need to segregate it to a different pool. So you have a pool over here that's for sequential streaming, and you have a pool over here that's for your random VM stuff. So you would fix that at config time, or at deployment time.

Starting point is 00:26:58 There's a no-prefetch setting that basically says L2ARC, you can feed yourself anything out of arc except for things that have got there by the prefetcher. Remember that slide we talked about where we missed something and demand read it up into the arc, and the prefetcher said, well, this looks like it might be streaming, so I'm going to get these three or four blocks here. There's actually accounting on this, and the accounting says these blocks were demand, these blocks were prefetched, because that will actually tune it, the ARC will actually tune itself later if it finds out that almost everything it's prefetching is a miss. It'll say, okay, maybe let's not do as much of that.

Starting point is 00:27:39 So the ARC is actually a really cool self-tuning thing. Sometimes it's smarter than it needs to be. And then a lot of people say set the secondary cache so that only metadata gets into the L2 arc for those pools and data sets that are doing streaming workloads. So even though you may not be hitting, you've got these large streaming files, you may not be hitting the data blocks very much. Let's just put the metadata in there, and that's going to help. So we're going to try and validate some of these. But they all seem more or less plausible, given what we've learned about the architecture. Here's the fun part.

Starting point is 00:28:24 So here's our solution in our test. I sort of highlighted in yellow the parts that are sort of important. So we have a 512 gig main memory on this system. That's more or less what your arc is going to be, the arc size. Because on a FreeBSD OpenZFS-based storage server, your arc is going to grow to, if your workload is sufficient, and in a performance lab it's always sufficient, your arc is going to grow to consume almost all your main memory.

Starting point is 00:28:57 We have some enterprise-grade dual port, because just saying an NVMe is not enough. Just saying, oh, I've got some NVMes in there. Well, what are you talking about? Are you talking about, you know, Evo 960s or M.2s or U.2s? What are you talking about? So we've got a lot of hard drives and we're working with a big dataset, well fairly big dataset. It's a 1.2 terabyte dataset. It is specifically designed, that active dataset size was specifically chosen

Starting point is 00:29:40 because it will fit, it will not fit in ARC. So we're gonna always be, we're gonna be missing, ARC missing a lot. It's also designed so that the active dataset will always fit into L2ARC. Even if I'm using one of those 1.6 terabyte drives, it'll fit in there.

Starting point is 00:30:03 If I'm using two, it'll fit in there. If I'm using four of them, it'll fit in there. So, and we're doing some preconditioning here to make sure that this is fully warm before we take our measurements. And my, one of my acceptance criteria to move on with measurement period is that theC size plus what has been fed to L2ARC is greater than or equal to 90% of the active data set size. This is a pretty safe assumption in my environment because, you know, this is my system and I'm the only one running workload against that thing. You know, if you're on a mixed use system, some of that data in L2ARC is not going to be your data. So this is kind

Starting point is 00:30:46 of a, this is a clean room metric. Possibly some blocks won't be there yet, especially with random, and we'll see that. We actually rebooted the system when we added, so I tested one L2ARC device, and then I added a second, and then rebooted the system to basically clean those headers we were talking about earlier. So this is actually another pitfall. If you add L2 Arc devices, they can skew results. Now this is only a pitfall in a testing environment because in a deployed production environment, you want to add a second L2 Arc device, you put it in there and add it.

Starting point is 00:31:25 And what's going to happen is over a very, very long period of time, that first L2 arc device may be very, very full. And if it's a big NVMe SSD, three or so terabytes or something like that, it's going to take a while. And then the other one is basically empty. It's going to take it a while to load level along those. So I didn't want that, I wanted everything to be starting from scratch in my test environment so every time I added a new L2 Arc device I rebooted the system, cleaned out the caches and I validated that there was balanced read write activity

Starting point is 00:32:01 because when I first started doing it I thought I'm just going to add the second L2ARC device and see what happens. And then I pulled my disk statistics out. And I've got one that's doing like 80% of the L2ARC hits, and I've got one that's doing 20%. Well, what's going on here? Well, it's because the first one was loaded, and the second one wasn't loaded yet. The only really important thing here is active benchmarking. You guys can go back and look at the rest. Active benchmarking basically is a term that means we're not just caring about the top-level performance metrics of IOPS and latency and bandwidth. We're actually caring about what are each individual disks doing?

Starting point is 00:32:40 What are each individual VDEVs doing? Let's go out and collect the profiles, the TPM counters and the profiles. Let's go do all this other stuff while the benchmark is running. So if it's doing something weird or unexpected, we can find out why. I'm not going to present that information today or very much of it, but as a performance engineer, I care about not just what the system is doing, but why it's doing it. So that's an in-house automation that we've worked on the past couple years. So let's do some 4K.

Starting point is 00:33:20 VD Bench is the load generator, so this is a very synthetic random 4K workload. Okay. The only really unexpected thing is that one and two NVMe devices sort of perform about the same. They don't scale. But then four really scales. That's a deep dive that I have to do. But you can see here that as, that the L2 arc is very effective for a random read workload. So more than six times the ops compared to the no L2 arc case. And even though something strange is going on here between 1 and 2 L2 arc, it's still very beneficial.

Starting point is 00:33:59 I suspect there might be something going on with the round robinning of the drives. So I've engaged my ZFS developers about that, and we're looking into it. It's really cool when you go to a meeting and you're presenting performance numbers, and your main ZFS developer goes, oh, that's exciting. You never want to hear a base level, like an operating system level developer look at you and say, that's exciting, because that's another word for scary. Yes?

Starting point is 00:34:32 Are you getting super linear scaling between OneDrive and TwoDrive? Super linear scaling between OneDrive. OneDrive is, is that 18,000 IOPs, and TwoDrive is in excess of 40. Yeah, I'm sorry. I should have called that out. This is no L2 arc. These are one and two L2 arc devices kind of laying on top of each other, which is the weird exciting thing. And this is four L2 arc devices. And actually the benefit can be seen even down here at the low thread count, which usually that's not the case.

Starting point is 00:35:15 Usually, you know, your single threaded aggregated workload is very, you know, they're kind of very tight down here. Yeah, good question. Let's look at latency. We have a major reduction in latency, and actually at the higher thread counts, the reduction in latency is more significant. But what I think is really cool is at the low thread counts counts because a lot of media and entertainment customers are going to have maybe one application running on one workstation looking at a share or data

Starting point is 00:35:52 set. So this is a very important metric if you're playing in that space. Yes? about the hardware interface connection for the NAND, but that 2-4 gap, are they plugged into separate sort of adapters? Have things or places on the front that might explain why they're the same and increasing the support? Right, so the block diagram of our server, essentially, is what you're talking about, is whether or not they're plugged into some sort of PCIe switch behind the scenes. They're actually plugged into a backplane

Starting point is 00:36:31 that does not have any PCIe switches in them, but it is all fed by a single by 16. So we have a single by 16 on the motherboard, and then that backplane plugs into that, and it, I guess, aggregates it out to four by fours. Actually, it's four by eights, because these are dual port by four. Okay, we're only using one at a time, so we're not oversubscribed. You've got bifurcated DCIU,

Starting point is 00:37:08 and you're bif oversubscribed. Yeah. And that's about as deep as I can go into the hardware design of the system. Yeah. I've taken it apart a lot of times. But, yeah. Any other questions? These are very good questions. I appreciate them. This is just basically proving that we're collecting the individual. This is one L2Arc device. We're collecting the individual statistics for each drive. So the client aggregate here at this thread count, which when I say quote unquote best, what I mean is it is not the maximum ops that we achieved, but it is the highest ops we achieved before the latency started curving upward. It's kind of the hockey stick thread count because this is thread scaling. So the client aggregate is getting about 200 megs per second and the single NVMe is servicing half of that, which is pretty impressive. If we look at the arc stats, we're

Starting point is 00:38:10 seeing that even with this massive, by the time this preconditioning is running, it took 12 to 16 hours for the preconditioning to satisfy my metrics. The arc has tuned itself to the point where it's actually still getting a 20% hit rate at this point, which is really impressive considering it's 256 gig memory and a 1.2 terabyte active data set size. We're still getting 20% ARC hit. It's really well tuned. We still have some errant feeding. So what this is, we're actually writing into the NVMe SSD.

Starting point is 00:38:46 NVD 0, you can tell, is the only one active at this point. So we're getting some errant feeding at the beginning of this. It is a purely synthetic workload. So there are probably going to be, even though my NVMe SSD is big enough to hold like every single bite of this workload of this active data set size, by the time the measurement starts, that's why I had that 90% threshold, by the time the measurement period begins, there's still going to be some errant feeding going on here. We're still going to be churning the tails of the arc a little bit.

Starting point is 00:39:21 Let's see. So this is with two. The sawtooth thing you're seeing there is actually just an artifact of how we collect our stats. Some of the time series from the thread counts may be shifted a little bit. Client aggregate again is about 200 maybe bytes per second and the two NVMe are servicing about 100. So this result is similar to the single NVMe result and we can see that each NVMe SSD is doing about 50 mbbytes per second this time as opposed to 100. So we know that the devices can do more. We just don't know why they aren't in this particular

Starting point is 00:40:05 case. We're going to really see better scaling here with four NVMe devices active, client aggregate bandwidth at 450 movie bytes per second, and we've got 250 of that coming from the L2 Arc devices themselves, and each one is doing, I would call that 75, which is pretty good. What I would really look for here, what I saw early on is that early on when my methodology wasn't right and I was just adding the devices, some of these drives were doing a lot more than others. But we're out of balance here.

Starting point is 00:40:56 So here's just a, I guess this is the takeaway slide from the random reads. This is the L2 arc effectiveness. So this is the ops versus response time hockey stick chart and you can see even at the beginning we are far, far lower with this active data set size. Now these aren't hero numbers. You know our marketing team doesn't care a lot about these numbers because the ops numbers are actually a lot lower than what we can achieve with this system. This is a 1.2 terabyte active data set size. I mean, that's something to

Starting point is 00:41:29 get pretty excited about because you're way, way, way bigger than your cache. I mean, we're not, this is not a cache hit number. So we're at sub milliseconds on a active data set size that is far, far larger than our cache, which is a very exciting result. And we scale out to this magic. This is a number that the marketing team cares about. We're actually up over 100,000 ops with that large active data set size. So this is something they can use in sizing, which is pretty cool. So 670% increase in operations per second and a 670% decrease in latency at the 16 thread data point. This is sort of a cautionary tale of quote unquote moving the bottleneck.

Starting point is 00:42:22 So with no L2 arc, this system is very happy. We're using very little of the CPU. But when you add all of the NVMe SSDs, your CPU is no longer waiting on slow drive, so your CPU is actually copying buffers. Your CPU is just real busy. So just as a cautionary tale is, as you incorporate faster and better storage tiers, you're probably going to need more CPU at some point because you're sort of moving the bottleneck. Yes? Yes.

Starting point is 00:43:09 I do not know that, but I can find that out for you. Yeah, this is... Okay. Okay. That's more of a question for our hardware qualification team and how they configured these drives. They were handed to me and said, these have been qualified for use in our system, so check this out.

Starting point is 00:43:33 Sequential read heavy. This is really where the crux of the talk is. I'm going to spend the rest of my time here. This is 128K sequential 100% read. Again, it's very synthetic. Something cool happens here. So first of all, the cool thing that's happening here, first of all, the wisdom said L2 arc is no good for sequential workloads.

Starting point is 00:44:01 But we've shown that it's very effective. It can be very effective with the right devices. So we're getting 3x the bandwidth versus no L2 arc. I don't know why the colors changed, but I apologize for that. So we're getting three times the bandwidth compared to no L2 arc, but actually what's happening is as we increase thread counts,

Starting point is 00:44:23 wow, the performance goes up. We sort of did a deep dive on that and we realized that once we have so many clients and so many threads happening, the traffic as it arrives to the storage array starts to look random.

Starting point is 00:44:40 It doesn't look like sequential anymore. So when you've got 12 clients like I do and you're running 128 or 256 threads per client of a very synthetic VDBench workload, it begins to look like random at the server. So sequential reads begin to look like random reads. Response time is much the same story. Pre-fetching.

Starting point is 00:45:09 You can actually see, this is how we diagnosed that weird hump we were seeing. You can see that the prefetch reads start to go way down. So the ARC itself is saying, well, there's nothing here to prefetch. At this point, about this point, ARC says, well, that's not streaming. I'm not going to mark that as streaming. So that's actually how we diagnosed the previous, the phenomenon we saw in the previous results. So, neat stuff. Much of the same, this is something that, you know, we have the balanced performance between all of the MVMEs. Here's the money shot. And actually, this is a neat, you know, if you see a scaling curve like this, it looks weird. You're like, why is it doing that?

Starting point is 00:45:57 Well, the reason it's doing that is at this point is where we stop prefetching. At this point is where, from the server perspective, we're no longer doing a streaming read workload. We're now doing a random read workload because the thread count is sufficient to where things are interleaved and it starts to look random. So we get a little bit more scaling. So really, this changes workload profiles

Starting point is 00:46:22 right in the middle in the synthetic lab environment. We've got about a 60% or 60 to 70% performance increase out of L2ARC versus no L2ARC for sequential read. That's a result because a lot of people in the ZFS space are saying don't use it for sequential read. This is why. All my data fits in L2ARC. Well, so this is a testing I did a long time ago before our active benchmarking automation was complete, so I don't have a lot of collateral data on this,

Starting point is 00:46:58 but this is a much smaller server, kind of the same workload, 128K VD bench sequential read workload, but this is a much smaller server, kind of the same workload, 128K VDBench sequential read workload. But this is a 400 gig SSD as L2ARC instead of an MVME SSD. So this is a SAS SSD being used as an L2ARC device. And this is actually a much smaller active data set size. So we get all of our data in L2, or in the L2 arc case, we get all of our data in L2 arc,

Starting point is 00:47:30 and we think, yeah, that's cool. We're in a faster tier. Well, that one SSD drive is not nearly as fast as our 142 hard drives, SAS 10K hard drives. So actually what happens is we get stuck behind our latencies and all of our requests sort of get stuck behind that, and we're, it's, it's a very bad thing. So this is why. This is why

Starting point is 00:47:51 that, that previous assumption existed. So here's some, some, some people were saying, you know, not to do prefetching into the L2ARC or only do metadata into the L2ARC. With older and slower and fewer L2ARC devices, this made sense, but in our testing, none of these mitigation factors really made any sense. If your cache devices are fast enough, it makes no sense to keep things out of L2-Wark. It only makes sense to keep things out of L2-Wark if your L2-Wark is not well suited to serve the workload. So I've been given my five minute warning, so I'm going to skip a lot here. Write heavy workloads are bad, okay. So basically what's happening is either it does no harm,

Starting point is 00:48:52 just like the designers of L2Arc say, adding L2Arc does no harm, or we have some up to 20% regression in IOPS workload because what we're doing is we're spending a lot of CPU cycles writing to those NVMe devices constantly. So in those previous results I was showing you, at some point we're writing, we're writing, we're writing, we're feeding L2 arc. At some point the feeding slows down

Starting point is 00:49:14 and we get those CPU cycles back to use for client workload. This never happens here. We're constantly turning the arc. We're constantly under memory pressure. We're constantly using those CPU cycles to feed L2ARC, and so our actual client visible performance decreases. So this is probably the money shot for the rest of the presentation. There are two key metrics that you can calculate to size a system or configure a system for L2ARC. The first one basically says, hey, is my active data set size going to fit in my L2ARC? Okay?

Starting point is 00:49:52 The second one basically says, is my L2ARC fast enough to make that a good thing? Right? Because if you get, you can get all your data in L2ARC just like I showed you before, but if you get all your data in L2ARC and like I showed you before, but if you get all your data in L2Arc and L2Arc is slow, that's a bad thing. Key metrics there. What type of drives? Your bandwidth and ops and latency capabilities

Starting point is 00:50:18 are going to rest on your type of drives. Segregated pools versus mixed pools with segregated data sets is a couple of things you can do. So, for example, if you have smaller drives, like you were saying earlier, if you can only afford very small NVMe SSDs, then you kind of want to segregate those off to the pool or the data set where it makes most sense. Capacity. Large devices can hold more of your data. It will take longer to warm.

Starting point is 00:50:57 In a deployed environment, that probably doesn't make any sense. Or that doesn't make any, you want, it doesn't matter how long it takes to warm. You're going to be running for years, hopefully. So you don't care how long it takes to warm. You're going to be running for years, hopefully. So you don't care how long it takes to warm. I only care about that in my lab. There is a thing that can happen where if you have like 64 gigs of RAM on your system and you have like, I don't know, 100 terabytes of L2 arc, you're going to eat up all of your arc with the indexes to your L2 arc in the headers. So your arc hits are actually going to go down. So beware of that. Probably not a good idea. Device count, our device count shows,

Starting point is 00:51:39 our testing shows that device count actually improves latency because it's basically spreading the load if you have more devices. What are you going to feed? If you have a slow L2 arc, you want it to be demand-fed, not prefetched. If it's a fast enough L2 arc, you want to feed everything. And those fast enough is that ratio

Starting point is 00:52:03 I showed a couple slides ago. If you're writing, you basically want to do those mitigation practices where you segregate the pool. Finally, know your workload. And this is kind of preachy, teachery stuff, but if you're a customer looking at evaluating a storage solution, know your workload. Know at least the applications. And if you're a vendor, know your architecture. Don't just go out there and say, yeah, L2Wark's awesome. NVMe is a rocket. I'm going to strap myself to that rocket. So not necessarily. You can actually have problems there. So the big picture here is with enterprise-grade NVMe SSD devices

Starting point is 00:52:48 as L2ARC, we have reached the quote-unquote future that Brennan Gregg mentioned, and we can now have it as an effective tool for a streaming workload. So, thank you. I'm going to give you some more homework. Please download the slides for more content,

Starting point is 00:53:04 and please rate my talk in the SSD portal. So, thank you. I'm going to give you some more homework. Please download the slides for more content. Please rate my talk in the SSD portal. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #134: Best Practices for OpenZFS L2ARC in the Era of NVMe

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.