Storage Developer Conference - #109: Real-world Performance Advantages of NVDIMM and NVMe

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 109. I'd like to talk to you today about the real-world performance advantages of NVDIMM and NVMe, mostly focused on NVDIMM. And I'm doing that by using an open ZFS file server as the ecosystem we're looking at things in. Having gone to SDC for many years now,

Starting point is 00:01:04 there's been a lot of talk about persistent memory and NVMe, and the performance numbers presented are, you know, as fast as you can get. So what happens when you put, I wanted to see what happens when you put those devices in a real system. So let's get to it,

Starting point is 00:01:23 if I can figure out how to advance my slides. Okay, hopefully that will work from now on. So to start out with, I'm going to do what I just said I didn't like very much and walk you through some performance, a device performance survey of flash and persistent memory devices, mostly flash devices, but some different technologies. And I'm going to start by saying what aspect of performance I think matters most when you're looking at strictly flash and persistent memory devices. I don't know if you all will agree with me, but I'll tell you what I think. And then we'll get into the real-world example using persistent memory and flash devices as a slog device in an OpenZFS file server.

Starting point is 00:02:12 I will explain what that means. If you have no idea what an slog device is, we'll walk through just some fundamentals of how OpenZFS works. And then we'll go into a slog-type performance survey with the persistent memory and flash devices. Uh-oh. Oh, it worked. Okay.

Starting point is 00:02:37 So I think that synchronous rate latency is really the key when you're comparing flash devices and persistent memory devices to each other. I think it really shows the differences between the different technologies very well, and we're going to look at some of those results. I will say don't forget the maximum mebibytes per second that you can get, often you're limited by either your interconnect or your controller on the actual device, which is easy to get caught out by. So talking about the interconnect performance, let me grab my pointer here, technology. So why do we care about NVMe? NVMe is the red bars here. You can see we started out in life.

Starting point is 00:03:32 This is our maximum peak performance that the interconnect will do in one direction. We started back in the old days, SATA 1 and 2. SAS is in gold. So for our enterprise types, we're pretty much on SAS 3 solidly across the industry. I don't think there's any SAS 4 out there. I could be wrong. But you can see that NVMe, just buy 2. And keep in mind, if you're dual porting your NVMe drives, you will be restricted to BI2, at least in most of the

Starting point is 00:04:05 cases. I think I heard, I had a conversation where someone said that wasn't strictly true. But in general, if you do have a dual ported NVMe device, you know, the four lanes will be split, so two go to each controller. But if you're doing BI4, and of course this is with PCIe Gen 3, with a BI4 device-four device, you're obviously way ahead of everything else. So it's really awesome to see that a single device can get that many mebibytes per second. But why do we care about NVDIMM? Now, the exact number here may be in dispute, because in all my years of doing storage performance, I've been focused on storage performance my whole career, about 12,

Starting point is 00:04:51 13 years now. I've never cared about the memory speed. It's always been, if we're hitting memory, we're good. So I've never even really thought about evaluating it. Of course, in OpenZFS, you'd, you know, the whole goal is to be operating at memory speed most of the time. But I've only recently started looking at OpenZFS. So, but, you know, we can just see, you know, what can you get out of your memory bus? The potential bandwidth is so much higher than even NVMe.

Starting point is 00:05:23 So that's why persistent memory is such a hot thing. So let's jump back to what I was talking about before we got into sidetracked on interconnects. Why do I think synchronous write performance is a key differentiator? I think it boils down to the usual thing where you can have something fast, safe, and cheap. Oh, no, you can't.

Starting point is 00:05:49 You can pick two of those things in most cases. So do you want to be fast and safe or fast and cheap? Well, safe and cheap, I guess that's good, too. In general, in Flash devices that you use, if you are fast and safe, then usually there's some power fail safe cache, usually just super caps. Generally, that's the differentiator between enterprise devices and consumer devices. And even prosumer and enterprise, that gets a little bit murky. I was trying to pressure the Intel guys on whether my consumer optane was PowerFail safe or not,

Starting point is 00:06:28 and I didn't quite get an answer. Yeah. So, you know, obviously hard drives are slow. So, yeah, writes, reads, it all is kind of not great when you're dealing with a hard drive, at least in our new paradigm of NVMe devices and persistent memory. But, you know, write isn't that great for NAND flash either because that's been the whole evolution and the whole difference between, you know, enterprise devices and consumer devices? How much extra capacity do you have to avoid wear out and ensure you have good performance so you don't have to wait for a whole bunch of cells to be erased before you can actually write to them?

Starting point is 00:07:12 So it's no walk in the park for NAND Flash. I will say that, I mean, Flash, I think it's easy enough, it's easy for Flash to be fast enough. It's like when all Flash arrays first started, it didn't really matter what you did. It was so much faster than what things were before that it's like, it was fast enough. It's like when all flash arrays first started, it didn't really matter what you did, it was so much faster than what things were before that it's like, it was fast enough. But now that we have new devices,

Starting point is 00:07:32 we have like Optane 3D crosspoint technology, we have NVMe coming in, now there's more differentiation between all flash arrays. A basic all flash array now may not be fast enough for you. It's the same with the device layer. And so, like I said, usually you're limited by your controller or your interconnect for your single flash devices, unless you're using NVMe. So I decided, when I started at ixsystems almost a year ago now. I started buying.

Starting point is 00:08:06 I moved to Tennessee, so I had a little bit more money than where I used to live. So I started buying devices gratuitously off Newegg. And some of my coworkers joined in on this. And because we're a big free BSD shop, there actually is a benchmark that's sort of been improved. And it's not actually that old, but the disk info command, if you use dash W and S,

Starting point is 00:08:28 it'll actually do a really quick, single-threaded synchronous write test. The intent there is to actually test the suitability of a device for an OpenZFS slug device. So that sort of led to this whole presentation. So as I bought more devices, as other people bought more devices, as other people bought more devices, as our company bought more devices and we brought them in, the first thing we did was we ran

Starting point is 00:08:49 this quick test. Now it's not incredibly scientific. Different hosts were used. They weren't really normalized. It is a quick test. And the devices did range from new to very well-aged. But despite this, we have some cool numbers to work with and work up the stack. So let's start with hard drive-based storage. So we know it's slow, but whatever. It's served us well for all these years, and with ZFS, actually, there's a pretty good case for using hard drives still.

Starting point is 00:09:24 So I looked at two 7 case for using hard drives still. So I looked at two 7.2K hard drives. One is just a consumer laptop hard drive, 2.5-inch. The other is a 10-terabyte, again, consumer, but an HGST Helium drive. And the other one was a cool thing. I got a laptop off eBay, and it a 500 gig hybrid hard drive, so an SSHD. So briefly, these were, I guess, considered a good idea. I think they're sort of fading out of existence,

Starting point is 00:09:55 but they have a small amount of flash in them, and the idea was at least you could acknowledge some writes pretty quickly because they're safe, and I don't know how safe they really are, but also you could cache hot reads. Now, Intel's doing the same thing with their Optane devices now, or they're trying to push that,

Starting point is 00:10:12 so this isn't a totally lost concept, but that is with discrete devices. So anyway, if we look at the two lines down here, the blue line and the red line, these are the two just basic hard drives. And we're looking at the ops per second, so higher is better, and we're looking at what happens to performance

Starting point is 00:10:33 as we scale up I.O. size from 512 bytes, you know, 1K, and then up by powers of 2, up to 8 megs. So we can see that our basic hard drives, my cheap laptop hard drive is, you know, well down in about, you know, 60, is that 60 ops per second? Yeah, it's around 60 ops per second. You know, the large new modern one

Starting point is 00:10:59 is around 100, 120 ops per second or so here at the peak with smaller IO sizes. And there are a few wobbles in there. The real surprise to me was that when we look at the hybrid hard drive, we're up here at 1,000 ops per second. So that's actually equating to about one millisecond for these small I.O. sizes in this part of the curve.

Starting point is 00:11:23 So that's actually really good latency, as we'll see in the next slide. But the pure hard drives are in the tens of milliseconds, 10 to 30 milliseconds, for these synchronous writes. Again, single-threaded synchronous writes is all we're looking at here. So let's keep the red line, which is going to change colors,

Starting point is 00:11:43 but it's the 10-terabyte hard drive, and let's keep our SSHD,, but it's the 10 terabyte hard drive, and let's keep our SSHD, just because it's a weird device. And, you know, one millisecond latency, it's still an order of magnitude better than those hard drives for this test. So next, let's step up to the consumer SATA SSD level. I had three of these that I chose from,

Starting point is 00:12:03 a Crucial MX300, 500 gigs, Team Light, which was, I think, two for like 35 bucks on Newegg, so a very inexpensive, small 120-gig drive, and a 500-gig Mushkin Reactor, which I bought for my L2 arc in my FreeNAS Mini at home. So consumer SATA SSDs are also around one millisecond. So we can see

Starting point is 00:12:28 they're hovering around the same as that hybrid hard drive, which is now the red line right here that's dotted. So we can see here that they're about the same. The cheaper SSD, the team group one, is a little bit more inconsistent. Again, it's a very short test, so with more conditioning, that may have gone away. But we can see it's about on par with the other one, and for some reason, the Mushkin drive

Starting point is 00:12:57 was lagging behind a little bit. But we're all in the same region. We're one order of magnitude higher than the hard drives, the pure hard drives. So these are great. We need to remember they may not be power fail safe. In fact, I'm pretty sure none of them are. So they're cheap and fast. So for this example, let's keep the yellow line here. That's the Crucial MX300. It's a very stable curve, and it's a pretty good representative sample. So let's go up the stack one more level,

Starting point is 00:13:31 and let's look at what an enterprise SATA or SAS SSDs look like. So the SATA device here is just an Intel DCS3700. This is where we get into the well-aged category. This drive has been on for about four years, but the endurance actually is pretty good according to the Intel tool and the smart stats. So it's been sitting around doing nothing for most of its life.

Starting point is 00:13:50 So that's the green line right here. So we can see that even though it's an enterprise SATA SSD, it's still around one millisecond in the smaller IO size region. It actually is a little bit slower than that MX300. But we do know that this device is PowerFail safe. So if we unplug it with no warning, we're not going to lose data that's actually been acknowledged.

Starting point is 00:14:15 That's guaranteed to us. The other devices I looked at are just some SAS SSDs, two HGSTs, an UltraStar SS300, and a SSD MHB. I forget which line exactly that is. But one's 400, one's 100. You know, the 100 gig one is a high endurance. The 400 gig one is a lower endurance.

Starting point is 00:14:41 And I also dug up an old stack Zeus IOPS drive, which is actually just DRAM and an old six gig SAS with a six gig SAS controller in front of it with I assume it's capacitors to keep it safe. So it can dump to flash if it loses power, but it's sort of like the first NVDIMM just with SAS. They were really cool, they were just really expensive. But we can see here that the enterprise SAS SSDs,

Starting point is 00:15:10 you know, they're all clustered here, so again, we've moved up one order of magnitude. We're around 0.1 milliseconds in terms of latency with our synchronous riots. And you can see actually for the small IOPS, you know, that Zeus IOPS drive is doing very well, especially considering it's, old six gig SAS device. But we can see it does fall off

Starting point is 00:15:31 as we get into more of a bandwidth workload. So it does get limited by the interconnect. And the other drives just sort of fall off as well. So let's keep the 400 gig HGST SS300. That one sort of has a nice balanced curve. So what's our next stop? Our next stop is the world of NVMe. So unfortunately I only had two 960 EVOs kicking around.

Starting point is 00:15:58 All my pros were in use and I didn't feel like destroying all the data on them. But we can see here that at least for these devices, and I really would have loved to test a pro. I just did not have a pro, and all of my other NVMe SSDs were just busy. I could not extract them. So, you know, hey, if you install FreeBSD,

Starting point is 00:16:18 send me your numbers, and I'll update the charts. But you can see that they're actually in the same realm as the normal SATA flash, which this surprised me. I thought, you know, maybe they'd be a little bit better. They weren't. And there actually was a difference between the 500 gig and the 250 gig. The 500 gig drive was actually slower than the smaller drive.

Starting point is 00:16:38 So maybe it has more extra space or who knows. Maybe there's, there shouldn't be a controller difference, so I really can't explain that. But you can see they're in the same realm, they're in the same state of flash realm, even though they are NVMe devices, but they are consumer devices. So next step, with some very exciting devices here, the sort of prosumer and enterprise NVMe realm, really just one enterprise, the 1.6 terabyte Samsung PM1725A. You see that, and I've seen that in a couple of presentations here at SDC. But also an HGST SN150.

Starting point is 00:17:20 That was pretty well used, but that's a PCIe form factor. I believe the SSD750 Intel is as well. That was run by a co-worker in another continent, so I never saw that device. But you can see here we haven't quite jumped up another order of magnitude for the small I.O. sizes, at least transactional stuff. But we're doing fairly well.

Starting point is 00:17:42 We are at, where are we? So we're at around 20 to 30 microseconds there. That's equating here to about 80,000 ops, 60 to 80,000 ops or so, if I had to guess. So that's pretty cool. We can see that there are some differentiations here in the mid-IO sizes. There's a few wobbles going on. Again, maybe with a longer running test, that would change. I mean, I'm not running a whole S3 test suite

Starting point is 00:18:14 to precondition the drives and delete all the content and all that. But this is just a quick example. But we can see they do follow the same trend line. So let's walk forward with the Intel 750. Now, I have a fondness in my heart for Optane devices, and despite the Intel logo on the back of my shirt, I'm not paid to say that.

Starting point is 00:18:39 I just think they're really cool. So the first Optane devices that I'm aware that Intel released were the 16 gig and the 32 gig M.2 Optane devices. And for the price, I thought, wow, that's going to be a really good slog device because we don't need a lot of capacity, and it should be very fast and should have high endurance. And we can see here that even though these devices cost in the order of $100 and change, we can see the latency for small IOs is absolutely fantastic.

Starting point is 00:19:10 It's the same as our prosumer and enterprise NVMe flash devices. So we're ranging from tens of microseconds up here, almost 100,000 ops per second, synchronous writes. But they do have bandwidth limits. And as you roll up your I.O. size, you can see they fall through the entire spectrum of devices all the way down. The small one gets, it matches the hard drive right there,

Starting point is 00:19:44 that HGST Helium drive. So it's very interesting. Like, you know, Optane is, you know, with these small Optane devices with one or two chips on them, you know, you're hitting, it's either a controller limit or it's actually the, you know, the chip. I don't know the architecture of it. But you're hitting basically a controller limit for the device.

Starting point is 00:20:06 So that's why you see this line peel off so quickly. And you do get more out of the 32 gig versus the 16 gig Optane. So it's very interesting. It's a device that spans all of the latency, the whole IOPS range, depending on your IO size. So it's pretty cool. Now, newer Optane devices released,

Starting point is 00:20:26 and my wallet cried out for help, but I couldn't help but buy them. So I got a 118 gig, the Optane 800P, that's an M.2 form factor, and also a 480 gig Optane 900P. I wish I had the 905P, but that's like as much as a mortgage payment, at least in Tennessee.

Starting point is 00:20:43 So I don't have that one yet. But that one, the 900p, is actually a PCIe form factor device. And you can see here that if we look at the 118 gig, which is the yellow line here, we can see it follows the same curve as the other M.2 devices, but it certainly is a little bit more favorable.

Starting point is 00:21:03 And the latency, again, is very good up here. Instead of falling out, you can see the difference down here is not that big. We're comparing against the pink dotted line right here versus that yellow line. So, again, even though it's 120 gigs and, you know, it's a lot better, you know, it's a marginal bump,

Starting point is 00:21:23 but the price is also very low. Now, the 480 gig optane has tons of chips. I assume a lot better, you know, it's a marginal bump, but the price is also very low. Now, the 480 gig optane has tons of chips. I assume a much better controller, and I believe it's electrically the same as, like, the data center version. So we can see here that it surpasses, I mean, we'll forgive it for this little difference here, but you can see that it exceeds the very, very expensive, well,

Starting point is 00:21:47 I guess not very expensive, but the fairly expensive 1.2 terabyte Intel SSD 750. And remember, that was about the same as the Samsung, you know, enterprise NVMe device. So this is, I mean, this is really impressive. We have one device, we're doing synchronous writes to it, and we're getting basically 100,000 ops per second. But, NVDIMMs move us up basically almost another order of magnitude, even from that.

Starting point is 00:22:24 We're talking 2.5 microsecond latencies and 400,000 ops per second. So, and this is with a 16 gig NVDIMM in our system. Now, every write sent to this device is actually being mirrored to another controller over PCIe interconnect before it gets acknowledged. And we're still here. So I need to power off one of my other controllers

Starting point is 00:22:52 and run this test again. I've been running tests too hard and fast to actually stop and power off one of my controllers. But even with mirrored writes, this is what we're seeing. And we're also using the NVDIMM device not as memory, but we're actually using it as we have a shim layer that sort of treats it as an actual disk device, disk-ish device.

Starting point is 00:23:16 So even with all that considered, it's still beating everything. So that's really cool to see. So that's the survey of single device performance. You didn't come here for that. You know, you could do that yourself. But I did it for you, and I have a lot of flash devices to prove it. So let's talk about a more real world example using these devices as a slog, which is what this benchmark was intended to assess,

Starting point is 00:23:45 whether it would be a good fit or not. So for those of you not familiar with OpenZFS, I'm going to go over a quick overview. So we have our FreeBSD plus OpenZFS server here, which is one of the products we sell. Inside of the server, we have the front end, so NIC or HBA, where data comes in from. We have system memory,

Starting point is 00:24:09 a portion of which is used for ARC. And then we have a bunch of disks. So these are just 7.2K hard drives in my case. Those are the data VDEVs. That's where all the data resides when it's considered safe and it's protected by RAID and checksums and all that good stuff.

Starting point is 00:24:28 And so those are in a Zpool, so just a storage pool, ZFS storage pool. And then we also have some SLOG VDEVs and some L2ARC VDEVs, so we'll get into all those. But let's take a look at ARC. ARC is the adaptive replacement cache in ZFS. It's a global system-wide cache that is composed of main memory. And a storage appliance usually, almost all the memory in

Starting point is 00:24:52 the system goes into this cache. It's shared by all the pools that are on a system. And it's used to store or cache. Now, it's not a read cache. It's adaptive replacement cache, not read cache. That's an important distinction. So actually all incoming data goes to the ARC first. And then also your hottest data and your metadata, to what extent it can, will be placed in ARC. So it can be serviced with the fastest speed possible. And the ARC actually balances its usage as cache

Starting point is 00:25:26 between most frequently used data and most recently used data. So it's a fairly smart cache. So now let's talk about the L2 ARC. If we just talked about the ARC, let's talk about L2 ARC. Now, you'll see it's in a much different place. So L2 ARC, your level two adaptive replacement cache, actually resides on one or more storage devices, and those devices are in the pool.

Starting point is 00:25:53 So these devices will only service IOs going to a single pool. Usually, these devices are flash devices, usually read intensive. Obviously, these devices are flash devices, usually read intensive. Obviously the data has to be written there, but the idea is it's written there once and then read back many times. And let's see, what else did I wanna say here? So this caches the warm data and metadata,

Starting point is 00:26:22 if there is any, that doesn't fit in ARC. So stuff doesn't really get directly evicted to ARC, but stuff does get moved to L2ARC. So if you have a lot of active data, some will reside in ARC, and then some will be pushed out to L2ARC if it's read a lot. That's weird. Oh, okay, no. I put the red lines there to show you where the ZIL is. So the ZIL is the ZFS intent log. So it's sort of like a journal.

Starting point is 00:27:01 By default, the ZIL actually resides on the data disks in the pool it just round robins using different v devs inside of the pool what this does is it's used to quickly store synchronous write operations on a persistent storage so you can acknowledge them um so you you can just all the data sits in a well we'll go through exactly what that means later with a diagram it's hard to explain in words. So the client requests, if incoming data comes in,

Starting point is 00:27:30 the client requests can be acknowledged if it's a synchronous write once the data's logged to ZIL because that's on persistent storage. From this, the data is later written into the pool from main memory via a transaction group. So we'll see a demonstration, an animation of that in a few slides here.

Starting point is 00:27:49 The last thing I wanna talk about is the slog, Slog, whatever you wanna call it, that's actually a separate ZFS intent log, Slog. Mr. Dexter has done a lot of explanation of this and I pulled from some of his materials on the IxSystems webpage, because it's a very hairy subject that confuses people. It is confusing even for me.

Starting point is 00:28:11 I'm fairly new to OpenZFS. So the optional slog, it resides on one or more storage devices again, and it's associated with a single pool, just like L2ARC. Here you definitely want flash or better, but it has to be very high endurance. There'll be, unless something goes wrong,

Starting point is 00:28:30 there are never any reads from this device, but any synchronous writes are written to this device and logged there, so there are a lot of writes. So again, it's added in a single pool, so it only services this single pool. And it allows your ZIL to be separated from your primary pool storage. So in some cases, that really can help your performance a lot.

Starting point is 00:28:55 Now, you may not need a ZIL, but there are ZIL stat, is one utility that comes to mind to let you know if you're using your ZIL heavily and whether then you can consider maybe separating it out and putting it on a separate device to increase your performance.

Starting point is 00:29:10 You have to be very careful, though, because if you choose the wrong device here and you hit a bandwidth limit, that's going to be your bandwidth limit if you're doing all-synchronous writes or NFS, for example. You know, you can be, if I put that small optane in here,

Starting point is 00:29:23 I'm going to get, like, 250 megs per second maximum with NFS because it's all gonna be bottlenecked through the slog. So let's walk through what happens during an asynchronous write to OpenZFS. I'm gonna have to hit the button here a lot. So we have a request, data comes in over our front end, the NIC or the HBA, and the system shuttles that forward to OpenZFS.

Starting point is 00:29:51 So OpenZFS immediately accepts the write, puts the data into ARC, which is main system memory, and then once that's done, it can acknowledge that back to the client. So at the next transaction group, which about every five seconds, that data will then be moved down onto the data VDEVs in the pool, and the data, and that gets copied from ARC to the data VDEVs in the pool. The data does remain in ARC because it's most recently used, but it's not dirty anymore.

Starting point is 00:30:31 So nothing will be lost. The data is safe on the hard drives after the transaction group. I think it's pretty similar to other copy-on-write file systems, if you're familiar with any of those. Waffle springs to mind for me. It's not exactly the same, though.

Starting point is 00:30:50 So how does this change when we're doing a synchronous write and we have a slog device in our pool? So just like before, and I've bolded the steps that are different here, just like before, the request comes in, OpenZFS handles it, but the data is written both to the ARC and to the SLOG. As soon as the data hits the SLOG, that can be hacked back to the client. So at the next transaction group, that data will be moved from ARC,

Starting point is 00:31:28 or copied from ARC, down to the data VDEVs in the pool. So now your data resides both in ARC and on the pool, the actual hard drives, the persistent storage. And it is technically in the slog, but once the transaction group happens, we don't care anymore,

Starting point is 00:31:44 and that'll be overwritten at a later date. So the data comes from ARC. We never read from the slog. The slog is only used if there was a power fail event, and then we had to replay that log to restore the pool to the state that was acknowledged to the clients. So that's the intro. That of, of, that's how this log is used. So you can see, now

Starting point is 00:32:11 for NFS, everything is going to be going through, through this log if you have one in your pool, because everything is treated as a synchronous write there, or as a synchronous operation. And you can see, if everything's going through this device, you want this device to be as fast as possible. Like I said, if you use that small Optane device, you'll get stuck at some small number of megabytes per second.

Starting point is 00:32:36 And that's just because everything has to go through here to be act, and then later committed to the pool. Question? That section of the pool is both only, right, that the after 75 goes to the pool. Question? That's actually not true, because only writes that are after certain sites go to the pool, or writes that are bigger than certain sites

Starting point is 00:32:51 may go to the main pool. The bias approach of the input can be related to the pool pool. Yes, right. So the assertion was that that's not true, because large writes can go directly to the pool and there is the log bias setting that you can set per pool, right?

Starting point is 00:33:12 Or per data set, yeah. So with the way it works, and at least with FreeBSD and OpenZFS and the way we have it set up in FreeNAS, not many people play with the log bias setting, especially in OpenZFS and the way we have it set up in FreeNAS, not many people play with the log bias setting, especially in OpenZFS anymore. And if you do have a slog, everything does go through the slog. It doesn't bypass. It will bypass if it's using the ZIL. And there is a setting for that. So let's take a look at some real data here. So what I'm doing, I'm actually using iSCSI for this testing

Starting point is 00:33:54 because then I can actually set whether I can set sync always to make everything go through the SLOG or I can set sync standard and then the SLOG will just not be used. In this case, I'm running through some worst case scenarios here basically. We're doing random 4K write against eight LUNs and I'm scaling up thread count.

Starting point is 00:34:18 So each of these dots is an additional amount of threads per LUN. So our baseline here, where we have the best performance, is this blue line right here. So that's with sync standard. So we're not using a slog, we're not going through the ZIL here. That's our best performance. You can see we're getting, oh, 4250, or 42500 ops or so at the peak with that thread count so the next run i did is with well let's start from the bottom actually so we can go up in order of increasing speed so i took one of the ssds that i believe we looked at earlier the ssd 800 mh, or at least one of the distant relatives of one of them. So that's a

Starting point is 00:35:07 single SAS NAND flash device that we're using as our slog. So we can see here that that device simply can't service all these ops, so we do get bottlenecked now that we're sync always, so all of our I.O. has to go through the slog. So we hit this limit. Now, if you recall the charts from earlier, we saw that the Zeus RAM device for small I.O.s, like 4K I.O.s, was a little bit faster than this HGST enterprise SAS device. So if we use the Zeus RAM instead for our slog, we can see that's a little bit, we do get a little bit more performance out of that. Now obviously we would be bandwidth limited with the Zeus RAM because it's only a six gig SAS device.

Starting point is 00:36:00 But for a transactional workload, that's actually not too much of a concern, although I'd have to do the math to see if it was. I don't think it was. Now, I said you can use one or more devices for your SLOG. So I said, okay. Now, I wouldn't recommend doing this, but I took these two devices

Starting point is 00:36:18 and just threw them both in as SLOG devices for that pool. Now, you probably don't want to use disparate devices, but just for the sake of the example, that's what I had on hand, so that's what I did. We can see using both of those devices, so we'll round-robin between them, we were able to actually service more I.O. Let's see, what did we get up to there?

Starting point is 00:36:44 That's almost 30,000. But we're still well short of where, you know, we're not using the slog. So enter nvdim. And again, like I said, in this scenario, we are actually mirroring all the writes to the other nvdim and the other controller before we acknowledge anything back to the other system

Starting point is 00:37:03 so we can acknowledge the client that requested the write. But you can see we're not hitting the knee of the curve really early here. We're actually following the same exact shape. Now we have increased the latency a little bit, so that has pushed our ops a bit lower for each thread count. But certainly the NVDIMM performance is much more favorable than any of these SAS Flash devices. So again, it's a worst-case scenario. You're not often going to be doing all 4K random writes. And with iSCSI, you may or may not decide to change the sync setting

Starting point is 00:37:46 at all. But this just gives us a nice comparison point of, you know, with nvdim we see a totally different shape of that curve. And it matches what we get when we're not even using this log. So the next test I ran was sequential writes, in this case 128K, and I just did the same thread scaling. In this case, it was 12 LUNs, but the same thread counts.

Starting point is 00:38:19 So to build these lines, I just scale up the number of threads, which just scales up the number of outstanding IOs, basically. So our baseline, again, is the blue line with sync standard. So we actually do run into a bottleneck here in terms of the thread counts. Something happens, and we're no longer... we're just backing up our queues

Starting point is 00:38:40 with the higher thread count there. But you can see we're at about 2,750 megabytes per second. So let's go back to the slowest first. I should have reversed that. But let's go to the slowest line that we can see here. And this is with one of our SAS SSDs again. So what's happened here? We've hit the wall from the get-go. Well, if you look

Starting point is 00:39:06 at the spec sheet for this drive, it can do about 410 megabytes per second sequential writes. So in this case, the vendor spec sheet is spot on. That's all that device can do. So we're not going to go any faster. Now, with NVDIMM, you can see, again, we have sort of a similar curve, but we're better able to pump bandwidth through the NVDIMM. It's not the same, certainly, as when we're not forcing all the data through the slog, but this is a much more favorable curve and much more favorable performance than using a SAS SSD that will be bottlenecked here. Now, like we saw before, you could add more SAS SSDs and you'd get up to that point,

Starting point is 00:39:54 but then you're paying for a lot of SAS SSDs and you're taking up your drive bays, whereas if you have a modern motherboard, you have a couple slots that you could easily throw an NVDIMM into. And it's far superior with a single device. In this test, it was, it's a bunch of mirrored 7.2K drives.

Starting point is 00:40:27 It should be somewhere in the order of, it probably varied. It was between 80 and 142 drives. I think in this case it was 142. Or 140. I'm trying, I know that's sort of interesting information I'm just trying not to focus on what I tested I'm not trying to sell that product here so one more chart you'll notice we had a green line that was not evident on the previous chart

Starting point is 00:41:01 that's because I zoomed in here so we could better see the shapes of these curves. So what happens if you take all the slog devices out, but you still leave sync always on in this iSCSI case? So we're going to use the ZIL that resides on the VDEVs in the pool. So you can see actually doing that, because we have a lot of devices,

Starting point is 00:41:23 we can exceed the bandwidth limit that we had when we had that single SAS log device, because we're spreading the load over a lot more hard drives, and those hard drives are all 12 gig SAS hard drives. So each one can do, you know, at least 150 megs, probably maybe a little bit more, maybe up to 300 or so. But you can see here, we can exceed that. We just have higher response time because they're hard drives. So we're using them both to write all the data to, and then we're writing that same data again

Starting point is 00:41:56 later when the transaction group hits. Wait, actually, sorry. We're not doing that because it's 128K. That is the case where we are bypassing the data. We are still writing metadata and some other stuff, thank you, into the slog, and then we have to write that. So we probably are antagonizing the heads and causing some additional seeks. But actually, this is, so 32K and above,

Starting point is 00:42:24 when you're using a ZIL, is when you will bypass. The data gets written directly into the pool, and it doesn't get copied again. But with FreeBSD and FreeNAS, if you have a slog device, then that doesn't happen. You always go through the slog. And then at the transaction group,

Starting point is 00:42:47 the metadata that's changed, that needs to be changed to reflect that new layout on the disk, then gets committed to the actual pool and moved from the ZIL on the drives. So it's a little bit heady, but you know, it's pretty easy to see. If you have one single SAS device, you can easily get bottlenecked.

Starting point is 00:43:08 You can get very favorable latencies when you're using NVDIMMs, and that does allow for fairly good bandwidth as well. But in some cases, like with this configuration here versus just not having a slog device, chances are, if this is your workload, you would be better served not having a slog at all. So a slog doesn't always increase your performance.

Starting point is 00:43:35 However, now with NVDIMM, you may actually consider using an SLOG for a workload where you may not have considered it before. So that's all I had. I wanna just thank IX Systems for making this talk possible. I spent a good bit of time and some very very limited availability equipment to run some of these tests.

Starting point is 00:44:02 I got my Twitter, GitHub, and email up there. So thank you, everyone. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the

Starting point is 00:44:37 Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #109: Real-world Performance Advantages of NVDIMM and NVMe

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.