Storage Developer Conference - #64: Past and Present of the Linux NVMe Driver

Episode Date: February 13, 2018

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 64. Hello, everyone, including the latecomers that were enjoying the music outside. So I'm going to talk about the past and present of the Linux NVMe driver, and it's been there for a while. Now I'm one of the co-maintainers, and I'm also active in the NVMe technical working group,
Starting point is 00:00:58 hassling all the device and standards people to think of, or us poor host people that actually have to deal with drivers. And, well, what's a driver? Everyone will know, and if you think of golf right now, you're in the wrong room. So, piece of computer software that controls input and output. Sounds harmless, huh? But it could be a whole lot of different things. So just if we're sticking to storage drivers, because we're all storage people, and look
Starting point is 00:01:30 at good old SCSI, you could have a tiny little driver like the virtioSCSI for virtual machines, which is including the header, 1,300 lines of code, or we could be three orders of magnitude bigger for a nice little fiber channel driver that everyone loves. So, well, drivers can be quite a lot of different things. So it could be lots of different hardware types. Like if you have seven generations of different hardware in the same driver, it's going to be big. If you support an initiator and a target and SCSI and NVMe in one driver, well, it's going to be big, and all these factors multiply. So if we're moving to NVMe drivers, this could look a little different, too.
Starting point is 00:02:17 So if we go back to Linux 4.4, our NVMe driver was about 7 and a half K lines of code, split over a few files. Actually, in 4.12, it got a lot smaller, which is interesting. But that's because at that point, we actually had two drivers, and in the end, it's actually more. It's just the one that's actually driving
Starting point is 00:02:40 the little PCIe device you think of when NVMe. It's a lot smaller. And either way, we're doing pretty good compared to other NVMe drivers. So if I look at the open source OFED Windows driver, it's getting closer to our fiber channel monster on the Linux side. And it actually has a lot less functionality, which is pretty interesting. So the humble beginning of our little baby, the NVMe driver, was our friend and colleague Matthew Wilcox back in the day, early 2011.
Starting point is 00:03:13 And if you look at the lines of code, the two files down here, it's actually pretty much exactly the size as our example for trivial little driver. The other interesting thing when looking at that commit in GitWeb is if you look at the date of the commit, that was actually a month before the NVMe 1.0 spec got released. Someone had a head start in there. So if you look at this very first commit, it was very little in there. In fact, in some ways, it doesn't really look at
Starting point is 00:03:49 what your friendly marketeer will tell you about NVMe. So there was only single submission queue, single completion queue, only did small data transfer, so only up to single page size of actual data transfers, just did a read and a write and a few admin commands. It was tiny, simple, and didn't actually work. So it took a couple more iterations before it was even
Starting point is 00:04:14 in shape to get merged into the Linux kernel. So January 2012, one year exactly. And at that point, it started to resemble what we think of as NVMe. So it got multiple queue support, basically one per CPU. It was pretty strict about that. It didn't really work well at that point if you had less queues than CPUs. It supported larger data transfers, lots and lots of fixes, and it grown about 800 lines of code.
Starting point is 00:04:43 So, well, quite a bit bigger than before. Well, we had that driver in the main Linux kernel. There was still no NVMe product on the market by then. So just a couple prototypes and labs, giant FPGAs. I think Martin had one of those and really loved it. And continued like that, basically, until we got a few products on the market, got lots of little bug fixes and a couple features that are not too major but interesting. So we got support for the NVMe deallocate command, what we Linux people know of as discards.
Starting point is 00:05:20 So telling a device that that data had a bunch of LBAs, it's not needed anymore. Go reclaim it. Optimize your garbage collection. Actual support for cache flushing, which was committed before it first got into Linux kernel, actually started working because someone tested it with a device that had a volatile write cache, which they probably didn't do before. We got a feature that has been really useful, and that is the character devices. So normally if you talk about block storage in Unix systems, you've got your block device node that you do the actual I.O. on, which for NVMe we've got one per namespace. But if you want to actually do administrative and set up things,
Starting point is 00:06:03 we've got the nice little user space NVMe CLI tool that allows you to basically exercise every NVMe command. And for that, we have a character device handle that offers the Ioctl to do it, which is really useful if you, say, have a drive that doesn't have any namespaces. There's not going to be a block device node for it. Or you have a really broken device that doesn't come up and can't create IOQs, which apparently is a common failure mode for Intel devices, so you could actually bang that back into a shape where you can use it with a firmware update. Then another thing we got in there, which was a little controversial mostly later,
Starting point is 00:06:38 is that someone at Intel, not the maintainer, added a weird SCSI translation scheme that you could use the SCSI path through iOctl that we have in the SCSI subsystem on the NVMe namespace block device node and would silently translate it to NVMe commands and submit them. Turns out that code was extremely buggy, had a few exploits in there, and generally didn't work, but a few people relied on that anyway. And the first really, really major change, which also went along with a bump of the version number to the comfortable sounding 1.0 from the Odo, whatever releases, was the switch to BlockMQ. So BlockMQ is a piece of infrastructure we have in Linux. It's sort of the block layer, or rather the new version of the block layer. So before that, our application, our file system, our block device node, more or less called straight into the driver. So it filled out a little bio structure that describes every block I.O. and dispatches that straight to the driver, which is, well, very low overhead calling convention for sure, but it also meant that every driver written to that interface
Starting point is 00:07:51 needs to do a lot of work. And the prime user of that interface in Linux traditionally was remapping drivers. So if you do a software RAID or a volume manager, those were the drivers written to that interface. And we always had another layer called the request layer that sits there. We've got the block and queue, which did a lot of work that I'll get into on the next slide that really helps with block drivers. But once PCAE flashcards started showing up, the performance of that alt layer was just too bad for people to use it, so they started duplicating lots and lots of bits of the infrastructure, and we had to act.
Starting point is 00:08:33 So first prototype in 2011, we got this new block and queue layer, which was designed as a replacement of the request layer, and so what it does is it splits and merges IO requests because whatever the application the file system submits might not be what the driver really wants. And the easy case you can think of is you're doing really large IO, but your device can handle smaller IO, so it needs to be split.
Starting point is 00:09:01 The other is you have a stupid file system not looking at any XT, whatever developers developers that just submits a lot of very small I.O. for what's actually a contiguous big I.O. So we're merging that back together. And it has a couple other interesting helpers. So it manages multiple submission and completion queues, so this whole spreading of I have N queues available and M queues. So this whole spreading of, I have N queues available and M
Starting point is 00:09:25 CPU cores, how do I perfectly spread them so that we're having one as local as possible is taken care of. We've got a command ID or tag allocator, which is actually tied into the management of the per IO data structure. So for every possible outstanding command, we've got a pre-allocated data structure, both for the common block work and a driver specific part, which is indexed by our command ID.
Starting point is 00:09:52 So there's one very, very easy to use bitmap allocator that gets you all the data you need for your IO and there's no massive memory allocations and so on and so on. And as I said, we first got this in 2011, merged it into the Linux kernel three years later and initially used it for the Word.io block driver. And then later that year,
Starting point is 00:10:17 we converted the SCSI layer over to optionally use it, and I did a lot of work on that because it really helped with performance on things like SRP or high-end rate adapters that were really limited by the only way we've done it and then in the next year well next 3.19 we converted over NVMe and by now we've got another 10 12 or maybe even 15 drivers so the latest one I saw converted over just two days ago was the IBM S390 DASD driver. So even mainframe technologies from the 70s is now using our best and fastest infrastructure. And in general, we didn't have to do much work on block and queue to fit NVMe in, but there was one really interesting thing where NVMe differs from most of the block storage we're
Starting point is 00:11:11 doing in Linux, and that's the way it describes data transfers for DMA. So what NVMe's got is a concept of PRP, physical region pointers whatever and the whole idea is it's not a scatter gather list as most IO does where you have an offset in the lengths but it only has offsets so you can't describe the data transfer with one descriptor that spans a page and a page in that sense is actually an NVMe concept it's the same concept as the page in a typical operating system, so usually 4K, but it's a different setting, so they don't have to be the same.
Starting point is 00:11:51 And in fact, in Linux, they are usually, but not always, because our NVMe page size is always 4K, while the system page size might be larger at 8, 16, 64K. But this means that, for example, if we've got a transfer where we have two pages that are contiguous, for NVMe, we actually have to set two PRP entries for it.
Starting point is 00:12:14 Well, in this getter-gather list, the lengths would just increase. And that was something the Linux block layer didn't really expect. So we had to get new code in there that basically tells the merging code in BlockMQs, like don't merge the IOs together if they would span multiple pages. And well, we got that in.
Starting point is 00:12:39 We rewrote it later, and it was buggy for a while. But in the end, it was really useful because it turns out there were a couple other drivers that had the same sort of limitation and worked around it by bounce buffering and doing weird things. So all the RDM, not all of them, so Mellanox now has a memory registration type for RDMA that supports gather-getterless as Wendry extension, but all the other RDMA memory registration schemes have the same limitation. They're basically using the same data structures as PRPs
Starting point is 00:13:10 and used them even earlier. They just didn't have a fancy name for it. And the Hyper-V virtualized driver for Microsoft, Hyper-V and Azure also has the same strange limitation in there. And, well, in the meantime, NVMe had actually grown support for scatter gather lists. I guess a lot of people complained very loudly, and the rumor was it was the array people
Starting point is 00:13:34 that probably can't as easily fix their stack as we could. So NVMe 1.1 now has XGL support, but it's optional. So very few devices actually support it. And it's only supported for I.O. commands. So when you think of I.O. commands, it's the read and the write. Well, a few others that don't matter. But all the classical admin setup stuff had to be peer piece. Well, except Antra and VME Fabrics a little bit later, where now everyone uses SGLs, but they're different SGLs, just to make it complicated.
Starting point is 00:14:09 And we've got patches out there for SGL support in Linux. We had the first one a couple years ago, but it didn't come with benchmark numbers and wasn't quite as pretty. But now another engineer started on it and actually backed it up with numbers. And as soon as we've got actually larger contiguous transfers, at least 16K, obviously the SGLs win because they're much more efficient to describe that. While on the other hand, for example, for 4K transfers, the PRP will always be more efficient because it's smaller. So we're doing some more fix-ups
Starting point is 00:14:43 to do the perfect detection of these thresholds. And then we will use whatever fits better if the hardware actually supports this GL. And well, after we've done the block and queue switch, the usual small tripling in of features continued. So we got support for T10 protection information in February of 2015. A little bit later, we got support for T10 protection information in February of 2015. A little bit later, we got support for the controller memory buffer.
Starting point is 00:15:09 So that's a little piece of memory in a PCIe bar that the controller can expose where you can place data or submission queue entries in. And we have the basic infrastructure. There's only submission queue entries in there. A lot of people really want to use it for PCIe peer-to-peer transfers eventually, but we're still missing the overall infrastructure for that higher up in the PCIe layer for discovery and higher up in the I.O. subsystems to describe physical memory that's not mapped into the kernel virtual address space. So there's a lot of heavyweight infrastructure going on that before we can make full use of that.
Starting point is 00:15:47 We've got support for a persistent reservation API. So that's something I did for a PNFS layout that was primarily targeted to SCSI but has been extended to NVMe as well. So the applications can use iOctols, do persistent reservations without doing weird SCSI path- through tricks that break more often than not. And another really interesting thing is support for Apple devices.
Starting point is 00:16:12 And now people might say, well, it's NVMe, right? You should support every device. Well, you haven't talked to Apple. So Apple came up with the idea of building an NVMe device where they used the wrong class code because it was big Andean instead of little Andean as in the NVMe specification, where you could not do 64-bit MMIO reads but had to split it into two 32-bit MMIO reads, and you could not use a queue depth more than one reliably. But all of that would have been easily caught
Starting point is 00:16:41 if they had just run the freely available NVMe conformance test suite, but apparently they didn't because it doesn't run on macOS. So we had to work around that a little bit. And the other thing is we got basic SRIOV support. So the single root IO virtualization allows a PCI device to create virtual subfunctions, which, if you're in Linux or just like any other PCIe subfunction, so you can either use them in your hypervisor host or assign to guests.
Starting point is 00:17:14 In other OSs, you can only assign them to guests. I have no idea where they put in that arbitrary limitation. But, yeah, so you can create virtual little NVMe-functions, which at that point are really basic. Now that we have NVMe 1.3, since a couple months now, there is actually much bigger NVMe virtualization specification that actually allows you to do useful things with these SRIOV functions and manage them in a vendor-independent way. But we've not implemented that quite yet. The next big change after the block-and-queue move, which was started early last year, actually really at the end of 2015, was that we moved away from having the NVMe driver,
Starting point is 00:18:01 which we did before, to more of a subsystem model, which is pretty similar to what we're doing in SCSI. So the first thing we did was actually copy SCSI for a method, and the method is that we started to split away, split out the upper level, which fills out a little command structure that describes the NVMe command we're doing, then uses a pass-through layer, which then goes down to the actual driver
Starting point is 00:18:30 that deals with the hardware or transport, because we were going to get some more of these transports pretty soon. And the way we've done this in SCSI for the last 10, 15 years, I guess, it's more than 10 years, about 15 years, is that we've used the concept of a block layer pass-through. So the same request structure that the block layer builds up to pass the file system or application IOs, we abuse a little bit and take it as a pass-through command that has a pointer to the fully initialized command structure,
Starting point is 00:19:01 so SCSI CDB or NVMe submission queue entry, and we can use all the existing queuing infrastructure to handle both cases with the same piece of code. And that allowed to split out the little PCIe driver from the common NVMe code, allowed us to support the fabrics transports, which we're going to go to next. And the other interesting thing it allowed us to do is support different I.O. command sets. And now some people might say, there is no other I.O. command set.
Starting point is 00:19:31 Bill hasn't got his key value store through yet, even if he's talking about it. Well, there is, it's just not in the standard. So the Open Channel Light NVMe SSD people, and Matthias is not here, I guess. Nope. They've come up with their own NVMe command set for open channel SSDs, and we actually have support for that in Linux,
Starting point is 00:19:50 where we do an FTL in the host and then inject the low-level, very low-level NAND operations in specific non-standardized NVMe commands, and that sort of structure has really helped with supporting that as well. And, yeah, so now with NVMe over Fabrics, first we've done a move. So before that we had the NVMe driver in two little files and drivers block with all the weird random block devices that are not SCSI or ATA. And now that we were going to grow a lot more files in NVMe, we decided it's worse to have its own directory. And then we split out a new NVMe core module
Starting point is 00:20:31 from the existing NVMe module, and that kept its NVMe name. It didn't become NVMe PCIe just for backwards compatibility reasons because you can't just change people's driver name. They'll get a little upset. And if you're on PCIe, you'll just need these two little modules. So you've got your low-level PCIe driver and you've got a common shared core code. If you're on Fabrics,
Starting point is 00:20:55 there's actually another layer inserted because we have a common Fabrics library. So there's a fair amount of code that's common between all the Fabrics transports and not shared by PCIe. So we've got another little module for that. And all of that went on for a couple months and finished in June 2016 when we published the NVMe over Fabrics code, basically the day after the specification went public. We've been working on that quite a bit. And if we look at Linux 4.12, let's move that away, it's a little loud.
Starting point is 00:21:36 So we've got all these different bits of the NVMe driver. And the funny thing, it turns out, the biggest part is not the core driver, but the NVMe over Fabrics fiber channel driver, which we got late last year. And it's slightly bigger than the actual real core code, which is about 4,000 lines of code. Then we had another optional bit in the core driver, which is the SCSI translation, which is not much smaller than the actual core NVMe code, despite not actually doing anything
Starting point is 00:22:03 but translating around a bit. We got rid of that in Linux 4.13, so I wanted to have 4.12 so we can see the numbers. Then we've got a little bit of optional code for the open-channel SSDs, which if you pull it in, it's in the core module. We've got a pretty tiny, again, fabrics library at just above 1,000 lines of code.
Starting point is 00:22:26 And we've then got the PCIe driver, which is about 2,300 lines of code. It's still pretty small for a Linux driver. And the RDMA driver, which is even smaller. So all in all, it's still a pretty tiny subsystem. In fact, if you count lines of code if we include our NVMe over fabrics target so the controller side implementation all our NVMe related code is still smaller than the OFED Windows PCIe driver
Starting point is 00:22:57 And well, yeah, as I said in the meantime family's grown a little bit So we got the fabric support first posted in June 2016, then merged in July. We got the fiber channel support in the end of last year. And well, the next interesting thing is that we put a little lock on your little NVMe devices by getting TCG Opal support, which supports authentication for the device
Starting point is 00:23:23 and full disk encryption. Got that code from Intel, so now you can use your TCG-enabled devices with Linux out of the box. No more weird user space tools that generate init ramfs files or whatever. Just a proper unlock on a suspend from RAM and from disk, which is pretty important if you want to have your system back working with a disk. And the interesting thing, or maybe not interesting if you're used to Linux development, is that of course we didn't add all of that code to the NVMe driver, but we have a common little module in the block layer that contains all the low-level TCG Opal functionality. And we have not even 50 lines of code to wire it up in NVMe. And a couple months later, I did the same wire up for ATA disks.
Starting point is 00:24:13 So the same code will now work not just on your NVMe device, but on your SATA device as well. And in theory, also on your SCSI devices, except that as far as I know, there's no single SAS or fiber channel or whatever device that actually implements TCG Opal. And well, so the next big interesting thing is optimizing things. So if we look at the traditional IO flow of any driver, and thanks to Damien Lamolle, from whom I stole this beautiful graphics and a few more later on, is so the traditional way we execute block I.O. or network I.O. or any I.O. is interrupt-driven.
Starting point is 00:24:54 So we go down the whole stack, submit our command, let the device execute it, and let the device generate an interrupt when it's done. And then we get a context switch, and we jump all the little layers up again. Turns out that with NVMe and a fast enough device, this was starting to be a bottleneck. So we had optimized a lot in block path, and somewhere around the time when we did the blocking, we optimized even more in it, and we needed to figure out what we can do instead.
Starting point is 00:25:25 So what we looked for instead was to introduce a polling mode. The networking people have been doing some of that. Users-based drivers have been doing some of that. So in Linux 4.4, we got the initial poll to IOMode. So Stephen talked a little bit about that in his talk today, so you might have seen some of that. So we've got a FLAC and a couple of new system calls that are basically the extended version of the extended version of the extended version of read and write. We keep adding more of those.
Starting point is 00:25:54 I added two generations of them, by the way. Where you tell it's a high-priority command, and if it's a high-priority command, we're going to Paul for it. And in this very first version, we started Pauling as soon as you submitted the I.O. to make sure we don't ever lose any of that I.O. So with the beautiful DRAM-based NVMe card I have in my test lab, we got our latency for 512 byte reads down for 6 microseconds to four and a half. And also got a lot less jitter if you look at it. So that was pretty nice. The only really major downside of
Starting point is 00:26:35 this scheme is you're always using 100% of your CPU while you're polling, which is good for Intel because they can sell you more CPU, but it's not really very efficient. So a couple of releases later, we got a scheme called hybrid polling. And basically what we're doing is after the I.O. submission, we don't start polling, but we start polling later. And that sounds very easy except for the fact that you need to figure out when do we start polling. So the big part of that was building an infrastructure that tracks your average I.O. completion times and starts polling when we're halfway there in the first version. And that was pretty nice because it drastically reduces the CPU usage. So we've got 100% CPU usage
Starting point is 00:27:27 if we go to the adaptive hybrid polling scheme, which is the one we ended up with, where we're down to 58% of the CPU used for polling because we're only polling a little later. If you look at the latency graphs, we're actually getting exactly the same performance as that. It turns out this is still not very optimal, but it's a very safe first guess because you're very unlikely to miss any of that I.O.
Starting point is 00:27:53 unless your I.O. sizes are drastically different. So the next thing that Stephen, who gave the talk before now, added a new mode where we start polling just before the expected I.O. completion time. And who would have guessed that that provides a lot more savings on CPU cycles, almost down to the same amount of CPU usage as the interrupt-based one. So I'm really excited about that. But at the same time, we'll probably have to improve our I.O. completion time estimation to be bucket-based for different I.O. sizes and so on and so on for this to become production-ready.
Starting point is 00:28:32 And the other interesting polling thing that we don't have in the kernel yet, which Sagi, one of our co-maintainers, has been working on, is an optimistic polling. So this polling is interesting if you've got someone synchronously waiting for your I.O. So your user space database is waiting for exactly that I.O. But if we're using Linux as, say, an NVMe or SCSI target device where we get RDMA ops in and NVMe ops out, we don't really want to poll for a specific I.O.
Starting point is 00:29:01 because there's lots of them. So what we came up with a scheme where we have a thread that pulls the completion queues of both the RDMA device and the NVMe device and just reaps I.O.s as long as we can and then sleeps for a while and goes back to it, which still allows to avoid all the interrupts and do pretty efficient processing of the I.O.s. It just turns out fine-tuning this is pretty hard,
Starting point is 00:29:23 so we don't have it in yet, but we're looking at that for one of the nextOs. It just turns out fine-tuning this is pretty hard. So we don't have it in yet, but we're looking at that for one of the next kernel releases. Well, in the meantime, last year's drivers getting older, getting fatter, lots of exciting new features. The one I really liked was the
Starting point is 00:29:42 range deallocate support. So if you look at the NVMe deallocate command or the ATA trim command or the SCSI UNMAP command, they all don't just contain a single LBA range that is unused now, but they contain a few ranges. Each of them has a slightly different format, but the concept is the same. And it turns out the Linux BIO structure
Starting point is 00:30:03 that the file system used to submit I.O. is very much built around the fact that it contains the start offset and the length. So that concept didn't fit in very well. Well, our block and queue merging layer came to help. So I did a little hack to that, that basically the request can always have multiple BIOS. And in that case, we could have used them for that fact, which provided really big speed-ups to applications that do a lot of de-allocate operations. So that was really nice.
Starting point is 00:30:31 On well-built devices, there's basically no performance overhead of that anymore. On other devices, not looking at Intel, it's still pretty severe because their de-allocate performance is so bad. So it still depends if you want to use it or not because it's device dependent. Then, now that NVMe is moving into the client space,
Starting point is 00:30:50 so how many people have you already got an NVMe device in your laptop? How many of you are running Linux on it? Awesome. Yeah, once you've got the NVMe device in your laptop, you start caring about power management. You probably don't in your server, but the one exciting feature that got the NVMe device in your laptop, you start caring about power management. You probably don't in your server, but the one exciting feature that's in NVMe
Starting point is 00:31:08 is the autonomous power state transition support. So you basically give the device a little table about which power state to enter after it's been idle for how long, and then the device is going to do automatic transitions between those to make sure it's always using power, but it's still available with not too much latency. We added that support in early 2017. It turns out
Starting point is 00:31:32 a lot of devices broke because of that. People were only testing it with the Intel Windows RST E-Driver, which had really weird policies for it. Once we used Linux, things started falling apart. So the worst combination was Samsung devices and Dell laptops, which basically was a guarantee to break loose. It might have been part of their PCI root setup as well. And, well, now we have a blacklist of a couple devices where we can't use that,
Starting point is 00:32:00 but the other ones are really happy about it. And then we got another feature I did, the host memory buffer support. So after we had the controller memory buffer two years earlier, we now have a host memory buffer. And that name sounds very fancy for having a really cheap device that doesn't have any or very little DRAM and wants to steal it from the host. So what the driver basically does is it allocates a big chunk of memory, gives it to the device, gives it a DMA address for it, and says, here, you can
Starting point is 00:32:29 use that poor device. Your parents didn't even give you any DRAM. Well, let's have some from me. And this was just the catching up on the old features. And once we got NVMe 1.3, we started adding all the exciting new features. And some of them we even added before NVMe 1.3 went public, because the way NVMe works, once a technical proposal, which is the individual bids to get into the new standards version are rectified, we're allowed to implement it. So the first one we got was the set doorbell buffer feature,
Starting point is 00:33:06 which sounds a little weird and might make more sense if you have the other name that was proposed for it, which was para-virtualized NVMe. So if you've got a virtual NVMe device in your hypervisor that's exposed to the guests, it's a very nice idea because every modern OS has an NVMe driver and things will just work. It just turns out that any sort of traditional device that does a lot of MMIO operations,
Starting point is 00:33:30 so memory map diode operations, leads to pretty bad performance because every MMIO write is a trap from the guest into the hypervisor. So to get good performance, you want to avoid that by just using a piece of memory instead. And the one scheme that really works well with that is Word.io, like the Linux transport for virtualized drivers. And Linux people love it. It just turns out other people don't. So for Windows, you will always need a third-party driver and it's a pain or whatever. So some people at Google came up with this really cool idea that they're going to use the event,
Starting point is 00:34:11 like the buffer mechanism from Ritio and Gretra fitted into NVMe. And that gives really nice performance numbers. It's just they sent a patch for it at a point where it wasn't standardized. We were like, uh-uh, go to the NVMe working group. We want to have a standardized version of that. It's a cool feature, but as a random, underspecified vendor extension, I don't think we can have that in the driver. So it took about another year, and we got a technical proposal in NVMe that standardized
Starting point is 00:34:38 it, and we've got it supported now. So that's pretty cool. The next one is we've got UUID identifiers, so this is the fourth generation of identifiers in NVMe. NVMe 1.0 did not have any sort of global unique identifier for the namespace, so if you're coming from the SCSI world, that's the device identification VPD page, and it wasn't then there, so NVMe 1.1 added a 64-bit IEEE-assigned identifier, which solved the problem, but it's generally still way too small for the typical use cases.
Starting point is 00:35:13 So NVMe 1.2 added a 128-bit version of it. It just turns out that the IEEE identifier is not very good for software-defined storage because it requires a vendor prefix and then a statically handed-out number inside of that. And if you're just shipping a Linux-based controller that random people can't configure, it's A, very hard to get a vendor prefix,
Starting point is 00:35:37 and B, very hard to make sure people don't accidentally use the same one. So what SCSI did a couple years ago is they added a new UUID identifier, which uses the RFC4122 randomly generated UUID as an identifier, and we added that to NVMe as well and built a little bit of infrastructure
Starting point is 00:35:57 that when we have to add another identifier type, we can reuse it for that. And then we got support for the streams feature, which we're actually not really using the streams. So it turns out we build an infrastructure where the application can tell the kernel if it's hot data, cold data, medium cold data, really cold data.
Starting point is 00:36:21 And we're passing that down the block IO stack and then the NVMe driver all the way down. We'll try to allocate a few streams really cold data. And we're passing that down the block IO stack, and then the NVMe driver all the way down. We'll try to allocate a few streams and stash it into that, which seems to work okay for some people, but it would have been much easier to just pass these hands on to the device. Well, I mean, we had a bit of a big fight in the committee,
Starting point is 00:36:50 and the thing that we're doing in the Linux API is something we Linux people would have loved to just pass to the device, while the people that argued for streams was like, well, it's not just about hot and cold. We want to separate different data streams. But we can't really use that easily because it's very hard to pass that sort of stream ID all the way through the I.O. stack, through all the remappers and file systems.
Starting point is 00:37:10 So we ended up basically implementing the scheme we wanted anyway and translate between the streams at the lowest, the schemes at the lowest level of the stack. And last but not least, that's the wrong way, and last but not least is my favorite current project, multipathing. Starting in NVMe 1.1, NVMe gained the feature that your controller is not your controller anymore, but your subsystem, and then the subsystem has controllers.
Starting point is 00:37:42 So basically you can't build one NVMe device that has two PCIe connectors. Both of them are called a controller, just to confuse everyone who thinks of that PCIe device as the controller, but the whole thing is the subsystem. And you can access both of these controllers independently, either from two systems, which would be the traditional shared storage use case, or you can access it by two different ways from the same system. And the typical use cases for multipath access are aggregating the bandwidth over multiple connections. While that doesn't really make sense on PCIe, where you can just add a couple more lanes,
Starting point is 00:38:20 you are in there for redundancy, which you're typically doing for network-attached storage. It doesn't really make sense if you've got a single little device that sits inside of your computer and just has two different PCIe connections. Or you can do it for locality of access, which makes a little more sense. Let's say you have a dual-socket system and you want a PCIe connection to each of them and make sure your I.O. stays local and doesn't go over the interconnect. But in general, the PCIe NVMe multipath thing didn't get all that much use. It's a bit of a fringe feature that's supported
Starting point is 00:38:51 by some high-end enterprise controllers, but has very few users. It became interesting with NVMe or Fabrics when we've now got network storage where the whole redundancy and load balancing becomes interesting interesting and that makes me always think of driving from san francisco to san jose and then you're at that split of 101 and 280 and it's like which one you take you'll need the most up-to-date information of which of
Starting point is 00:39:17 your paths is actually going to be the good one right now and that's what what a TP that's currently doing in NVMe is trying to solve for NVMe, not for 101. And it's called ANA, the asynchronous namespace access, which, as the name suggests, is the logical equivalent of Alua and SCSI for all these people that love dealing with SCSI arrays. And the idea is that for each controller in namespace where the controller is telling the host, is this an optimized path? Is this a path that works but is not optimized? Is this a path that's currently offline and can't be used at all? And we're getting notifications for that,
Starting point is 00:39:58 and then the host will decide, am I going to route it through my active path right to the storage device? Am I going through the other path where we might have to do some I.O. shipping on the other end, but it will still work, or do I better not send that I.O. at all? And if we look at the existing SCSI multipathing in Linux, we basically got a layer between SCSI and the I.Oter that's called Device Mapper Multipath. It's a little module, stackable block driver module,
Starting point is 00:40:29 that decides which path are we going to send things down. And it's a little bit messy because the information about the state comes from the SCSI layer, so we have to communicate that up to the other layer. And to make it even more complicated, a lot of the decision making and path probing is actually in a user space daemon, so there's a lot of interaction
Starting point is 00:40:48 between different components that should be separate. And the device handlers are a modular piece of the SCSI layer, yeah, so I could have made this even more complicated. What I'm trying to do in NVMe is to pretty much cut all of that out. So once we've got the NVMe driver, we know which namespace IDs on the controllers refer to the same thing, because NVMe is relatively strict about the identifiers, not perfect, but way better than
Starting point is 00:41:17 SCSI. And the way that ANA is built, it builds on the typical NVMe concepts for that, like asynchronous event notifications, log pages that are all really nicely handled by the driver. So the driver always has the full picture of what's going on. And the other nice thing in NVMe is NVMe doesn't do partial completions of IOs, unlike SCSI. So a command is either done or not. And because of that, we can actually literally bounce I.O. between the different NVMe controllers with no measurable overhead. It's just another 60 x86 instructions and two more cache lines that we're touching. So it's way better than using device mapper multipath on top of NVMe, which some people are trying to do.
Starting point is 00:42:05 And this adds up to 5 to 6 microseconds. And if you looked at my previous numbers, if you're polling, that's the time we need for the whole I.O. in NVMe right now. So it's basically doubling your excess latency. And the other really nice bit is you don't need to do a setup in user space with a config file and UDIF rules.
Starting point is 00:42:23 It's like the kernel driver just creates another node that gives you access to that namespace, a unique identifier through any of the available paths. And I'm pretty much done now. There's a couple nice references, but... And would you like to test it on the VLCs for any fabrics? It's pretty much only used for fabrics. So it will work for PCIe too, but the main use case is fabrics.
Starting point is 00:42:53 Well, you should not use both at the same time because your upper layer will never see never then. But, I mean... Yeah, and we had a lot of arguments about that. It just turns out there's basically no code to the multipathing, right? So the big part about multipathing is just discovery of the topology and reacting to the events for topology changes.
Starting point is 00:43:25 And that is completely transport-specific. That's what we have, say, in the device handlers in SCSI right now. And the part that the device member multipass does right now is basically it clones the request. It clones the BIOS. It does a lot of memory allocations to deal with the fact that SCSI can do partial completions, and we have to partially retry it. But the actual implementation of doing a path selection and a failover, it's less than 100 lines of code,
Starting point is 00:43:54 so I have no problem with having that duplicated. And a couple of the helpers in the block layer to deal with that go in the block layer. So there is common code, but in general, there just isn't much code. The whole multipath implementation is less than 2,000 lines of code, out of which most is discovery. We'd still like to get to the plot version.
Starting point is 00:44:15 Do you guys think the same way? Yeah. And I mean, SCSI, as I said, is a little more complicated, mostly because of partial completions, also because historically people did winter-specific multipathing implementations. So before LUA came out, there was an EMC way of doing it, an HP way of doing it, a LSI logic way of doing it, and we had to support all of that. Well, for NVMe, we've already told Huawei, it's like, get lost, join the NVMe working group on ANA. We don't want to enter a specific way of doing it.
Starting point is 00:44:56 Well, it's not going away, right? People have their existing SCSI setups, but I don't want people to use it on NVMe, and I'll make sure it won't be usable with ANA because we're not going to expose the information to upper layers. And for, especially, you know, assuming we find some big insights, we can get to work on this. We'd like to do it just for a little bit, and we use all these things for some abstract. Yeah. And we can't break people's setups on an upgrade, right? Yeah. Yeah. Just for the shiny, cool new stuff,
Starting point is 00:45:35 use the shiny new cool code, and for your legacy setup, you're stuck with what's there. How do you see that ending field in the new P-lead and P-write and developing over time? How would you like to see it develop? Well, so, I mean, it's a flex field, so we have a couple other flex in there that are pretty exciting and could be a talk on their own. But the whole idea is, in a typical application, you have foreground and background I.O., right?
Starting point is 00:46:01 So think of a simple lock structure database where your main workload is writing to the lock. That's high priority. You really care about your transaction commit because you're going to reply to the network and your records need to be on disk. And then you go in the cleaning phase where you can just do background operations
Starting point is 00:46:18 and your active main logging thread will have the high priority bit set and Paul and the other one will just run in the background, interrupt driven. I was thinking like for caching and screens and other stuff, do you think that same thing would be able to be used for that? So an initial version of our lifetime hands actually abused fields in that bit. We decided on a different interface in the end that is a little more general.
Starting point is 00:46:48 But yeah, we could use more fields for that. We just want to have reasonable semantics before. We don't want to throw random special cases in there that keep accumulating. How do you manage all the quirks and bugs? Let me open the source file. Okay. The question is, any vendor that will come with every small bug
Starting point is 00:47:23 will be putting the fix here? Well... I was thinking you have this register over there instead of over here, sorry. Well, I mean, it depends, right? I mean, right now, the quirks aren't all that bad. So we've gotten the Intel quirks. They're actually not really quirks.
Starting point is 00:47:41 They're just behavior that's beyond the standard that they really want to do. So they want IOs, not spread a certain boundary. We finally actually got that standardized in NVMe 1.3. It's a trivial little 16-bit field in the identified data. They guarantee they're zeroing out everything if you do a deallocate, which is not guaranteed by the standard. Then, well, we've got a couple devices that can't enter the deepest power state, and we've actually got another quirk list that's not based on PCI IDs,
Starting point is 00:48:11 but the identify string for that because all the Samsung devices, a lot of Samsung devices have the same PCI ID but actually behave differently. We have a QEMU virtual thing that crashes if you use a new identify command that was only added in NVMe 1.1. We've got a few adapters that need a little delay after reset before you're checking they're ready. We've got the completely fucked up Apple devices.
Starting point is 00:48:37 So we've got a new Samsung device that you can't drive with a high QDAP. So it keeps accumulating, but it's nothing new in Linux, right? The HCI driver looks much worse than the USB storage drivers. So we've been, or SCSI for that matter, yeah. So we've been doing that long. And I mean, at some point, we might say this drive is too broken.
Starting point is 00:48:56 We're not going to support it or we'll have a copy of the driver. But so far, it's a little box here and there that can be worked around. But then there's also things that they can say, right? So we'll logically implement support for one that does have a workaround. little box here and there that can be worked around. Yeah. Yeah, but I mean this is... For the disks and tapes and CD-ROMs. Yeah.
Starting point is 00:49:37 And I mean the same, I mean, look at HCI. HCI is the same model as NVMe. It's a common spec with a common PCIe class code and, like, five gazillion vendors implementing it. And that quirk list in HCI is two pages long. And the NVMe one will grow to that, too. And does it impose more conditionals during the code? It does, but we try to keep them out of the fast path in general.
Starting point is 00:50:05 Yeah. Yeah. Well, if we had to do something for the fast path, life wouldn't be so fun. We'd probably have a different version of the fast path routine that we select at setup time, but I hope we don't have to go there. For HCI, we didn't. HCI quirks are still all in the setup path. There's tons of them, but they are out of the FastPath. Do you respect that? Well, a lot of the time it's not PCI spec, but buying an EMA and getting the IP from somebody that didn't know what they were doing.
Starting point is 00:50:48 Yep. So, thanks everyone. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with
Starting point is 00:51:17 your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.