Storage Developer Conference - #118: Linux NVMe and Block Layer Status Update

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 118.

Starting point is 00:00:41 Thanks for joining me. I'm talking about the Linux NVMe driver and block layer status update. I've given a pretty similar talk two years ago here at SDC, so I'm trying to mostly not get into the older history that we were looking at there. There's a few references. Back in September 2017, we were talking about Linux 4.13, which is a couple of releases old, obviously. Just this month, we got a new Linux 5.3 release. That's to about where we're talking here.

Starting point is 00:01:17 A couple more recent developments as well. And back from that all talk, this is kind of what we're talking of when we say the Linux NVMe driver. It's not really a single driver anymore. It's really driver stack. Got a core module down in the lower left corner. This is like completely generic NVMe code that you use over PCAE, over the various network transports. And then for PCAE, we have the driver that's just called NVMe for historic reasons because that's how NVMe started out. It was all just PCI, so it's now called NVMe-PCI or-PCAE

Starting point is 00:01:57 because we had to keep compatibility. And then two years ago, we had two transport drivers for the so-called NVMe over Fabric spec, RDMA and Fiber Channel sitting on top of a shared NVMe over Fabrics module. And if we really want to put this into the bigger picture, there is an amazing graph that doesn't even fit on my slide. So I've taken their diagram directly. So the props go to my friends and engineers at Thomas Kran, who are a German-slash-Austrian systems integrator

Starting point is 00:02:38 who've been doing great work on these diagrams and updating them regularly. So we're talking about the I-O stack. This is kind of where the I-O originates. It's like the various file systems, block-based, where you really think of file system, network file systems, or I-O directly on the block device nodes. And then we go through a couple remapping layers in Linux, like software RAID, volume management, yada, yada. I'm not going to get into the details here. And then we enter what's really the core of the block layer. And the interesting part, and both me and other people used to talk a lot about it in the last couple of

Starting point is 00:03:17 years, was the block MQ layer, the multi-Q block layer, which is really essential to the functioning of the NVMe driver as we have it today. And the big news, and I think it was Linux 5.1, a couple of releases ago, is that we finally got rid of the legacy code base that you see here on the left, like the classic IO schedulers, the classic request structure, which was old code with a lineage back to the early 90s, which had huge, huge scalability problems. That's why we never used it with NVMe and finally moved SCSI and other protocols over.

Starting point is 00:03:51 And our block layer maintainer, Jens Axpo, did a lot of work converting random weird drivers very few people use over to the new infrastructure, like the Amiga floppy driver and stuff like that. So that we could finally get rid of the old stuff. And then once you're down there, this is where all the drivers get into. And because this diagram is still a little older,

Starting point is 00:04:17 it's just a single NVMe driver, but just used it as a cross-reference to the other diagram I did. And what's kind of important and interesting, even if it's off topic here, the people that are coming from, say, the Windows or VMware world, find interesting is NVMe is in no way under the SCSI layer. So we have a SCSI layer for weird historic reasons. ATA actually sits underneath it with a translation layer, despite us hating it. It wasn't even always like that. But that's a totally separate discussion to be had over hard

Starting point is 00:04:46 liquor. But NVMe sits aside at no scuzzy anywhere, and we're, we're really glad about that. And in this code structure, what's new is basically that we got two major new, or three-ish major new pieces of code since then. One is a new NVMe over Fabrics TCP transport. It's just another of these network-based NVMe transports. One that a lot of people find very exciting. Not really because it's the fastest or the most advanced or anything, but it doesn't really require

Starting point is 00:05:29 any special hardware. It runs on every network card you can imagine, because TCP IP is just everywhere. So it's, it's really exciting to drive NVMe into spaces where before people would have to deal with expansive RDMA or fiber channel hardware and associated switches and new knowledge and so on and so on. And there are a couple additions to the core code, and there's actually a third one

Starting point is 00:05:56 that's pretty small but useful, I forgot here. The big one is multipathing, the support to access an NVMe namespace through multiple controllers in a, well, fast and well-defined way. We'll have a couple slides on that soon. The other is that we now added NVMe-specific tracing support. And the one that's missing here is that we actually gained support for failure injection so that we have better ways to test all kinds of error handling path. That sane hardware should never trigger insane one,

Starting point is 00:06:27 unfortunately does, and not everyone can have every piece of insane hardware. And with those additions, with the TCP one, we just got another new driver, basically, that is equal to RDMA and fiber channel in that it uses the shared fabrics helper in addition to the core NVMe code and this is kind of how this looks in the actual code base.

Starting point is 00:06:58 So we've got our core.c which is really most of the core code. It's a pretty big chunk of code. The fabrics library including its header is pretty tiny. of the core code, it's a pretty big chunk of code. The fabrics library, including its headers, pretty tiny, the fold injection code the same. Fiber channel is actually pretty gigantic, and that's despite sucking in a lot of other layers of code, too, and having some of the biggest drivers in the kernel only dwarfed by the GPU drivers,

Starting point is 00:07:22 which are basically each of them two OSs and a couple frameworks. And we've got the line NVM stuff, which is basically early-ish open channel hardware support, which we're actually kind of hoping to maybe eventually get rid of because it didn't work out too well. The multipath encode, a common header, PCI, which is actually pretty big, but the PCI NVMe driver actually drives hardware, so it's really a low-level hardware driver, unlike the Fabrics drivers, which are just protocol drivers,

Starting point is 00:07:58 which, after a couple layers of abstraction, eventually end up with a real hardware driver underneath that we're not counting here, like the RDMA HCA drivers, the fiber channel HBA drivers, and so on. TCP is, well, RDMA is next. Again, not tiny, but actually pretty small for transport. TCP, a little larger tracing code, and altogether it's not even 20,000 lines of code. So we're still pretty lean compared to something. I think last time I compared to the Windows OFET driver, which even just for basic NVMe PCA functionality was about twice the size of what we do for all protocols.

Starting point is 00:08:48 Okay, so multipathing, the big new thing, except for TCP, but TCP, while interesting, can't really explain much in detail. So as I mentioned before, in NVMe, just like in SCSI, unlike some simpler protocol like ATA, you have the concept that there are multiple paths, multiple ways to access a namespace in NVMe terms, in SCSI it would be logical unit, the concepts are pretty similar. And the interesting thing in NVMe is that NVMe actually had a very tight architecture of what really is valid as another access path.

Starting point is 00:09:25 So in SCSI, a lot of it is just, yeah, maybe it's going to work. But NVMe has this concept of a subsystem. Controllers are part of a subsystem. We can actually check for a subsystem NQN identifier to make sure it really is the same thing. And when doing the ANA protocol later, we even further tighten up some of these requirements

Starting point is 00:09:46 so the host can actually be relatively safe, or actually is safe, binding to controllers that have the same namespace, that it really is the same one and not different ones, which used to be a problem in SCSI in earlier days. And as I mentioned, the thing that actually makes the multipathing interesting, at least for fabrics or more complicated setups, is a protocol extension called ANA, asynchronous namespace access. And those who have been around in the SCSI world, the naming similarity to ALUA and SCSI is not coincidental. It's like, it's modeled after that with a lot of lessons learned,

Starting point is 00:10:27 with a lot of simplified in many ways to only allow one way to do it instead of many ways to do it, and we finally had to agree on it. And we allow multipath access not only with ANA, because, for example, if you're having a PCI NVMe controller, a lot of the more expensive enterprise controllers actually are dual ported. So you have two different PCI ports to access them and sometimes people use it for failover. But one interesting use case that we specifically

Starting point is 00:10:59 had in mind here is to connect it to NUMA sockets. You have a dual socket system with very high cost of going across the nodes to the other socket. And in that case, you can't just have your NVMe device attached to both of them, and you're actually doing local I-O access instead of going over to the interconnect that might slow you down. And one of the things that's really interesting, especially in this PCIe NUMA case

Starting point is 00:11:27 but also in highly optimized RDMA environments is that we actually really, really do care about IOPS, about latency, about these modern performance characteristics and not just maxing out the bandwidth as the old array people really like to do with their multipathing implementations where people were happy as soon as you could fill up the wire. And while we're talking about these older multipathing architectures, so what Linux

Starting point is 00:11:55 does in the SCSI world is the SCSI layer itself has almost no logic related to multipathing. It almost isn't in there because there's actually some logic to detect what we're doing and something we call a device handler, but it's all actually driven by a middleman called the device mapper multipath code, which is a kernel component in the actual IOPath

Starting point is 00:12:19 and a daemon in user space that manages it. And it's, some people said it's the biggest pain of enterprise Linux deployments. I'm not sure it is the biggest, but it's definitely up there. And because your device nodes change if you're on a single path or multiple paths, you've got that daemon to deal with. It's a lot of hassle. So for NVMe, both because of the performance reasons and the manageability reasons,

Starting point is 00:12:48 and the fact that we have a very tightly integrated spec, decided that we're actually going to do a new multipathing implementation that's part of the NVMe driver. And as you've seen before, I mean, the actual.c file is 700-something lines of code. There's little stuff in headers and other files. But it's a very, very small edition.

Starting point is 00:13:09 And it just shows up transparently. So your block devices referring to the namespaces will just use all the available paths without you doing anything. And the path and decisions are based on the ANA state. So if the controller, well, the subsystem through the controller tells you a state is not available, we're not optimized, we're not going to try to use that when we've got an actual optimized state.

Starting point is 00:13:37 There, we try to use the Pneuma proximity, and then people come in kind of from the fiber channel-ish background. We later added a round-dropping mode. And in many ways, round-dropping is not a very smart idea if you have high IOPS storage where you have cache line misses every time you hit another path. But they still wanted to use up the bandwidth of their dual-ported card in a single PCIe socket, and that's kind of the only way to do it.

Starting point is 00:14:02 So I'm not happy about the round-dropping, but there are use cases, and people like it. And I've seen some numbers from people I've worked with at Western Digital for NVMe over Fabrics where we're seeing six times better IOPS with the NVMe multipath compared to using device-member multipath, even on NVMe, while actually using less CPU cycles,

Starting point is 00:14:29 because device-mepper multipath was CPU cycle-bound in that use case. So the other interesting bit, and I think I'm just going to jump ahead there, is the tracing. So Linux has a pretty nice trace event framework where you have static trace points inserted in functions that are pre-formatted,

Starting point is 00:14:51 a couple tools to parse it. Also, I personally prefer the text output like here, but there's graphic versions too. And on the block layer, we used to, we still have something called block trace, which is a really, really amazingly useful tracer for any block layer, we used to, we still have something called block trace, which is a really, really amazingly useful tracer for any block layer interaction because it traces the block IO requests

Starting point is 00:15:10 from the file system that issues it through remapping, through the actual issue in the driver. But by being generic, it works on generic block layer concepts, and now we have some low-level NVMe tracing where we can trace things like the actual hardware queue we go out to, the command IDs, like, additional command fields if we're using T10 diff, if we're using DSM bits. And for anyone who's trying to figure out what

Starting point is 00:15:38 actually goes out to the wire, what goes on very low level, it's, it's a really nice tool, but remember, it's in no way a replacement for block trace. It doesn't have anywhere near the functionality. It's just another tidbit for very low level tracing. So yeah, these were the completely new code blocks, and we've also redone a lot of stuff. And I guess the most exciting part is really the

Starting point is 00:16:04 IO polling rework for people who want extremely high performance I.O. And back a couple years ago, we did the first version of the polling support in Linux, which landed in Linux 4.4, where an application that has a high priority I.O. and we had to specifically enable it for the driver can request that before returning that synchronous read or write request to the application, it will actually pull the completion queue instead of waiting for an interrupt. And this actually gives pretty nice performance already but has the big, big limitation that it's limited to a queue

Starting point is 00:16:41 depth of one. And in the classic polling version, kind of like what you think of polling, it would literally spend all the CPU cycles you have on that core just polling because it doesn't do anything else. A little later, Damien, who sits there, came up with the concept of a hybrid polling where you keep statistics about average IO completion latencies,

Starting point is 00:17:01 and we only start polling when we actually start to expect that completion to happen instead of all the time, which gives gigantic reductions in CPU cycle usage while being almost as fast. But all that was still pretty limited. And then, mostly driven by ENDS-AXBO, our block layer maintainer,

Starting point is 00:17:22 we came up with a new interface called IOU Ring. And if you're interested in that, there's actually talk just on that here at SDC on Thursday called Improved Storage Performance Using the Neon Linux Kernel IO Interface. And if this sounds interesting to you, go there. It's cool. And the idea behind that interface

Starting point is 00:17:41 is that we have a ring-based asynchronous IO interface. You have a submission ring, a completion ring. People familiar with NVMe might understand that concept. And part of that, besides a couple other improvements, is the idea of a dedicated polling thread, similar to what some user space drivers like SPDK have already been doing, where instead of your application process that submits the I.O., we have some other thread hogging a core to do polling. We can actually combine that to some extent with hybrid polling, but basically you have

Starting point is 00:18:15 your applications just submitting I.O. through one ring, has another ring where it can get notifications for completions, and then the actual polling of the hardware is done by another thread. And initially, we've just done that for PCAE. By now, we also have it for RDMA and TCP. And this is actually a really nice graph from benchmarks from Jens that he recently Twittered. I think the numbers are actually not that recent because we've improved a little since then.

Starting point is 00:18:45 But we're at 1.6 million IOPS here without any real work, just a little user space application using the IOU ring appalling thread. And even without appalling, the IOU ring is a pretty nice performance advantage when the classic asynchronous IO interface just leveled out. But again, the NVMe changes were actually relatively small compared to the actual user space interface and the block layer changes. First off.

Starting point is 00:19:18 Yeah. I want to mention that at the point, we had support for interrupt-less pulls. Oh, yeah. That was pretty significant in performance and the effort to mention that at the polling, we had support for interrupt-less polls. Oh, yeah. That was pretty significant in performance and the effort to make that. Yeah, that's actually because we have those separate polling queues and not share the queues with the normal ones, which was part of that. What Keith, who's, by the way, one of the other Linux NVMe driver maintainers

Starting point is 00:19:40 out of the three of us, so he really knows what he's talking about. And the big advantage of having those separate queues in addition to avoiding false cache line contention is that we don't actually have to enable interrupts in hardware for that queue. So NVMe, when you create the queue, you can tell it's like, do I want interrupts enabled or not? And if you don't enable them, the hardware

Starting point is 00:20:02 doesn't even have to generate the interrupts and no one has to handle them and nothing actually happens. And I'll actually get into a few other optimizations that kind of came out of that later and I think this is actually mentioned as a bullet point too but it's good that we covered it here. And yeah, so we have pretty nice performance numbers.

Starting point is 00:20:22 The other thing I'm pretty excited about, work that Chaitanya did about two years ago, the scatter-gather list support. So scatter-gather lists are kind of what just about every storage interface but HCI and the initial NVMe uses. And it's just a way to efficiently describe variable length data for transfers.

Starting point is 00:20:45 And NVMe 1.0 did not support these SGLs, just something called peer piece. NVMe 1.1, I think, finally added SGL support to PCIe. The various fabrics transports always used SGLs in a slightly different form. And Linux 4.15 finally gained the SGL support for PCIe. And this is kind of, so if I'm doing a large data transfer, so with a PRP, which always just describe a page,

Starting point is 00:21:15 and a page could be a couple different sizes, but for practical purposes is 4K most of the time. With a PRP, you really need yet another entry for every 4K chunk. So if you're doing, say, I.O. on a single x86 huge page, which is two megabytes in size, you have a giant PRP list, because every little 4K chunk needs an entry. With scatter-gather lists, on the other hand, you have exactly one entry. So there's one entry start here, and there's length field, and that really helps generating much more

Starting point is 00:21:47 efficient I O as soon as they get larger than a page or two. And I think that's about the threshold we have, right? Sixteen K or thirty-two K. Yeah. And something that kind of goes along with this, even if it was done at a different time separately, but really helps with the same sort of problems

Starting point is 00:22:05 is a block layer thing. And it's something called the multi-page BioWix structure. And the BioWix structure is a very central structure in Linux block IOS app system. It's basically this tuple of a page struct, which is Linux internal abstraction, but for our purpose here, it's a placeholder

Starting point is 00:22:25 for physical address. That's the interesting part in this case. So we can generate the physical address of the memory from here. We have a length field and an offset field into the page because the page is always aligned. So the physical address is split into the page frame number, as some low-level hardware people might know, and an offset into it. And the structure as is would already be pretty flexible because it's got these 32-bit lengths and offset fields. But in practice, at least at the upper layer of the stacks, we always just used it to describe transfers inside a single page,

Starting point is 00:23:00 kind of like the PRPs I was just complaining about. We did the same thing. And making Lay from Red Hat based on patches from canned overstreet a long time ago finally got around to fixing up a lot of places in the block layer to get rid of these assumptions, which now means that, again, if we're doing I.O. on, say, a single huge page of two Macs, we can store that in one of those structures

Starting point is 00:23:23 instead of a lot of them. Now, we actually used to merge these together before or slightly after they hit the driver, kind of depending on where you see it. So it, it never went out to the wire like this, unless a driver like NVMe forced it. But there was still a lot of, like, merging, splitting, thrashing a lot of cache lines for no good reason before

Starting point is 00:23:46 we got this in. So in Linux 5.0, this got fixed, and the other interesting thing is just in Linux 5.4 merge window, which is going on, like, right now, we finally got a serious merge that switches to networking stack to use the same structure. They basically had their own version of it before, so it really helps to make, again, a couple of these data transfers that fly from the block layer to the network stack, like NVMe over TCP, a little nicer to handle. Another thing that kind of came out of this is optimizations for single segments. So if we just have a single of these biovacs, we can actually handle them a little more efficiently than the normal IOPath. So the normal IOPath from the biovacs creates a struct that we call scatter list, which is a really weirdly designed structure because it basically contains the same fields that we already have

Starting point is 00:24:45 in the BioWack and then contains another two structures looking like a scatter list which contain a EMA address as we call it on Linux. I mean this is the address as seen by the IO device which for many cases might be the same as a physical address but it could be the physical address with an offset or it could be something entirely different if you're using an IOMMU. And what the scatter list structure kind of helps with is the concept of merging segments

Starting point is 00:25:13 in the IOMMU. So most IOMMUs have the idea that you give them multiple discontinuous scatter gather elements, and they actually merge them into a single continuous range out on the wire. And for that, the scatter list kind of makes sense, but it's just a very cash-line inefficient way of doing that. And I've been actually working for a while, including taking over maintainership of the DMA mapping subsystem to come up with a better way of doing it, but I'm already spending two years on it, so I decided to take a little shortcut just for the small IOS emitter.

Starting point is 00:25:44 And the idea is, if we only have a single biovac, it's very easy to not merge it into that because obviously the IOMMU is not going to merge it. There's really only a single segment. There's no reason to optimize it. And then we just map it directly, store the DMA address in our NVMe private request structure. We do need a separate length field because I've already got one. Nothing's got merged. And with a PRP, that means we can only easily do it for a very small size of IO,

Starting point is 00:26:16 but with a scatter gather list, this can actually, as long as it's one physically contiguous entry, it can actually be gigantic. Again, the prototype example of a huge page when people are doing databases or HPC really helps for that. And I saw on my test setups, I actually saw some pretty small speedups with that,

Starting point is 00:26:34 and then Jens tested it on his IOU ring benchmarking rig, and he actually saw a 4% speedup, which some people would kill for it, which surprised me. And this just means, again, I mean, with modern NVMe devices, modern I.O. interfaces, we're now living in a world where every single cache line counts. It's like, every time you touch another cache line you don't have to do, it's gonna show up in benchmarks. So there, we're getting to the point where we're really just, there's not much FET to

Starting point is 00:27:02 trim anymore. But it's exciting. And because of that, there's all these little performance optimizations that aren't really huge changes to the code, but kind of nice and useful. So one interesting thing that also sort of fell out of the explicit poll queues in IO-Uring work, even if it's not directly related, just using infrastructure,

Starting point is 00:27:24 is the idea of dedicated read queues in IOU ring work, even if it's not directly related, just using infrastructure, is the idea of dedicated read queues. So instead of having, like, writes and delocate commands and reads all on the same queue, this directs the reads to a dedicated set of NVMe queues, and Jens, who works for Facebook, apparently has a couple sensitive critical read workloads where he'd rather not have big writes disturb the reads in the same queues so that they are not blocked. I'm not sure if anyone else has ever been interested in using it actually. But yeah. Lock less completion queues. So actually I didn't have the interrupt disabling in there,

Starting point is 00:28:00 but that's kind of coming from the same thing. Now that we don't have an interrupt and a user process both banging the same CQ, that we don't have an interrupt and a user process both being the same CQ, but only the interrupt, which has exclusion by definition, we could actually get rid of the spin locks and interrupt disabling on reading the completion cues, optimizing away a few more atomic instructions, getting rid of a few fields.

Starting point is 00:28:22 I think Keith actually has a couple more patches to shrink the cue structure that it fits a single cache line, getting back to our old topic. We might finally have some time that we get back to that. We have a concept of batch doorbell rights. So the NVMe interface, if you submit any submission queue, it works by placing the submission queue entry in the queue and then you eventually ring the doorbell

Starting point is 00:28:45 after you place one or more of these entries to actually tell the hardware, okay, there's something to read now. And traditional Linux driver, like most drivers, was always putting in one entry and then ringing the doorbell, which works okay. It's good for latency, but if you actually have a lot of SQEs that you know they're batched up already, it makes sense to defer that a little bit and batch it up.

Starting point is 00:29:08 And this is especially true for virtualized NVMe controllers where the MMI overwrites that a drawable ring does are really, really expensive. So if you can reduce that a little bit, it gives it a little speed up. And because we gave so much love to PCIe here, there's another nice little thingy for RDMA. And RDMA has this concept of an inline segment. So you send

Starting point is 00:29:31 your RDMA packet that includes, that contains the submission queue entry over to the controller, and normally the controller would then use RDMA primitives to actually read the data that the other side writes to itself. It's kind of like two protocol round trips. And it also has the concept of an inline data where you can just add a segment of data right at the end of your initial write that contains the SQE.

Starting point is 00:29:59 And that is nice because it avoids a round trip, reduces latency, the downside is that you need to pre-allocate that space in every buffer, so it bloats the amount of space used by the queues. And what we did initially is to only support that for a single segment, like a single biovac. And Steve Wise figured out that even if we have the same size, it's actually not very costly to allow multiple segments.

Starting point is 00:30:20 So if we have, say, an 8K buffer, there's no reason not to allow up to four segments if they're 512 byte, now contiguous ones, because you can get that pretty much for free and it actually helps to speed up some workloads that, say, always do 8K writes, which often might be two segments because you're not using contiguous memory. All right. So the other interesting, really, really interesting, in every sense of the word, enterprise feature, is the PCAE peer-to-peer support. Some people might know, others not. So PCA Express, while for the host kind of looking like good old PCI with like the physical

Starting point is 00:31:04 signaling, is really a network protocol underneath. It's pretty well hidden, but for anyone reading the spec, it's like that. And one thing that PCI allows is to not just have your host CPU through the root port talk to a device in pack, but the devices can

Starting point is 00:31:20 actually under circumstances talk to each other. This is especially interesting if you've got a PCI switch in there so that your uplink to the host isn't touched, but there's a couple other use cases for it. And we finally grew support in Linux to have a limited version of that PCIe peer-to-peer support in Linux 4.19.

Starting point is 00:31:40 And that support is basically a generic support in the PCI layer, which handles PCIe as well, to register bars, like the big memory windows PCIe cards have, with that layer, so they can hand out allocations, and then a way to figure out if two devices can actually talk to each other, depending where they sit in the hierarchy

Starting point is 00:32:04 and what kind of settings, like ACS. What's ACS? Advanced control? Access control. Access control services, yeah. The access control services allow for it. So basically it's like, hello, can we actually talk to each other or do we always have to go through mom? And if we can, we can then initiate the direct transfers. And right now there's basically just two entities that can talk to each other because nothing else has been wired up. And one is the NVMe PCIe driver, which can export the CMB, the controller memory buffer,

Starting point is 00:32:35 which is basically just a giant memory bar that has no real functionality, but which the NVMe device can read from and write to internally is this peer-to-peer memory, and then RDMA cards can address that so that if you're doing an NVMe or Fabrics target, that can directly access the PCIe bar on the NVMe controller from the RDMA HCA. And there's all kinds of work going on, like actually supporting setups that have IOMM use instead

Starting point is 00:33:06 of directly mapped DMA addresses. And there's one really big warning. So CMB, as specified in NVMe 1.3 and earlier, is really gravely broken for typical virtualization setups. NVMe 1.4 has kind of a shoehorn fix that solves some of that. Some of this might be dangerous if you're using interesting remappings with IOM use. And it's not actually Linux fault. Is that Linux now dedicating the queue for reads?

Starting point is 00:33:40 Is there any reason why we're not telling the MDME devices that read are a specific read queue and they can take advantage of that? Okay, so the question was, now that Linux has a dedicated read queue, can't we tell the device that it is a dedicated read queue? So one correction first is, Linux, by default, it doesn't do that. There is an option to do it,

Starting point is 00:34:03 which Yance, I guess, for Facebook uses, which I don't think many people use, just as a spoiler. But if we do this and we had a way in the standard to tell that device, we'd happily do it. I don't think there is a way right now. Yeah, but if there was, we would happily set that bit or two. Oh, yeah. We can do this for FICON today, okay? Yeah. Over Fibre Channel?

Starting point is 00:34:35 Yeah. All right. It's part of mainframe. Yeah. Will you be able to do this not just for devices that are in the same PCI bus, PCIe bus, which is what you're talking about here, but for devices that are on, let's say, point-to-point fiber channel fabric

Starting point is 00:34:50 that are in two different PCIe buses. So if you've got, like, two different PCIe root ports on the CPU complex, it'd be... You've got a system here and a system here, and I want to do the peer-to-peer transfer without dragging it through the host and exposing it to the network. Well, not as a single hop thing.

Starting point is 00:35:08 I mean, you could probably build something, but it's not like a single entity that, I mean, not at this level. So PCIe, this is just the peer-to-peer just at the PCIe level, and you are either behind the same root port with a switch, or you have some CPUs that actually allow that routing from one root port to another, but it's basically PCIe-specific. Now, you could do things like, for example, Mellanox has a NIC that can actually directly do peer-to-peer

Starting point is 00:35:36 and do NVMe transfers without going to the host. I think it's kind of sketchy, but I think it's cool. And if you have that on both sides, you could actually do multiple hops without ever touching a CPU, but that think it's cool. And if you have that on both sides, you could actually do multiple hops without ever touching a CPU, but that's not directly solved by this. The reason we do that on PyCon is because in analyzing over many

Starting point is 00:35:54 years the amount of data that's moved every day in a typical setup, the majority of the data moves device to device. Well, there is another thing, sort of in that area, which we're not doing, which I know SCSI does and NFS does. They have this idea of a target-to-target in SCSI

Starting point is 00:36:12 or server-to-server in NFS copy, where the client just says, okay, you talk to you now over the network. But that's not really related to this. And then we'll announce this next month for heterogeneous devices. It'll go from disk to tape. Tape. Yeah, but that's, I mean, that's higher level storage protocols than this. This is like super low level, bus level. Yeah, so after all this interesting performance and enterprise stuff, there's another big

Starting point is 00:36:49 angle, and that is consumer-grade NVMe has finally fully arrived. If you buy a decently high-end laptop these days, you will get it with an NVMe device. Only the shitty stuff will still come with SATA. And the real shitty stuff will come with the UMFC and UFS. Don't buy those. So all these beautiful little M.2 devices, or BGA, which is the soldered-on version of it, or Toshiba at FMS

Starting point is 00:37:16 has announced another weird small form factor, and there's like CF Express and SD Express, which is NVMe-based compact flash and SD card. So it's getting tinier and tinier. It's getting everywhere, and it's getting consumer-grade. And consumer-grade means a lot more buggy devices.

Starting point is 00:37:33 I mean, the enterprise ones are buggy too, but usually, except for a few ones, not in a way that we really have to work around them in the driver for consumer stuff. So we've now used up 13 of our quirk bits, so we have 13 different misbehaviors we need to work around for specific devices, and that doesn't even count the ones we work around for all that don't even need it. What's really interesting in that thing is Linux 5.4, which is developed right now,

Starting point is 00:38:02 will contain support for these so-called Apple NVMe controllers that you find in recent MacBooks, which look a lot like NVMe, except that it turns out the submission queue entry, which in NVMe has to be 64 bytes, it's actually 128 bytes. They apparently have something in there, but you don't actually need it to work. And they won't work if you use more than a single interrupt vector. And they actually use a shared text space between all the queues, so if you have a command ID of 1 on your admin queue, you better

Starting point is 00:38:31 don't use that on an I-O queue, because it's going to blow up. But fortunately, Linux kind of has support for all this in the block layer, so it's just setting a few bits here and there, and things will actually work. And the people who reverse-engineered were pretty happy that Linux now runs nicely on these MacBooks. But it's weird stuff and I hope people actually read the NVMe spang next time before they implement it.

Starting point is 00:39:00 The other thing that is really not 100% specific to consumer devices but 98% is power management. So NVMe has this concept of power states. You can either manually manage them where the operating system always says, okay, go into power state in. And we think that's kind of a nanny approach and the device should really know better. So the Windows Microsoft driver

Starting point is 00:39:24 actually does the explicit management. And Linux and some other drivers tell the device, it's just, okay, here's a couple parameters. Please just pick your power state that you think is best. And in a lot of cases, this works really well and gives really nice power savings. In a few cases, it spectacularly blows up. And the interesting thing is that very often, it's not even specific to a particular NVMe M.2 device. It's particular to a specific firmware version on a specific device combined with a specific BIOS version of a specific

Starting point is 00:39:56 laptop. And that leads to some interesting workarounds. Since Linux 5.3, when Keith did some cool work after Dell tried to do some not so cool work first, we can also use APSD for a system level suspend. So not just for runtime power management, but when you close the lid or put the thing away. And before that, we basically fully shut down the thing, and you'd think that would use, save most power, right? It's off. It turns out, in many cases, not all of them, it's not. And that's because Microsoft has pushed the concept

Starting point is 00:40:30 of a modern standby on people, where they'd rather do that, and if you don't do that, the laptop does weird things. So for a couple platforms, this, again, helped with power saving. A couple others got nicely broken, so

Starting point is 00:40:44 it's a bit of a pain. Well, and last but not least, we have the sad puppy thanks to Intel. So one of our biggest, biggest issues in consumer envy me is Intel's chipset division, who's really

Starting point is 00:41:01 out there to screw up our life. And Intel has something called the RAID mode. Don't think of RAID because it has nothing to do with that. So the great idea is you have an HCI controller on your device, which then hides one or more, I think up to three NVMe controllers inside its PCIe bar, behind the HCI bar. There is no sane way to enumerate it because it's not documented. All our quirks for the buggy devices I just mentioned

Starting point is 00:41:34 based on the PCI IDs couldn't ever work because they're hiding the PCI ID for the real device. And there's no way to do things like SRIV, creating new functions, assigning functions to guests. It's like it's all broken because Intel's BIOS and chipset people decided, oh, there's an NVMe device. We're not actually going to show it to you. Your NVMe device is not going to show up in many laptops with Intel chipsets if your BIOS

Starting point is 00:41:59 is in the wrong mode and some of them are hardwired to that. Instead, there will be some magic in an HCI device that isn't even otherwise used. And it very much seems like an intentional sabotage. So we had one Intel guy, kernel developer, trying to post draft patches for it, and ever since he couldn't even talk about that anymore because he got some gag order.

Starting point is 00:42:24 And, yeah, this is our biggest problem, in addition to power management. Everything else is cool to me. Thank you very much. Questions? Yeah, do you have the same automated setup on the back end? Yeah. Questions? Yeah.

Starting point is 00:42:52 Yeah, automated setup basically means as soon as we discover multiple controllers on the same subsystem, we just link them together in the kernel, and you will just get multipass access. You don't do anything. It's just there. So that's about as automated as it gets.

Starting point is 00:43:06 What is the timeline for 5.4? The timeline for Linux is 5.4. So we don't even have the first release candidate. So it's another seven weeks from now-ish. Eight. There are. Seven to eight weeks from now. Also, you would be either. seven, two, eight weeks from now? Okay, so the question was if it's possible for kernel users to use IOU ring.

Starting point is 00:43:33 And the direct answer is no. But you get, so you don't get the ring, but you can use things like the dedicated polling thread from kernel space two, and we've actually looked at that for the NVMe over fabrics target. So you'll get, you'll get a

Starting point is 00:43:50 similar use case, it's just not exactly the same interface, because kernel interfaces just look very different. Or is it because it needs to rewrite? So, I mean, first, I mean, the question was, Fiber Channel code base is so good, is that because it's so complicated or it's junk and needs to rewrite? So it's not really 20% of the kernel. It's a huge part of the NVMe driver.

Starting point is 00:44:11 It's also a huge part of the SCSI layer, but overall, it's not anywhere near as much. And I'm not sure. I mean, one problem I know with Fiber Channel is just that a lot of stuff that really is protocol generic isn't really in a generic layer, but either in the drivers or in firmware. And a lot of the drivers support very different hardware revisions. They're very different.

Starting point is 00:44:35 No one's really done NVMe for Fiber Channel. I mean, not NVMe over Fiber Channel, but an NVMe-like hardware interface for Fiber Channel. So there's a couple drivers from different companies. Each of them supports very different hardware generations, and each of them duplicates a lot of code. And I don't think there's one single clear answer, but you kind of get where this is heading to.

Starting point is 00:45:00 So the canonical example, which I guess might even be the most interesting one, is to inject the timeout and figure out how the timeout handling works. Something like that. Asymmetric, yeah. You're right. Sorry. No, you're right. My name is on the TP and I still can't spell it. Damn it. Yeah, well, it's, anyway, I have a hard time getting that over, but I'll try to explain it. No, so actually for a long time there was no I.O. scheduling for NVMe at all. I don't

Starting point is 00:45:37 think I said that earlier, but it definitely is true. And with the move to BlockMQ, we then later gained support for I.O. schedules for that, but they've always been shared by all drivers using that infrastructure, which includes SCSI. So they're not NVMe-specific, but there are I.O. schedulers now that didn't used to be there, and they're not NVMe-specific.

Starting point is 00:46:02 Okay. Okay, remaining questions are to be asked over coffee. Thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community.

Starting point is 00:46:36 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #118: Linux NVMe and Block Layer Status Update

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.