Storage Developer Conference - #37: NVMe Over Fabrics Support in Linux

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 37. Today we hear from Christoph Helwig, consultant, and Sagi Grimberg, co-founder and principal architect at Lightbit Labs, as they present NVMe over

Starting point is 00:00:46 fabric support in Linux from the 2016 storage developer conference. So us here is me, Christoph Helvig and Sagi. We're both long-term Linux kernel developers. I've been doing a lot of Falstom work including NFS serving and block storage and I only got into RDMA for this project. I've been working on RDMA for the last, I don't know, five years or so. I worked for Melnox and basically involved in the whole storage RDMA use cases and applications. Yes, so for talking NVMe over Fabrics in the beginning, well, we'll first have to assume you at least know what NVMe is and NVMe over Fabrics.

Starting point is 00:01:37 I hope you attended the last talk or followed all the hype because we're not going to get into that again here. So NVMe over Fabrics and Linux. Basically that's been some sort of involvement since the beginning. So the first demo that kind of had the NVMe over Fabrics name, I'm not sure if other people did similar technology, was at I think IDF in 2014 by Intel and Chelsea. And they said they were using Linux, but none of us was really involved with that, I think, IDF in 2014 by Intel and Chelsea, and they said they were using Linux, but none of us was really involved with that, so we can't comment.

Starting point is 00:02:10 Then during early spec development, we had at least two Linux-based prototypes. So one was done by Intel with a few other people chiming in, and none of us was directly involved, but Sagi might know a little bit about it. Not sure if he can comment on that or if it even matters. But I've been driving the implementation we've done at HGST, so I can talk a lot about it and we'll do a little bit later on.

Starting point is 00:02:37 And well, a lot of Linux developers, us too, Steve Bates, who's in the room here, Ming Lin from Samsung, have been involved with both the various implementations and the actual spec development. So I've been pretty active on the spec, too. And Jay. Yeah, and Jay. And, well, he's not a long-term Linux developer, but he's been really helpful with our project, one of the Intel guys. And Sagi and I are both on the technical proposal

Starting point is 00:03:06 and spend a lot of time working on the spec and so on. So what was our little HGST prototype? So my manager there came to me, oh, this envy me over Fabrics hype. What is it actually about? Well, I don't know. The latency numbers they claim for it, I already did in my previous project

Starting point is 00:03:24 for SanDisk over SCSI, so I'm not sure what the hype is about. And well, so we looked at it, and we knew we could get about 10 microsecond added latency over SCSI with SRP before, because that's the last big project I did for SanDisk, as I said, which happens to be merged with Western Digital just like HGST now, so we're all one big family again. So what I did was, oh, well, I know there's this RDMA protocol, we can already get these really good numbers out of there actually better than the previous demo. Let's just tunnel NVMe commands over it.

Starting point is 00:03:58 Turns out that obviously worked pretty well, but it also obviously wasn't quite what the existing spec did, which I later got some access to it. The spec at that point actually was pretty horrible and tried to emulate PCAE doorbell rights over NVMe fabrics and created giant packages over the wire. It was actually worse than the performance we saw with the existing SRB protocol. So as part of that, I get involved with the spec, try to move the spec over to what we did, moved our implementation towards what the spec did,

Starting point is 00:04:30 and we were basically converging from both sides to get to something everyone was happy with. And in 2015, so last year, the different people with the different prototypes finally tried to get together and there was a lot of weird legalese involved with that, sharing code between possible competitors, industry associations and so on. So we need a lawyer here to explain the details. But in the end, we managed to get a development group started

Starting point is 00:05:05 that was organized as a working group of the NVMeXpress organization, where we could do our Linux development. And we tried to make it follow Linux-style development possible so that we're not ending up with a later big code base that doesn't really look like anyone wants it. So we set up Git reposit repository, set up mailing lists,

Starting point is 00:05:29 went for maintainers, one of them is Sagi here, and basically worked as if we were doing a normal Linux development mailing list, but had to do it in private for now because the spec was under NDA. But even before we could release our actual new code and tell people what's in this NDA part of the spec, we tried to get ahead and actually prepare Linux for landing this codebase so that we wouldn't have to start all the discussions that code refactoring would involve.

Starting point is 00:06:03 And one thing that was very obvious and that we could go public with before even showing the spec to anyone was that clearly what was the existing NVMe over PCI Express driver in Linux was something that should evolve more into subsystem similar to what we have for SCSI or ATA, for example, so that we have a split between the actual command set

Starting point is 00:06:26 that we're talking. And NVMe, for example, in theory, supports multiple command sets, even if only one is supported in the spec right now. And a core part, and then the actual transport drivers. So the first thing I did for that is to rewrite how the NVMe driver submits internal commands.

Starting point is 00:06:45 Internal commands are those that don't come from a file system or a raw block device and just do the read and write operations, but internal things for scanning for namespaces, initialization in the hardware, doing all kinds of issues where you submit commands internally. And what we've been doing for SCSI forever is that we reuse the structure we use for submitting I.O.

Starting point is 00:07:09 requests to block drivers, which is called struct request, to also be able to attach a fully build up command to it. So in SCSI, you have your SCSI pass-through command, which basically means you have your CDB attached to it, you have your payload attached to it, and then the low level driver doesn't try to do any translation to the SCSI command because it already has it. And that scheme was something we could apply almost one to one to NVMe as well.

Starting point is 00:07:33 It's a little more difficult with NVMe because NVMe has more data in what's called the submission queue entry or just NVMe CMD in the Linux structure. So it has the actual data pointers in it, which SCSI doesn't do. It has the namespace number in it, which SCSI got rid of 20 years ago for a good reason. So there's a few things we might have to clear out, but the scheme is pretty much the same. So that was pretty easy to do ahead of time. And then we could start splitting the structures so something that is a super generic concept

Starting point is 00:08:08 and doesn't have any knowledge of PCIe or just MMIO, Memory Map IO in general, can move to a common layer. Everything that looks very PCI specific stays in a PCIe driver. And the few places we still had to call out into the low level driver that we couldn't handle through this command pass through.

Starting point is 00:08:27 So if it was a non-command based abstraction, could add a little operations vector. And the other big thing we could do is actually change the file structure to prepare for this. So before the NVMe driver was two source files just in the generic drivers block directory, we knew that we'd get a lot more two source files just in the generic driver's block directory. We knew that we would get a lot more different source files for that split so we started creating a hierarchy

Starting point is 00:08:50 where we could just deal with that. And so once we got the actual spec out, we could fully complete the split and actually split it into two loadable modules and not just at the source level. And now we have a pretty nice little structure. So we've got our completely transport independent NVMe core module. This includes the actual NVM I.O. command set support. This includes a lot of the block layer interaction and so on and so on. And then we have an NVMe fabrics module.

Starting point is 00:09:27 This includes everything that is the common part of the NVMe fabrics specification. So if you look at the specification right now, it has an RDMA transport and a fiber channel transport. We've actually added another non-standard transport in Linux that we're gonna talk about later. And people are already thinking of more transports. And this is the bit that's shared between them all.

Starting point is 00:09:48 And then in terms of the transport drivers, we have the refactored existing PCI Express driver. And we've actually got a RDMA front end driver. We don't have the fiber channel driver yet. And we're only going to talk about it a little bit. Partially because the spec isn't done yet. It's not done by the NVM Express organization. It's done by T11 and T11 needs a lot of time

Starting point is 00:10:10 for smoky background politics so it's gonna take a while. But we already have a prototype code that's been posted to the Linux mailing lists and so on. So it's not gonna take too long until we'll be there. And the architecture that I just drafted in words, we've also got a nice little diagram. So we tried to use the existing code as much as possible. Turns out it wasn't actually all that much, also significant. And one important thing that you'll see in this diagram is if we're in the fast pass,

Starting point is 00:10:44 so the actual block device I.O. code that comes from the file system or straight from the application, we don't really dive into common code. Well, actually, we do a little bit, but there is no layering that puts the common code in between the actual transport driver and the FastPass I.O. So we'll actually call into little library functions to format the actual command structure, but that's about it. So there's no additional indirect function calls

Starting point is 00:11:10 or heavy-handed abstractions. Instead, we have the transport driver in full control of its structure layouts and cache line footprint and so on. This was something the PCIe people insisted on because they've been optimizing for crazy high-speed I.O. and I actually did that at the same time. So I actually just did a new block device I.O. implementation for that.

Starting point is 00:11:31 And it also really helped us with the Fabrics development. What we do use the common code for is, for example, user space pass-through I.O. tools. So Linux, we have the nice little NVMe CLI tool that allows you to send every crazy admin command invented and even some vendor specific commands And this makes sure the handling is one in the same and buck-to-buck comfortable Yeah, yeah, it's really great to find bucks And

Starting point is 00:12:00 Yes, and the the NVMe CLI tool is in fact what we use for our own new commands, so to connect to the remote controller, unlike a PCIe controller, it's not immediately found. We'll talk about that a little bit later. And we added all that to the NVMe CLI tool so that you have your one single tool to go for for all the NVMe- related interactions on the Linux system. So features supported, various RDMA transports. So my favorite one is still InfiniBand,

Starting point is 00:12:32 even if even Melanox tries to talk about ethernet all the time. It's a lot less painful. But we support the various rocky formats as well, and IWARP. We don't support Intel OmniPath, which is another little hype thing at the moment, not because we don't support Intel Omni path which is another little hype thing at the moment not because we don't really like to but because your driver is so bad and doesn't support everything we need eventually we'll get there so dynamic connect disconnect to the remote controllers of basically

Starting point is 00:13:00 theoretically unlimited number at some point will run into scalability issues. We've got the discovery support. We'll explain a little bit later what discovery actually is, but actually we don't because we moved the slide. So I mean, discovery is, discovery is just a way to find remote controllers. So in PCIe, it's pretty easy to find the controllers. You've got your PCI Express bus.

Starting point is 00:13:24 There is a defined enumeration over it, and your OS will just find them. Well, now you're on, say, Rocky 2 or iWarp, and in theory, every possible IPv6 address in the world might have a controller behind it, so we'll need some way to find it. We've got support for the Keep Alive extensions that were added to NVMe at the same time as

Starting point is 00:13:45 the Fabric support. So for Fabrics where we don't have a reliable connection loss indicator, we have a way to probe for that. And we have some very, very basic multi-pathing support, thanks to Sagi using the existing device mapper code. It's very basic, mostly for the reason that NVMe doesn't really have something similar to Alua in SCSI, so we don't really know which path is active, which is passive, which is preferred, port grouping and so on.

Starting point is 00:14:13 But if you just have a simple failover multipathing where you just try all paths and if one fails, you keep using the other ones, it'll just work. And features that are not supported yet is, one, any form of authentication, mostly because the spec has been outsourced to the TCG group instead of NVMeXpress and it hasn't happened. See a scheme there about outsourcing parts of the spec. It's usually the ones that are delayed. And the other one briefly mentioned is fiber channel support, which is a work in progress. The existing fiber channel vendors are all over it, let's say it that way.

Starting point is 00:14:59 It's life or death for them. But, I mean, the other thing is if you care more about IOPS than latency, which is reasonable for some workloads, Fiber Channel actually does really well, just at a price tag. So it's not like it's completely stupid, but it's definitely not what we focused on here. Someone's going to sue me for that. So I didn't say that. Ignore it. As we mentioned earlier, most of the code is shared for different transports but it's barely the most.

Starting point is 00:15:32 I think it's about 52 or 53% as you can see on this part of the pie chart. Then the PCIe driver is still pretty big, a lot of code in there because it's actually way more low level than our VRBs RDMA driver. The RDMA driver, I just checked it, so the host driver is 1,900 lines of code, so that's another big check, and the actual fabric-specific common code is pretty small. And about, not quite two-thirds, but more than half of this is actually a deprecated SCSI emulation in the NVMe driver which we'll hopefully get rid of. So if you remove that, the common core is even smaller.

Starting point is 00:16:12 It's actually the largest source file in the NVMe tree. It's a weird translation to support SCSI commands sent to it which doesn't make any sense but Intel really wanted to be added back in the early PCIe days. So, and in addition to this host driver, which you use to access your remote NVMe devices, the other thing we implemented was a target device. And target is actually a word that doesn't appear in the NVMe spec ever. Unfortunately, it doesn't have a word for that concept at all, that target system you know from SCSI. It only has names for the various subcomponents of it, which make it very awkward to talk

Starting point is 00:16:51 about an implementation. So I made a habit of always calling it target and maybe eventually getting the word in the spec. And the NVMe target supports implementing NVMe controllers in Linux. And right now, it's just Fabrics controllers, the RDMA Fabrics controllers we were talking about, and, well, anything else. But in the future, it's prepared for adding something

Starting point is 00:17:17 that either is or looks like a real PCI Express controller in the future. So we try to keep the Fabrics-specific separate and at least allow for concepts that only exist for PCIe, controller in the future. So we try to keep the Fabric-specific separate and at least allow for concepts that only exist for PCIe, like mapping multiple submission queues to a single completion queue, something we don't do in Fabrics, something we avoid to do in PCIe, but it's still allowed by the spec. You can't do it for Fabrics.

Starting point is 00:17:39 No, you can't. I mean, in terms of NVMe submission completion queues, the spec explicitly disallows it. Not RDMA completion queues. We'll talk about that later. So what we do is we have our RDMA target, the in progress fiber channel target, and the other thing that was the mystery transport

Starting point is 00:18:01 I mentioned earlier is we've got a loop transport. And the loop transport is basically a host and a target driver at the same time, and just does an injection of one into the other, which allows us to do local testing and feature development without ever having to go out to a wire. And it's proven really, really useful for testing as well. And one thing that we try to make sure is that the NVMe target can use any Linux block

Starting point is 00:18:27 device. So what we're not doing by default and currently actually at all is command pass through. So we always implement the actual NVMe engine in Linux. That means your backend can be NVMe, but it also can be a SaaS or SATA SSD, RAM disk, VIRTIO if you're running in a VM, crazy fast directly mapped persistent memory devices, and so on and so on. And it just uses the normal Linux block layer to communicate with the device.

Starting point is 00:18:57 In fact, if you look at the actual block IO path of the NVMe target, it's about 100 lines of code. It's literally nothing, which also means we can't add stupend latency or overhead to it. That's always a good thing. We also played around with the NVMe command pass-through,

Starting point is 00:19:13 so I mostly did it for features we don't want to support in a target like security command pass-through. Some others thought it would be useful for performance, which I can't really see as it would use exactly the same infrastructure, just instead of being a generic request, we would attach that little 64 byte

Starting point is 00:19:30 NVMe command block to it. And the supported features are, well, all the mandatory NVMe AL commands, which there only are three of, read, write, and flush, aka cache flush. We support the full set of admin commands we need to support, which is a little less for fabrics than PCIe. We support the dataset management command,

Starting point is 00:19:53 which has a couple different sub-function, but only one that is really useful, which is deallocate in Nvm-e speak, aka trim in ATA speak, aka unmap in SCSI speak, aka discard in Linux speak. Basically, deallocate or trim blocks were written so that the device knows they're unused now and can make better garbage collection decisions

Starting point is 00:20:14 or unallocating backing stores. The whole provisioning of the subsystems, controllers, namespaces is entirely dynamic. So you can add a new namespace to a controller that a device is currently connected to. The device will get an asynchronous event notification and be able to see the new device and so on and so on. And we include the discovery service, which I mentioned earlier, which is used to find

Starting point is 00:20:40 the devices, including support for referrals. Even if you go discovering on one of our targets, it can just refer you to another one on the other side of the globe. Not really useful right now, but it will be useful at some point, especially when you do fully clustered implementations, which we kind of support at the moment,

Starting point is 00:21:01 but without Lua or some other distributed semantics, it's not that useful yet. Which actually brings us to the next slide with features. So features not supported. One is a couple log pages for the smart, like the usage logs and the error logs. The first one actually has a patch out on the mailing list now, which we'll hopefully have soon. Well, again, no authentication.

Starting point is 00:21:26 An important one is persistent reservations. So NVMe has something that is almost the same but not quite as the SCSI 3 persistent reservations. And we'll hopefully support it in the Linux target. We just want to have a common API that we don't have to implement it on our own and are able to use clustered backends. So there's two clustered persistent reservation backends for Linux at the moment. One is using the devise. One is using a distributed lock manager when you do more like active passive failover things

Starting point is 00:22:00 using DRBD. And they're both in weird out of tree shape so we're trying to work to make them all sit behind an APA so that the SCSI and NVMe targets can use the same backend code. Yeah? What's the target completions for the adding and sending? None really at the moment. So persistent reservations, well actually,

Starting point is 00:22:25 so the smart lock really soon. The error lock we'll just don't plan to do anytime soon because the spec requires the error lock to be there. We have it, it's always zero. So the way the NVMe error lock works is when you get back an error, there's a bit that's set to be in it that tells you there might be more information about it in the error lock.

Starting point is 00:22:47 And a lot of PCIe controllers never set it, so we followed their lead. And you get your error lock, there's just not going to be anything in it. And the real driver of implementing that is a host that actually makes use of it. So right now, none of the NVMe drivers I know would even ever ask for it, which makes it a little pointless to implement. But if people come up with good use cases for it, it should be a fairly simple implementation. Not in the code base right now, but I've been prototyping various things. It's just that the current hinting story in NVMe

Starting point is 00:23:26 isn't all that nice, so Martin and me and some other guy have been trying to change that, but it's gonna take a while until we actually have useful hints in the spec. How about the authentication? Well, the problem is, again, there's no spec for that right now. So what the NVMe spec basically says,

Starting point is 00:23:42 it defines two new fabric-specific commands, authentication send and authentication receive, which then tunnel it to security protocol. None of those is defined yet, so NVMe worked with the trusted computing group to define a spec for that. It's just that nothing has happened so far. Okay, so the next one is fused commands commands and there's actually only one of them, compare

Starting point is 00:24:08 and write. In SCSI that's something VMware really likes and no one else ever uses and in SCSI it's implemented really awkwardly with a command that has a data in buffer that is actually split into two buffers in a magic way and creates real problems for the implementations. So what envy me and we needed for that is something slightly less ugly but still a little hairy. So you get two commands, a compare command and a right command and they're linked by the fuse bit. So only if the compare succeeds will then execute the right as part of the fuse command.

Starting point is 00:24:41 So think of it a bit like a load-linked store conditional over a storage protocol. That becomes very well defined when you have relaxed order. Well, but NVMe doesn't have relaxed order, but anyway, it's a bit painful. I think it's actually less painful than the SCSI version of it, but the concept is hairy one way or another.

Starting point is 00:25:05 And we're most likely not gonna see it until there is a VM for host implementation or someone actually starts using it on Linux, which I don't expect to. Then independent of the authentication, there's also the traditional NVMe security protocols, which is more like drive, lock, unlock, either by plain text pin or TCG magic.

Starting point is 00:25:28 Right now we don't have an implementation of that either. We're thinking about a few ideas there internally at HGST, but not anytime soon. The other thing we haven't done isn't really a protocol feature, but more an implementation feature that a lot of people are playing around with right now, including HST or WD for which I work, is some form of PCI peer-to-peer I.O. So that we do transfers straight to the device, say an NVMe controller memory buffer,

Starting point is 00:25:57 say a PCIe card that is just directly, just has directly LBA addressable storage. And for that, we don't want RDMA into host memory first, but straight to the device. And it's very hard to get a framework in for that and we'll need to do a lot more talking that doesn't involve just the NVMe target, but the PCI subsystem maintainers,

Starting point is 00:26:20 the driver maintainers, memory management people and so on. Yeah. system maintainers, the driver maintainers, memory management people and so on. Yeah. Yeah, so listen to Stephen Bates, you'll find him on the schedule. He's probably the one who's been playing with that for the longest. So then we again got our little pie chart. So again, the most code is common.

Starting point is 00:26:46 In fact, just a little bit more than on the host, but at least we don't have a crazy SCSI emulation here, so there's actually real work done in the common code. Then we have our loop device, which is pretty easy. We've got the RDMA implementation, which looks larger in this chart, but it's actually smaller than on the host. So our RDMA transport drive is only slightly less than 1,500 lines of code. And then we've got a little bit of fabric specific code too but it's actually fairly trivial. And the size of the whole core is actually smaller than some of the SCSI target transport drivers

Starting point is 00:27:24 because we try to keep it A very simple because the spec is very simple and we implement the minimum possible or necessary. And the other reason is that we try to aggressively offload our work if it's not really NVMe specific to other people, which mostly were us two just wearing different hats. So we did a lot of RDMA subsystem changes with Sagi is gonna talk after I finish one more slide or something like that. The other is the configfs subsystem. It's a kernel file system that allows

Starting point is 00:27:56 to configure subsystems in the kernel using a file system like interface. I actually started maintaining that just for that so I could push it all in there. And as I mentioned on the last slide, we plan to go on like that, especially for persistent reservations which is another giant code blob that we really don't want to have in our driver because it's not specific to us and has multiple possible backend limitations. So NVMe target configuration, we have this little configfs that I mentioned interface

Starting point is 00:28:28 that lets user space tools configure it. And it's kind of modeled after what we did, what's done in the Linux kernel for the SCSI target but just simplified down to the minimum. Part of that is just that the NVMe spec is much more coherent than the various SCSI transports which have a lot of leeway to do things differently, use different identifiers and so on, different features. And part of that is just that we developed everything at a go instead of adding things piecemeal and tried to look for the big picture. And we've got a user space utility that is written in Python, uses an existing library

Starting point is 00:29:06 for these config.fs interactions, which allows a little, actually I should just move to the next slide, which has a picture of it. So it's this little command line, a little bit graphic by using colors and ASCII art interface. There's tab completion for everything. So nice little interactive tool and it also allows you to save and restore the configuration. We ship a systemd unit file that you automatically get everything done, system boot and so on. And after this, I'm going to hand over to Sagi who's going to tell a lot more about the RDMA subsystem work we did as part of this project.

Starting point is 00:29:57 Hi, is it on? All right. Okay, so I'm going to talk about all the RDMA stuff and all the work that was done in the RDMA stack in the scope of the NVMe RDMA driver development. Basically when we came in and implemented the code, we started looking on all the other storage RDMA protocols that exist in Linux today, and we identified some spots that actually duplicate code and basically re-implement a lot of boilerplate code for things that

Starting point is 00:30:52 really should live, really are generic and should live in the generic RDMA stack. So this is a very simplified look of the Linux RDMA stack today. You have in green the RDMA protocol drivers. We call them ULPs, upper layer protocols. These are the protocol implementers that work on top of the RDMA core. The RDMA core is basically an API layer with combined with management. And underneath we have the specific device drivers that implement the RDMA. We

Starting point is 00:31:27 have iWarp drivers like 6DB3 and 4. We have IB and Rocky drivers, MLX Mellanox, MLX 4 and 5. We have others as well. We can see that we have the SCSI RDMA protocol, iSCSI over RDMA which is ICER. We have the NFS over RDMA which Chuck's been doing a lot of work lately and we have the new NVMe RDMA transport. So as I said, the stack is logically separated to these three components and we basically did a lot of code centralization and we'll start enumerating them one by one

Starting point is 00:32:08 So the first was memory registration that was a big a big pain or Very painful as an RDMA developer or someone who is involved in the stack Basically memory registration if you attended the last talk, is the way to allow remote peers to access your local buffers. You need to basically pass the physical mappings of the buffers themselves into the ACA so it will know to access them directly. And then you get some kind of token. It's called a remote key in RDMA language.

Starting point is 00:32:43 And you pass it to a peer, and it uses that to basically stream data into your local buffers. So that is called memory registration. And funnily enough, the stack offered maybe five or six ways, different ways to do basically the same thing. Different methods had different semantics and speeds meanings. And I think at some point I tried to introduce a seventh way

Starting point is 00:33:11 but I started getting anonymous threats to my mailbox. So, okay, so I said we should rethink the whole thing. So we started basically, so again, it was a lot of effort. I think the most ambitious driver was the NFS client who actually tried to support all of them. But Chuck did a great work cleaning that up and taking the NFS client to a new level, only resulting in two different ways to do that. So we started on discussion on the mailing list that we wanted to converge on a single method. We basically have one mission, we want to register memory,

Starting point is 00:33:53 we want to find a single way to do that, we want new developers basically to have a natural choice and they know how to do that. So we converge on fast registration work request, which is basically one method that was available. It was basically the most widely supported by various device drivers, devices, and it was pretty, it was basically performing very well. Because it's fast. Because it's fast.

Starting point is 00:34:28 We called it fast because it's fast. Remember that name is important. It's very fast. So then we looked into the FRWR interface that the core layer presents, and we found that a lot of the code, a lot of the code, a lot of the tasks that a ULP or a protocol driver needs to do is way too much, and it basically, we saw the

Starting point is 00:34:53 same code replicated in each driver. So what we did, we basically took all the shared code or replicated code, and we put it inside the RDMA core, and we provide very intuitive and simple interfaces to the different ULPs and we allow them to test scatter lists rather than construct their own page vectors with the alignment constraints of each various ARNIC or device and we also looked at the provider drivers, the device drivers themselves.

Starting point is 00:35:27 And we saw suboptimal implementations. Most of them used like a shadow page vectors used for NDNS conversion and flagging, which was really unnecessary. So we also offered a set of helpers from the core to do that correctly. And we basically did that and migrated all the ULPs to use this registration method. I think RDS is the last one that needs conversion.

Starting point is 00:35:54 We have it too now. Yeah, all right. So we migrated all the ULPs and I think now it's, if you're going to write a new RDMA driver, protocol driver, it should be a fairly simple task. So that was good for everyone and everyone benefited from that. Another aspect that we looked on was the completion queue API.

Starting point is 00:36:19 So basically the completion queue polling implementation was replicated through every ULP that existed in the stack in some form and Some got some got the the AP some Implemented the API correctly, but we found a lot of No, no I don't think we got it some each Each driver has its own mistakes in terms of correctly re-arming the CQ to generate more interrupts, fairness and context abuses.

Starting point is 00:36:52 Basically we saw drivers that can stay in hard IRQ or soft IRQ forever. But what we did is what we wanted to do is basically abstract all the details on how to do correct polling from the RNICs themselves and provide a simple and intuitive interface to all the ULPs. So what we did, we took basically all the ideas from all the ULPs and put that in the core layer. Okay. Both the allocation and the allocation and the polling, uh, polling implementation itself. And we basically allowed the ULPs to pass a dumb function, which basically executes once the, once the work request complete.

Starting point is 00:37:40 It's very similar to the bio, the Linux bio interface, if you're familiar with it. And we also, for different, different modes of your piece, if it's a server or a, or an initiator mode driver, we offered several contexts completion queue completion polling context that are available. One is soft IRQ, um, with, with a new library called direct you poll, which was migrated from, what was it? Yeah, some kind of implementation that stayed in the block layer.

Starting point is 00:38:15 We made it generic and used that. And we also have a work queue interface for target mode, which usually allocate buffers from its run to completion stack from The completion to the back end and we also offer direct polling for application driven polling and we might and we migrated all the UOPs or the vast majority of them to use this API and the nice thing that was our indication that we got it right is that we had several developers

Starting point is 00:38:49 from the RDMA ecosystem subsystem converting to this API themselves. So once they got it correctly and it was very intuitive for them, we knew that basically we got the API right. Another aspect is that we found that is useful is pooling completion queues. So basically RDMA applications can achieve better parallelism or completion processing efficiency

Starting point is 00:39:23 if they basically use pools of completion queues So, if you look at the completion processing efficiency, if they basically use pools of completion queues and stack correctly assigns RAQ affinity to cores and stack multiple queue pairs on the same completion queue. So, basically you get the benefit of better completion aggregation per interrupts and overall reduce the number of interrupts per completion queue. And smart assignments of RAQ affinity as I said, helps the overall parallelism of processing in the software. So what we did is move all the completion queue allocation

Starting point is 00:40:00 API into the RDMA core, basically hinting from the QP creation on which CQ you want to do that. So we moved it to the RDMA core. So basically completion queue pools are allocated on demand in a lazy fashion in a per-core semantics. And the ULP can pass an affinity hint

Starting point is 00:40:22 on which core or completion vector it wants to see interrupt on. And it can leave it blank for wildcard assignments. And the nice thing is that basically all the wisdom that we found in the individual ULPs now move to the core. And ULPs that weren't aware of smart assignments and correct pooling gain benefit from using nice and simple APIs. It's a work in progress.

Starting point is 00:40:57 I expect it to make kernel 4.9 once we get the SRP bug out of the way. Any questions? None. All right. Too hardcore. Too hardcore. Yeah, it's – you need to bear with me. It's a very high-level description of low-level implementations that we did. It's funny that you talk about pooling and polling in this. Yeah, yeah. It is. It is.

Starting point is 00:41:24 I'm going gonna get some. Yeah, all right, okay. So one more aspect that we noticed that is duplicated on all the protocol drivers was the RDMA data transfer generation from each driver itself. So all the RDMA protocols share the same classic RPC model. You've seen it in the last slides if you attended. For reads, basically, the initiator

Starting point is 00:41:52 sends a read request to the target. It is followed by one or more RDMA writes of the data and reported back in a completion, basically, which is another send work request. Writes basically work the same way. You get a write request followed by a series of RDMA reads and notified back on the write completion. This is not mandatory.

Starting point is 00:42:17 If the data is small enough, it can write on the write request on some protocols. So what we saw that basically every single ULP implemented the same sequence of operations to transfer data from local scattered gather of buffers into a remote scattered buffers. Lots of code duplications. Inevitably or not surprising, some got it right, some got it wrong. And

Starting point is 00:42:51 the most surprising thing is that all the logic existed in the ULPs, none existed in the core layer, nor the provider drivers. That would just pass through all the logic itself just to the wire. So we needed some high-level API to make all the data transfer transparent and all the code that should exist once be implemented correctly and give a nice API to the individual UOPs. So what we did, we created a new core library

Starting point is 00:43:22 to basically generate a series of RDMA reads and writes in a very generic way with a very simple interface. Basically you get three functions. You get an init of an RDMA read write context, you get a destruction of that context, and you get a post. Just post a series of RDMA writes which were initiated. It basically works on common scatter list

Starting point is 00:43:49 which is basically I think most of the RDMA ULPs. I think NFS needs a bit of conversion to use it seamlessly. But it's very useful to have like generic common scatter list that are used in Linux. And it masks away all the details of DMA mapping and correct wisdom. Let's say it implements all the wisdom of correct signal and completion signaling and what completions can be signaled or suppressed, correct batching of multiple work requests, RDMA work requests, basically get the highest performance from implementation from a ULP.

Starting point is 00:44:37 It also have support for the Mellanox signature extensions to implement T10PI which can be implemented in NVMe over RDMA as well. It should be easy enough to implement. And again, after we did all that work, we migrated the individual UOPs, at least ICER and SRP. NFS still needs a conversion. We're hoping to get that soon.

Starting point is 00:45:09 And to use all the new API, and I wish I could have captured the number of code lines that were eliminated for each driver. So that was, for ICER, I just remember it was very, very significant. I think it was bigger than our whole. Yeah, probably. OK, so performance results, I'm going

Starting point is 00:45:32 to give it back to Christoph. Any questions before I give it back to Christoph? Does the user-based implementation also get the benefit of the combination? Good question. The short answer is no. They get none of these benefits. In RDMA, the user space is pretty,

Starting point is 00:45:58 all the data plane is pretty much disjoint from the kernel data path implementation, so it will need to be re-implemented in user space for user space to get benefits of that. That's basically why we haven't. We don't expect, the user space application will not see the same benefits. But once someone picks up the glove and re-implements it in user space we can.

Starting point is 00:46:28 So short answer, it's doable but no one's done it yet. We're almost running out of time so let's go on to some benchmarks. They're a little bit older but the basic scheme still works. So what we've done here is we've done a benchmark on little QDepth 1 reads, 512 byte reads. And the things we're comparing is a RAM disk and NVMe SSD. And if you glance at the numbers, you notice it's definitely not a flash SSD.

Starting point is 00:47:04 It's one of our HDST prototypes that's actually DRAM backed, so it's way, way faster than what you expect from like a real SSD. And then we've benchmarked that either locally or going out over NVMe over Fabrics on a Mellanox HCA with all the little optimizations you can apply to.

Starting point is 00:47:25 So for example, we're avoiding the memory registrations for that by using a nice to use but not quite safe feature that all the RDMA protocol drivers use to avoid memory registration. Not going to get into that. Don't do that at home. So basically, the local RAM disk latency is very, very low. And then we're at about four microsecond latencies with our NVMe DRAM device. We get about, oh, and this is the percentile of the I.O. So these are the latency spikes we might actually see in the implementation if you don't just go for the average latency, but for the worst-case latency.

Starting point is 00:48:03 That's actually something that's gotten a lot better recently and then we get about seven micros seven to eight microsecond latency if we do NVMe over fabrics to that synchronous RAM disk on the real remote machine so that's a good proxy if you care about remote persistent memory for. It will not be quite as fast as the RAM disk, but very close. And then we get 12 to 13-ish latency for a real NVMe device behind the PCIe bus on the remote system, even if it's a very fast one for our QDIPS 1 random reads. And this is the normal interrupt-driven performance.

Starting point is 00:48:43 What is the load size of I-O? I-O size? What is load size? normal interrupt-driven performance. And I... I.O. size? 512 byte I.O., I mentioned that earlier. So the smallest possible block size on any block device in general. So the other thing that we do was we added polling support and one thing we have the polling support in the NVMe over Fabrics host driver. So what we do there basically is once we submitted an I.O. And one thing we have the polling support in the NVMe over Fabrics host driver.

Starting point is 00:49:05 So what we do there basically is once we submitted an I.O. and we know the tag, we start hammering the RDMA completion queue from the submission context. We actually have an experimental polling support in the target driver as well, which gives another slightly better number, but in practice it's very useless because once you start polling for one tag, you can't handle another one. For useful server applications it's downgrade but it's trivial so we left it in. With that our latency numbers get a lot lower especially for the RAMDIS case right now but also for our actually the RAMDIS case is the same, but for our NVMiver fabrics with polling,

Starting point is 00:49:45 which is the number that goes right into here. So for that, we go down to about 10.3, I think the best reproducible number we got. Do you have the P99 and 99 point whatever broken out anywhere? Damien has it somewhere. Where's Damien when you need him? He was here in the last talk.

Starting point is 00:50:04 Yeah. I only have the one most talked Yeah. Not anymore, okay, yeah. But anyway, we need to rerun a lot of this because we made even further improvements to the RDMA stack. But the whole idea is once we start doing intelligent polling on the host, the target doesn't really matter because it's pretty much busy all the time anyway. We can get another two to three microseconds out of it.

Starting point is 00:50:24 Another project that it recently was to actually much busy all the time anyway, we can get another two to three microseconds out of it. Another project that it recently was to actually rewrite the Linux Paul Block I.O., Block Direct I.O. code for small I.O.s and that gave us another 18 percent reduction on local latency. So we're work that we have to work in all the different passes in there. It's not really just the protocol. What's the distribution of the spy data from one way or the other? Okay. Was that on 100 gigabit? What's here? Not 40, 55, or what is the smaller one?

Starting point is 00:50:59 50, FDR, 50 gig. Yes, 56, not 55, yeah. Not sure really, so. Yeah. Yeah. So right now we're spending a lot of time on looking at these graphs just for local IO and once we're done with local we we'll move back to Fabrics and see if there's more in the RDMA stack. The block stack rewrite,

Starting point is 00:51:29 or the whole load, is that coming in 4.9 as well? No, so the basic block polling support has been in for a few releases, Jens and I did it. For 4.9, I won't have it ready, but I have a Git tree, so it's actually 200 lines of code because I've ripped everything out I didn't need and then made it go faster. So I plan for 4.10 hopefully, but we'll have some more discussion on that. And so that's about it.

Starting point is 00:51:58 I think you wanted to do studies again. Okay. I'll go real quick because the time is up. All the code is merged into 4.0 kernel. Fiber channel support is still pending, and we have updated NVMe CLIs and NVMe TCLIs that will get into distribution soon. You can find the links of the NVMe over Fabrics code

Starting point is 00:52:25 either upstream or our foreign X tree that we basically maintain and the NVMe CLI and NVMe TCLI code repositories. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community.

Starting point is 00:52:57 For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #37: NVMe Over Fabrics Support in Linux

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.