Storage Developer Conference - #76: Accelerating Storage with NVM Express SSDs and P2PDMA

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 76. It's going to be a little bit of a kind of walkthrough here. This is a topic that is kind of near and dear to my heart. I've worked on this in various different shapes and forms for quite some time.

Starting point is 00:00:58 But we're going to talk a little about how we can improve the way data moves around in high-performance PCIe slash NVMe-based systems using a framework that we are proposing for both SPDK and the Linux kernel called P2P DMA, peer-to-peer DMA. So I'll talk a little about the motivation for this, and this is something some of you may be very familiar with and some of you may not have heard of at all. I'll talk a little about what exactly is a peer-to-peer DMA and how that differs from a traditional DMA in a PCIe-based system. I'm actually going to get down and dirty with some of the APIs that we're proposing. So be prepared for actual code. Horror, horror.

Starting point is 00:01:43 An overview of peer-to-peer DMA for both SPDK, Storage Performer Developer Kit. Jim Harris is actually giving a great talk on SPDK right now, so feel free to go to his talk. And also the Linux kernel. We're upstreaming, or we're in the middle of upstreaming some code for peer-to-peer DMA for the Linux kernel. So I want to talk about that.

Starting point is 00:02:04 One of the most relevant consumers streaming some code for peer-to-peer DMA for the Linux kernel. So I want to talk about that. One of the most relevant consumers or utilizers of this framework for the people in this room is probably the NVMe controller memory buffer, which is an optional feature of an NVMe SSD that actually plays very nicely into this peer-to-peer DMA framework. And then I'm actually going to give you a couple of real examples of peer-to-peer DMA in action. So one of them is optimizing the data path in NVMe over Fabric targets. So how do we get the data from the networking device to the drives in as efficient a manner as possible? And then I'm actually going to go do an interesting application

Starting point is 00:02:46 that we did at Flash Memory Summit, which is actually offloading compression in a hyperconverge environment and why we would do that and the benefits of doing that. And, you know, let's keep it kind of open. So if you have any questions or anything you want to talk about or any kind of comments, just feel free to jump in. We can keep this pretty open. There's lots of expert faces in the room. So I'd be interested in hearing that. So one of the motivations for this is that PCIe devices are getting really, really, really fucking fast. Like we now have, you know,

Starting point is 00:03:22 Mellanox and Broadcom have both announced, I don't know if they're sampling, but they've certainly announced 200 gigabit NICs. So, PAM4, 56 gigabit per second. So, PAM4 at 28 megabaud, right? We have NVMe SSDs that can easily push a million 4K IO, right? Even more. We have GPGPUs that can easily hit the limits of 16 lanes of gen 3 and with gen 4 coming we're about to see like a doubling of pcie bandwidth it doesn't take a lot of pcie devices to hit an awful lot of throughput right oh my goodness they really want me to join this internet i don't want to join your effing internet. CPUs have lots and lots and lots of PCIe lanes, like the new AMD-based Epic servers. There was a lot of press. 128 lanes.

Starting point is 00:04:14 But they have a lot of lanes. That means you can attach a lot of devices without requiring switches. And of course, PCIe switches are quite common. I wouldn't say they're incredibly common, but they're quite common in things like storage servers and also these big artificial intelligence graphic boxes where we stuff a whole pile of GPGPUs into the system. So I don't have to architect a very, very challenging system. And I can easily hit 50 gigabytes per second of DMA traffic. So traffic between the NICs, between the drives, between the GPUs, and between the drives, and between the processors and the drives, I can easily get 50 gigabytes per second of IO traffic. And of course, today, all that traffic, all that DMA traffic has to go through the memory subsystem of the processor.

Starting point is 00:05:08 So if it's a DDR, DDR is not bidirectional. It's not full duplex, right? It's a single set of IO in both directions. So if I've got 50 gigabytes per second of DMA traffic, I've got 100 gigabytes per second of DDR traffic. And that's a pretty big number. That's quite a lot of DDR channels. And all they're doing is DMAing traffic around. There's not even any guarantee that the CPU needs to do any loads and stores on that traffic at all. So why on earth is it going through system memory? So if all the DMA traffic has to pass through the CPU memory subsystem,

Starting point is 00:05:43 we get a bottleneck, right? We get a bottleneck either in the PCIe subsystem or in the memory subsystem. And even before we hit those bottlenecks, quality of service for applications that are doing loads and stores on the processor cores on that CPU are getting impacted because they're trying to get loads and stores that will occasionally cash miss and have to be serviced from DRAM. But at the same time, you have all this DMA traffic also going through the same DDR interfaces. And so there is a contention issue and there will be quality of service issues. The other thing, of course, is that if everything is going through

Starting point is 00:06:21 a bottleneck, i.e. the DDR interfaces, you will eventually hit some kind of scalability issue. Peer-to-peer DMA is a framework that tries to get around or address all of those concerns. So what is a P2P DMA? Basically, it's a DMA that bypasses system memory. Rather than using system memory as the DMA target, we actually use memory that's provided by the PCIe endpoint itself. That could be, for example, just a standard PCIe bar, or it could be something that's a little more standardized, which the most classic or the most relevant example, I think, is an NVM Express controller memory buffer.

Starting point is 00:07:06 And more and more drives are starting to appear that have these controller memory buffers. And my company, Adetacom, this is our logo here, we provide a device today that has a very high-performance controller memory buffer on it. So traditionally, when you do a DMA, if you try to move some data between two NVMe drives using any standard OS today, the buffer that's used for the DMA is allocated out of system memory. Now it may end up in your L3 cache, your data direct IO, you have other things. But it's going to basically go up to the processor's memory subsystem. And then this device is basically going to suck that data back in as a separate DMA. What the peer-to-peer DMA framework allows you to do, if you want to,

Starting point is 00:07:53 is rather than allocating memory from this pool, we can allocate memory from this pool. And now the DMA basically becomes here, and then the second DMA is actually internal to the device in this particular case, and doesn't actually hit the PCIe subsystem at all. In other cases, you might have an intermediary device acting as a kind of proxy or a buffer here, and you may have a DMA in and then another DMA out. So there's many ways you can skin the cat, and we kind of want to be able to turn them all on. But the reality is that's not enabled in operating systems today. There's been kind of hacky ways of doing this for quite some time. GPU Direct is maybe something some of you have heard of or worked with.

Starting point is 00:08:39 That never made it upstream. It's NVIDIA-specific, and NVIDIA is Anthema to the Linux kernel because they have binary drivers, so we don't like them very much. So an overview of peer-to-peer DMAs in SPDK. We pushed some code back in February. It was actually Valentine's Day, I think, when it went in. It was Jim Harris' Valentine's Day present to me. That's how much he cares. He's in the other room, so I, when it went in. It was Jim Harris' Valentine's Day present to me. That's how much he cares.

Starting point is 00:09:05 He's in the other room, so I know I can see him. So SPDK, as many of you know, is a free and open-source user space framework for high-performance storage. So rather than doing things in kernel space, SPDK basically passes the device through the kernel using something like VFIO or UIO. And Hyatt Regency, will you disappear, please?

Starting point is 00:09:27 There you go. I'm getting interrupted. Context switching. And we operate on the device from user space. And they have a real focus on NVMe and NVMe over fabrics. And there's some very empirical data that shows in certain cases that there's some advantages to using SPDK over using an internal solution. So what we could do is we could actually say,

Starting point is 00:09:53 well, if the NVMe device has a controller memory buffer, let's actually utilize that if we want and allocate pages for DMA from the controller memory buffer rather than from the normal traditional memory allocator. And then in order to enable or to illustrate how that API should be used, there's a very simple application that's now part of the examples folder in SPDK called CMB Copy, which basically is just an illustration of how to use the CMB API.

Starting point is 00:10:29 So you can go to the upstream SPDK today, download this. You can run this example if you have the right hardware, and you can actually look at how the traffic flow is different than if you'd done a traditional copy. And here's a line of code. um but basically we created a new api which is this this function call here and you pass in a pointer to a nvme controller that has a cmb and then you pass in how much memory you'd like to get from that cmb and there's an allocator behind that api that makes sure there's enough memory left to give that particular process memory pages. If there isn't, it will return a failure. If it returns a failure, you can decide, do I error out or do I just keep going with normal system memory?

Starting point is 00:11:16 And the plan normally is if you can't get CMB memory, just use traditional system memory. But maybe your application wants to error out. That's something the application can decide. So this basically allows us to take a certain amount of memory from a controller memory buffer, allocate it, essentially pin it so that nobody else gets it, and allow that application to use it for DMA. What basically happens is there's a mapping that occurs. So as you construct the NVMe command, in the NVMe command, there's normally either a PRP or an SGL.

Starting point is 00:11:53 And rather than that PRP pointing to host memory, it now points to the controller memory buffer. As far as the NVMe drive is concerned, it has no idea it's doing a peer-to-peer DMA. Why would it? It's just being-peer DMA, right? Why would it? It's just being told, its DMA engine is being told to start issuing memread or memwrite TLPs against an address. That's what it does. If error codes come back, it errors out. If it doesn't, and data comes back, you're good, right? So, you know, that code is in SPDK. There are some issues that we need to go address in SPDK around this framework, and that's something that's kind of on the to-do list.

Starting point is 00:12:34 So we need to make some improvements to the allocator. We need to add VFIO support. Right now, we just have UIO support. And there are some issues around controller memory buffers and virtualized environments that I can't really go into today because we're kind of in the middle of sorting those out inside NVMe. And then there is this thing called access control services, which is a feature of PCIe. And that's a whole pile of fun and games in and of itself that we can get into over beer, preferably sooner rather than later. ACS always seems to go with beer. All right. So that's SPDK. It's upstream. You can go get that in the upstream version today.

Starting point is 00:13:18 Moving over to the Linux kernel. So the Linux kernel, we are trying to do something quite a bit more general than what SPDK is doing. And on top of that, the criteria for acceptance into the Linux kernel is higher than it is for something like SPDK, which is very kind of market vertical specific. The Linux kernel is running on your phone. It's running in an industrial phone. It's running in an industrial center. It's running in a data center. To get something into the Linux kernel, you have to address a lot more concerns than you do for SPDK. So it's undergone a lot more rigor.

Starting point is 00:13:56 It's also a lot more general. So what we're doing with P2P DMA for the Linux kernel is to support any PCIe device that is interested in either one of two different things. Either it has some kind of IO memory, some kind of bar or CMB that it would like to contribute to the framework. So it could be the device that goes, hey, I'm an NVMe CMB. I've got 16 megabytes of CMB.

Starting point is 00:14:24 I'm going to keep four for myself. The other 12 I'm going to give to peer-to-peer DMA. And I'm going to show you driver code that lets you do that. Or I would if this thing didn't show up. Bugger off! I could just join. I could. I could, but I won't because then I'll get all these messages and they'll be like, oh, crap. And then the second part is if you are a DMA consumer,

Starting point is 00:14:48 as in if you perform DMAs, and every single PCIe device out there is a DMA master, otherwise what the heck is it doing on the PCIe bus, right? You basically give it an API for saying, hey, rather than using system memory, if possible, if it's there and if it makes sense, I'd like to use peer-to-peer memory and maybe get a more optimal data flow in my system. So not only are we providing drivers that do this, we're also providing the framework that sits over the whole PCI subsystem and allows any device to basically play at that party. So this isn't just an NVMe thing. This is not just an RDMA thing.

Starting point is 00:15:30 This is not just a GPU thing. Any PCIe endpoint that wants to play in this space can play in this space. And part of the patch series includes driver notes for driver writers to say, if you want to be either one of these or one of these or potentially both, these are the kind of things you might want to do in your driver in order to take advantage of this new framework. So if you are a driver developer for a graphics card company or an FPGA accelerator company,

Starting point is 00:15:58 you may want to go take a look at the driver notes set of patches associated with this particular framework. We have a central framework for managing all this memory that's getting contributed by these different devices coming from the different device drivers. So there is a central entity that has its own kind of sysfs and configfs and so forth. Drivers do need to be updated to donate and or consume memory from this framework. I'll talk about that a little bit.

Starting point is 00:16:29 And currently, we have about half of this code upstream. Anything that's not upstream, and I will be setting, I'll have a slide that people should take a picture of with all the links to the different URLs. But currently, we keep this code on a Linux P2P mem GitHub under my account.

Starting point is 00:16:49 So I said we were going to get down and dirty with code. This might be a little hard to read. But this is a function that basically says PCI, P2P DMA, add resource. And then there's basically a structure to a PCIe device structure, which is a Linux kernel structure. There's an integer pointing to a bar. There's a size T that's saying how big in bytes. And then there's an offset. This is the function you could basically add if you're a driver writer, if you wanted to give some memory to the peer-to-peer framework. So the NVMe driver may get updated to allow people to give some of their CMB

Starting point is 00:17:26 or all of their CMB or none of their CMB to this particular framework. And this is the function that you want to go take a look at. So anyone who has their kernel, anyone who has the Linux kernel downloaded, if you actually did a grep for this, you wouldn't find it yet because it's not upstream. But if you went to Rtree and did a grep, you would find quite a few occurrences of this. And obviously the hope

Starting point is 00:17:49 is that once we're upstream, other drivers will also update to contribute to this particular, to contribute their memory if they have some. And it's not, you know, it's not as common for a PCIe endpoint to have a lot of spare memory as it is for it to be a DMA master. Pretty much every PCIe endpoint is a DMA master and can do very high-performance DMAs. Not that many have a lot of spare DRAM to contribute to this framework. But they do have a lot of memory. Graphic cards, in particular, have an awful lot of memory. And maybe they want to carve out just a few meg that they contribute to this framework, right?

Starting point is 00:18:29 NVMe CMBs, right? Drives have a lot, you know, enterprise drives have quite a lot of DRAM on them for metadata and so forth. And maybe they contribute some of that as a CMB, okay? When you give, when a driver gives memory to the central pool, we basically manage that using a gen pool allocator, which is a generic allocator that many subsystems in the Linux kernel use. So this is an allocator we didn't develop. It's one that's in there. It's well tested. It's well beaten on. Any kinks that are in there have been ironed out.

Starting point is 00:19:03 And we'll see later, basically, people borrow memory from that allocator. And it keeps track of who's got what, how many ref counts are kept on any particular page. It's responsible for doing defragmentation and taking care of all that kind of stuff. So that's the allocator. Now, the other secret sauce of this is the fucking Hyatt thing. It's going to get in there. I'm going to weave it into the story somehow. In order to do a DMA in the Linux kernel today, we have to have what's called struck page backing for the physical pages we're allocating to DMA. That's not a physical requirement,

Starting point is 00:19:47 but it is a Linux kernel requirement. If you try to pass a physical page of memory to an API call in the Linux kernel that doesn't have struct page mapping, somewhere in get user pages, there will be an error path, and basically the software will die. So in order for us to do DMAs with this new memory, this memory that the driver has donated, we have to get struck page backing. And luckily there's a device-specific function that we can use. This is not our function.

Starting point is 00:20:22 This is a function that's been around for quite a while called devm memory map, which basically performs a remapping of that memory and also generates struck pages for the pages in that mapping. So the great thing is I basically get to map these new physical pages. So I've plugged in some new memory. This particular part of the code, which is transparent to everybody out there who's writing their drivers, right? You don't even need to worry about this if you're just updating your drivers. In the background, we basically make this call. We put struck page backing on all those pages. If you give us 20 terabytes of memory, if someone's got a 20 terabyte bar, which you could have.

Starting point is 00:21:03 I mean, your BIOS is probably going to die if you do, but you could have. This might actually take a little bit of a while because we have to go get basically struck page backing for all those terabytes of memory. So once we start getting really, really, really big, byte addressable, persistent memory, we may want to rethink how we do this.

Starting point is 00:21:22 But right now I can basically get like one gig for a million dollars from Everspin or whatever. Million dollars a gig! So once it gets a little cheaper, things will move along. Sorry if there's anyone... That's my joke. That's not the true price. So it's just... So there are some interesting dependencies, and anyone who was in at the start who saw my kernel compile blow up.

Starting point is 00:21:50 If you want to take advantage of DevM memory out pages, you need something called zone device. To use zone device, you need memory hot plug. To need memory hot plug, you need sparse mem. And some architectures do not have all those dependencies in them yet. So the patches right now only support x86-64, so AMD and Intel. Because only they have all the dependencies on which zone device relies. So if you try to set your.config, if you take our patches and you try to set your.config for ARM and try to enable all this, it's not going to work. The.config will be invalid because you've tried to turn things on that it doesn't actually support.

Starting point is 00:22:34 We do actually have some out-of-tree patches for ARM64. And we are working on out-of-tree for RISC-V as well. But right now, the upstream effort will be exclusively Intel and AMD. And if you want to, you can buy me beer letter, and I'll explain exactly why that is. But it's a little too detailed to get into right now, getting into the bowels of the MMU and so forth. So again, this slide is very cluttered. But basically, once we've donated memory, other devices might want to use it. And the reality is that giving out peer-to-peer memory willy-nilly doesn't make any sense. For example, someone asked me earlier, what happens if the device that wants to borrow the memory is on a different socket than the device that wants to donate it?

Starting point is 00:23:24 That's a bad decision. You probably don't want to do peer-to-peer there. borrow the memory is on a different socket than the device that wants to donate it. Okay, that's a bad decision. You probably don't want to do peer-to-peer there. So we had to come up with some kind of find function that allowed us to put in place some rules, some policies around which we could decide, is this a good device to donate some peer-to-peer memory for? And if so, which device in our system that has peer-to-peer memory is the right one to do it from? So we're not going to go through all this in Munisha, but basically we have a function that says, hey, I've got a bunch of devices

Starting point is 00:23:55 that probably want to do peer-to-peer among each other. Is there a peer-to-peer memory donator somewhere out there that's a good fit for this list? And that's basically what this find function does. So we basically pass in a list of all the things that want to do peer-to-peer between each other. So that could be all the NVMe drives off a switch, or that could be a graphics card and an NVMe drive and an RDMA NIC that are connected to a single root port. And as long as the rules are, the policy rules are in place, and right now those policy rules are a certain way, but this is open source Linux. People

Starting point is 00:24:31 can contribute their own. And we may even have plugins where people can plug in their policy rules, either through something like Berkeley Packet Filter or through ConfigFS or something like that. All right. So we had to come up with some kind of formula. We used something that currently is based on distance, which is a metric of how many hops within the PCIe tree. We have rules right now that everything has to be underneath the same PCIe domain. So that has certain implications for how your devices have to be physically located in order for all this to work.

Starting point is 00:25:03 But it's open source. If you want to change it, you can. If you want to change it in a way that you think makes sense for the Encol community, you change it and you submit a patch. So these things are not set in stone and nor should they be. We have certain things like if two devices are both equally good at donating memory, we randomly pick between them because we don't want to always rely on one. We want to be taking memory from both. All of this can be overridden by configfs. If you really want this device to do peer-to-peer

Starting point is 00:25:35 with that device, if you really want that to happen, of course you can make it happen. It might be shitballs crazy, but if you want to do it, I'm not going to stop you. Once you've got your donator device, you can basically say, I'd actually like to allocate some memory from that device to do a DMA. So this is, again, this is the device that wants to do a DMA. At some point, we'll basically do this call in order to request some pages that it can do this peer-to-peer DMA with.

Starting point is 00:26:11 And this function may fail because the device that you're asking to borrow from may have nothing left. It may be so busy servicing other clients that it hasn't got any pages left to give you. So you have to handle that. The pages that it gives you, like I said earlier, are struct page packed. So you can pass them in to any DMA API in the Linux kernel, including the block layer. And you can be assured that that DMA will work.

Starting point is 00:26:44 And if there's an IOMMU involved, then it will go through the IOMMU API, but some interesting things can happen there. So we definitely warn people to be a little careful if they're trying to do some of this in virtualized environments. And there will be some interesting things that will happen as people try to do this more and more in those environments.

Starting point is 00:27:05 At some point, you know, in the deconstructor, you want to call the free function to give those pages back. And one thing that's kind of interesting to note, but what we actually store in the GenPool allocator is the PCI bus address, not the virtual address. And that saves us a lookup when we come to do the DMA. Yes? Yes.

Starting point is 00:27:36 So the question was, I have to find and ask, so there's obviously some kind of interdependency there. So that's a good point. This platform does accommodate hot plug, assuming the system supports it. So you can actually plug in new things and take away old things. But when you do the find,

Starting point is 00:28:03 you will get the best device at the point at which you issued the find. And that's actually in the driver documentation. So if you know you're in a system where there's a lot of hot plug activity, you may want to reissue find. You can have a callback. If anything changes in the PCIe sub system, call me back on this function call. That's something any PCIe driver can have. And basically, your driver could say, if something changes in the PCIe topology, I will recall this find function in my callback,

Starting point is 00:28:33 and I'll get the new best handler. So the Linux kernel already accommodates for things that might change. As in, you can recall find, and you don't even need to do it randomly. You can do it based on when the PCI layer has realized something has changed because a new device has been added

Starting point is 00:28:49 or a device has been removed, which is very important, obviously, in NVMe systems. So in find, specifically, the find doesn't take into account that it's busy or full or... No, right now it's a distance metric, right now. But we're certainly, I think once we get this upstream, I think there will be some good discussions around

Starting point is 00:29:12 do we need multiple policies? Do people want to have a plug-in where they can plug in their own policy? This is where Berkeley Packet Filter could be a great example, right? Because you could actually push in from user space your own policies based on compiled BPF code, which I think is a great way of doing it, to be honest.

Starting point is 00:29:31 So we can have conventional policies in the kernel. If you want to do your own thing, you could inject it through BPF. A good question. Yes? Is this working in this way when you can intercept some topological change of PCIe, but in case of some ongoing EMA transfer from one place to another, so then something wrong can happen, right?

Starting point is 00:29:58 Well, you can get this pop-up for a start. So are you saying if you're halfway through a DMA and someone pulls a PCIe device out of the system, you know, if that happens, you're kind of screwed no matter what you're doing. So, you know. There is no good way to intercept such this error and handle that gracefully. So if you talk to the PCIe switch vendors, they'll talk about downstream port containment and raising an AER. But at the end of the day, you have basically pulled a device out in the middle of a whole bunch of TLPs. And now between hardware and the operating system, you have to step in and clean everything up. It's doable, but it's not easy either for system memory.

Starting point is 00:30:46 And I don't think it's particularly harder or easier, no matter whether it's peer-to-peer memory or whether it's system memory. And most systems are just going to die, right? Unless you have DPC and some very fancy operating system stuff going on. Maybe even BIOS. Yeah, I mean, because UFI is going to get called back

Starting point is 00:31:05 and if the UEFI can't handle the re-enumeration of the... So, in theory, it's all doable. I'm sure there's plenty of people in this room who work for companies that are working on PCIe surprise removal. The ones with no hair. Or hair at odd angles

Starting point is 00:31:22 because you pull it out. Yeah, so it's a tricky one. Yeah, we are cognizant of that. Good question. So I talked a little about DMAs. Obviously, one of the block subsystems that use DMAs a lot is the block layer, right? And we wanted to tie into that because NVMe and NVMever Fabrics are kind of our first example applications. So we actually have a couple of patches that touch the block layer. And Jens, who's the maintainer, has been

Starting point is 00:31:54 giving us some feedback on those. We don't hit the block layer too hard, but we do need to be able to indicate if a scatter gather list is backed by traditional memory or peer-to-peer memory. And there's a little bit of ongoing work. So this is probably one of the areas that still might see a little bit of change before it goes upstream. But I think we're converging here. And basically, you can test whether the scatter gather,

Starting point is 00:32:18 which is not necessarily a physical scatter gather, it's just what we call it in the block layer, is backed by peer-to-peer pages or not. The great thing is that this actually, assuming we get the block layer support in, any device that leverages the block layer, obviously for us NVMe is the most pertinent and timely, but SCSI, MQ, and other devices could also take advantage of this at the block layer level. So it should be quite easy to get. For example, if you had a RAID adapter in your system and that RAID adapter had some DRAM on it, as RAID adapters tend to do,

Starting point is 00:32:55 then that could also be a donator and also could participate in peer-to-peer DMAs as well. So it doesn't have to be NVMe. Or it could be, yeah, i think you all get the idea right take a picture of this slide i'm not going to walk through it basically someone who might be in this room might not ask me for like an update on where we were with uh all things peer-to-peer and so i wrote an email and i went this is a really fucking good email. So I'm actually going to put it in my slides and people can take a picture. I put it on LinkedIn as well. But basically, there are, you know, basically a whole bunch of links in here related to peer-to-peer

Starting point is 00:33:38 patches, how we expose to user space, which is not going upstream the way we currently do it, but we'll talk about that later. QEMU for testing. You can actually test peer-to-peer in QEMU without any physical hardware because we have CMB models for NVMe drives in QEMU, and I know because I put in those patches. And we also have a little bash script that builds a Debian file with the very latest kernel and header images in it. So if you're a Debian-based distro company, which we are, then you can basically just run this script, this last one, at any time. And you basically get.deb packages to install the peer-to-peer kernel. And I try to keep those as up-to-date as possible. So that's kind of like where we are in terms of the patches, and I'm going to talk about a couple

Starting point is 00:34:31 of use cases now. So again, feel free to jump in with questions. But if I walk through this slide, this is basically going to show how we used peer-to-peer DMAs to optimize an NVMe over Fabrics target. So the hardware that we had was an Intel Skylake CPU with a bunch of DRAM attached to it. We had a micro-semi-PCIe switch. We had our Adetacom device, which is NVMe-based, but it has a controller memory buffer, a big one, and a very high performance one that supports write data and read data. And then we actually had a JBoff, a Celestica Nebula. And then we had a bunch of drives. We had some Intel drives and some Seagate drives, actually. So a bit of a mix there. And then here we actually had a Mellanox CX5. And what we did is we did some performance analysis, and this thing popped up.

Starting point is 00:35:33 We did some performance analysis for the vanilla NVMe over Fabrics target. And what happens in a vanilla NVMe Fabrics target is data comes, assuming we're writing to the drives, basically data and commands come in. They get put in a buffer in DRAM by the RDMA NIC. The NIC raises an interrupt or we pull, and basically the operating system realizes there's data here. And then basically it processes a little bit of the data to see, hey, what am I actually doing with this incoming data? Which drive am I writing it to? And then it issues some NVMe write commands to the drives, assuming the drives are backed by the, you know, assuming the dirty secret of NVMe over fabrics is these could be SCSI drives, right?

Starting point is 00:36:15 There's a layer of indirection, so you could have SCSI, but assuming it's NVMe, these drives are here. So, you know, if I'm doing an awful lot of I.O., back to my motivation point, right, all that I.O. has to go into this DRAM, and some of it may end up in DDIO. We have L3 cache, but a lot of it's going to leak into DRAM. This means I might need quite a few channels of DRAM here, even if this thing isn't really doing very much,

Starting point is 00:36:39 apart from being a DRAM buffer. Maybe I'm spending a lot of money on a processor, but all I need is DDR bandwidth. Now, in a hyperconverged environment, I have a different problem. I might be using these cores quite a lot in hyperconverge, but now they're fighting with the DRAM for quality of service on the memory, right? So either way, I have a problem. So what we can do, and we work with Mellanox to do this, is we can basically enable changes in the operating system so that the data path now uses memory

Starting point is 00:37:08 from our controller memory buffer rather than from here. And this is something that's in the patches. This is one of the applications that we included in the patch set that's going hopefully upstream. And so what happens is the new data path becomes RDMA NIC, registers memory against the CMB. It does a DMA into here. This thing still gets notified when that DMA is done,

Starting point is 00:37:33 and now it issues NVMe commands that have pointers to here rather than to here. So this guy is still running all the I.O. He's still in control. If anything goes wrong, you still have an operating system to come in and help tidy up. But now my data path, my hot path, my DMA path is completely off the processor. That lets me do one of two things. Either I'm hyper-converged and my customers are happier because now they get better quality of service in their VMs.

Starting point is 00:38:04 Or I didn't need this thing and now I can replace this with a RISC-V SOC. Why not? I like RISC-V. There you go. Thank you. Somebody's paying attention. We didn't all fall asleep after lunch. Buy that man a beverage of his choice. Thank you. Thank you for attention. We didn't all fall asleep after lunch. Buy that man a beverage of his choice. Thank you. Thank you for pointing that out. Yes, the slides. There you go. I just screwed up the entire talk. So we actually did, I'll go back to the last slide because we did one other thing. So I didn't talk about, not peer-to-peer directly, but one of the other things we did with Mellanox is, and I mentioned this in the talk before lunch, we actually programmed the

Starting point is 00:38:50 offload engine in the Mellanox NIC as well. So in that case, the Mellanox NIC isn't just DMAing to here. It's also now the one ringing the doorbells on the drives and putting submission queue entries in here as well. At that point, this guy doesn't really need to be here anymore. Pretty soon you put a little OS here and you've got a blue field. There's many ways to skin an NVMe target. It's kind of fun looking at all the opportunities. So we end up with four rows here. We have vanilla NVMe over Fabrics, as it is today in the Linux kernel and SPDK.

Starting point is 00:39:28 We have the ConnectX 5 offload, so the CX5 is still using system memory, but now the commands are being issued by the NIC. We have back to the CPU doing the commands, but now we're using peer-to-peer memory. And then we have both the offload and the peer-to-peer. And basically, the CPU utilization changes and goes down. We actually get some saving here because it's not using as much of its own memory, so the memory allocators are down. This one went up, and I think this is DDIO and L3 cache hot or not hot.

Starting point is 00:40:08 Because this one went up because in the vanilla one, the NVMe over fabrics code is touching certain lines of code and pulling those lines into L2 and L3 cache. Whereas here, the ConnectX 5 is touching those lines of code, and they do not become hot in the processor's cache hierarchy. And the way we got these numbers is we used the PMON counters. We were actually measuring the registers in the integrated memory controllers. So what we need to go back and look at is the L3 hit versus miss rate, and I don't have that data yet.

Starting point is 00:40:38 But this was a pretty fun number, and I think it's related to L, last level cache. Artifacts. number and i think it's related to l last level cache artifacts um you're gonna drive me crazy on the uh and then when we when we turned on the peer-to-peer memory we definitely saw a significant memory drop and now that's basically because the dma traffic is almost all well all the dma traffic is going through the no-load and not through system memory. Obviously, there's still some traffic going out to memory because the commands are still getting processed by the CPU, so there's a little bit of variability there.

Starting point is 00:41:23 Moving on to the other one, I know that was a lot of data, but the basic upshot is peer-to-peer DMA can definitely help on the memory side. So it's kind of interesting there. This is an actual customer use case that we turned into a Flash Memory Summit demo. The customer is HyperConverge. They wanted to be able to do compression of their data in a way where the data didn't have to go through the memory subsystem because they wanted to keep the memory subsystem for the VMs. So they required libz compatible compression, and they wanted to use the peer-to-peer DMA to minimize the DMA impact. They wanted a U.2 form factor for the accelerator, so that's kind of one of the reasons they came to us. and their input and output data

Starting point is 00:42:05 had to be located on standard storage based nvme ssds with a very standard ext4 file system running across those drives i'm not going to go into this we don't have time but we have this awesome device come buy some from me make me rich thank you um offload it. So this is what we did. We ended up basically with uncompressed data on some SSDs. In this case, just one, but we can do it across multiple SSDs. We have an application running on the AMD EPYC, and we used an AMD processor because the customer wanted a lot of PCIe lanes. And also, the AMD has very good peer- a lot of PCIe lanes. And also,

Starting point is 00:42:50 the AMD has very good peer-to-peer performance between its root ports. And that's not something the Intel processors have been as good at doing. They tend to require an external switch. So the AMD is very interesting because there's a lot of PCIe lanes and good root port to root port peer-to-peer data transfer. So we had some NVMe SSDs in. We had some of our U.2 no-loads with compression algorithms loaded onto them. And then we had output SSDs, and the application was here managing all the I.O. And basically what would happen is we'd issue an NVMe read command to these SSDs

Starting point is 00:43:23 with the read address pointing to our CMB. So data would get DMA'd from there to here. We would then compress it, put it in a different place in our CMB, and then the application would write a write command here and pull our compressed data to the output drives. And so what was happening, basically, is we had this peer-to-peer compression path,

Starting point is 00:43:44 and we were completely avoiding the memory subsystem on the AMD EPYC. We were measuring, this is a Gen 3 by 4, so the best we're going to do is about 3.4 gigabytes per second. With this particular setup, we were hitting about 3.1 gigabytes per second of input. And then we were roughly compressing by a factor of two in this particular setting, so we also had 1.5 of output. And we were getting about 99% DMA offload from the processor.

Starting point is 00:44:16 So what you're saying is that the address on the node is mapped to both cases? Yep, using the peer-to-peer DMA framework I showed you earlier. So basically, there's an application here that basically has taken some of the pages of the CMB and give it to this guy for DMA, and then given different pages to this guy for DMA. Yep. And that application is actually on GitHub, and you can go take a look at it yes

Starting point is 00:44:47 the controller memory buffer the U.2 we have has 8 gigabytes of DRAM on it and we can expose anything between 0 to 8 gigabytes yeah I mean to be honest you don't need a huge CMB because it's basically the delay bandwidth product. So it doesn't need to be massive.

Starting point is 00:45:12 We normally ship with like 512 megabytes just because why not? But we typically don't need it. Thank you. So to wrap it up, we think there's a good justification for having peer-to-peer DMA as a framework in the Linux kernel. As these devices get faster and as we maybe change our compute hierarchy or infrastructure so the processor is not necessarily the god of all things, we move to a world of more heterogeneous SOC-based type systems. SPDK already has upstream support for this,

Starting point is 00:45:45 but there is some work that needs to be done to kind of harden it. Linux kernel, we're doing really well. We got some, we got, last week, we actually got Axe from the PCIe maintainer. So we're really, really at this point just sorting out the block layer, couple of things,

Starting point is 00:46:00 and then I'm actually hoping we may hit 420 or 421. There we go. The initial applications that are in the Linux kernel, at least, are NVMe-centric, but I'm hoping others will follow, and some people even earlier today were telling me they're playing with graphic cards and RDMA NICs and so forth.

Starting point is 00:46:20 In the words of Linus, who's currently taking a break, go forth and test. Thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #76: Accelerating Storage with NVM Express SSDs and P2PDMA

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.