Storage Developer Conference - #36: Enabling Remote Access to Persistent Memory on an IO Subsystem Using NVM Express

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 36. Today we hear from Stephen Bates, Senior Technical Director at Microsemi, as he presents Enabling Remote Access to Persistent Memory on an IO subsystem using NVM Express and RDMA from the 2016 Storage Developer Conference.

Starting point is 00:00:52 My name is Steve Bates. I work for Microsemi, and I have the longest, most incoherent title of any talk at Storage Developers Conference. I think I submitted this abstract just before I went on parental leave and I put in every single keyword that I possibly could. So enabling remote access to persistent memory on an IO subsystem using NVM express and RDMA. I decided that I was going to change the title. So IOP mem PMM for MMIO, which is any better? Maybe not any better.

Starting point is 00:01:28 So PMem I will talk about a little bit. The last speaker, in fact, all the speakers today have been kind of a great segue into what I'm going to talk about today, because I'm going to touch on NVMe, NVMe over Fabrics, and also persistent memory, i.e. accessing memory using load store type semantics as opposed to block type semantics.

Starting point is 00:01:51 And I'm also going to touch on some of the stuff that's been happening in the Linux kernel. So the last talk, if you were here, was quite interesting because we were talking about how do we write high level languages like C and Java and how do we put in place tool chains and libraries that are architecture and OS independent that are going to let us basically get to code that can run on any architecture and provide us with very nice memory semantics. We're looking at this and we've been looking at this for a long time and it's kind of scary. So it's interesting that one of the things that's happened in the Linux kernel is they

Starting point is 00:02:22 went this load store stuff is really, really interesting, but let's take a step back and let's do something with persistent memory that we can take advantage of right now. And one of the things they've come up with is something called PMEM, which is a driver that turns an NVDIMM into a block device, which makes it very easy for applications to talk to it because it's talking to a block device.

Starting point is 00:02:43 We know how to talk to block devices. We've been doing it for a very long time. So that takes away some of the challenges. But of course it doesn't give us all the optimizations. So I'm going to talk a little bit about that and how we can extend that to talk about not just memory that's hanging off the DDR interface or off the memory channel as we call it, because it might not be DDR, right? It could be hybrid memory group or some other kind

Starting point is 00:03:06 of protocol. And how we might use it to talk to memory that's sitting on the I.O. system, for example, PCIe memory, which is something that we're all pretty familiar with. And if you work with NVMe drives, you actually do that kind of thing all the time through the driver, though not necessarily with persistence.

Starting point is 00:03:24 So anyway, that's what we're gonna get to. Before I jump into that, a few of us, for some reason my name is the name on the lead of it, but a few of us, including Tom who's here, are giving a keynote later this week. This slide was a slide I really wanted to put in the keynote. We didn't have enough space for it,

Starting point is 00:03:41 so I'm gonna do it now because I really like this slide. I spent a lot of time thinking about it. One of the things that somebody told me a very, very long time ago, and it's kind of stuck with me, is that throughput is easy, latency is hard. Now, there will be those among us who work on things like storage controllers and whatever. They go, you know, throughput is not that easy. But I think the way to summarize it, the way to think about it is throughput's an engineering problem, right? And the way I think about throughput is

Starting point is 00:04:07 I'm sitting on the highway, and I'm counting how many cars per hour go past me, right? And if this is a very wide highway, the cars don't have to be going super fast, and I can see a lot of car, you know, a lot of vehicles every hour go past me, right? That's throughput. I can get more throughput

Starting point is 00:04:24 by increasing the width of my highway, by adding more lanes, adding more parallelism. That's what we do with storage controllers for years. We fan out to many, many drives. The latest NVMe controllers, they use a lot of flash die and they bring in those flash die and then present it as an NVMe. Parallelism you through put. Latency is something very different. Latency is me getting in my car in San Francisco and going, I wanna be in Los Angeles in three hours. Is that possible? I don't know.

Starting point is 00:04:55 It is if you've got one of these. You know what? I was half expecting you to go, that's my car. Do you have a Siron? No, an old Volkswagen. They were faster, right? Yeah, quality German engineering. So latency is basically trying to get from a location to another location.

Starting point is 00:05:21 One specific IO, right? It's not one of the many IO.O. it's like I need this I.O. to get here as quick as possible. And that is a physics problem because time doesn't go backwards. In fact unless you're willing to travel very close to the speed of light it doesn't even slow down. So if you want to, the problem with time is that once you spend a bit of time accessing the media you don't get that back. Once you spend a bit of time going time accessing the media, you don't get that back.

Starting point is 00:05:45 Once you spend a bit of time going through the SCSI stack, you don't get that back. They add together. And Christoph and showed some slides on that. Intel have a very famous slide. They've used a very long time for that, showing that the sum of latency is the sum of all the parts, and there's no minus sign in that equation.

Starting point is 00:06:02 If there was, we'd be in a very different space. So throughput is easy. Latency is hard. Persistent memory is all about latency. And that means that everything we do has to be thought about very carefully, because you cannot take it back. Once you've spent a microsecond implementing 20, 40 functions

Starting point is 00:06:21 because you want to have some kind of safeguard, you don't get that time back. So we've got to think very carefully because it only goes one way. One thing I was thinking is, and I might put a patent on this, you could issue an I.O. and travel away from the computer at almost the speed of light and then slow down and then come back. And you have 100 microseconds here. You've only gone 10 microseconds. I got a 10 microsecond I.O. out of 100 microseconds. I've expended an awful lot of energy to accelerate to the speed of light. And I can't even see the IO cuz now I'm mush,

Starting point is 00:06:49 cuz the human body would have collapsed under the gravitational forces. Shit, but anyway. Anyway, there'll be a lot more slides like that on Wednesday, so come along. It's gonna be a good session. The motivation behind today's talk though is really around PCIe I.O. devices. All right, Sagie used to work for Mellanox, used to work for Libet, Christoph, and you know everyone in this room probably has worked, either works for a company that generates like a high-speed I.O. device or takes advantage of them in some way in their software stack.

Starting point is 00:07:33 RDMA NICs, graphic cards, FPGA accelerator PCIe cards, NVMe devices, right? They're all PCIe detached. They all tend to have incredibly high throughputs, right? We have a product at microsumi that's capable of doing, you know, up to five gigabytes per second of PCIe IO. So we're getting an awful lot of bandwidth out of these devices. These devices tend to have DMA engines. Someone mentioned it earlier. I love DMA because DMA is basically the CPU going, I gotta move a lot of data. I could sit here for the next 7,000 instructions doing load stores. I could do that. That's a waste of a very expensive processor that's capable of running much more complicated instructions than just loads and stores. I actually want to do something where

Starting point is 00:08:18 I issue a request to this stupid DMA engine that's a lot simpler than I am and say, hey, you go and move the data. I'm going to go generate some revenue by running some analytics. And you tell me when you're done, either by me polling for a completion, which takes CPU cycles, or by you raising an interrupt through some interrupt scheme. And then I'll come back, wake up, because the OS tells me to, and service the IO. That. right that's you know that's kind of the way all these PCI devices want to work so you know most NVMe devices including ours most all RDMA devices will have very high

Starting point is 00:08:57 performance DMA engines they'll support scatter gather lists they'll support multiple contexts right and we can take advantage of those and program those and so forth. The other thing PCI devices are getting are exposed bars, right, regions of memory, some of which we M-Map for the driver functionality, right? So we M-Map the driver and then we peek and poke at the registers and they do things, right? They issue IO, they reset the device or whatever.

Starting point is 00:09:24 But other bars are more generic and classic example of that is, for example, something like a frame buffer things. They issue I.O. They reset the device or whatever. But other bars are more generic. And a classic example of that is, for example, something like a frame buffer in a graphics card, where you can actually have a region of memory. And if you write to that region, something appears on a monitor. It's like, woo.

Starting point is 00:09:36 We were doing that for a long time. But we're starting to look at ways we can take advantage of that again. So PCI devices tend to have very good DMA engines, and they also can expose bars. Those bars don't have to be mapped or, sorry, backed by actual memory. I can write some firmware on a device

Starting point is 00:09:54 that exposes a humongous bar. It might crash the server, because the server tries to enumerate, and it's like, whoa. But there's nothing stopping me exposing multiple terabytes of bar space. Now, if you write to that bar in some random region, can I guarantee I'm going to keep that data for you forever? Probably not, unless I have a whole bunch of hard drives

Starting point is 00:10:14 behind me. But maybe there's games we can play there. Interesting things could be done there. Something to think about. This is where Christoph's going to jump on me. Until this work, pretty much any high-performance transfer of information between two PCIe devices has required the use of a buffer and system memory. And Sagi and Christoph put up some very detailed slides earlier showing how data flows in the NVMe over Fabrics driver. And if you weren't here for that, the basic idea is that we move some data from a remote

Starting point is 00:10:48 host to the target's DRAM, the target's system memory, and then we move the data from the DRAM to the NVMe devices through the block layer. Or a piece of DRAM in the NVMe. Potentially, assuming it Or a piece of DRAM in the NVMe. Potentially, assuming it has some kind of DRAM in the NVMe device, which most of the high-performance ones will. And it comes from something like DRAM. Exactly. So we're using DRAM in three different places.

Starting point is 00:11:17 We're using it on the host initiator, we're using it on the target, and we're using it on the drive itself, typically. And the problem, of course, is on the target side, it could be talking to hundreds of hosts at the same time. Scalability becomes a problem. I have to reserve memory for all of them. Memory bandwidth is not infinite, right? You know, people think it's, you know, it's not infinite, right? So you also have to worry about not only how much volume of memory am I using, but am I also cutting or eating into my DRAM bandwidth, which again could compromise performance

Starting point is 00:11:49 for other parts of my system. If I'm a hyper-converged system, I'm not just a storage device. I may also be doing analytics, processing, et cetera. Every gigabyte per second of DRAM bandwidth I lose because of my storage stack, I don't get back. Again, it's not additive. So we potentially want to try and take that buffering step out of that and that's kind of one of the main things that we look at in this paper, removing that.

Starting point is 00:12:16 We're working in the Linux space so this is actually results that are based on a patch to the Linux kernel. I'm going to get to it at the end but there's actually other ways of solving this problem. This one I'm just going to talk about because it's quite pertinent. But I'm not necessarily saying this is the way we have to do it in the Linux kernel, because Christoph would definitely

Starting point is 00:12:33 hit me if I said that. So I'm not going to say that. There's definitely other ways of doing it. I don't even necessarily think this is the right way, but it's a very interesting way to look at. And while we're there, we're also going to touch on some fun things around NVMe. So before I go any further, people are going to maybe get a little bored with

Starting point is 00:12:49 this but I wanted to kind of give you a very quick 101 on PCIe Express sorry PCIe version of NVMe Express. Until recently we didn't have to qualify with the PCIe but now thanks to the work by the NVMe working group we also have PCIe sorry NV by the NVMe working group, we also have NVMe over RDMA, and pretty soon we'll have NVMe over Fibre Channel as well. The PCIe version, it works reasonably straightforwardly in the sense that, and this is one of the things I really like about NVMe, the NVMe group defined a boundary down here at the PCIe layer. They went, if you want to make an NVMe device, you have to do this at the PCIe layer.

Starting point is 00:13:32 You must present a certain region of memory, a PCIe bar essentially, and the contents of that bar have to follow this spec. And that was great because unlike RDMA, it meant we could have a common low-level driver for every NVMe device. So it's not like we need a hardware driver shim, like we do in RDMA for the Mellanox products, for the Chelsea products, for somebody else's. We can have a common NVMe driver for any NVMe device.

Starting point is 00:14:01 And I love that. I love the fact that HGST sent me a sample, I plug it in the server, I boot it up, it just works. It doesn't matter, I get one from Memblaze, I get one from Intel, it just works. It's fantastic. Basically the driver works by memory mapping or IO remapping a region of memory,

Starting point is 00:14:22 and then it knows, well basically now I can control this device because I'm basically going to issue reads and writes against that memory region on the device. Very standard Linux driver kind of work. Admin commands are used at load and probe time to configure things at queues and doorbells. And because we probe it, you can do things like hot plug so I can plug in a new drive even when the system's up assuming the hardware supports it and it should just go hey, there's a new NVMe device, please don't do a kernel loops. Saki's laughing.

Starting point is 00:14:55 In theory, that's what works. I actually find this very, very interesting and I kind of wanted to show it just to give a little bit of background. And I don't know how many people have really done this, but I put a protocol analyzer between the CPU and the NVMe drive, and I issued some individual I.O. I just wanted to see, you know, what do you actually see on the PCIe bus if you do one NVMe command, right? You know, it seems like it should be obvious, but it's kind of fun to actually go and go, well, the

Starting point is 00:15:28 spec says this should happen, but what actually does happen? So here, I'm sorry, it's probably not super clear, but here we have a single NVMe command. It kind of starts here and it ends down here. And this is a, this is considering a system that's key essence. So take a system that's completely quiet and just do one NVMe read. Okay? That's it. Don't worry about trying to interleave commands or anything. And so what happens is, you actually, the first thing that happens is the host

Starting point is 00:15:58 rings a doorbell on the drive by issuing a simple TLP to a very specific address. It knows where it is. It was decided during startup So basically the host starts and goes hey NVMe drive wake up What happens next is the NVMe drive goes and pulls in the head of that submission queue 64 bytes of data That's going to tell the drive what to do So the first thing that happened is we rang a doorbell on the drive The next thing that happens is the drive issues a memory read essentially to main memory because

Starting point is 00:16:28 the queues are stored in main memory right now. Pulls it in, 64 bytes, and somewhere in that drive there's a little piece of hardware or firmware or software, depending on your implementation, that goes, this is a 64 byte NVMe command. This is an NVMe read. They've asked to read this particular LBA. Maybe there's some other things that I need to do with that. But that's pretty much it. What happens then depends on the media

Starting point is 00:16:52 that that NVMe drive has. But somehow, you're going to go find those 512 bytes, in this case, or maybe 4K in this case. You're going to go find them. You're going to energize a NAND die. You're going to talk to resistive RAM. You're going to go find them. You're going to energize a NAND die. You're going to talk to resistive RAM. You're going to go to DRAM. Whatever it takes, depending on the media,

Starting point is 00:17:11 it's kind of irrelevant to NVMe. You could, in theory, go to a spinning disk. NVMe doesn't care. What happens next is you DMA the data to the address that was given to you in the NVMe command. So part of the NVMe command said, here's a region of physical memory, probably in system memory. Can you please DMA to that address?

Starting point is 00:17:33 It's not a virtual address because DMAs have no idea what virtual addresses are, right? So it has to be a physical address, and we have to do something in the driver, get user pages or whatever, to make sure that if we're talking virtual addresses in our program, we're talking physical addresses by the time the information gets to the drive. Once the DMA is done, we write to the completion queue. We have to tell the system the read is complete. And then we basically have the completion queue.

Starting point is 00:18:04 And then depending on whether we're polling or interrupting or doing whatever, in this case we actually issued an MSIX interrupt which is a type of PCI interrupt which basically tells the processor that something's happened, there's hardware in there that works out. Well actually what happened was I have serviced whatever interrupt is associated with this interrupt that's tied to a thread, that thread is woken up, it goes, oh, we're done. And the data is now in my system memory. I can pass it back through the page cache or if it's direct IO, I can copy it.

Starting point is 00:18:33 You can do all kinds of things. It depends on the OS and so forth. So that's the anatomy of an NVMe command. It's worth keeping that in mind. And in a little bit, we're actually going to look at how long does that take and what is the kind of thing that can affect the jitter on this kind of command. The other thing is this is a block-based interface. We are not talking memory semantics here.

Starting point is 00:18:57 So this is not load store. This is DMA, ask for something to be done, have it done. There are some nice things about this. One, it uses DMAs. DMAs save you using CPUs, and CPUs are expensive. DMA engines are not. You can do things like data integrity. You can do atomic.

Starting point is 00:19:18 You can say, either this write will pass in completion, or it won't pass at all. There's good things about the block layer. Let's not forget that. So some of the hardware that I've been using to look at some of this. I wanted a very low latency NVMe device. Just so happens Micro Semi makes one. I actually have one here. If anyone wants to take a look at it. So this is shipping in volume today. Underneath this heat sink is what's called Princeton. It's one of our NVMe controllers.

Starting point is 00:19:50 Some of you may be familiar with it. Some of you may even ship product with it. Rather than using Flash, we use a whole bunch of DRAM for the principal store. And then this is a little Flash card. And there's also, not shown here, of course, because marketing, there's a big, chunky capacitor thing. Because it's here. And the idea is that if you lose power,

Starting point is 00:20:12 all this DRAM gets vaulted into this flash. And then on reboot, it comes back. But the great thing is that basically, I have a super low latency. I had an Optane SSD before Intel did. I'll be at a higher price. I haven't seen the price yet, Jim. It's going way, way faster than all these other things. That's right. And also I have infinite endurance. So I can write this puppy all day long. It's

Starting point is 00:20:37 DRAM. Doesn't matter. Power fail every two seconds. Yeah, that's true. This guy only has so many cycles. Exactly. So the interesting thing also about this device is, for reasons that will remain unclear, we didn't just present this as an NVMe device to target. Actually, a customer asked for it. One of our customers was like, hey, this is cool that this is an NVMe device.

Starting point is 00:21:01 Capacity is not great, but for a write cache, it's a really good thing. But can you expose that memory or some part of it as an additional bar? So can we have it as both a block device and also a memory device? And we went, yeah, we can do that. And the interesting thing is that that's actually a controller memory buffer. This is before controlling memory buffers were even in the spec, so it's not standard, but it actually a controller memory buffer. Now, this is before controlling memory buffers were even in the spec, so it's not standard, but it's a controller memory buffer. So the interesting thing is if you stick one of those in your server,

Starting point is 00:21:34 and you do, look how geeky is this, if you do an LS PCI minus VVV, so you get very verbose, and you look at the device, you'll even see it's, we still have the old PCI signature. So now PMC is now micro-semi, but there's a PMC Sierra there. And you'll see that there's two memory regions, two bars that are exposed. There's one here, which is non-prefetchable and 16k.

Starting point is 00:21:58 And then there's this really big one here. This one is the standard NVMe one. Any NVMe device has to show this one, right? Otherwise it's not NVMe. And this is the one the driver will I.O. remap or whatever so that it can talk to it. This is basically one gig of our DRAM capacity that we've exposed essentially as a controller memory buffer. This can be remapped, this can be m-mapped, this can be anything you like, right? And now this region is memory addressable, albeit on the I-O subsystem, not in standard DRAM, right?

Starting point is 00:22:35 So it's not system memory, it's not coherent, it's got an L3 cache in front of it. Performance in certain instances will suck, and I'll talk about that in a minute. But it's interesting. And the question is, given that we have that, what do we do with it? So one of the things you want to do with it is standardize it.

Starting point is 00:22:55 So just very quickly, I won't go into too much detail. This is all in the public NVMe spec, which anyone here can go and download from nvmexpress.org. This specifies a couple of registers that are in that NVMe region of the bar, that first bar region zero, that tells the driver about the CMB capabilities of the drive. So basically somewhere in the driver eventually there's going to be a line that goes if the NVMe version of this drive is more than 1.2 and if the CMB size register is not equal to zero, let's go and do something.

Starting point is 00:23:30 And that code isn't necessarily really there today. There's some initial work on the CMB in the Linux driver. I don't know about the status of the Windows and VMware and other drivers. But in the Linux driver we have some initial CMB support. Hopefully more of it is coming soon. And basically this really is just going to describe the size of the bar, the location of the bar, which you can get from the PCIe data anyway, but it's also going to tell you about the capabilities that this drive is willing to support for that bar.

Starting point is 00:24:00 And right now we have a couple of different capabilities, and they're not actually on this slide, but they basically say what kind of things would I like this bar to be used for? Can you use it for submission queue entries and you use it for completion queue entries? Can you use it for write data? Can you use it for read data and then one of the things we don't have in there right now But we're thinking about adding is a flag that says this is a persistent bar. If you write to this bar, I will keep your data for you, right? Which actually turns it into kind of an NVDIMM on the PCIe system. So that's kind of where we're going with that.

Starting point is 00:24:37 So before we jump right into that, I just kind of want to take a quick step back and look at some latency numbers so that we can compare latency from the NVMe device with latency from some of the other things. So, you know, given that Intel and Christoph and Saigi mentioned earlier, given that very fast NVMe devices have latency dying around the 10 microsecond mark, it's pretty interesting to go and work out where's that latency coming from? How much of it is dependent on the SSD itself? How much of it is NVMe protocol overhead? How much of it is driver?

Starting point is 00:25:12 How much of it is interrupt-based? And like I said to Christoph earlier, if you have an outlier, so I'm a statistician by background, and of course average is interesting, but distribution is also incredibly important, right? Thank you. Yes, I did appreciate that. Because a lot of people say average, and that doesn't help you if every time, you know, one in a hundred is way out here, right? Your application

Starting point is 00:25:36 cares about here, especially if you're doing multiple reads, and the result that you give back to the user is dependent on the result of all those reads. If I issue a hundred reads and I have to add the results together to give someone an answer, my performance is bounded by the worst case read out of those 100. So if your average is 1 but your 1 in 100 is 10, and you're stuck with the 10, right? Average is not that useful to you there. So we did, like I said, we put the LaCroix analyzer here. And we did some I.O. and we did some measurements.

Starting point is 00:26:08 This graph is probably not super clear. But basically, we broke it down into the I.O. contribution time. And what we find is that the drive itself, our drive is a little slower than your drive. But the NVRAM device, we call it was pretty consistent and had a very good consistency in terms of its outliers as well, around eight to nine microseconds. But what we actually noticed was there was occasional reads

Starting point is 00:26:36 that were taking much much longer than that and we tracked that down to the MSIX interrupt times and you can kind of see that on this graph where I've actually broken the latency up. The blue is the total latency measured by FIO up in user space. The red is the NVRAM times measured by the protocol analyzer. And then the green is basically the difference,

Starting point is 00:26:58 which we ended up working out was certainly the variability was mainly due to MSIX interrupt times. And from that we got into polling. Christoph's been doing quite a bit of work on polling in the block layer. So I've been doing quite a bit of testing of that work, looking at the impact of applying polling. So very interestingly, if you actually plot the latency of the NVRAM component of the

Starting point is 00:27:25 distribution it's very nicely bell-shaped. Central limit theorem is obviously working for us here. So that tells me I'm submitting lots of random variables together which tells me that's probably pretty good. It tells me that there's probably not something deterministic in there. This is the addition of multiple random variables which always converges to the central limit theorem. So that's kind of nice. It's always converges to the central limit theorem. So that's kind of nice It's always nice to get the central limit theorem into a presentation like that So so pretty fun and for anyone who cares

Starting point is 00:27:53 This is really just the measurement of the time between the submission queue doorbell The the time between seeing the submission queue doorbell and the time that we raise the MSI X interrupt. All right, so yeah All right, so that that's kind of the NVMe background. Now I wanna start getting into PMEM and IO-PMEM a little bit more. So I'm gonna have to use some Linux kernel terminology here. There's been a lot of contributions

Starting point is 00:28:21 around the NVDIMM region of the Linux kernel in the last little while. And it may or may not be because one of the large CPU vendors has some interesting technology that they want to hang off the memory bus. You know who I'm talking about. But the great thing is that this is really important work. And I think the entire industry benefits from it. So there may be some ulterior motive for it, but I think we can all take advantage of the fact that this is being done. And there's a lot of very good programmers who are contributing

Starting point is 00:28:53 to that space. It gives us a really good insight into what's coming down the pipe, which is one of the reasons why I really enjoy tracking the Linux kernel development, because it really tells you what kind of devices are going to start appearing depending on the activity factor. And I have a little script that basically just runs off the commit logs and works out

Starting point is 00:29:11 what areas of the kernel are being worked on the most based on submissions. And that tells me where I need to go work. So in order to prepare for this memory channel, and this is not the only work, but this is some of the work, there's a few things that have been going along. We have zone device. Zone device has been around for a while, but really it's a way of saying, I have a lot of memory in my system.

Starting point is 00:29:33 I have a lot of stuff hanging off my DRAM bus. I wanna split it up into zones. So this is saying I've got this much physical memory, and I'm gonna split it into regions. We kind of do that with zone DMA, but that's kind of more for back historical reasons about certain memory regions having certain properties and not being accessible. This is really saying I'm going to have regions of memory that maybe have different access characteristics than others. Maybe it's slower. Maybe it doesn't have infinite endurance. Maybe it's something I want to reserve for use with a driver. And that's the second

Starting point is 00:30:06 thing. PMEM is a driver in the Linux kernel that basically takes a zone device region that's allocated by the user at start time and says, rather than throwing this into the big pool of memory that's part of the memory subsystem that you can give out through, you know, get user pages or KZ or whatever, I would like you to reserve this memory region for a driver and that driver will be defined by the P-MEM driver. So, actually, zone device is always reserved and will never go into the generic memory. Thank you. Very good. So, zone device basically says this is a certain region. Everything else go in the standard pool. Thank you. Very good. So zone device basically says, this is a certain region.

Starting point is 00:30:46 Everything else, go in the standard pool. Thank you, Krista. It's good having an expert in the room. So basically what it means is that memory is not available for your system memory. So if you put, for example, if you've got a 64 gig system and you put a 16 gig NVDIMM in and say I'm going to reserve that 16 gig as a NVDIMM PMEM region, you don't have 64 gig for everything else.

Starting point is 00:31:12 It doesn't work that way. Sorry, guys. You're giving up memory capacity in order to have this particular region and service it. If you're going to do it, make sure you keep that region busy because you've taken memory away from the system. Now, obviously, NVDIMM is only so interesting because it's DRAM cost and so forth. There's stuff coming from certain CPU vendors that remain nameless Intel that may have much

Starting point is 00:31:40 larger capacities and have properties where you really don't want to treat it like DRAM. So this makes a lot of sense for that. DAX, Christoph mentioned earlier, DAX is really direct access. It's a way of basically telling the operating system, you know, I may look like a file system. I may, you know, you may think that I'm basically backed by a block device, but I'm a special kind of block device. I have special properties in the sense that I'm kind of addressable at lower than LBA level. I'm going to see if Kristof nods. So basically what that means is you can, you can optimize because you know the underlying system supports cash line

Starting point is 00:32:22 changeability, cash line addressability. But at the downside is there's certain problems with doing that. You no longer necessarily have the atomicity that you want. And also what happens to things like the RAID stack when you start being able to change things at byte or cache line level as opposed to block level? Right, some issues there. So the DAX Framework is definitely something you can take advantage of. It's something that's provided as a service to file systems.

Starting point is 00:32:46 So file systems can take advantage of it. Right now, ext4, which apparently is broken, and xfs are two, certainly two that I know of. Ext2, but don't touch it either. Ext2, but don't touch it either. XFS, Jeff. And then the last thing, which is kind of important, this is something that got added a little later.

Starting point is 00:33:08 There was some discussion around struct page support. So what's important about struct page support? Well the interesting thing is if you want to do a DMA, right now in the Linux kernel if you do a DMA you basically at some point probably do a get user page. And somewhere in the lines of code for get user page, there's a little bit of code that says basically if the memory doesn't have struct page support, just bork and don't do the DMA because we're not going to do it. We're not going to let you do it. So you can't DMA to a physical location that

Starting point is 00:33:39 doesn't have struct page backing essentially. Now initially, this was a problem even for the NVDIMM type work because there were people who were saying we're going to have such large memory attached devices, do you really want a struct page piece of metadata for every 4K page? At the time we're talking 64 bytes for 4K. You can do huge pages and gigantic pages and I'm sure at some point we'll do super gigantic pages to reduce the overhead. We already got super gigantic pages. Can we call them super awesome?

Starting point is 00:34:17 Anyway, so what we did is we've actually added an option in the kernel config that says if you want struct page backing for this memory, you can have it. And that's a kernel config. And that's useful for things like direct IO and DMA. So just to summarize, how does PMEM work? Everyone can go do this. You don't actually need an NVDIMM.

Starting point is 00:34:37 You can practice with just normal DRAM. When you boot your kernel, you want to basically have a boot option which says, MAP equals, and then this is the size of the PMEM region, and this is where you want to basically have a boot option which says mapp equals, and then this is the size of the PMEM region, and this is where you want it to start. And the idea is that that's supposed to line up with where an MVDM plugs into your system. So if you have an 8 gig MVDM, you'll have an 8G there. And just make sure it sits where you want it to sit. And you're going to have to do that to make sure that that's the case.

Starting point is 00:35:02 Otherwise, you've basically taken some DRAM and you've turned it into an NVDIMM which doesn't work because you take the power away. It's not going to work right. Whereas you've got an NVDIMM that's like could store your data for you but you're just throwing it in the memory pool. So please make sure it lines up right. What happens is the PMEM driver will bind to this reserve region and it will register a device in the device registration structure of the Linux kernel and that will appear as slash dev slash PMM zero or one or two depending on whether, you know, how many of these things you have. And it's just a block device.

Starting point is 00:35:41 And the great thing is all the block device goodness that lives in the kernel and user space applies to this. You can put a file system on it. You can put a DAX file system on it. You can put a non-DAX file system on it. You're going to lose some performance by doing that, but you can do it. You can make it part of a multi-disk structure. Plexastore, I don't know if anyone's Plexastore here, but they are talking about something

Starting point is 00:36:03 called M1FS, which takes PMEM devices and NVMe devices and puts a file system over the two and does auto tiering and caching between the two. Uses the fast memory where it can, uses the slow memory where it can't, hides that all from you, the user. You just see an awesome fast file system. Put databases on there, you can do whatever. It makes it easy to work with persistent memory. We can do this today. Everyone knows how to use a block device.

Starting point is 00:36:31 So don't get me wrong, I do think the future is load store memory access. I think that's kind of where it's going, but this is something we can do today and we get pretty interesting performance. So let's talk about performance. PMEM performance. So I have here some results. The first number is latency in microseconds. The second number is bandwidth in megabytes per second. For all you geeks out there, the gory details are in the bottom. I'm not going to go through them.

Starting point is 00:37:01 I'll send you that slide. It should be online for anyone who wants it. And then we have three different columns. I have QDepth 1, number of threads 1. So this is my QDepth 1 results. You can see our NVMe device is going pretty high there. I'll talk about that in a second. It's pretty bad. Never. Don't buy our product. QDepth 128. I thought this one was interesting because these numbers are, latency-wise, these are kind of the same. And I think one of the things that's happening here is the PMEM driver is servicing one I.O. at a time per thread. Maybe someone can correct me if I'm wrong, but I thought that was kind of interesting. And then what's also interesting, I find, is that

Starting point is 00:37:42 as you basically get up to the higher QDEPepths, you definitely still get a lot of benefits. But all this talk about sub-microsecond latency or very, very fast devices, it only applies at QDepth 1. And are we really going to be running things at QDepth 1? Because at QDepth 1, the bandwidth is kind of sucky. So it's like all this talk about ultra-low latency and what not, true at QDepth 1, but the number of I.O. at QDepth 1, even with very small amounts of latency, it's not huge. Yes? I suffer from my, yeah, I deserve that question. These are average. Yeah, I don't

Starting point is 00:38:24 have outliers. I should, I should, but I don't have them here. Yeah, I don't have outliers. I should, I should, but I don't have them here. Yeah, I'll leave that as an exercise to the user. All these scripts are online. I'm pretty good at putting stuff on GitHub. You can ping me if you wanna try running these on your own systems. Yes?

Starting point is 00:38:40 When you run these, is it with NSX interrupts or with the polling? Yeah, that's a good question. So these results are with MSIX interrupts, and we should go back and look at polling. Yes, I agree. Again, blame me for that. And like I said, I'm a big fan of just giving you one set of data

Starting point is 00:38:57 and all the tools you need to generate your own. So that's my excuse. But I'll let you do that. Interestingly, I mean, the PBM driver, it's pretty fast, three microseconds. That's pretty fast, right, for a block device. You do a little faster. Oh, Chris does. Yeah.

Starting point is 00:39:13 I was surprised, actually, that you get that much latency. Yeah, I was a little surprised. I kind of dug into it a little bit. As I briefly mentioned, right now we get reproducible four microsecond latency within the internet. Yeah, yeah, yeah. I think you have a godlike system or something. Yeah, I think one of the things I've noticed is there's a lot of variability, right? You change your system, you change the OS, you change the time of day sometimes, right? There's a lot of stuff that's going on. This really low latency performance testing,

Starting point is 00:39:45 it's definitely a bit of a fine art right now. Intel processors and other processors put themselves in low power states. You've got to be careful about that, careful about clocking, all these things that you've got to keep an eye on. And it means sometimes you get data, and you're like, this data is insane. So you have to apply a little bit of judgment.

Starting point is 00:40:03 So this is definitely a snapshot. Don't treat it as de facto. I recommend people go measure it for themselves. Linux kernel is free. You can have it on your laptop in 20 minutes. You can even do measurements like this on a VM. You're not going to get great data, but you can do it. I've done it. It's kind of fun. You SSH, and you think you're on a bare metal machine. I've SSH into my VM before. Done the measurements. Gone, oh, that doesn't look right. You're like, oh, I'm on my VM. That's why.

Starting point is 00:40:29 Like a QMU KVM. It has an in-vitro emulation that might go to slopes. Yeah, yeah. So what did we do to change PMEM? PMEM is in the kernel. You can go grab it. We had to make some changes. And I want to talk a little about those changes.

Starting point is 00:40:52 Like I said, we're a great believer in putting the code that we talk about online. So there's a GitHub repository at the bottom with a fork of the Linux kernel. It's a little out of date. I think it's 4.5. We're now almost at 4.8. Um, but that's, uh, something you can go pull. We did 88 lines of changes that basically enabled struck page support for zone

Starting point is 00:41:13 device memory that resides on IO memory space. Our change was really this part here. We already had struck page support for zone device. We just wanted to make sure that even if the zone device sits on IO memory, it gets struck page backing. Okay, that's the change. I've got some good news for you. What? The ARM people are getting a memory source

Starting point is 00:41:33 in the operation that allows you to map the PCIe bar to the IO view. Excellent. So just implement support for all the x86 IO views that you'll get. How many of those are there? I think I just put on AMD basically. Yeah, yeah, okay.

Starting point is 00:41:49 Legacy, IBM and SGI. I have a slide on this, Christoph. We'll get there. Now, PMEM.C won't work with our change because PMEM.C is trying to look at system memory. We need a PCIe driver, right? It turns out PCIe drivers are pretty easy to write because the PCIe bus subsystem has a lot of functionality you can just tie into. So we wrote an example driver.

Starting point is 00:42:10 So we didn't change the NVMe driver, but I'll talk about that in a minute. What we did is we went, if you want your device to be an IO PMEM device, then use this driver. This driver basically says, if there's an IO bar, let's turn it into a block device. So let's do what we did for PMEM for IO PMEM memory. So if you go back to one of the earlier slides where I showed two bars, basically what it will do is it will take that one gig bar and turn it into a one gig block device called slash

Starting point is 00:42:40 dev slash IO PMEM zero, right? That you can start hammering with IO, okay? This is my disclaimer, so I don't get beaten up by the Linux kernel people. We submitted this to the kernel as an RFC intentionally to generate discussion. We were not necessarily saying in its current form, this must be accepted because I think there's other ways of solving this problem.

Starting point is 00:43:01 Christoph mentioned one. So this is an interesting idea that moves us down a path, but it's not necessarily, even in my opinion, the best way that we want to solve this problem. That said, I do want to talk about some of the things that it lets us do. So the example driver, iopmem.c, it's a self-contained PCIe driver.

Starting point is 00:43:22 You could take parts of it and put it in the NVMe PCIe host driver and take that functionality. You could do that. In our case, what we were doing is unbinding the NVMe driver from our card that I showed you earlier, and then binding this driver to it. We had a module parameter to identify which bar we were exposing.

Starting point is 00:43:43 In theory, though, you would actually tie it into that CMB part of the NVMe driver. Thank you. At the back. For now, we use the entire bar. We have a DAX enabled block device and basically you can put a file system on top of that. One of the DAX enabled file systems. We also, just for fun, put in an M-app operator in that driver. So now it starts to get funky because we are essentially I-O remapping something to turn it into a block device that we can then I-O remap, mem map.

Starting point is 00:44:22 So world within world. But that is interesting. It did work. And it would actually let you essentially m-map the bar into a virtual process space and do things in that. The other thing that we can do is if we put a DAX-enabled file system on there, you can put files on it and m-map those files. That's totally legitimate.

Starting point is 00:44:42 We do that with block devices all the day, all the time. And that gives you basically cash line accessibility into files on a file system, which is often easier to work with than the LBAs of a raw block device, right? We like, you know, file systems are good for a reason. So all that done, let's talk about some performance data. Anyone recognize this?

Starting point is 00:45:05 It looks a lot like an NVMe over Fabrics deployment. So let's imagine that we have a couple of processors. Maybe one of them has an IOP mem attached. Ultimately, maybe we can take IOP mem and replace it with NVMe with a controller memory buffer. Here we have an RDMA NIC. And in a standard NVMe over Fabrics flow, for a write, what happens is I write some data

Starting point is 00:45:32 or I write a request for a command, assuming it doesn't go in packet. There's going to be a buffer set up here in DRAM. Basically, at some point, this RDMA NIC is going to come and request the data. It gets DMA to this NIC encapsulated in your favorite fabric protocol, whether it's Rocky, iWarp, Elephant, you can do RDMA over anything. We could do RDMA over squirrels if you wanted.

Starting point is 00:45:56 Performance would suck, but the squirrels would... It's just a way of carrying a message, right? So it comes over here, the DRAM or the data ends up here, and then the block device layer basically issues an I.O. which goes down to the device. Now though, there's memory here. Can I register this memory against this as a memory region? Yes, I can. Okay? With Archange, you can. So now, this is where the data ends up. So my NVMe over Fabrics write actually becomes here, here, here, through a PCIe switch straight into here.

Starting point is 00:46:29 Do I need the switch? No you don't. Are there some performance issues with Intel's peer-to-peer PCIe transfers? Yes there is. All right? By the way, this company sells PCIe. By the way, thank you Christa. We just happen to sell very, very good PCIe switches as well.

Starting point is 00:46:48 I have one in my pocket. So we can move data. We were able to transfer data into this device at about four gigabytes per second, so over the whole network. Not super fast, but the limiting factor was actually this device. It's just a limitation of its capabilities. If you have more than one device, you can scale that up until this RDMA NIC or the Fabric or something else becomes the limitation.

Starting point is 00:47:16 And, of course, competitors or other people could come up with other devices that have much better performance numbers. Reads, we were doing 1.2 gigabytes per second. And again, that's just the limitation of this particular device. Other devices should be able to do much better, worse, depending on how you implement it. And of course, remember, latency is hard.

Starting point is 00:47:35 Throughput is easy. Throughput is just IOP mem, IOP mem, IOP mem, IOP mem. Do it in parallel. All right. Latency was kind of interesting. Do it in parallel. Latency was kind of interesting. We're not talking additive latency. We were getting sub-three microsecond latency for RDMA reads off that IOP mem.

Starting point is 00:47:56 So maybe it's worth going back. For this guy to access this bar, performance sucks. Everybody knows that, right? You don't treat a bar on a device as something to load store into from your local CPU. You've got L3 caches. There's a reason we have an L3 cache. It's because writes suck. You want to actually cache the writes and flush them when they get bigger. So small writes kind of suck. Reads are also not great. So local access typically is not great, which is why we put in DMA engines Interestingly because this guy has a DMA engine performance here is actually very very good

Starting point is 00:48:33 Alright, so the interesting thing about IOP mem is it serves best as a peer-to-peer communicator not a peer to the host communicator So we want to look at applications which are more peer-to-peer. I'll get to those. I don't know if I have that with me today. Sub-three-recurse-second read time depends on the block size. Pretty low latency access. We also did transfers between the IOP MEM and an NVMe SSD. Think data replication. This could be an NVMe SSD, that could and an NVMe SSD. Think data replication. This could be an NVMe SSD. That could be an NVMe SSD. So very quickly, because of time, I want to get into some use cases.

Starting point is 00:49:12 So background copying between NVMe devices. We could write something in the host, tie it into the RAID stack or the multi-draught, the device mapper layer, the MD layer, where we could basically get lazy data replication in the background. These guys communicate with each other and they basically take copies of each other's data and tile them out to give you some form of lazy data replication. The host OS is still in control. It's still issuing all the I.O. It's not like I'm getting devices to talk to each other without the OS's knowing. The OS is the conductor.

Starting point is 00:49:49 The drives are the orchestra. So they communicate with each other. The other great thing is, like I said earlier, there's no data traffic flow between the memory subsystem here. This is staying pretty idle. Even though this could be scrapped, it's almost like a duck, right? You look at the legs under the water, they're going crazy. But this guy is moving gracefully. Just.

Starting point is 00:50:19 Let's imagine you don't want to put CMBs on your NVMe SSD. Let's just say the market isn't there yet or whatever, there's nothing to stop you having standard non-CMB enabled SSDs and an accelerator device with IOP mem capabilities, like an FPGA card. Then that FPGA device could be doing all kinds of things, erasure coding, RAID parity generation, background deduplication, maybe security scrubbing, maybe doing some kind of analytics on the data, looking for certain structures, right? These are all things that can be done, right?

Starting point is 00:50:50 And you could even expose this device as another type of NVMe device if you so wanted, right? You could tie it into the NVMe system. You can tie it into the software stacks that are running on the processor. This one, obviously, NVMe over Fabrics, tie it in to the RDMA NIC. So rather than having all the data hit DRAM and go down,

Starting point is 00:51:12 it just goes to the specific drive, and then the I-O execution pulls it into the NVMe devices. And then the last one is, again, if you want to save some money, potentially you only have one NVMe SSD with a CMB. It acts like a write cache, and then you lazily copy it out to other devices later in time. And then another one, which I'm not going to get into. So the Linux kernel has been changing.

Starting point is 00:51:44 It always changes and it's lovely to track it and see where it's changing. Lots of new ways to attach NVM. PMEM is a really easy one for us to target, because it treats PMEM as block device. It's not optimal, but it's something we can work with today. We did some extensions to the Linux kernel that turn on IOP-MEM, which is related to controller memory buffers on the NVMe spec.

Starting point is 00:52:10 And I think there's a lot of interesting use cases we can take advantage of, where we offload both DMA traffic and functionality from the CPU. Next steps. So right now, nothing upstream supports this DMA between PCIe devices. There's been a few proposals. We've had PeerDirect for a long time from Mellanox for GPU RDMA. IOPmem, we also have a recent submission based around DMA buffers.

Starting point is 00:52:37 And Christoph just mentioned another one tied into the IOMMU of ARM cores that we need to go take a look at. I think as a whole, though, I mean, the community in Linux works best when everyone gets together, sits down, and goes, there's lots of ways to solve this problem, but which one's the right one? Which one is the one that gives us a good API that consumers can enjoy and consume that's going

Starting point is 00:53:00 to be best long-term going forward, that addresses the majority of use cases, the majority of concerns, the majority of issues. I don't think individual companies throwing things at Linux RDMA is going to be the right way to solve it. I think the right way is to get the right people talking about it. And that's something that we're trying to do. There are issues around this.

Starting point is 00:53:19 There's security. You're going to allow devices to DMA to places they haven't typically been allowed to DMA to. There's routing issues if you have complicated switching and bridging. Can the devices actually see each other the way you think they can? There's coherency, right? PCIe memory is not coherent necessarily, or at all. And there's architecture specifics, right? ARM is different to x86, is different to MIPS, is different to anything else. But I think there's architecture specifics, right? Different ARM is different to x86, is different to MIPS, is different to anything else. But I think there's a lot of potential behind this idea.

Starting point is 00:53:54 I think the fact that it ties into NVMe CMBs, ties into NVMe over fabrics, ties into acceleration, makes it very timely. And, you know, I'm looking forward to getting the industry to have a discussion around it and moving forward. Thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list

Starting point is 00:54:15 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #36: Enabling Remote Access to Persistent Memory on an IO Subsystem Using NVM Express

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.