Storage Developer Conference - #67: p2pmem: Enabling PCIe Peer-2-Peer in Linux

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 67. Thanks for coming. I'm Steve Bates.

Starting point is 00:00:42 I'm CTO at Adetacom, and I'm also got a consulting company. What I'm talking about today is a little more general than that. As a kind of pet passion of mine, I've been on this four or five year journey now of making PCIe endpoints to DMAs to each other. And right now, I feel like we've gotten from the point where, with regards to the Linux community, we've gotten to the point where a lot of people see that there's potential value in this. And there are challenges around what it is that we're trying to do, but there's also so much potential benefit that it feels like it's more a case now of working out how we get this upstream rather than whether it goes upstream at all

Starting point is 00:01:27 We'll see I do want to talk about the latest incarnation of our fork of the Linux kernel that is a framework for helping PCIe endpoints talk to each other and still under OS supervision, but but on the data plane at least trying to keep the traffic away from the root complex and the DDR. So I'm talking about that. And then very nicely, this also at the same time as I've been working on this over the last few years, both within people like Tom in the room here on the RDMA side and then

Starting point is 00:01:59 a whole bunch of us on the NVME side, I've actually been looking at if we'd actually start to work within this framework, what are some of the interesting things we can do around persistent memory that could be potentially remote addressable that isn't an NVDMN. So... Male Speaker 1 in audience. Can I interrupt you? I apologize.

Starting point is 00:02:18 Can you switch off your mic? I mean, switch back on again. It sounds like perhaps there's a frequency something issue there. Any better? Hello? Hello? There's some noise on there. That's weird.

Starting point is 00:02:32 Where did that come from? Do you have a phone in your pocket? No. I'm actually a robot. That could be the issue. Yeah. Yeah. Let's plug ahead. Yeah. Yeah. No worries.

Starting point is 00:02:48 Thank you. So there's some interesting things that are happening there. We actually have some kind of advanced products, prototypes that we're using with things like controller memory buffers and even now the new NVMe persistent memory regions. So we're actually in our lab starting to do some real remote persistent memory over fabrics to PCIe attached memory on the target rather than just NVDim type stuff. And I think that's kind of cool.

Starting point is 00:03:14 So anyway, what's the rationale? Or a little bit of nomenclature. I always have to put in a picture of my kids. My son coined the phrase blocky. It's the combination of blah and yucky. But PCIe peer-to-peer using P2P members is not blucky. So it's actually pretty cool. The rationale, we're drinking from a fire hose.

Starting point is 00:03:37 I didn't have a fire hose, but it's a cute picture. This is not one of my kids. RDMA NICs are getting faster and faster and faster. I keep telling Adan to slow down. Stop, stop, stop. But he's going, no, Stephen, no, no. Michael is beating my ass. 50, we were doing 10.

Starting point is 00:03:56 We were doing 40. Then Brad Booth and the 802.3 guys got smacked down. And they went 25 gig a lane. Now we can do 25, 50, 100. 200 is coming pretty soon. The NICs are incredibly fast, right? NVMe SSDs, a million IOPS per SSD. Samsung gave a talk yesterday and like we were looking at numbers and they were like pretty big numbers, like, you know, eight or nine gigabytes per second. But I went, hang on, that's only like half of the potential of those four or five SSDs. It's very easy now to get a PCIe subsystem that's

Starting point is 00:04:28 running multiple tens of gigabytes per second like easy like flash array throw in some graphic cards throw in a couple of RDMA NICs you're talking 20 30 gigabytes per second at the same time we're starting to see the NVME storage systems come on the market new Isis had a couple on show Celestica I don't know 30 gigabytes per second. At the same time, we're starting to see the NVMe storage systems come on the market. New Isis had a couple on show. Celestica, I don't know if you saw there, but they had some on show.

Starting point is 00:04:51 So we're starting to be able to buy these hardware devices that can allow us to populate them with lots of endpoints. And all of that data is coming in at this incredible rate. And the CPU is like, Jesus Christ, guys, come on. It's the, ah, and fall, whatever. We basically have got to the point where even if it's only a couple of instructions per IO, you're absolutely screwed. You're just dead.

Starting point is 00:05:17 And there's not that much you can do in a couple of instructions per IO. So we've gotta start thinking about how do we help the CPU out here and maybe let it focus on the important thing. Because the other problem that we have is the second point on this slide. Right now, certainly in the major OSs and for sure in Linux, every single I.O. that's coming in on an RDMA NIC that has to go to an NVMe drive requires a bounce buffer that's in system memory. And that means that if you're doing 20 million IOPS, all those 20 million IOPS, and Dan showed his curvy diagram earlier today. It's part of the reason why I'm so tired. Every IO has to go to system

Starting point is 00:05:57 memory and down to the drives. Every read has to come out of the drives, be DMA to system memory, and out through the RDMA NIC. If you have a graphics card that's sharing some stuff and you're not using out-of-tree kernel hacks like GPU Direct, it has to go through system memory. So the poor processor is like, I don't even want this effing data. I don't even want it, and you're passing it through my root complex,

Starting point is 00:06:19 and I have to pump it out through the integrated memory controller to the DDR and then pull it back in the integrated memory controller. I got threads here that are trying to make me some money. I'm watching my Netflix movies. I'm doing my fintech. I'm checking the stock market. And those poor processors are doing load stores through the same integrated memory controller that's getting annihilated with like 20 gigabytes per second. And I'm sorry, memory bandwidths are pretty big, but fuck, that's not cool, right? So the quality of service of the load stores that are being done by the cores, that's not good, right?

Starting point is 00:06:57 So that's kind of the noisy neighbor problem. So we still like to have an operating system. There's some people who are moving away from having an operating system at all, but I think that's pretty insane because I kind of like being able to work out what happens when things go wrong. So it's kind of nice to think about a paradigm where the operating system is still potentially, it's still the management, it's still maybe issuing the IO requests and managing what happens when things go wrong and maybe servicing some interrupts and so forth. But the data path, if we can, keep it away from the root complex of the processor. So there's many ways of solving that problem. I've been looking at the Linux kernel specifically,

Starting point is 00:07:43 and some of the people in the room have. And, you know, there's been a few different ways. I actually presented this same kind of topic last year, but it was quite a different set of kernel patches than this one. This one is maybe better, it's maybe worse, whatever. And there's a few people out there in the community who have kind of contributed or said, hell no, this is not getting in. And that's all fine because, you know, the kernel has an awful lot of people who care about it and it does a lot of different things. And we have to be very careful about what we put in

Starting point is 00:08:11 because once it's in, it's pretty much never coming out. So we want to make sure we get this right. So this is just an example that we've proposed and we're discussing in the community. And maybe it will go upstream, maybe something different to it will. But the main idea is really right now in the community, and maybe it'll go upstream, maybe something different to it will. But the main idea is really, right now in the Linux kernel, if you just pass a random 64-bit value down to the DMA API, it's probably pretty sensible for the kernel to do some checks on that, because you may have just given it a completely random

Starting point is 00:08:40 address. And the DMA engines that are in these NICs and whatever, they have no idea where they're DMAing to. They're just a piece of RTL where you give it some values, you know, by a PCIe driver, and you say, you know, whatever comes in on your Ethernet port, depacketize it and push it like crazy to this address range. That's fine if it's an address range that you've reserved. If that's the kernel stack, bad things are going to happen. If that's the VGA memory map location, you will see some mad shit on your screen.

Starting point is 00:09:10 And if that's the blow up the computer please poke address, then you better make sure you're not in the room. So the kernel likes to make sure that things are being checked. And right now, if you try to pass in a PCIe bar address for a Linux API, it will return an error code. I'm sorry, that's not valid. And one of the reasons why that is is because IOMemory doesn't have a struct page backing to it.

Starting point is 00:09:37 And there's a lot of reasons about why that is. But the main thing we should understand is that right now, I can't set up a DMA engine on an RDMA NIC to push data to a PCIe bar that's somewhere else in the PCIe subsystem. Yes? Yeah. Well, they're doing it from user space.

Starting point is 00:09:59 So you can use SPDK or a hacked version of SPDK to do this, but the upstream kernel, the one you get from Linus, will not let you do what I just said. So my emphasis is I want everybody to benefit from this and I would like it to be in the upstream kernel. I can do it today with my own hack of the kernel. You could do it with your hack and you could do it with your hack and Accelerow can do it with theirs. We can add it to SPDK and they'll be able to do it with your hack, and you could do it with your hack, and Accelerow can do it with theirs. We can add it to SPDK, and they'll be able to do it there.

Starting point is 00:10:27 But this particular conversation is around Linux kernel. You see what I mean? So there's nothing physically stopping anybody doing this. It's just if you want to do it in an upstream Linux environment, then we have to obviously get it upstream. And they have requirements that things like Accelero don't. Because Accelero is probably specific to a couple of architectures, and they're probably willing to do some things that an operating system is probably not going to let you want to do and stuff like that.

Starting point is 00:10:56 Around security, IOM, MU, stuff like that. So a little bit of background on some of the things that we use within the kernel. The kernel is open source. Everybody probably knows where it is. You can go and grab the kernel code for some of these things. But some of these are quite important. So zone device is basically the ability for us to take a range of memory addresses, which in Linux are called physical frame numbers, PFNs.

Starting point is 00:11:25 So every 4K of memory in Linux basically has a PFN associated with it. And then some of those PFNs have struck page backing, but not all of them. So PCIe bars may have a PFN associated with them, but they don't have a struck page in the upstream kernel. Basically, this allows you to introduce memory to the system but not throw it into the standard memory management system. So this might be even an NVDIM N that you don't want the

Starting point is 00:11:52 operating system to use for physical memory allocation in the way that other would. Because you're going to mind the file system on it and treat it like a PMEM device or something like that. So in Intel, we're actually actually the guys who introduced that, mainly looking for the Apache PaaS and VDIMM kind of stuff. PMEM is basically a device driver that lives on top of a range of PFNs

Starting point is 00:12:16 and turns it into a block device. So if you wanna have a NBDIMM block device in your system, one of a few different ways of doing that is to register it as its own device and then put the PMEM driver on top, you getin block device in your system. One of a few different ways of doing that is to register it as its own device, and then put the PMEM driver on top. You get a block device. And then if you want, you can put a file system on top. Just be pretty careful about what file system you put on,

Starting point is 00:12:34 because some of them don't work so well, including ones that actually claim that they did work, like EXT4 with DAX. DAX is a framework that we added kind of at the file system layer that says, hey, some of these block devices that you're talking to now aren't actually block devices. They don't have the same sector atomicity that we expect from a block device. They're actually constructed from physical page ranges.

Starting point is 00:13:01 And so what DAX allows you to do is to take a file system that's just a normal legacy file system like EST4, and it actually provides a framework to help, and we thought it worked, to help actually make it work on physical memory. But surprise, surprise, as you dig into it, you start finding corner cases, and there's issues with that. But this is the way in which we can upgrade file systems, like EXT4 and ZFS and all that kind of thing, to make them work on an NVDIMM,

Starting point is 00:13:33 or to make them work on anything else that looks like a bunch of physical memory. And then struct page support. Basically, if you want to use the DMA API or some other things to do with memory, it's useful to have this. It's a little bit of metadata for every page of data. So for basically every page in your system, and in Intel systems that's 4K, on ARM systems sometimes it's 4K, sometimes it's 64K. You basically have 64 bytes of metadata that tells you stuff about what that physical data is doing.

Starting point is 00:14:02 So these are all very much in the plumbing. GitHub, I love GitHub. So anybody who wants to go, it's really long, but anyway. You can go to that tree. There's a couple of other trees that are up there. But Linux P2PM is where we kind of keep this work right now. It's all just GPL license code. Anybody can take it. You can do whatever

Starting point is 00:14:25 you like with it. Basically, there's a whole bunch of patches. You can't see them all, but we have about 20 patches that sit on top. And right now, we're rebased off 4.12.3. So, we're based off a stable. We haven't rebased on the 4.13. We typically rebase on every major, but we haven't done it for a little while, I guess. We've been busy doing a startup. But a whole bunch of code. And basically what we're doing here is the biggest change is basically in the section of zone device code I talked about earlier, we've added some additional flags that say, hey, we know you typically don't do this,

Starting point is 00:15:07 but we're going to add some new memory in the system. And can you give it struct page backing? And oh, by the way, it's IO memory. And it's a little bit naughty, and we've had some discussions on the kernel community about why that's naughty. But it certainly works. It's just maybe it's not upstreamable. But it does work.

Starting point is 00:15:22 You can add some additional changes on that. And what we did is we created a new device class called P2PMEM. And you can think of P2PMEM as like a kernel-wide orchestrator, which is a device in the device tree that drivers can basically either donate memory to or take memory from. And we built it on top of GenAlloc, which is a standard allocator in the Linux kernel for handing out memory that's a special type of memory. So basically what we do is we give something like

Starting point is 00:15:51 a PCIe driver a framework that it can call into and say, hey, I have a bar. I don't need this bar. It's not for configuration. It's not for my driver. It's a bar. It's backed by, maybe it's backed by DRAM. Maybe it's backed by Spintorch MRAM. Maybe it's backed by some awesome new memory technology that may or may not eventually appear from Intel. Maybe it's backed by something else, right? It's just a bar. Maybe it's backed by nothing. Maybe it's just fake, right? I mean, you can do that. I would like to give that memory to the Linux kernel. And I would like other drivers, if they like, to basically do something with that memory. So the P2PMEM becomes basically the go-to place to either donate or grab memory from.

Starting point is 00:16:38 So you can think about it like that. And it has a character device. You have slash dev slash P2PMEM. And you can go in and do iOctols and open and read and write and stuff like that. Okay, so that's the first step. The next step is you take a PCIe driver, either one you write yourself or you hack one that already exists. I happen to hack the NVMe driver. It's already there, and we have a standard way in NVMe for exposing a bar. It's called, well, originally it was called, no, there's now two ways of doing it. One is controller memory buffer, and the other is persistent memory region.

Starting point is 00:17:14 And they're totally standard. There's config registers you can set to advertise if you have a CMB, which bar it's on, does it have an offset, what kind of data can you put in the CBB, and now we have ways of saying that it's a persistent memory region. We have even ways of saying how you guarantee consistency, so we have a mechanism for doing that. That's pretty cool. We can advertise the size of it.

Starting point is 00:17:38 I think we can even advertise things like how long it takes for it to flush on a pail, and we have a way for it to say it's not quite ready yet, even if the rest of the drive is ready. Blah, blah, blah, blah, blah. So one of these changes, and you can just go to GitHub and look at it, takes the NVMe driver and says, and there's not a lot of changes.

Starting point is 00:17:58 It's like maybe 10 lines of changes to the NVMe driver. And it says, I am going to look at my CMB registers and see, do I have a bar? And if I have a bar, I'm going to look at my CMB registers and see do I have a bar and if I have a bar I'm going to give this much of it to P2P man so P2P man is now aware of it and can give it to somebody else so that's the mechanism by which suddenly the kernel knows ah Stephens just plugged in his lovely new startups NVMe drive that has a controller memory buffer and that controller memory buffer supports all data types

Starting point is 00:18:25 and it's this big. And now I, as the kernel, I'm aware of that. I create a gen alloc out of it. And other people can now come to me and do gen alloc requests against the memory. And I can choose to give it to them or I can say, nope, I'm not giving it to you because I gave it to them.

Starting point is 00:18:42 Or I don't like who you are. Or, you know, you're not trusted. So the kernel is completely in control here. So P2P mem, basically, if we have a bar, we can do stuff with it. So what can we do with this? Like, what can we do when we have this? So I'm going to go into a couple of examples. And this isn't just about RDMA.

Starting point is 00:19:09 Sorry, guys. It's not just about NVMe. This is any PCIe device that either wants to borrow memory or wants to DMA to another device or that has memory that it can donate. Why would you donate some of your memory? Well, let's say we have two NVMe drives and they want to copy data from one to the other. The classic standard way of doing that is this is our setup.

Starting point is 00:19:38 We have an Intel server. We have the MicroSemi switch. I used to work for them. We have some standard NVMe SSDs. These do not have CMB. These are just standard NVMe SSDs. And then this case here, I actually have a micro-semi NV RAM card,

Starting point is 00:19:54 which basically has a large bar on it. And so it basically becomes a P2P mem device, and the CMB basically gets mapped in. What I can do is I can write some code now that says, can you please take a lot of data, like all the data on this drive, and put all of that data on this drive, but don't copy to DRAM and down.

Starting point is 00:20:16 Can you copy to this and down? I don't have to change user space at all. It just works. Basically what happens in user space is the dev, when this CMB appears up here, is a slash dev slash P2P mem major number. P2P mem zero if it's the first one in the system, one if it's the second.

Starting point is 00:20:36 We have a sysfs framework for working out which P2P mem belongs to which CMB, and you'll be glad to hear, so you know where in the system your memory sits. And you mem app that, you open it, you mem-amp that, and you basically use that as your buffer for your read or your p-read v or whatever, and everything just starts to work. And what happens is all the data gets copied, but none of it goes up here. So basically what happens is the copy speed is about the same,

Starting point is 00:21:04 we're not really seeing a throughput advantage, but the upstream port TLP count and the DRAM requirements go literally 10 orders of magnitude down. Yeah, it's going to suck bad. That's why we want to know where in our PCIe tree the devices are. If this switch isn't here and you're directly connected, sometimes it doesn't even fucking work. Because sometimes people's PCIe is so shit that it doesn't let you do peer-to-peer. That's how you really do this.

Starting point is 00:21:47 And it's not just Intel. I've tested on power. I've tested on ARM. I can't say exactly which ones fail and work. Some of those are under NDA. But let's just say it happens more often than not. So this is going to be something that's maybe going to have to be constrained to switch environments. But all flash arrays and GPU clusters

Starting point is 00:22:05 are the kind of places where this might actually be interesting anyway. But definitely something to be cognizant of. And like I said, that's why we have the sysfs, so we can actually troll through the tree. We have got a compatibility function in there, so we can have rules on when P2P is allowed and when it's not. The operating system can step in and impose certain rules

Starting point is 00:22:23 on top of that. So there's ways and means that we have about ensuring that it's done only when it should be done. And right now, the rule is basically, if you're both connected to the same switch, okay. If you're not both connected to the same switch, for now, we're saying no. But that's just a rule. That's a line of code. And the kernel community can decide as a collective what the rules should be or whether we need some kind of,

Starting point is 00:22:49 maybe some other methodology for doing that. So this particular example involved two DMAs for every movement, right? This guy DMAs to the buffer, and then this guy does a DMA from the buffer into whatever's behind it, Optane or NAND or SpinTorque. But the PCIe bar could live in one of the drives. That's a CMB.

Starting point is 00:23:18 So the last example I call the bounce buffer example. We introduce a device that acts as the bounce buffer. But now one of the bounce buffers could be one of the NVMe drives. So one of the drives could have a CMB, and the other could not. And that CMB could be a persistent memory region. I'm going to skip this slide quickly. No, I'm not, because it's kind of a plug. But anyway, we have a device that we're working on.

Starting point is 00:23:44 Some of you may have seen the talk on Monday at Datacom. We have an FPGA card that is an NVMe compliant device, and we've implemented a fully functional CMB. My other drives in the market today, there's a couple I've seen now that have CMB, but most of them right now are only supporting submission queues. So we did something that would support everything. So that, we did, you know,

Starting point is 00:24:05 basically I wanted that to play within the lab, so we did it. And because we're a true CMB, people can DMA to us and then issue an NVMe command that is inside our own bar, and we can detect that, right? We know the PCIe config space, the BIOS has told us the range of our bar.

Starting point is 00:24:26 So when the PRP comes in, we do a check and go, is this PRP inside us or outside us? And if it's inside us, we do an internal DMA rather than an external DMA, right? And the thing about that is now I'm actually saving PCIe bandwidth because now I'm able to do copies with one external DMA. So in the last example, I was having to do DMA to there and then DMA back to here. In this example, it's a true CMB. So I can say copy all the data from here and put it on here, assuming I have the capacity to do that, which our particular thing doesn't today. But that's just, you know, that can be changed. And other people will have drives with much bigger capacities and so forth.

Starting point is 00:25:06 So what happens is basically this guy does a DMA. The physical address is on the CMB. When it's done, it raises a completion. This guy then triggers the write on this side. This guy detects it's inside the CMB and just sucks it in. And it's done. And in fact, maybe it can acknowledge it immediately because it's power loss protected. It knows everything at that point.

Starting point is 00:25:25 We've had this conversation earlier today. Once he gets the completion or so the submission command realizes everything that that drive needs is already inside of it, it can just do an ACK if it has the physical power save capabilities to guarantee that acknowledgement. So that's kind of cool. And again, it's the same idea. The speed stays the same. The CPU load has gone down

Starting point is 00:25:48 orders and orders and orders of magnitude. Only one external DMA. We got some results. I'm not going to go into too much detail, but basically we've been doing a lot of testing with our card. There was some more data on Monday, but basically the CMB can be incredibly fast.

Starting point is 00:26:07 It's a very fast path in our particular topology. So we can do, you know, we're saturating PCIe Gen 3 by 8, and we're almost saturating PCIe Gen 3 by 16. So there's a lot of capacity there. I think what's more interesting is, so what else can we do with this framework? Well, I've showed you NVMe drives, but let's introduce a different type of device.

Starting point is 00:26:30 Let's bring in an RDMA NIC. So now we can move data between NVMe and RDMA. Where have we seen NVMe and RDMA get together in a really sexy way recently? Oh, NVMe over Fabrics, of course. So this is, I think this would have been an FMS demo if it hadn't set the joint on fire. It wasn't us, it wasn't us. So basically this is an envy me over Fabrics demo

Starting point is 00:27:00 we did with Mellanox and Celestica at FMS. Well, we didn't, MicroSemi did, but MicroSemi was paying me lots of money. And basically, we had a JBoff here. We had Mellanox CX5s. And what we did is we used the Micro Semi NVRAM in this case, and we used it as a P2P mem. And it became the buffer for all the I.O. rather than this DRAM.

Starting point is 00:27:21 And basically, we would run in one mode for a few minutes, and then we'd switch to the other mode. And we have counters here. And we could show that basically, in both modes, the throughput was the same in terms of IOPS. But again, just orders of magnitude less load here. I mean, it's awesome to watch, because it goes literally from 10 gigabytes per second to a megabyte per second,

Starting point is 00:27:45 depending on the IO size. Cause we still have submission queue entries and completion queue entries and so forth. But the data is now all going from here into here and then these drives are DMA out of there into their nonvolatile memory. I mean, yeah, not, not, not processor core. Yeah. Not load as in what H top reports load. Measure the load on this link between the, yeah, I measure how much traffic is good. Yeah.

Starting point is 00:28:19 Both, both order. Yeah. It's pretty much the same cause there's no need there's no need Well, what we're doing what we're doing like 20 gigabytes per second on 20, but like 10 gigabytes per second So the l3 is gonna flush I We do it for you know, it's like how long does it take to fill a 128 megabyte cache at 10 gigabytes per second, right? Ddio doesn't help you here. There's so much data going through the system. Like, what are you going to do with the 129th megabyte that comes in, right?

Starting point is 00:28:51 It's like, basically, it's like the L3 cache is like, it's like a write-back cache that just fills up instantly. It's exactly the same effect. DDIO just doesn't work in that kind of environment for this kind of thing. But we can argue about that. Yeah. So that was all, to be honest, very exciting. But this is the one that really gets me excited. Because if you look at all the pieces now, we have RDMA, which gives me low latency, CPU offloaded, remote connection to stuff.

Starting point is 00:29:29 I have a standard way of exposing a bar and a persistent bar on the PCIe bus now with NVMe. So PMR, NVMe PMRs mean that I can standardly, with multiple vendors working to the same spec, go out and buy devices that maybe expose persistent memory that isn't block-based. It's byte-addressable. It's IO memory. I have a Linux-based framework that allows that particular NVMe device to tell the OS what it has. And because of PMR, there's a standard way of also saying that it's persistent. And we can tie that into P2P mem or some other thing to say, hey, the particular attribute of this P2P mem is that it's persistent and it's NVMe PMR. And so device drivers then can use that to go, OK, well, I know I'm an RDMA Nick.

Starting point is 00:30:22 You've just told me your PMR with a flush mechanism that involves writing to a specific address. I know what that address is. I have a DMA engine. I can certainly poke that address. We can do something around guaranteeing data is actually consistent. So rather than it being just a temporary buffer,

Starting point is 00:30:41 it can become the way we access the non-volatile memory directly over RDMA. All right. Male Speaker 2 in audience 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking.

Starting point is 00:30:53 Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking.

Starting point is 00:31:01 Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Male Speaker 2 speaking. Why didn't you register that memory block? No, that doesn't change. The verbs normally act on a struct page. So that's the change that we did. We changed lower down underneath, and we've given this memory struct page.

Starting point is 00:31:14 So you just pass the memory. You've just got to find a handle that points to that memory, which for us is the slash dev slash p2p mem 0. So you literally just open slash dev slash p2p mem, and So you literally just open slash dev slash p2p mem and then you basically m-app it and then you pass that into IBV register MR. Or for IBV write data, we added an upstream accepted patch that lets you pass in an m-app option to point

Starting point is 00:31:40 and we just, so you just do IB underscore write underscore bandwidth dash dash m-app slash dev slash p2p mem zero and it just put, so you just do IB underscore write underscore bandwidth dash dash M app slash dev slash P2P mem zero. And it just works. There's no other, no changes needed in lib verbs. No change needed in IB core. Thank you. So this is literally what we're doing.

Starting point is 00:31:58 Like I did, I was working on this this morning even. Like we basically, our friends at Mellanox have given us CX5s. we have the micro-semi switch, we have the Adetacom no-load. You know, right now, our PMR is a fake PMR. I don't have it, but I can go buy MRAM. I put it in there, or we can work with Everspin and put it in there.

Starting point is 00:32:19 That's really, you know, kind of a second-order thing. But we can advertise as a PMR, and we can basically have it so P2PM realizes that it's persistent, and it can tell this guy. And then what happens over here is basically my application can now use RDMA verbs to access this. And this PMR could be pretty big. Like, this could be terabytes.

Starting point is 00:32:41 I mean, BIOSes might complain, but they won't complain for a while. But it could certainly be quite big. We could have CMBs and PMRs. We could have CMB and PMR and block devices. We could have NVMe devices that have zero block-based capacity, which I know Amber will slap me for saying. Because NVMe is supposed to be about block, but that's kind of where we are right now. So we've been doing some experiments.

Starting point is 00:33:11 This is pretty early days. We're just starting to get some of this data. But I mean, no surprise that using peer-to-peer instead of going through the DRAM, it just means that we're not loading the processor again. So the other advantage, of course, now is that normally what would have happened in RDMA is I would have gone up here, but now I'm going straight here.

Starting point is 00:33:34 So this means basically I can do NVDIMMs without an NVDIMM. I can do it on the PCIe bus if I'm trying to expose over the network. So if you want to do persistent memory over fabrics, you can basically now have a choice. You can go and buy NVDIMMs and put them in an RDMA Nix, and that will all work fine. And now you can also go buy PMR-enabled NVMe drives.

Starting point is 00:33:56 You can plug those in on a PCIe chassis, like a Celestica or like the Unisys or whatever, like an all-flash array essentially, but now it's a persistent memory over fabric all-flash array. You can use P2PMEM then to basically access it. And on the client side, it doesn't even know. It doesn't care. It's just using verbs, right?

Starting point is 00:34:16 Tom, now it's up to you to find the last little gaps because I know that's what you're really good at. You know, we definitely aren't all the way there. We still have some work to do with RDMA around flushes and fencing and things like that. Well, flush isn't so bad because, like I said, PMR, very cleverly, I did the initial PMR, and it's so much better than it was when I started it. I initially proposed it and the standards people did the right thing. They went, this isn't what we really need.

Starting point is 00:34:47 And one of the things they added above what I proposed was we are going to have a way for the drive to say, this is how you make me persistent. And I guarantee if you do this, if you do whatever it is, and we have different things that you're allowed to do, it's persistent by nature. You have to switch from the buffering,

Starting point is 00:35:04 so that sentence is good. Yeah, and so one by nature. You have to switch from the offerings. Sending the flight is good. Yeah, and so one of the methods is basically just read from the bar. If you do a read from the bar, yeah, exactly. And another one, one says read anywhere in the bar. The other one says you must read from this specific address. And I think there's a few others. I mean,.

Starting point is 00:35:21 Yeah, yeah. But because we're advertising that through a standard driver, we can then pass that information through another standard driver, the RDMA ones, so that the RDMA cards know what to do. I got half a tick of the acceptance. I got half a tick, I got half a tick. Yeah.

Starting point is 00:35:40 I think the promising one is written on a specific address, not on the blogging. Yeah. And compare that to the not overwhelming. Yeah. And compare the. Yeah. Yeah. So what about the ecosystem for PMR?

Starting point is 00:35:52 Well, things already exist. So even outside, you know, this week we've had Everspin. They could expose some of their capacity. There's very small capacity, but they could expose that as a bar. And that's a permanent bar. It's backed by spin torque mram, so that can be a PMR. The microsemmy guys for ages have had this NVRAM card, which is a supercap-backed NVRAM device. It can expose some of its DRAM as a PMR, and the supercap ensures your persistency.

Starting point is 00:36:20 We're working on a framework. We're software-centric. We'll deploy on other people's hardware. If that hardware happens to have spin torque, MRAM, or some other kind of memory level persistency, we can expose a PMR as well. And other drive vendors, Toshiba I know has talked about PMR, and I'm sure other drive vendors will be going down that path as well.

Starting point is 00:36:43 So where are we in terms of upstream? The reality is that we're getting there. I think this slide is interesting. Basically, there's three steps. You need zone device, we didn't do zone device. You need P2PMEM for the specific architecture. This is not architecture agnostic. And then upstream.

Starting point is 00:37:01 So basically where we are is for x86, we have this, we have this, but it's not upstream. PowerPC have added its own device. I haven't tested this, but I'm hoping this is an easy one to get to because the problem is right now with ARM, we don't have this one. So I'm having conversations with the ARM people.

Starting point is 00:37:20 This is important for ARM anyway because it's NVDIMM support as well. This is how basically tied to memory hot plug. So it seems like this is something ARM64, now that you have Qualcomm and Cavium and so forth doing data center tech servers, NVDIMM support upstream seems like it would be good for ARM64. So where are we going?

Starting point is 00:37:42 Back to your point earlier, this has been a very Linux-centric stack in terms of what we've done, but we don't have to use Linux. I mean, there's other operating systems and there's also user space. So SPDK is something that we're thinking about, like maybe we do it in SPDK. Certainly if the kernel community

Starting point is 00:38:00 continue to say it's not acceptable and we can't go into the kernel, which would make me very sad if we can't find any way of making that happen. Obviously, this kind of thing is the way it would end up going. We can tie in. I mean, I talked just about RDMA and NVMe, but there's other things like graphic cards. And AMD have something called RockM, which actually kind of exposes some of the memory on the card, and that could actually be tied into some of this. And then, of course, in the long term,

Starting point is 00:38:31 all of these problems go away because we're just going to use OpenGen CCIX, which is my name for what happens when we're going to get all these silly people with different ideas about the same thing in a room and make sure they just form one standard rather than three. That's it. Mark.

Starting point is 00:38:55 So I'm kind of thinking how this would affect the scale. Yeah. Does that creation have any meaning for this type of thing? Yeah. Yeah. So this is very much for over fabrics, right? It would be pretty foolish, I think, to try and build that for, if you want locally attached persistent memory,

Starting point is 00:39:24 don't put it on the PCIe bus because the load and storage now have to go all the way out and all the way back, and performance sucks. That's why we have DMA engines on PCIe devices. So I would very much suggest, advise people not to build this for local stuff and expect it to work at all, like performantly. So it's really all about the remote stuff, Mark. And I think there, the API is identical to what it is for NVDMN, for remote access. Adan says it's even better.

Starting point is 00:40:00 Okay, yeah. So it has ordering, yeah, that's very true. The other thing, I mean, so I think this is a good conversation to have, though, Mark. I think now that the ecosystem is getting to the point where we can see a path to this kind of thing, we need to have SNEA looking at it and going, where are the holes? Where does the programming model fit into this? Is it just identical? And I think also SNEA needs to be cognizant in telling its

Starting point is 00:40:26 members that you do not want to do this for direct attached persistent memory. This is persistent memory over fabrics. That's where it's going to be good. And don't have people coming to you going, but you said it was going to be awesome, and it's not awesome. Yeah? I'm curious to see how this works with PCIe

Starting point is 00:40:43 non-transparent bridging. So that's a good question. The question was, does this work with PCIe non-transparent bridging? It, I, so, I mean, I just worked with a guy on NTB bridging, like, we did an upstream driver for the microsemi switch for NTB support. And I can't get my head around how that ties. I'm trying to work out what would be broken and what wouldn't be broken. But basically, it should work.

Starting point is 00:41:12 I just worry about where on earth the data's going in order to get to where it needs to go. Because it should work. It should work. We create a new PCIe bar that's on the switch. And so as long as that's the PFN that's given to the DMA engine, it's going to push it to the switch. And the switch doesn't care whether it came from host memory or somewhere else. It's just going to pass it through, translate it, and send it to whatever the NTB setup destination was.

Starting point is 00:41:46 My feeling is it should work. I actually have the equipment to test it, so that's something that we could do. Yeah, I mean the application that comes to mind for this is. Yeah, exactly. We can do interesting things around multicasting as well. Come in from the RDMA, we could actually scatter it out and do all kinds of stuff with it. But yes, yeah, for replication, it's a pretty interesting idea. Yeah, it may not have to be NTB even.

Starting point is 00:42:13 It could just be multicast. But NTB would be a lot more, I think, flexible, which is pretty neat. Yeah.. The bandwidth went to crap. And sometimes, depending on what you're doing, this doesn't even work. And you get a UR, an unsupported request back. So sometimes the PCIE will actually just not even let

Starting point is 00:42:50 that TLV through. And it did work in my experiment. Yeah. Yeah. Yeah. Yeah. So I think you mentioned that you were sort of not aligned if the search was not there.

Starting point is 00:43:04 Yeah, so, yes. Maybe you want to consider not adding the text and using it as a computer. I mean, this is all Linux kernel code. It's open source. It's not mine. I wrote it, but it's not mine. It's DPL.

Starting point is 00:43:17 It's the communities, right? So it's not what I... As a community, we need to make some decisions around are we going to have policies on this? Are we going to enforce them in the kernel? Are we going to expose them to user space so the user space can make some decisions? These are not trivial questions because it's a

Starting point is 00:43:34 dangerous thing. We're playing with DMA engines and letting them do things that make them. So it wouldn't surprise me if I suggested something and even Linus would step in and say, was it bat shit acid crazy? Just whatever. Which would be great because then I'd have a flame email from Linus and I'd put it on my wall. I'm actually important enough that he actually called me an asshole.

Starting point is 00:43:57 That's pretty cool. It's like a badge of honor, right? So what we have done in our current version is because we basically say if you're both connected to the same switch, you can do this, and if you're not, you can't. And that's what's in our lab right now, but that's not necessarily what's gonna go upstream. It's not even necessarily what I'm proposing to put upstream.

Starting point is 00:44:19 I'm saying let's have a conversation. And like I said, maybe it's something the user defines, and we have a set of rules, And the user just selects level. You can have low security rule, medium security, high security, depending on what they set. There's many different ways of doing that. And as a community, we need to decide. But right now, the one that I find always works is both the

Starting point is 00:44:38 endpoints are on the same switch. It's going to work. And right now, that's good enough for what I need to do and the experiments that we're doing. And also, the switches are pretty darn expensive, and I don't work for microsemmy anymore. So I can't put five of them chained and see what happens. Tom? I said a really interesting thing about setting the block size of a NVMe device to zero. Was that about the EMR or the NVMe device?

Starting point is 00:45:06 Yeah. So I'm saying, let's imagine we now, with the spec we have, or will have, once everything gets approved, if it does, we would be able to build a standard NVMe device that's an NVMe device to the host that has a PMR that's a very large size, several gigabytes, and potentially has no namespaces at all. No namespaces?

Starting point is 00:45:29 Well, it has no block storage at all. Basically, if you ask it how many blocks it has to be able to turn into namespaces, it's zero. Yeah. A memory-only device, but nonetheless an NVMe. It's still NVMe. Admin commands, queues. Queues become a little more weird because you don't really

Starting point is 00:45:47 need queues. You need an admin queue. SRIOB would still work. So if you wanted to have different people managing it and splitting it up, you could virtualize the PMR and give different parts of it to different VMs running on the top of the hypervisor. And we get all the ecosystem.

Starting point is 00:46:07 We get 2 and 1 half inch servers. We know it's going to work. It's NVMe. But it's now a memory addressable one. And this is where Amber hits me. She's like, you've taken NVMe? It's NVMe, but it's not NVMe. But that's like any standard. As a community, we decide where it goes, right?

Starting point is 00:46:27 How do you deal with such a device's latency discrepancies, right? It's free latency. It's going to be wildly different. It's great latency because of what it's plugged into. And it depends what it's plugged into. Right. We don't know what it's plugged into yet. The memory, by the time PMR is in mass production,

Starting point is 00:46:45 it could be plugged into a memory we don't even know exists yet, right? Would you expose such an attribute? That's a very good question. It seems critically important. It's not ordinary. So I don't think Tom's here, but Tom Friend from Toshiba was pretty involved in PMR, and he already has a proposal for V2,

Starting point is 00:47:06 which digs into exactly some of that. We want to give more information in a standard way that's related to how the PMR is going to behave in the system. And some of that would be around what type of non-volatile memory is backing this, right? What is it? Is it a memory type? Is it a huge ton of DRAM? Which is gonna, or is it phase chain? I don't think we'll get to the point of saying what type of memory it is. The underlying technology, I was afraid of the protocol by which you perform a read, right?

Starting point is 00:47:37 I mean, it would be very different. You block a response line by line from this device. Well, I mean, it's a DMA engine that's doing the read, typically, right? Because we're not doing local access. Yeah, but the DMA engine will have the same problem. The MDMB gets around this problem by saying, please write it. So in our lab, we right now are doing like 14 gigabytes per second reads out of it, out of RCMV. Well, we're not going to get more than 15, because it's

Starting point is 00:48:12 like 16 lanes of Gen 3. If you're doing small, random byte accesses, it's not going to be so good. It depends on what type of byte. That's what applications will do. Well, that's not what they do to a block. They don't know that it's memory. It looks like memory, but it ain't.

Starting point is 00:48:35 That's my point. So we'll see. Yeah, we've got to see what the implications of that are. This is incredibly, like literally, we're getting data literally day by day. And I think that will help us understand, is this a cool idea or a sucky idea? I do think it's interesting to be able to offer people the

Starting point is 00:48:53 choice of NV-DMM or PCIe. And over a network, it looks identical, and it's all lib verbs compliant. Yeah. I do have one other question. Your little example that showed the switch with two devices and they're like chatting with each other and not buffering.

Starting point is 00:49:12 Is there any problem with saturating that switch system stability wise when those guys start using it so heavily? No, not system stability wise otherwise Micro Semi is in a world of pain. So I mean Micro Semi Not system stability-wise, otherwise micro-Semi's in a world of pain, right? Yeah. So I mean, micro-Semi are pretty good at validating that

Starting point is 00:49:29 kind of thing. So if you're using the CPU as a root complex, it's a different story. I mean, the switch architecture doesn't really care whether a port is an upstream port or not. It's a switch. Yeah, exactly. And the way that, you know, not the.

Starting point is 00:49:42 Someone in the front of this room may have been involved in the architecture of that switch at some point in their career. So you don't have a system integrity QoS issue? No, no, I mean, it would depend on the switch, and people would have to go to the different switch vendors and validate that for the workflows that they're interested in,

Starting point is 00:49:59 the switch isn't gonna start doing blocking. And so people would make demands saying, what is your non-blocking throughput between all these ports? That's why you have these meetings with the architects and say, I'm not going to buy your switch because it sucks. I'm going to go buy PLX.

Starting point is 00:50:14 It's going to save PCI time. Because you're using the hell out of it. Yeah, exactly. Yeah, no, it's very true. And we do have to be careful about that. I think we're out of time. and I need to run for the plane. But super thank you very much, everyone. Thanks for listening.

Starting point is 00:50:34 If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #67: p2pmem: Enabling PCIe Peer-2-Peer in Linux

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.