Storage Developer Conference - #18: Donard: NVM Express for Peer-2-Peer between SSDs and other PCIe Devices

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. Today you are listening to SDC Podcast Episode 18. We hear from Stephen Bates, Technical Director with PMC, as he presents Donard, NVM Express for peer-to-peer Between SSDs and Other PCIe Devices from the 2015 Storage Developer Conference.

Starting point is 00:00:49 Hi, I'm Steve Bates. I'm a Senior Technical Director in PNC. We do enterprise storage, and I'll talk a little bit about that in a second. I'm going to talk today about how we're leveraging some of our NVBM Express product portfolio and IP and technology to look at how do we combine things like storage and networking and compute that can all live on the PCIe bus and make them all work very interestingly together

Starting point is 00:01:21 without necessarily having to go through the central processing unit, or as a good friend of mine likes to call it, the computational periphery unit, and do some interesting things that might tie into NVM over fabrics. Amber hits me when I say that because it's not standard yet, so I'd say NVM where RDMA is. You're the wrath of Amber. It ties into hyperconverge, so what happens if I take a box and I put some fast storage, maybe some processing, like a GPGPU or an FPGA card, and then something like an RDMA NIC, and then put some software on top and then tile that out like a billion times or a billion times or however many times these huge hyperscale companies are doing it.

Starting point is 00:02:09 Looking also at things like next generation all-flash arrays, which maybe don't use SATA or SAS. Instead, they use RDMA to get into the box. And then inside we have a PCIe fabric and NVMe drives they're built out of things like NAND, and maybe things that aren't NAND, like 3D crosspoint or resistant RAM or whatever like that. So this work is, it's all CSTO, it's proof of concept, it's exploratory. I see some really friendly faces in here, and you'll be working with a lot of interesting, wonderful companies and people and kind of seeing what all this does. I don't have all the answers,

Starting point is 00:02:46 so this is really very much a technology discussion and trying to work out what's interesting, how do we make an ecosystem out of it, what standards might have to be, you know, worked on or impacted in order to go make some of this happen. So feel free to jump in with comments or questions as we go. I'm going to try and move reasonably fast

Starting point is 00:03:05 and then have some time for discussions afterwards. You have to do this. So PNC, actually we have a dilemma. My boss would probably kill me if he sees this slide, but we're very much an old school, new school kind of company right now. We traditionally have done very well over here. And we've enabled storage companies, HP is a big customer, EMC, NETA, and also some of the newer storage companies in the public cloud,

Starting point is 00:03:38 some of the hyperscalers, to connect up a lot of drives, a lot of hard drives typically. And we manage those drives very well and provide that connectivity. And I'll be honest, we make a pretty nice sum of money, not a huge sum of money, but a pretty nice sum of money doing that. And we've solved hard problems for people over the course of the evolution of SaaS quite nicely. But we've also realized that things are changing, and this business is not going to go away overnight, and I don't think it will go away anytime soon.

Starting point is 00:04:11 But there's a lot of really interesting things that are happening, and we need to take part of that. So we have a new school division as well, which is much more focused on some of these new things that we've been talking about at this conference and other places. NVMe SSD controllers, that's something that we sell. It's public knowledge that companies like HGST and OCZ, which is part of Toshiba, use us in their NVMe offerings

Starting point is 00:04:34 for their high-end enterprise-slash-datacenter-type SSDs. And again, we also sell to the hyperscalers who do weird and wonderful things with our controllers that are nothing like NVMe. To be honest, sometimes I don't even know what they're doing with them. PCIe switching is a new business that we've got into. It's very nice because it ties in very nicely with NVMe. NVMe runs over PCIe today, and NVMe over Fabrics is all about providing distance and scale

Starting point is 00:05:05 to NVM Express by letting it run over other protocols like RDMA and FireTime. We also have, as well as simple fan-out switches, we have some what we call our switch tech storage switches, which have some weird and wonderful things in them that they can do, like non-transparent bridging, so multiple hosts can share drives. And anyone who was at IDF last month, we did a demo with Intel that showed how multiple hosts could share a pool of NVMe drives and semi-dynamically repartition those drives between multiple hosts at the same time, which is not something that really anyone else has been able to do until now. But that's a problem we've got to solve, because in the future,

Starting point is 00:05:46 you're all going to want to deploy lots of NVMe drives, you're probably going to want to virtualize some layer over the top, and you're going to want to be able to go, this guy doesn't need all this data, or doesn't need all these drives, and I want to move some of them over here. And these are problems that we're having to solve. This is legacy, but it makes tons of money, and this is you and Gruby, and it makes tons of money. And this is you and Groovy. So what have we been looking at?

Starting point is 00:06:11 We've been building some ideas around this very straightforward architecture. So this is going to be familiar to everybody. We have a host CPU. I don't really care what architecture it is, but if you're in the world of enterprise and data center, I think it's public knowledge that Intel have 108% market share. What a joke. We obviously connect DRAM.

Starting point is 00:06:40 There's not that many systems that will work that well without DRAM. In fact, your workload is totally sequential. You don't actually need DRAM because you can just page it because you know in advance the data you want. That's a really weird problem that you're trying to solve if you know the data. Then we have some kind of in-server fabric. That's pretty much exclusively PCIe today. That makes a lot of sense.

Starting point is 00:07:05 Intel, as one example, gave us about a billion PCIe today, right? And that makes a lot of sense. Intel, as one example, give us about a billion PCIe lanes for free out of each processor, right into the new billion. And, you know, they will continue to do that. And we connect wonderful devices on these PCIe channels that come directly out of the processor.

Starting point is 00:07:22 So I can put my NVMe SSD, I can put some compute, whether that's a GPGPU or an FPGA card or some, like a Knights Landing, the Xeon 5 from Intel. You know, we don't really care as long as we have some way of making it do interesting data manipulation for us. And then we obviously want to probably pull data in from the outside world,

Starting point is 00:07:42 otherwise we're just working in a vacuum. And so, you know, we have networking devices which may or may not support RDMA, and may not, or may not be based on Ethernet. So that's kind of the model that we're working in. Excuse me. You know, the work that we've been doing is building on top of the standard nvmxpress. Linux has been great in the sense that it's supported nvme with an inbox driver for quite some time. Can you pronounce the word at the top for me? Donnard. So that's a good point. Thank you for raising that.

Starting point is 00:08:18 So all the projects that I start at PMC are named after mountains in my home country of Ireland. So Sleeve Donnard. Sleeave is like the Gaelic word for mountain. Sleave Donard is a mountain just south of where I was born in El Paso. Thank you. So, you know, I'm in CSPO, so I want to be able to collaborate with other people. This project is pretty much all open source. I have blogs and white hippers that tell you this is the hardware we use. You can go order that or something that you think will be compatible with that

Starting point is 00:08:55 or something completely different if that's what you want to work on. We have GitHubs that have forks of the Linux kernel. We have branches off that for some of our different work. And I'll talk about that in a little bit. We have user space libraries that me and my team have put together that tie some of this together and do things like performance benchmarking. And all of that is open source, it's licensed under DPL

Starting point is 00:09:17 where necessary because of the kernel. The current Linux kernel is DPL so you have to keep that. But anywhere else it's licensed. User space code is Apache licensed, so you can pretty much do whatever the heck you like with it. We don't really care. And the idea is that we build this as a sandbox for people to play with. And I know for a fact that some people in this room

Starting point is 00:09:36 have created or some semblance of this and done some work with us on that. The one thing that I haven't got in this diagram that I should probably put in that is optional, but I'll talk about why I think it should be a little more mandatory than just optional, is a PCIe switch. Obviously with Intel giving us a billion PCIe lanes, we might not need a PCIe switch because we can just connect everything up to the CPU. And I'll talk a little about the pros and cons of that a little later. So what are some of the goals and objectives of what we're trying to do here?

Starting point is 00:10:15 A lot of what got me started on Project Donner was thinking about, you know, we build an NVMe SSD controller that when you configure it correctly, can do a million IOPS. Now, if you configure it incorrectly, it will do minus seven IOPS. But it's pretty easy to get it wrong. But, you know, there's products that are in the market today, you know, like I said, HDS-TOCC, they're doing between three quarters of a million 4K to a million 4K randomized. So we don't do the low-end stuff. And in fact, some people might claim

Starting point is 00:10:50 our stuff is a little too high up, but that's the data. You can argue over that. Companies like Mellanox and Chelsea, I see what way I'm going here, and I see you down over there. These guys are doing these awesome RDMA mix that can do dual port 40

Starting point is 00:11:06 gig, 56 gig. We're now doing 25, 50, 100 gig. You convert that to IOPS, those are millions, literally millions of IOPS. So this guy can do a million IOPS. This guy can do a couple of million going even higher as we transition to 100 gig. These poor guys are trying to do some hyperconverged, let's process some of this data, it's literally drinking from a fire hose. No processor in the planet is going to be able to do any significant amount of data manipulation on a data stream that's traveling that fast.

Starting point is 00:11:39 So if you're trying to do image detection, you're trying to do some kind of searching algorithms, the reality is that these guys are kind of searching algorithms. The reality is that these guys are not the bottleneck. The bottleneck's over here. It's either processing on the core, it's either on the fabric itself, the PCIe subsystem, or it's going to be something to do with either the DRAM bandwidth or the volume of DRAM at your disposal. Something in here is probably going to get it. So with Donnard, I wanted to explore what happens when we repartition that working set. So let's say, for example,

Starting point is 00:12:15 we introduce some element of computation here. And rather than having data flows that always have to go through the computational periphery, maybe we can direct traffic directly from the networking device directly to storage, vice versa. Is there a framework that we can build that allows this to happen? This guy is still here. He's still running.

Starting point is 00:12:39 You pretty much still need an operating system somewhere. Somebody somewhere has to have some kind of, even just for error handling and things like that, but maybe this guy, I like to think of it more as the conductor of the orchestra rather than someone playing the instruments. So this guy is managing flows, providing quality of service, criteria metric, responding to things in the outside world. But a lot of the data, well, all the data path, preferably, is going what I call east-west. On this diagram, it's north-south.

Starting point is 00:13:16 This slide is in the deck, so I'm going to kind of skip it. It's just the hardware platform. If somebody really cares, we can talk a little about that. I'm going to talk about a couple of the pieces that we use to build our puzzle. And it's also like a plug for the company. This PS slide, you see it in a...

Starting point is 00:13:33 What's that? What's that? You gave that as part of the... Oh, yeah. Oh, my goodness. What am I doing? Sorry. You mean this one here? Yeah, so what this is trying to show is that maybe if all this is doing is management, I don't need a Xeon class and I maybe don't need as much DRAM.

Starting point is 00:13:57 Because maybe, so the problem with NVMe normally is NVMe normally you stage something in a DRAM buffer. So if I'm using standard NVMe, if I wanted to pull an RDMA, in fact Intel did a lovely kind of discussion on this just earlier today, but let's say I wanted to do an NVMe right to this drive from somewhere else using RDMA. Right now, what I'd have to do is RDMA in here and then do that.

Starting point is 00:14:24 So you're effectively double buffering. But another path is to maybe try and do this one. And that's part of what we were looking at with some of this one. Which means you might not need as much DRAM or as much DRAM bandwidth. So I'll skip that. So we have this product. PMC is not in the business of making solid-state drives. We enable our customers to do that. With that being said, we do build hard-level products. So this started as something a customer asked us to do, and then we ended up turning it into a generally available product.

Starting point is 00:15:01 So under this heat sink is our NVMe controller. And this thing appears as an NVMe drive, but it's not backed by flash. It's backed by DRAM. So it's basically at very low capacity, incredibly low latency. You know, you think 3D cross-point is fast. Well, DRAM's faster.

Starting point is 00:15:22 You know, it's just cost more. And it's got really, really good endurance because it's DRAM. I can just write it. So this sounds crazy. Why would I use this? This is great for next generation all-flash-array write caching

Starting point is 00:15:38 because I can write this guy over and over and over and over and over again in a log structure. And as long as I have at least enough capacity to store all my writes that are in flight, I'm good. I don't need terabytes, I need gigabytes. So this is not for everybody, but it is quite useful. The other thing that's really, really good about it

Starting point is 00:16:00 is our controller presents to the operating system as a block NVMe device, but it has a second access methodology. We can expose memory as a PCIe bar, and we can MM that into user space, and then I can basically do cache line changes that change cache lines on the D-RAM. So this, for us, is, even if I told my CEO, I don't care if we sell one of these, because I am learning so much about what next generation SSDs are going to look like from this, because I have cache line accessibility, I have a memory semantic way of talking to this, I still have my NVMe if I want it, which I do still want, because I want DMA engines, and

Starting point is 00:16:41 I get all of that on a PCIe interface. So this is part of you don't have to use this in Donner, you can use anything you like but this was very useful for me because it exposes both a cache line or memory semantic access methodology

Starting point is 00:16:57 and also a block based methodology and there's not really a lot of devices out there right now that do both block and memory kind of access semantics like that. Very useful for research. And like I said, we are actually selling some of them. We sell seven.

Starting point is 00:17:15 So I understand you access the same memory, the same cells, both by the block interface and the memory? So you can set it up either that way or... Slap the bar over the whole device. You could make the bar bigger than you have space for, as long as you have some way of handling page holes. So, it has a PMEM device style behavior as well. We know exactly what... So, everything that you've done in PMEM on an NVDIMM, we've done on NVDIMMs

Starting point is 00:17:45 and this. And in fact, the IL bus instead of the memory bus. Exactly. And so a lot of the work that, you know, Intel and you guys and the SNIED work, all the NVDIMM work that's happening, it has been fantastic because it's really helped us understand this as a concept as well. So it's a good point. I mean, this is basically an NVDIMM on the PCIe bus. When I expose it as a memory device, it's just like having an NVDIMM. Well, until you try to read. What's that?

Starting point is 00:18:15 Read performance. Well, sorry. So let's ignore performance. I'm talking about as an entity in the system. There's always performance issues. Writing performance will be pretty good, but reading performance will be remarkably low. But then we have DMA engines, which are very, very fast. Sure, but the block interface is probably a better way to read this. Exactly. Very good point that you've got. Sorry, just the gentleman in the back, and then Terry, I'll get to you.

Starting point is 00:18:45 Yeah, no, sorry, you were there. How low is the low latency? So I've got some slides on that. So if you want to read a cache line of DRAM on this thing, if you just assume we're up in user space, we've memmapped this in, and we do a read, that read gets serviced in about,

Starting point is 00:19:10 it's around between 600 and 800 microseconds. But it's architecture-dependent. So, that's the kind of memma you get. No, did I say microsecond? I meant nanosecond. Did I say micro?

Starting point is 00:19:25 I meant nano. Sorry, holy crap. This is called a hard drive. So basically, in that memory mode, what happens is the operating system goes, hey, I want to read this cache line on this memapp file. I'm talking Linux, and I will talk Linux semantics all the way through, because that's the one I understand.

Starting point is 00:19:44 Apologies to anybody who works for another operating system. But yeah, so if I try to read that emapp, what essentially happens is that falls through the driver stack. It turns into a PCIe TLP, or Memread, basically. It comes down, works out, oh, the Memread request is not for DRAM. I'm actually going out through an IOMMU, out onto the PCIe subsystem, enumerating my buses. I'm going out to this device. Basically, that memory read PCIe TLP hits this guy, hits our controller. We service that TLP. We pass that back as a completion.

Starting point is 00:20:21 That goes back through the stack and then into the OS. There you go. So then why do you need both of those things? We pass that back as a completion. That goes back through the stack and then into the OS. There you go. So then why do you use both of those things? Why not always do them? Because load-store semantics take up CPU cycles. The reason we have DMA engines is because I can have a thread that goes, I want to read a megabyte of data, and I don't want to sit here waiting. I'm going to switch to a different thread,

Starting point is 00:20:45 and you, DNA engine, you're going to service that, put the data in a buffer, and you're going to raise an interrupt when you're done. And so basically you switch threads, you go and do something that's going to earn you some money, hopefully, right? And then at some point in the future,

Starting point is 00:20:59 the MSIX interrupt trigger interrupts the operating system that says, hey, that thread said that interrupt, it's time to go back and do your work because your data's not in there. There's a really, I have some data a little later that looks at what happens when we raise an interrupt in an operating system. I hear some people laughing. This is data that I had to go measure and I couldn't believe nobody had put this online.

Starting point is 00:21:25 It's really very interesting. It's not so interesting when you're reading a hard drive where the access times are so long that you could literally read a newspaper. But as we move to octane-based drives like Intel and IBM, the time taken for the drivers to do its work and the time taken for the hardware to do an MSIX,

Starting point is 00:21:44 I think Jim showed some, you know, the VD access time is really absolutely nothing. Everything else is taking up the bulk of the links. And I think that's a problem we need to go think of that. Either we move to the memory channel in entirety and we give up

Starting point is 00:22:00 on super fast SSDs, or we go, yeah, I see. I think I have an idea, so, you know, this is, sorry, go ahead. I just want to follow up on that. So it's 6800 milliseconds for a read of a cache line. And when I want to flush that, when I push that cache line back out. Remember, this is a, inside you know this is a PCIe cache line. This is not mapped into your normal memory space,

Starting point is 00:22:26 so it's not necessarily cache backed by your L2, L3, or whatever cache. That's architecture dependent, and you're starting to get into some of the finer details of what happens in your memory subsystem on the given processor that you're working on. We can start talking Intel specific, but I got to wonder when we hit NDA problems and stuff like that. But you're right, Terry, there are things

Starting point is 00:22:51 you need to think about. Very similar to some of the NVDIMM stuff. Is it in a cache? If I write it, can I guarantee it's got to this device and it's not stuck in some stupid cache that somebody put there to make performance better, but now it's hurting me because I'm trying to go to persistence. Very, very, very similar problems to what SNE is working on with the NVD. But slightly different

Starting point is 00:23:12 because we're on the IO. My new SNE isn't solving those problems. It's just waving red flags. Somebody solve these problems. Yeah, yeah. That's fair enough. Okay. So, really, really interesting device.

Starting point is 00:23:25 And like I said, if anyone wants one, well, if you're a customer of ours, I'm sure we'll get you a loaner. But if you want to buy one, I'm sure we can sell you one for sure. Very interesting device. The other piece of the puzzle, and this is the block I didn't have

Starting point is 00:23:40 in my architecture diagram, but we have a PCIe switch product. I'm not going to go into a lot of detail, but it basically allows me to connect some of the things and some of the devices. And it's common knowledge. GPU Direct, I don't know if anyone has worked with GPU Direct or heard of it, but it's worth looking up.

Starting point is 00:24:01 I mean, basically, everything I'm doing here, I didn't invent. I basically stole it from GPU Direct, which is RDMA into the memory, the I.O. memory on a graphics card, and they use it in HPC all the time. But when they first started doing that, they realized, if I have my graphic card and my NIC connected directly to the X86,

Starting point is 00:24:21 and I try to do direct traffic between the two, something in the I.O. memory controller in the Intel architecture screws up east-west traffic. Surprise, surprise. The optimal performance path is to go from the I.O. device up into the memory system,

Starting point is 00:24:37 and then down from the memory system. That's probably the path that they cared the most about. I apologize if anyone missed it. But the problem is that if I want to start doing east-west traffic, as I call it, I do hit performant issues as I try to move large memory PCIe transactions east-west through that subsystem. So that is going to change as we go. I have no control over that.

Starting point is 00:25:06 Some of the people in the room probably have more influence than Intel. Go tell them about it. Go see if they're going to do anything about it. Or if you slap a PCI switch in front of it, you're good. Yeah, exactly. So the switch... So Terry works for Everspin, but also is a member of our sales team.

Starting point is 00:25:29 I like the point I made. It's for Microsoft, who is also part of our sales team. I'd like to put to it that that East-West problem goes away if you could have switched here. The CPU from whatever vendor you hear, I only know that the East-West will bad-treat an Intel root complex.

Starting point is 00:25:47 I hope to do this work on an open power server very soon, and compare, and I haven't tested on an ARM. I can't comment on their East-West, but I do on the Intel, certainly on the current architecture. What latency of the switch? This switch from one point to another is about 160 microseconds. Nanoseconds. I'm getting a message.

Starting point is 00:26:12 Oh my gosh. 160 nanoseconds. So it's not zero, but it's not huge either. And obviously that's per hop. So putting a lot of those in is going to cost you more. That's just part of the thing. So that's another piece of the puzzle we'll talk about in a minute. So the great thing about having the switch is that the east-west traffic goes there.

Starting point is 00:26:38 I'm not going to hang around too long on this slide. GitHub, Donnard, not too many things are called Donnard in the world. You'll get a picture of a mountain in Ireland, and you'll probably get this GitHub site. We have user space code, we have a fork of the kernel. Right now, I think we're rebased off 3.19.

Starting point is 00:26:56 There's actually all the great NVDIM work that's gone in. We need to rebase off 4.3 or 4.2 RC2 or something, because a lot of the new patches from the Intel folk and other MDM people are actually going to be quite useful

Starting point is 00:27:12 for this so it's quite an exciting time for some of this work. So some actual results. We started off by doing some experiments and these are quite old experiments with GPUs. These were Kepler class NVIDIA cards. These don't have any kind of graphic port on the back.

Starting point is 00:27:35 They're not for that. These are designed to crunch numbers. Then we have an NVMe SSD. This is one of our eval cards. That's why it looks like a complete piece of crap. It's not complete. But this is an NVMe device. That's what I had when I did this testing.

Starting point is 00:27:55 So what we were trying to do is, let's just go directly from the storage device to the memory on a GPU. And NVIDIA have a driver. Now, NVIDIA are well known for being a-holes, for want of a better word, in the Linux community, because they're drivers somewhat proprietary

Starting point is 00:28:16 in binary form, and binary is pretty much panthema to the Linux community. But they have to expose the symbols for the functions in their driver, because otherwise, how do you know where their functions live in memory space? We know what those functions do,

Starting point is 00:28:33 so you can actually tie in to the functions that they're providing. So basically, we rode on the coattails of GPU Direct, and we basically invented something that I called NVMe Direct, except I knew that Amber would hit me, so I changed it to Dunner. And we basically are able to say, the operating system can say, hey, I have a file on a file system or I have a region of LBAs on this

Starting point is 00:28:58 device, Mr. DMA engine on this NVMe SSD, can you please send it to these hashline addresses or these bus addresses? And this guy will go, yes, I certainly will do that. And it starts pumping out data. The PCIe system realizes that the destination for those TLPs is actually in memory. It's exposed by this guy. And so the data comes directly through here. If they're both connected directly to the CPU, it'll have to go up, hit the CPU, go through the IO subsystem, and then out. If you have a PCIe switch in there, it has the enumeration tables. It just throws the traffic directly from one device to the other. When we did that, we measured two different things. This

Starting point is 00:29:41 is classical, and this is the Donnard method. We measure just raw bandwidth. How quickly can I do this? And then we also measure how much DRAM am I consuming on the central process? Because in the original approach, you're actually double buffering. Because in normal NVMe, you would have to pull the data off the drive, put it in a DRAM buffer and push that DRAM buffer down to the graphics card. In the new version you don't need that. So this one is just

Starting point is 00:30:14 raw throughput. This one is how much DRAM volume I save. So it's kind of a figure of merit. So both figures of merit got better. I'll be honest, I haven't... I did these experiments before we actually had a PCIe switch at PMC.

Starting point is 00:30:30 We only just got that chip back from Fab a couple of months ago, and this work was done probably almost a year ago now. I need to go back and redo these now that I have a switch, because I think I can make this number even better. Why do you need any D-RAM at all in the Donner case? In this particular case?

Starting point is 00:30:49 Yeah, I mean, what's it doing? Yeah, well, I mean, in this particular case, I don't need any DRAM except somebody's running an operating system. Yeah, some OS orchestrated the two guys to talk to each other. Yeah, exactly. He doesn't even know it's happening. No, he doesn't even know it's happening. Well, he initiated it. He said, please move guys to talk to each other? Yeah, exactly. He doesn't even know what's happening.

Starting point is 00:31:06 No, he doesn't even know what's happening. Well, he initiated it. He said, please move this from here to here. But I have some cases later where he doesn't... that are already in A-base where he doesn't even know what's happening. Like, the enormous... So today, GPU doesn't know how to run and do any drivers. This is why the CPU has to be involved here.

Starting point is 00:31:24 Yeah. It doesn't know how to run in the new drivers. This is why the CPU has to be involved here. Yeah. Yeah. Yeah. Because theoretically you could have the GPU please... Well, yeah, on a hardware level you could get rid of everything and just say, but, you know... Yeah, and I know when the Amoeba Fabrics is kind of going down that path, but... Yeah, in theory these guys could do this by just not even having a central processing unit at all. Not at all. It's just a PCI rack at that point, right?

Starting point is 00:31:51 And this guy is a DMA. I mean, basically the rule is if one of them is a DMA master and the other one is a slave, things are going to work. If you've got two slaves, they can't do anything very interesting. If you've got two masters, they can't do anything interesting because they don't have a destination to go to. You can't DMA into another DMA's mailbox. You want an exclusive. Yes? Is that a constant QDEF or DIOD?

Starting point is 00:32:19 This was large. This was one. How do I get the biggest number here? So this looks good. So this was pretty big QDF, pretty big IO. I don't have a latency number here, but we can go back and measure the depth. Yeah, and I got a, you know, some of these results are definitely, you know, I would like to do, redo a lot of this. I have a switch, I want to measure latency, I want to measure how busy is the CPU working from a perf point of view when I'm doing it this way versus that way. To be honest, I can't do everything I want to do. I'm hoping some people in the room will find it interesting enough to come and ride on this roller coaster.

Starting point is 00:33:02 So, pretty interesting. For this particular, and all the code to generate these numbers is all available in the open source repositories, so people can certainly try and recreate them and help with that. So, for that GPU example, we actually decided to write something that's a little more interesting.

Starting point is 00:33:21 This is for a demo that we were doing for somebody. So, we did a needle in a haystack. So what we did is we went to, I think it's MIT, they have this big image database that's used by academics for certain image recognition. And what we did is a needle in a haystack problem. We took the PMC logo and we randomly... Where's Wally?

Starting point is 00:33:42 It's where's Wally. Yeah, exactly. So we basically took this PMC logo and we basically buried it in these images, a small set of the huge database. And then we basically put that entire database on the NVMe device and we wrote some code using CUDA to go do convolution on the image with our hit needle,

Starting point is 00:34:03 which is our logo, and go, please, you know, out of these 10,000 images, find the end that had this logo in it. And basically that's what we did. We did some comparisons between how many pixels per second can I do when I'm just using a CPU. We did it using CUDA,

Starting point is 00:34:26 but without the Donner methodology of DMAing directly. And then we also did it with the Donner. And we compared a hard drive, a solid-state drive, obviously the NVMe drive, and we got a pretty good speak-up with the SSD. Obviously, we can't do the R with the HDD because it's not an NVMe compliant HDD on the market.

Starting point is 00:34:48 I think something might be working. Anyway. But nothing that we did in this slide is non-NVMe. Right? So it's just an NVMe. This should work with any NVMe 1.1 compliant drive. The performance numbers will change, obviously, depending on the drive, but that's what we got. And then we also did some analysis to

Starting point is 00:35:12 work out where's the bottleneck. So, interestingly, the bottleneck was different in each of the three places. For the case where the CPU was doing the image convolution, it was processor cycles that gave the bottleneck. Even with multiple threads, it was processor cycles that came to the bottom there. Even with multiple threads, it just takes time to go through all those images and say, does this convolution of these two generate some kind of impulse? So I think this is in my opinion. With the CUDA version, because we were now pushing the inner core of that image recognition onto the graphics card, the problem actually became DRAM bandwidth. That's the problem.

Starting point is 00:35:48 Because we're moving data all the way to DRAM and all the way out, so it's basically an in-and-out that we want to call. And then with the Donner, we were actually able to make it what we want to be the limiting factor to the graphics card. So we were able to repartition the problem

Starting point is 00:36:01 and push the bottleneck to where I think I wanted it to be, which is on the graphics card. Because I can always deploy more graphic cards. I can have one SSD and two Teslas or whatever kind of machine. And I am going to run way over if I keep going at this pace, so I'm going to jump in. We did some demonstration work.

Starting point is 00:36:21 We did some work with Chelsea. We all see her. Thank you for that. And we did some work with Chelsea. Thank you for that. We did some work with Now and we did some work with Malnox as well. So we're definitely looking more at RDMA. These results are now looking at RDMA and MDME. This is a very interesting area as well. This was a problem that we wanted to solve. My background is enterprise storage and write commits are really, really important. We talked about that as well in the end. If

Starting point is 00:36:56 I'm a remote client and I want to write some data, I can't let go of that data until the write has been acknowledged. Because I don't know if I'm going to get the act back until I get it. The response time for an acknowledgement on a right is normally a pretty interesting thing to know in a storage system. It doesn't matter whether it's direct attached

Starting point is 00:37:18 storage or network storage or some other kind of storage. Typically you're interested in how quickly can I get my write to persistence and get an acknowledgement back to whoever initiated the write. Okay, pretty, pretty important stuff. Now, you know, in an RDMA meets NVMe kind of world, a standard write path for a write would be this blue line. So the write would come in through the RDMA connection. You would have an MR, an RDMA memory region, declared somewhere in DRAM. And the data would end up there.

Starting point is 00:37:51 You would basically notify this processor that there's new data, or it would be polling. And it would basically then do an NVMe command to move data from that buffer out to the NVMe device. And then you have to, once that write is acknowledged back to the driver, it then has to generate an acknowledgement over the RDMA network back to the initiator. That's your delay path. So network delay, DRAM delay, contact switch, due to interrupt generated by SSD, service interrupt due to the client.

Starting point is 00:38:27 Lots of steps there. So what we were doing in this work is looking at, can I actually just push directly into some kind of persistent zone here? Now, this is a DMA master. This is an RDMA. Typically RDMA NICs can only be masters. I'm sure these guys are correct. But typically they're masters. This guy in normal NVMe mode is a master. I already said two masters can't talk. But our NVRAM card can be a master and a slave. Slave is the EMI mode or or the direct memory interface mode. So we actually basically used our card as a proxy for a next generation NVMe SSD that is capable of exposing some memory type semantics on it.

Starting point is 00:39:16 So we're now getting into the world of non-standard, or NVMe as it doesn't exist today, but maybe NVMe might exist sometime in the not too distant future. We did some comparisons on, you know, what was my bandwidth? I don't have latency numbers here. I really should have latency numbers here. Sorry, I apologize for that. It's pretty important. But I don't have them right here.

Starting point is 00:39:42 And we can certainly work to get those numbers for anyone who's interested and then we also looked at the DRAM utilization and what we did there is there's a program I think it's called NPR or MBR so what that process what that MBR does is it runs the background

Starting point is 00:40:00 process on available threads and it basically hammers DRAM and then it basically sees how quickly it can geters DRAM and then it basically sees how quickly it can get to DRAM. Back to something again we talked about in one of the talks earlier. If I'm using my DRAM bandwidth to do all this blue path,

Starting point is 00:40:16 how much do I have left over for processes and threads that might be trying to make me some money? Because I'm not making money moving data. I'm making money telling Terry what kind of car he wants to buy next week or working out what kind of advert to put in front of Stephen

Starting point is 00:40:31 as he's having his dinner. So moving data is important, but we don't directly make money by moving data. And now we start to get into the interesting world of NVMe over Fabrics, NVMe over RDMA. So we did some work with our friends at Mellanox.

Starting point is 00:40:50 And this work is from demonstrations that we did at Flash Memory Summit. And there's lots of blog information on that. But we did an example, a prototype of NVMe over RDMA, or NVMe over Fab fabrics. So what we did is we have a server here. This server actually had a PCIe switch. We got our PCIe switch back from TSMC two and a half weeks before Flash Memory Summit. And I went to my guys and went, I'm buying everybody in that team a beer, if I can have it in the demo. And they were like, two and a half weeks. One beer they were. They thought it was one beer each.

Starting point is 00:41:37 Oh, dear. So we have the awesomeness of having a PCIe switch here. And that's not so relevant for this slide, but the next slide is very relevant. And what we did is we took the standard inbox NVMe driver, and one of our guys basically worked it. He wrote a client version, and he wrote a server version. So now the client version, you could do one module

Starting point is 00:42:01 and then use modrams or sys or whatever to do it that way, but we just split it in two. The client driver is here on the cloud. The client device has no direct attached storage. It doesn't have an SSD. But what happens is we expose a virtual device, a slash dev slash mvb0m1. As far as this guy's operating system is concerned, he has a direct attached mbme SSD connected to him. He can do admin commands, he can do command line interface, he can do FIO. It's just a block device. The performance, obviously,

Starting point is 00:42:37 is going to be a little different because he doesn't really have real hardware there. When he issues, for example, an NVMe read, what happens is it goes into the NVMe driver. We intercept that. We go, hang on, this is NVMe over RDMA. I'm going to take this IO. I'm going to pass it over to the RDMA part of the kernel. It gets RDMA encapsulated by the RDMA NIC. We've already established a connection beforehand, so assume that bit's already been taken care of. This value basically handles RDMA over to a buffer that we've already set up in the memory region here in DRAM.

Starting point is 00:43:14 This slide has DRAM, so just bear with me. At that point, the driver on this side kicks in. We have a polling operation on a completion queue. And the driver on this side goes in. We have a polling operation on a completion queue. And the driver on this side goes, I've got new data. He takes that data. He works out his NVMe command.

Starting point is 00:43:33 He basically passes that command to essentially the standard NVMe driver. And that performs the operation on the NVMe drive. And back. I just responded. It comes back. I just responded with argument. We did this in... We basically were seeing about 6 microseconds of additional latency through the RDMA stack. So if I do a read on the direct-attached drive,

Starting point is 00:44:03 I got 40,50 microseconds. It's a NAND-based SSD with an Optane drive. I think Intel are publicly quoting something around 10 in their IDF demo. But we're adding about 6 microseconds. I think NVMe over Fabrics is targeting under 10 as a good place to be. And obviously the individual vendors, of which there are several in this room who are active in this space, obviously compete with each other and try to get

Starting point is 00:44:30 those numbers to where they think makes sense for the market. Well, if you have Ethan here, it's going to take you four months just to go back and forth. So, you did pretty well. Yeah, so we did this, this was Rocky D2, so we're running a VTnet.

Starting point is 00:44:50 We did direct attach, so we didn't actually have a switch off here. Switches, they have latency too, right? But, you know, this gives you a ballpark of where we are. We actually also ran it, the Mellanox mix we had also ran in the FinnaBand, so we also ran it in IV mode. We didn't see a big difference between the two, but we didn't have any switches in there. That's going to, like I said, that will change going forward.

Starting point is 00:45:19 I'm going to skip this for the sake of time. Kind of get to the rough. So this is basically the kind of, the interesting modification that we made when we realized that our NVMe drive can be both a DMA master and slave. And here we have some interesting repercussions for latency and for access semantics across a network tied into, again, the NVDIMM work.

Starting point is 00:45:48 But in this case, what we did is we basically said, rather than having to kind of go all the way through DRAM, is there a way that we can do it so we can go directly into this device? And because this device is, like I said, both a block-based device and a memory access mode. We were able to do that. And so in this particular example, we weren't doing NVMe anymore,

Starting point is 00:46:11 certainly not standard NVMe, but we were basically accessing persistent memory on an I.O. memory device behind a PCIe switch using RDMA. So this guy doesn't need to do that. This guy ain't doing jack. Well Well he's initiating the connection, he's going to manage any error events, he's going to run an OS, but because it's RDMA, he doesn't even get notified of anything in this particular scenario. I can come in and I can make hash line modifications

Starting point is 00:46:43 on this PCIe card, The traffic is all behind this switch. And it's all happening from this client control. Now, we get into some interesting things like, if I'm getting hash line writes, atomicity becomes a problem. If I want to do a 4K write, but I want to have the rule that that 4K write either happens in its entirety or not at all, pretty standard feature of a block I.O.

Starting point is 00:47:08 But something crashes halfway through the transaction. I don't have a time machine. This device could provide that. We could provide that as a service by doing some kind of double buffering and going I'll take your data, I'm not going to commit it to the memory until I've got it all,

Starting point is 00:47:24 and then I'm going to commit it. At which point it looks a little like a block device. Because that's what block devices do. But it is interesting for a couple of reasons. One is that we're basically completely isolating ourselves from this part of the system. And another is that we're basically allowing this guy to continue to do whatever it is that makes us the money that we want him to make. Whereas this guy is just really moving data.

Starting point is 00:47:49 We've almost split the system into the money-making part and the person doing the heavy. So I think there's some interesting ideas there. And that's something that we're working through with... What was your end-to-end latency? What's that? End-to-end latency? What's that? End-to-end latency. It's about...

Starting point is 00:48:12 If you want an acknowledgement, it's about the same as before. It's about five to six additional microseconds. Five microseconds to do this. Because remember, we're a write-back cache, and we're a region of guaranteed persistence because we have a capacitor system. So all we have to do

Starting point is 00:48:30 is get the data in us. We don't have to put it in the nominal time memory. As soon as we've got that TLP, the onus is on us as this device manufacturer to make a quality of service statement that we will guarantee that data is safe.

Starting point is 00:48:46 That's what a power fail storage device does. So we can make that kind of... So we don't need to commit to memory, so we can acknowledge the right very quickly. So we were seeing the 5 to 6 microsecond round trip. But we're now able to do small IO. We can do 64 bytes. We can do 128 bytes. We can do 128 bytes.

Starting point is 00:49:06 We can do much smaller modifications. And I think with DRAM, it's somewhat interesting, but it's pretty expensive. With 3D cross-point or resistive RAM or whatever other wonderful memory technologies that are coming, I think the cost point of this per gigabit will come down, and the performance won't necessarily get any worse, because I don't think the memory

Starting point is 00:49:30 access times are particularly the problem here, given that we already have microseconds of network time. So, yeah. I don't... I see a lot of interesting things I can do with this. I think there's people in the room who can probably pretty quickly think up things that I have not even thought about yet.

Starting point is 00:49:49 So this describes right only or also reads only? So, I mean, the reads will be very similar to the rights in the sense that, you know. Then the master will be the link. Yeah, yeah, exactly. Well, yeah, in all these diagrams, yes. Yeah, I mean, in all of these, the last two slides, this link is always a master. He's always a master in these slides.

Starting point is 00:50:14 In the graphic card slides, the graphics card was the slave. In this mode, we're the slave. And the previous slide showed me the NVMe over Fabrics one. They were both masters, and we were using the DRAM off the processor. So in this example, he's a master, he's a slave. He's a master, he's a slave. But we're having to double buffer, basically.

Starting point is 00:50:46 So where are we going? This is one idea of where you go with this. We go to one of our friendly RDMA vendors. We put an RDMA NIC here. Maybe we put our NVRAM cards I showed you earlier here, the DMA masters and DMA slaves. Maybe we put some NBD SSDs here. Maybe some of these are NAND based.

Starting point is 00:51:09 Maybe some of them are not NAND based, which makes them ANN based. We've got a PCIe switch here because we want to have good east-west traffic flow. Maybe because we have to keep the legacy side of PMC happy, we have an HDA so we can go into a bunch of rusty disks. I put it in there. And maybe we have this guy, but at this point, what's this guy really doing? Maybe this guy can be here. Maybe he can be here.

Starting point is 00:51:40 Maybe he can be in here. Maybe he can be here. He's not doing a lot of heavy lifting, so maybe he doesn't have to be the biggest processor in the world. We can use this NVRAM as essentially a shared pool of persistence for write caching, and then later on move the data using the DMA engines in these guys out to the storage.

Starting point is 00:52:04 So I can acknowledge rights quickly. Maybe I can provide quality of service by taking these drives, some of which are ARM-based, some of which are NAND-based, and divisioning that up as a pool and providing quality of service metrics.

Starting point is 00:52:20 You want to pay for it? You get the opt-in. You don't want to pay for it? You get the 3D TLC. We can do all kinds of things there. We're using peer-direct, it's maybe not a word I've used in the presentation before. Again, borrowing off the shoulders of giants, peer-direct is a Mellanox-based patch set

Starting point is 00:52:40 in the Linux kernel that enables this east-west flow. It's part of what makes GPU-direct work, it's part of what makes Donner work. It's part of what makes Donnard work. And that's, Malinox have that publicly available. It's not upstream, so you've got to pull it in to your kernel and recompile. But you're all smart people, I'm sure.

Starting point is 00:52:57 So, I've probably run way over. Most of you are still here, so hopefully you're still interested. Where do I think... We're out of time. That's all right. So, especially we're out of time, I'm going to keep talking unless they have to keep me up. Yeah. Where do I think we want this to go?

Starting point is 00:53:13 I think NVMe SSDs would be very interesting if we have a standard way of exposing some kind of memory access into them. So taking what we do in a proprietary fashion with our NVRAM product, but integrate that into the NVM Express standard. There's actually already something that kind of does that. It's called controller memory buffers. It's not quite where I think it needs to be, but I think with the right people in the right rooms, we can take it from where it is today to where I think it needs to go.

Starting point is 00:53:45 All these themes are tying into why I think that's the case. In RDMA and MDMA over fabrics, they need a DMA slave as a destination. Why can't that be on the drive itself? We've got MDB 1.2. It's already a starting point. We can go from there. We've got next-gen NBM that doesn't need to have erasures.

Starting point is 00:54:04 It's biodegressible or need to have erasures. It's biodegressible or somewhat biodegressible. It's faster. It looks more like memory. So having a memory semantic way of accessing it makes a lot of sense, even if it's not on the DDR interface. It still makes sense because I can MM that. I can give my applications. I can take advantage of all the NVDM work that SNEA and others have been doing, and we can leverage that. But also have a block device access methodology as well. Maybe they overlap.

Starting point is 00:54:29 Maybe they don't overlap. Thinking about non-volatile memory as memory rather than storage. Do I need to have MSIX interrupts? Do I need to have that tonicity? Do I need to have some of the other services? Maybe there's ways I can solve those problems in my applications or in my operating system so that I can take

Starting point is 00:54:50 better advantage of every dollar I invest in non-wallet companies. PCIe switches allow east-west traffic flow, get the CPU up to a higher level where it can make me my money. And so I can only take advantage of that if I have a memory region on the PCI unit. I need to have somewhere to put the data. So I have some more slides, but we've gone way over. That's really my call to arms. Very happy, as always, to talk about this. As you can tell, I'm pretty passionate about it.

Starting point is 00:55:21 And I hope you guys enjoyed it. I'm out of time, and I'm going to go for a drink. Thank you. See you guys soon. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.

Starting point is 00:55:43 Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #18: Donard: NVM Express for Peer-2-Peer between SSDs and other PCIe Devices

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.