Storage Developer Conference - #81: FPGA-Based ZLIB/GZIP Compression Engine as an NVMe Namespace

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 81. All right, we'll get started. So I'm not Saeed, I'm Steve Bates, I'm the CTO. This is work that we did at Adetacom. We are a startup working on computational storage using NVMe Express. That is not something that is currently part of the NVMe standard,

Starting point is 00:00:56 but the wonderful thing about standards is they evolve. And maybe NVMe and other standards will evolve to take advantage of computational storage. And we think that's a very interesting idea, and we hope you do too. And in order to kind of justify why things like NVMe or SCSI or Fiber Channel might change to allow for computation on or near the storage. We kind of have to showcase some examples of getting benefit by doing something. So this is actually work that came out of a customer engagement. So I'll give you a little history on what the customer was looking for

Starting point is 00:01:38 and kind of how we solved that. I can't name this customer, but they're quite a large data center customer. So that's pretty good. And we're continuing to work with them as they kind of evaluate and go pre-production into production and so forth. So kind of pretty neat to see, but we'll talk a little about that. Before I jump into that, most of you, I'm sure, in this room have heard of NVM Express. If you haven't, why on earth are you here? You're in the wrong conference.

Starting point is 00:02:04 Please go away uh it's a standard specification that was designed from the ground up for accessing non-volatile media uh originally for pcie now obviously with fabrics we have multiple transports you know one of the things that was always interesting i used to work for a pretty big scuzzy shop for a while, and we were always like, is NVMe actually really better? And there are actually some reasons around why it is better, particularly around multi-queue and stuff like that. And now, obviously, it's very hard.

Starting point is 00:02:38 Well, it's not hard, but you have to look around now to find a solid-state drive that isn't PCIe based anymore, NVMe based. You can still get them for sure, and people are still using them, but it's getting harder and harder. It is designed to be high speed, high throughput, and pretty CPU efficient, and very importantly, designed for parallelism. One of the things about NVMe is it supports many hardware queues, much more than most people put on their controllers. And you can basically take those queues and map them to the multiple cores you have running on your processors, because

Starting point is 00:03:12 Denard scaling says we can't go to 8 gigahertz cores, so we end up having 2, 4 gigahertz or 4, 2 gigahertz. Recently I had a system in my lab that had 256 cores. That was a bit painful.

Starting point is 00:03:29 But we take advantage of all those cores with their different threads and issue I.O. from all those cores against these drives. For us as a small startup, this next point was very important. We had a decision to make. We're going to do computational storage stuff. Do we call it a Detacom and write an Detacom driver that's kind of shitty because we're not great at writing kernel code? And then do we go to the customers and say, hey, customers, I know you're using Ubuntu, but can you install this shitty out-of-tree driver that might not work with your particular backports and might cause a

Starting point is 00:04:05 bug on your bare metal and bring down 10,000 of your servers and then wonder why we sell absolutely none of them, right? So for us, it wasn't even a choice. We had to align with something that's already in upstream kernels. We're going after hyperscalers. We're going after data center customers. We can't provide shitty kernels on our website, right? We have to be working with inbox drivers. On top of that, the NVMe driver, you know, I haven't done the math, but if you look at the hours devoted and the average salary of the person who's working on the NVMe driver, that driver is probably tens, if not a hundred million million worth of software development salary. So, yeah, I'll use that. It's like, why wouldn't I, right?

Starting point is 00:04:49 And I'll show you in a minute why it makes a lot of sense. So one of the other things, I talked about this yesterday, NVMe has standardized something called a controller memory buffer. It's basically a PCIe bar. And that's a piece of memory that can be used for DMAs. But in order to utilize that, we need the operating systems to support it. So we also have been working on something called the peer-to-peer DMA framework. I talked about that yesterday.

Starting point is 00:05:17 If you're interested in learning more about that and where we are in terms of upstreaming, and the answer is hopefully very, very close now, you can come chat to me. And the peer-to-peer transfer is there to reduce the load on the CPU system memory and to free up CPU time. So NVMe can be used as a high-speed platform for sharing accelerators. So we use NVMe today to talk to traditional NVMe SSDs from Intel or Samsung or Seagate or whoever. But we can introduce these new devices. You know,

Starting point is 00:05:53 our product is just one example. There's other companies like NGD Systems and ScaleFlux and some of the larger companies are starting to take a look at this. And these are also NVMe devices, but they don't have any storage. Some of them do. Ours do not. We don't store a damn thing. I spent a long time working for a company that makes SSD controllers.

Starting point is 00:06:12 I spent six years of my life trying to get very fucking clever companies to get SSDs that work based on our controllers. And I went, there's no way I'm ever fucking doing a startup that talks to NAND ever again. Can I be more clear about that? So I think it's a great business and I love SSDs and that's great. But if I'm doing a small startup, I don't want to be an SSD. I want to be a computational engine. Why try and solve two

Starting point is 00:06:37 hard problems at the same time? So these are these new accelerators, but they just happen to present to the OS as NVMe devices. So you take one of our cards and you plug it in your system. The class code is non-volatile memory, and the subclass code is NVMexpress. And the driver that it binds to is the inbox NVMe driver. And then we go from there. So now I've got an interesting system, because now I've got high performance, low latency storage. This could be NAND. This could be something that's not NAND, which makes it AND. Boolean joke. It could be Optane or even something like a Spintorque

Starting point is 00:07:10 or an Antero stuff. But the innervases envy me. This could even be over fabrics. This could be a high performance networking device, RDMA or TCP IP coming soon or fiber channel. And then you could have your storage somewhere else. You can take it. I want to know who it is. But now on top of my storage, I also have this accelerator compute. And this could be doing a number of different things. I'm going to

Starting point is 00:07:40 focus on compression in this talk. But we are working either ourselves in-house or with partners to put other interesting PCIe acceleration functions here. Artificial intelligence inference, right? Because our investors say we're worth two times more than we are if we'd say AI. So, you know, anything, and because we're moving to a world

Starting point is 00:08:03 where things are more heterogeneous, AI is not best served on an instruction set architecture, at least the ones we get from AMD and Intel. So if I'm running a lot of AI, we're already seeing a lot of servers where we have a lot of PCIe accelerators. They might be Google's TPU, or they might be an NVIDIA card, or they might be an FPGA card. And we're struggling to come up with good frameworks for how do we manage these accelerators? How do we

Starting point is 00:08:32 tie them into applications? And one of the things people want to do also is they don't want stranded resources. So if I have a server that has 16 accelerators in it, and I don't need all 16, how do I let this server over here that needs 18 borrow the two? And I talked a little about that yesterday morning with Sean, and we show you how NVMe can actually allow for that disaggregation, which for me is another great benefit for using NVMe. All right. So I'll dig into the specifics of our platform.

Starting point is 00:09:05 I actually... Did I bring it? Did I bring it? Where is it? Oh, I didn't bring it. That's very strange. Normally I have one with me, but I forgot it today. Or somebody stole it.

Starting point is 00:09:17 Somebody stole it. We call it no load. Am I mic-telling? We call it no load. Bang, bang, bang, bang, bang. Sorry about that. So no load stands for NVMe offload, and basically it can present, in this case, FPGA accelerators.

Starting point is 00:09:36 I mean, we could do an ASIC if we raised enough money and wanted to go through all that pain, but we're not quite going to do that just yet. So we deploy on FPGA, and there's other good reasons to use FPGAs, because they can change their functionality over time. So that's something that we're doing. But we present, like I said, we present that computation, that FPGA resource as an NVMe endpoint. So we have an NVMe endpoint in there. And then we can basically use namespaces, which is an NVMe standard construct,

Starting point is 00:10:10 to present the different accelerators in a way that the operating system can carve up and give to different applications. So in the same way, you can take a chunk of NAND and carve it into different namespaces and share that from behind a controller. We can take some amount of compute resources, chunk it up into namespaces, and give those to different applications that need them. from behind a controller. We can take some amount of compute resources, chunk it up into namespaces,

Starting point is 00:10:29 and give those to different applications that need them. It also allows us to have, one accelerator can be a compression engine, one can be an erasure coding for RAID or read Solomon calculations. It could be an artificial intelligence inference or training or something. But once that's tied in, we can use standard NVMe tooling to manage this device. So you can imagine if our customer is

Starting point is 00:10:54 already on a path towards NVMe and they're writing management code, the fact that our accelerator presents as an NVMe device and can be managed with the same piece of software that they're managing the drives, that's a win, right? Because they don't have to write a different management stack for their accelerators. You know, we can do things like enclosure management. We can do, you know, we can follow all the rules around LEDs.

Starting point is 00:11:16 We can come in the same form factors as FPGA, or sorry, as NVMe drives can. So we have a U.2 form factor who was, the hardware was actually from Nalotech. Alan, just wave,.2 form factor who was, the hardware was actually from Nalotech, Alan, just wait, Alan,

Starting point is 00:11:28 great partner on the hardware side. He builds the hardware. We put the bit file on that turns it into no load and everybody wins, right? But we can also deploy as an add-in card. We could deploy on something

Starting point is 00:11:38 that has lots of FPGAs. We're not particularly fussy in terms of the physical deployment. We're more interested in the smarts that we enable on that platform. And then we have an API that sits in user space, and I'll show you some of that in a little bit, that uses the NVMe driver to access the hardware. So we still have the kernel in the path,

Starting point is 00:12:00 which is great for isolation and security and high performance. But we basically then operate using a library and user space. And then we present an API that someone like yourselves could use to tie into your application. So you could do Ceph, Acceleration, you could do something else. And we have the APIs to do that. So on that point, you you know the software stack for us is very important we do not change a line of code in the kernel space do not do never touch kernel space if you want to be a successful small startup because you will die under the workload all right the great thing is you know this driver is very well defined one of the things i love about NVMe is they didn't do what IBTA is. They didn't define verbs somewhere up here in this wishy-washy software space.

Starting point is 00:12:50 They said, your PCIe device will have fucking registers at this fucking offset that do exactly this. So, you know, and then we still get quirks. I mean, we still have drives that don't all work quite the same way, but at least it's a lot better than RDMA, where you have to have a driver for different NICs, for those particular NICs, right? So the same driver works for Intel drives as it does for us, as it does for Seagate, as it does for anyone else. And that's true for VMware, and for Windows, and for Linux, and do we really care about any other operating system after those three? Also, if you want to do things in user space using things like SPDK, we're supported there. We're just an NVMe device. Anything that works with NVMe works with us. It's really quite that simple. And then the APIs we provide are free and permissively licensed under the Apache licensing on a GitHub account.

Starting point is 00:13:47 So if you're interested, you can go there right now and take a look at our code. And we provide some example applications that build on that API to actually do interesting things. We have some customers who have asked us to do certain things in the kernel, and for them we have made kernel modifications. So one of our customers is very big on ZFS, and, yeah, it's public, right? It's Los Alamos National Labs.

Starting point is 00:14:13 Great guys, great partners, really enjoying working with them. And they were like, well, you know, we do RAID and compression in ZFS. Can we offload that to your device? We can't do that in user space. ZFS is in, well, it has a user space, but it has a kernel space, and that's the one we were interested in.

Starting point is 00:14:30 So we actually went and took a look. The compression part of ZFS already has support for accelerators. Intel's Quick Assist was added as a support device. It was literally two or three lines of code for us to tie in to that particular part of the stack. What's kind of interesting about that work is the way that we did it is we actually changed the driver so that when we advertised as an NVMe device, we said, oh, and if the NVMe device is an Adetacom device, don't register it as a disk.

Starting point is 00:14:59 Register it as a special pointer that we can pass to the compression and RAID parts of ZFS, and you can use this in kernel. And if we change the NVMe standard to standardize that, then we wouldn't have to make those changes out of tree. The NVMe standard itself could come up with a way of saying, hey, I'm a namespace type. I'm not a storage namespace. I'm a computational namespace. Treat me differently. And we can actually have people submit patches to the Linux kernel and the other operating systems in order to support anyone who wants to build NVMe devices that can do compute as well as or instead of storage. And there was a birds of a feather last night within SNEA. We're kind of moving forward on discussions on exactly that kind of topic. I'm not going to say for sure, hand on heart, exactly what's going to happen because

Starting point is 00:15:50 I can't predict standards. I don't think anybody can. But it's going to be an interesting discussion. For those of you who are familiar with NVMe, you've probably used NVMe CLI. It's a free and online tool from Keith Busch at Intel. It's kind of one of the de facto tools for managing the drives in your system. So there's a command called NVMe list, which you have to run as root unless you've set up permissions on your disk devices. And you get a whole bunch of information about the namespaces in your system at the current time. You can see here we actually have three Intel drives, and then we have four namespaces associated with our Adetacom device. So it's pretty typical today that you get one namespace on a drive. Some other drives have namespace management, so they might

Starting point is 00:16:35 have quite a few. But this device here actually has four. Now, NVMe CLI allows for vendor-specific plugins. So Intel has a plugin, WD have a plugin, Seagate have a plugin. We have a plugin. Ours is EID. So when you do NVMe EID list, basically, it only lists Adetacom devices, and it looks at the vendor-specific field of the namespace identifier to extract more information that's vendor-specific about that namespace. And this is very standard. All the drive vendors do it. This is how you get things like firmware versions and all kinds of other wonderful stuff. What we use it for is to identify in a human-readable way what our different namespaces do from a computation point of view. So we have one RAM drive, which we use for test purposes, and we have three compression cores

Starting point is 00:17:28 in this particular incarnation. If we burnt a different bit file, that list could be different. We could have Reed-Solomon, we could have SHA, we could have artificial intelligence, blah, blah, blah. And maybe some of those names become standard. Yes? How come I see the Identicom RAM driver on the bottom,

Starting point is 00:17:45 but I don't see it on the top? Because there's four here, and there's four here. In this particular parsing, because you're not using the plug-in, you have no idea about non-vendor-specific stuff. You can only go to the standard defined fields, and this is basically our namespace name. So if I look below the left-hand column, it says in1 through in4, and in1 is then the

Starting point is 00:18:09 Adetacon round drop. Yep. So that's the same as... Yep. Yeah, exactly. So this particular function call can't do anything vendor-specific, because you haven't called the plugin. This particular function call, because you're calling the Adetacon plugin, can look at vendor-specific fields,

Starting point is 00:18:25 and it understands how we've taken the vendor-specific field, which is quite a big field, and carved it up into subsections that are useful. And then this prints it out in human-readable form. So this particular version is a little older. Our current version actually dumps, like, the build time of the FPGA and the version number of the accelerator

Starting point is 00:18:46 so we can obviously keep track of versions and stuff. Interestingly, this works over fabrics as well. If our FPGA was plugged into an NVMe over fabrics target that was then exposing the namespaces over Fiber Channel or TCP IP or RDMA, you could extract the same information even if the accelerator is no longer direct attach, but it's

Starting point is 00:19:12 connected over fabrics. What's that? Well, that's vendor specific. So one of my apps guys would know what it means. I have no fucking idea. Basically, it means everything's working really well. Yeah, there's information in there, and obviously it just depends exactly.

Starting point is 00:19:35 It could be a date stamp. Yeah, it could be a date stamp. All right, so that's kind of the no load. That's the product that we're bringing to market and we're going to sell for loads of money and we'll be flying around in private jets this time next year, apparently, if you believe the hype.

Starting point is 00:19:55 But on top of that, one of the things that we recognized as we were thinking about NVMe for acceleration was also the fact that PCIe-based servers are getting very, very fast, an awful lot of IO. So you put in a couple of 100 or now even 200 gig Ethernet NICs, you put in a bunch of NVMe drives, you put in a bunch of graphic cards, you could easily get 50 or more gigabytes per second of sustained data movement around PCIe. And in the Linux or in all operating systems today, the way that that works, if you move data between two PCIe endpoints,

Starting point is 00:20:34 so I have an example here of a copy between two NVMe drives, the path today is DMA to system memory and you might get last level cache. It might not all go to DRAM. There's DDIO and other things that can happen. But the path is traditionally you allocate pages out of your main memory pool, typically DRAM. The DMA goes there, and then when it's done, you get a completion, and you issue another IO to the other drive saying, hey, can you DMA this data into you? Now imagine I've got 200 drives and an RDMA Nix and graphics cards, and they're all doing different things.

Starting point is 00:21:10 All of that data is going either over this interface or it's hitting the last level cache. That is a huge, huge problem. That either means one of two things. Either you need like six DDR channels just to get the bandwidth, in which case you're paying for a processor whose instruction cycles you may not even need or want, but you just need memory channels, right? Or you're in a hyper-converged environment

Starting point is 00:21:39 and you do need those Xeon instruction cycles because they're running VMs and applications and containers and Kubernetes. And in which case, they are fighting for the same DRAM that the DMA traffic is using. So there's a quality of service. There's a contention issue. So either one of those is bad. So what peer-to-peer DMAs do is they allow you optionally to say, hey, rather than allocate my DMA buffers from here,

Starting point is 00:22:05 can I allocate them from memory that's on the PCIe bus already? That's what we're enabling with the peer-to-peer DMA framework. And the most classic, the most relevant example of a device with memory already on it, on a PCIe endpoint,

Starting point is 00:22:21 that's done in a standards-based way, is the NVMe controller memory buffer. So other devices have memory. Graphics cards obviously have a ton of memory on them, but they don't necessarily expose it, and they certainly don't do it in a standards-based way. So now what we can do is the DMA can basically be this DMA here, and then this becomes an internal ingestion, an internal DMA. It doesn't even go back out on the PCIe device.

Starting point is 00:22:51 We have to make sure, there's all kinds of issues we have to worry about we can get into. Certain PCIe topologies have problems with this. Access control services is very, very scary. Don't do this in hypervised environment. Don't do this from a VM, please, right now. We'll get there, but right now this is kind of a bare metal only thing

Starting point is 00:23:08 so that legacy part there could be legit reasons that you want to use that you want to change the data exactly so is anybody thinking about being able to break up the peer toto-peer transfer?

Starting point is 00:23:26 Say, look, at this offset, these number of bytes, do you want it to go up to the DRAM complex? Yeah, well, so there's a couple of ways you can do it right now. I mean, right now, if you're a user space process, the upstream patches don't actually expose peer-to-peer in any way to user space yet, but we have a hack that does that that's a driver for P2P mem. And what you can do in your application is you can allocate buffers in both normal memory and peer-to-peer memory, and you can make decisions as the application writer on which buffers to use at which given point in time,

Starting point is 00:24:00 given who's talking to who. So, you know's talking to who. SPDK is a good example of that. SPDK, you can either call from the CMB to allocate memory, or you can call from something else. You can make decisions based on your understanding of the PCIe topology and what you're doing.

Starting point is 00:24:18 In the kernel, we will have to have discussions around that. We will have the discussions around user space APIs that make sense. But that's an example of what you want to push upstream. Yeah, I mean, all the things that we are pushing upstream right now are typically, if peer-to-peer doesn't work, fall back to the legacy path. Like, we're not saying break the DMA, right?

Starting point is 00:24:41 People may want to break the DMA if it doesn't work, but that's a decision for down the road. So you're kind of making use of the stop driver or making the most of it, but it's not optimal by any sense? Not yet. No, not at all. No, and I think there's a lot to be learned. I think once we get upstream, that's basically the green light

Starting point is 00:24:59 for a lot of people to start playing with it, and we're going to get a lot of input and discussion around how best do we use this, how best do we use it in the kernel? How best do we use it from user space? What are some of the things we might need to look at to make those judgments as calls? A quick question, a longer question. Do you have CMB and BAR? What's BAR? So, a BAR is a base address register that's a PCIe construct, a PCI construct, and basically it's memory-mappable I.O. memory.

Starting point is 00:25:27 So the other question I have is, in your right-hand drawing, you kind of have this magic line that goes through the CPU. It doesn't say where... Oh, it does show a CMB. Is there a CMB in the NVMe side? No, no. So you only need...

Starting point is 00:25:42 Yeah, you actually only need one device to have a CMB to start playing. And it could even be that the devices you're copying between, neither of them have a CMB, but another device does. In which case, you would have a green line to here, and then you would have another green line into the other device. So you can do DMA, peer-to-peer DMAs between two PCIe endpoints, but actually use a third PCIe endpoint as the donator of the peer-to-peer DMAs between two PCIe endpoints, but actually use a third PCIe endpoint as the donator of the peer-to-peer memory. So are CMBs in some sense faster than BARs? No, no, no. A CMB is just an NVMe definition on top of a BAR. Physically, they are the same.

Starting point is 00:26:21 The CMB is just a way, an NVMe standard way of allowing the operating system or informing the operating system that part of this bar has certain properties that may be desirable for the operating system to use. But from a functional point of view, it's just a bar or a part of a bar. Yes? So from the beer to beer-peer to work, you need to have the initial actual physical address. Correct. And that's part of what a peer-to-peer framework does.

Starting point is 00:26:53 So it has to go to the CPU's BTP table. Only if the IOMMU is on. And even then, it doesn't necessarily, because you may have translated it before you passed it down. Yeah. But typically, we assume the IOMMU is off. That's why I'm saying don't do this in virtualized environments today because we haven't really got there yet.

Starting point is 00:27:14 I think we will, but it's going to take a bit of work. And then you also obviously want to disable access control services on the path between these endpoints so that the DMA can go. The SSD in this picture has the address for the CMB. Correct. That's what you give it. It has to work with the IOMM to get the real address. It has to work either that or with the hypervisor.

Starting point is 00:27:38 It doesn't actually physically go to the IOMM per se, or it doesn't have to. There's a couple of different ways of doing it. But right now, like I said, we are advising people who want to do peer-to-peer to do it on bare metal and preferably just disable the IOM and Mew. That's what we're advising as people ramp up.

Starting point is 00:27:58 We are starting to have discussions at the highest level around, okay, now that it looks like we're getting the bare metal stuff upstream, how do we start doing things in more virtualized environments where an IOMMU must be on and we have access control services

Starting point is 00:28:15 and, for example, we might pass through two virtual functions, right? So this is a device. It could have SRIOV. It could have some virtual functions and we pass a VF So this is a device. It could have SRIOV. It could have some virtual functions. And we pass a VF of this to a VM. And we pass a VF of this to a VM, the same VM, right? How do they do peer-to-peer? Can they? Should they? Should we just say never, never do that? And then even more interestingly, or maybe not, you know, we pass a VF of this to one VM. We pass a VF of this

Starting point is 00:28:44 to a different VM. And we want to do a peer-to-peer between the two. That is starting to terrify me, and I'm not even a hypervisor person. So, yeah, it's a good question. We don't have all the answers yet. The great thing about open source software is it evolves. Yes? How do you work with the multi-sockets? Yeah. So it's a good question, yeah. So what happens in a multi-socket environment? You know, the reality is that right now,

Starting point is 00:29:16 the code that's upstream will recognize that, and it won't do the DMA. It will fall back to using system memory, because it's a really stupid idea. We do have some configfs stuff that you can use to override that. If you know your system is good and you want to do that, then technically it works. The performance may vary and you're certainly going to load up your socket to socket bus with traffic. So you can do it and it probably will work, but it might not.

Starting point is 00:29:45 So peer-to-peer DMAs, I mean, the kind of customers we're working with on this right now are very much people who understand the environment in which they're doing this and have a good sense of the PCIe topology and are making well-informed decisions around static systems that are not going to be changing all the time. Over time, there may be other people who want to use it, but we're not there yet. And if somebody's so unaware of their system

Starting point is 00:30:16 that they try to do peer-to-peer across the socket, they probably deserve everything they get because they weren't making good decisions. This is not something that's going to be on by default. You will have to go and do a bit of work as a systems architect to actually turn this on, even in the operating system. So we took those two different pieces together and combined them for a customer.

Starting point is 00:30:41 So the customer was basically deploying on, they were actually on an Intel based system, but we ended up doing a demo with AMD and I'll talk about why that is in a minute. But they were asking us to do compression on data that they had on some NVMe drives. And they wanted to get the compressed data onto another set of NVMe drives. And then they were actually buffering it out the back to a capacity storage tier. But they actually wanted this path to be done in a peer-to-peer fashion so that they weren't impacting the load store of the applications that were running on the processor. So what we did is basically said we took our lib no load, we used the inbox NVMe driver, and we wrote an application that basically did peer-to-peer data movement

Starting point is 00:31:27 with compression as part of that movement, using the compression engines in the no load to do that. So the way that that works out is basically you end up doing an NVMe read command to the input drive, and this works through VFS. So you can have a file system on here. We had EXT4, and basically you can start issuing NVMe IO against this drive,

Starting point is 00:31:50 and as the gentleman mentioned earlier, the pointers, the SDLs or the PRPs in those NVMe commands point somewhere in our controller memory buffer. So this guy starts issuing memwrite TLPs, and hopefully they get routed by PCIe. Certainly in the AMD system, which has good peer-to-peer characteristics

Starting point is 00:32:10 between its root ports, we were seeing very good performance here. So the memwrite TLPs would come in, and we'd put them in our CMB. And then when this guy had issued his last memwrite for the DMA, he raises an MSIX interrupt. The NVMe driver gets recall backed, it handles the completion. At that point, the application knows that that IO has completed

Starting point is 00:32:31 successfully and data is on our CMB. What it then actually did is it actually sent us a command to actually process that data via the compression namespaces and put the result back in the CMB at a different location. So that's basically another NVMe command basically saying, hey, take the data, put it there, and then put it back in a different place in the CMB. And obviously, there's multiple threads doing this simultaneously to different parts of the CMB. So we've allocated the CMB in different sunk sizes

Starting point is 00:33:04 to different threads. So nobody can allocated the CMB in different sunk sizes to different threads, so nobody can sit on top of each other, and nobody can DMA over the top of others. And we use the NVMe completions to know when everything's done and run that through P threads in a P thread safe way. And then once the data is compressed, we basically issue a completion and raise an interrupt and tell the application, hey, we've done the compression, and by the way, the data is where you asked us to put it, somewhere else in our CMB. And then what it can do is the drive can issue NVMe write commands to the output drive, pointing again to our CMB, to the compressed data, and the compressed data gets written to the EXT4 file system on the output. So given the FPGA that we had here, we can only get three compression cores on that, and we can do about one and a bit gigabyte per second

Starting point is 00:33:55 of input data compression per core. So if we had a bigger FPGA, we could go more. Also, they wanted U.2 because they were deploying in this kind of server, so U.2 was a good form factor, in which case you have a PCIe Gen 3 by 4 limitation until we get Gen 4 processors,

Starting point is 00:34:14 which are starting to appear, but AMD are on a path to Gen 4. I think Intel are on a path to Gen 5 at this point, but I don't know. I haven't seen their roadmaps. Question? So you said you fit three instances, three GZIP instances into that particular FDTA.

Starting point is 00:34:33 Can you say whether it was LUT limited or block route limited? That's a good question. I think it was actually reasonably balanced, and I think it was really more just closing place and right, given that we were at like 75% resource utilization. And was each instance encode and decode? That's a good question. So in this particular one I showed you, they were all encode,

Starting point is 00:35:01 and the customer's doing the decompression in software, just because it's much less expensive. But we have other customers who want us to do compression, but they're more concerned about data corruption, so we actually put decompression in, so we basically do a SHA on the input data, and then we compress the data, and then we decompress that data,

Starting point is 00:35:25 generate a SHA, and make sure the SHA match. There's no dedupe in here, right? That particular one does not, but we do have dedupe engines, yeah. Question? Yes. When you basically have CMDs with pointers before and after compression in the CMD, are there two different PCI bars? No, no, just one bar.

Starting point is 00:35:50 So the peer-to-peer DMA framework is responsible for allocating memory out of that bar. And it's a very safe, well-tested kernel allocator. So it's not going to give two pages to two different... Or sorry, the same page to two different processes. So you're not going to give two pages to two different, or sorry, the same page to two different processes so you're not going to end up with two devices trying to DMA over the top of each other. And then the driver,

Starting point is 00:36:13 you know, when you're done with the memory, when you free it, it goes back to the allocator and the ideas, and obviously the allocator can fail. If you've allocated all the memory and somebody comes and says, can I have some more pages, please?

Starting point is 00:36:27 They're going to get a minus e numem. And they'll have to follow that. The application has to handle that error somehow. Yes? So your comment about decompression being faster and softer, is it because your FPGA was too small to have more to get into the FPGA? No, so I mean, deflate is a pretty simple algorithm, right?

Starting point is 00:36:49 So it's just not very computationally intensive on any processor. So the compression algorithm, because of the static Huffman tape, or dynamic Huffman in this case, and some of the other things, like libz in particular is quite onerous on something like an x86-64 instruction set.

Starting point is 00:37:07 So thank you. So it made sense to target that. And this particular customer did want LibZ compatibility. There are other compression algorithms that get reasonably good compression without being as taxing on the processor. So LZ, for example. But they wanted CLIP. And there's obviously, everybody has their own favorite version of compression at any given time of any given day.

Starting point is 00:37:37 In our particular product, we ship, we actually, the U.2 has 8 gigabytes of DRAM on it. We typically expose 512 as a CMB, but only for the reason that it's a nice round number of two. We can make it bigger or smaller, and, yeah, we do. 512 is a lot. Can I make sure I heard that correctly? You allocate 512 megabytes for a CMB?

Starting point is 00:38:04 For the CMB. Yeah. I don't think in this particular application it even used a quarter of that amount. Like, I think, you know, 16 megabytes would have been... It's a kind of... Think of it as a delay bandwidth product thing. It's basically...

Starting point is 00:38:19 Your buffer needs to be... Your throughput multiplied by the time to write to the output. So it's just a delay bandwidth. Yeah. I guess we're looking at the example of one direct one that we could do this for that other microphone. Yeah.

Starting point is 00:38:33 So that's a good question. Yeah. So what about scaling? So there's a couple of things that happen. So one of them is, I think if anyone's going to XDF next week, we basically are doing this demo scaled up there. So basically more NVMe drives, more no loads,

Starting point is 00:38:50 and more output drives. And we're basically showing that if you do that correctly, you get linear scaling and performance. Because you're using peer-to-peer, there's no bottleneck anymore. In theory, because everything's doing DMA, I can partition my system in the

Starting point is 00:39:06 PCIe domain in a way that I can just basically stamp out this block as many times as I like. This block being input drives, no load, and output drives, and my performance will just continue to scale ad infinitum. Are all those devices on the same PCIe

Starting point is 00:39:21 bus? Because at some point, if they were, that would become the problem. Yeah, the SSDs and the loader solvers. So, in this particular system, and it'd be good to draw a diagram of this, but basically, everything is connected to

Starting point is 00:39:36 every single device on its own is connected to a single root port on the AMD EPYC. And the AMD EPYC was designed to have great peer-to-peer performance between the root ports, especially when those root ports reside on the same socket. But even across sockets, they claim good peer-to-peer performance.

Starting point is 00:39:55 As long as my atomic unit, as long as my atomic building block connects to root ports that are always on the same die, there is no contention between this atomic unit talking to these reports and this atomic unit talking to those root ports, right? Because they don't even need to talk to each other. So that, in theory, could scale ad infinitum, right? How do you make sure that the inbox driver

Starting point is 00:40:17 uses the same address? That's what the peer-to-peer DMA framework that we're upstreaming into the Linux kernel does. So go take a look at the patches, and that will definitively tell you exactly how that is done. Basically, if you go take a look at our patches to the NVMe PCI.C file, you'll see, well, it's not even in there,

Starting point is 00:40:46 it's in the block layer, we basically have the ability to submit a scatter gather list, which is backed in part by peer to peer pages. The NVMe driver has no idea. It has no idea it's doing a peer to peer DMA. Because all you're doing is you're asking a DMA master to start issuing some Memread or Memwrite TLPs. That DMA master has no need nor should it know whether it's DMA in system memory or somewhere else in the system.

Starting point is 00:41:13 That's not its job to know that. In fact, we don't trust these things one bit. That's why we have ACS in the first place. Because we don't trust the endpoints to do shit. We tell them what to do. So the NVMe driver doesn't really have to be peer-to-peer aware, per se. It just needs to pass through the PRPs that are allocated from the PRP allocator, which is done at the block layer.

Starting point is 00:41:37 Yes? So when you've got the data in the CMV and you want to run the compression command on it, is there a driver down there that does that? Do you send a command, or is that a hardware? Yeah, it's actually RTL. And we're not going to even get to those slides. But in the deck that will go online, we actually have some numbers in terms of the RTL

Starting point is 00:41:55 that we wrote to do the compression engine. So you're right. It could have been a software compression engine. To use which device? Oh, yes, yes, yes. So actually, the full version, which will go online, compares what we're doing to the QuickAssist in a bit more detail. The problem with the QuickAssist is, A, this is an AMD EPYC, so it doesn't have a QuickAss. And B, normally the quick assist is on the PCH device, which is DMI connected, and isn't necessarily a peer in the PCIe subsystem. So if you want to do peer-to-peer with a quick assist, it's not going to work because the peer-to-peer is not peer-to-peer.

Starting point is 00:42:39 It's because the quick assist is hanging off the PCH, and normally your high-performance PCIe is hanging off the processor core sockets themselves. Now, you can also go buy a Quick Assist as a PCIe card, and that could certainly have replaced us, but that just means Intel make money and I don't. So I'm not a big fan of that particular model. Well, there you go. So we'll do that.

Starting point is 00:43:08 We'll just stop my company and we'll sell more Quick Assist because I'll do anything to make Intel happy. Yeah, exactly. And we do have some performance numbers that compare to the Intel Quick Assist, but we're not going to have anything like enough time to do that. So, yeah, we're not going to get anything like enough time to do that. So, yeah, we're not going to get into that.

Starting point is 00:43:28 Let's not bother. So if you're really interested in the minutiae of the RTL we wrote to do the compression, then feel free to come to me. I can't even answer the question because Saeed's the best person to talk about the specifics of our compression core. We did do some comparisons to QuickAssist and some of the other hardware-based compression research that's been done, and we feel like we have a pretty good compression core in terms of compression ratio

Starting point is 00:43:54 and also throughput on standards-based corpuses like Calgary Corpus and some of the other ones. But for me, the story is not just do we have the best compression in the world. For me, it's much more do we have a framework that's pretty awesome. Compression is one example of what I put in here. But imagine it's pattern matching or image processing or security or data recognition or deduplication or whatever. Part of me doesn't even care what service is behind here. And part of our business model is actually now we're starting to work with some partners

Starting point is 00:44:28 who want to use us as the platform but push their own accelerators, not developed by us, but push their own accelerators that are maybe based on SDXL and map them in as no-load namespaces. In which case we become essentially a platform on which other people can innovate right yeah so our abstraction is very similar to SDXL so if somebody has an accelerator that's already SDXL so anyone who's doing Amazon F1

Starting point is 00:44:57 work for example that's pretty SDXL they can take that accelerator and they can pretty seamlessly without us really having to look at any of your proprietary code, and we can turn that through a wrapper process into something, assuming it fits in the FPGA, would be deployable behind this NVMe front end.

Starting point is 00:45:18 So that's kind of where we're going. So I'm sorry I didn't get it to all the compression stuff, but all this other stuff. Yes? What is the ratio of scalability when using one? If you have multiple drives on a server, how many drives can you use to download the CPU? Yeah, so the question there was around scalability.

Starting point is 00:45:38 In terms of how many drives can... Obviously, we have some pretty hard limits here, right? So we have a PCIe Gen 3... Well, we have a PCIe Gen 4 by 4 interface here, but the servers in which we deploy today are Gen 3, so we don't get the Gen 4 benefit, but there is a throughput thing there. You know, the actual throughput that we can hit is always a function of the size of the FPGA. So some of our customers, while they like this form factor, don't necessarily like this FPGA. And they can work with people like Alan to see

Starting point is 00:46:10 if they can fit a bigger FPGA here or a smaller one or redesign their own U.2 form factor. So that's always going to be a function. There is a DDR bandwidth issue inside here. You don't get that for free. But there's certain tricks you can play around caching and on-chip memory that can come to bear there. But the other thing is, once you've got something that works, and let's say this is working at three gigabytes per second, if I can just replete it, if I can just tile it over here, I get six gigabytes. If I can tile it again, I get nine. If I tile it again, there's no interdependencies between these units. The only final interdependency would be, thank you, would be if my processors couldn't issue any more I.O.

Starting point is 00:46:52 But that's an awful lot of I.O.s. That would be really the main limitation that would come into play. All right. Thanks for your time. We're getting kicked out. Birds of a feather tonight on NVMe. Thanks for your time. We're getting kicked out. Birds of a feather tonight on NVMe. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers

Starting point is 00:47:28 in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #81: FPGA-Based ZLIB/GZIP Compression Engine as an NVMe Namespace

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.