Storage Developer Conference - #115: Accelerating RocksDB with Eideticom’s NoLoad NVMe-based Computational Storage Processor

Episode Date: December 4, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 115. I'm going to talk a little about the value proposition of computational storage as I see it. To be honest, today I want to talk a little more about software consumption than the actual specifics of the title of this talk,
Starting point is 00:00:53 which is one specific instance where computational storage at least provided some value. I think we had a really animated, shall I say, birds of a feather last night. And, you know, among the rambunctiousness, there were some very good points made by some very serious people around, you know, why on earth would I do this? Why don't I just buy a bigger Intel processor? You know, and somebody just commented there's no such thing. At some point, you can't buy a bigger one because you're literally buying the biggest one that there is.
Starting point is 00:01:28 But it's a fair point, and one of the things I say to a lot of people right now, and people ask me, why do you work with Scott and Thad? Why did you guys get together and try to create? And I'm like, because my biggest competitor right now isn't any of them. It's just an Intel Xeon. And every single account that I try to sell into, what it always comes down to is an Excel spreadsheet
Starting point is 00:01:49 of this is what I pay CapEx and OpEx-wise with an Intel Xeon, and this is what I'll pay if I do it with you guys. And if the numbers come out positive, they're going to buy my stuff. And if the number comes out negative, they're not going to buy my stuff. And it's literally that simple.
Starting point is 00:02:06 So the value proposition we're finding is there. Now, is it there every single time? No. And is it going to be there for every single application? No. But there are definitely instances that we're finding, and I'm finding them, and Thad's finding them, and Scott's finding them, and other people are finding them,
Starting point is 00:02:24 where the Excel spreadsheet comes out with a positive number. And these people are rational human beings. They're like, this will cost me less. This makes sense, right? And sometimes it doesn't. So that's kind of talking about that. I'm going to talk a little about some of the things I think we're trying to achieve with this technical working group.
Starting point is 00:02:45 One of the things that is very important to me is how that interface between what we're trying to do in the hardware level with accelerators and offloads and institute processing, how that interfaces with operating systems and applications, the software consumption model for computational storage. And so one of the things that we're trying to do this time around is, like I said, all of this has been done before. We've been using accelerators for a long time. We've been using accelerators in storage for a long time. One of the things, and even prior to this, we have tried to standardize some of that. But again, this time I think we are really trying to have a concerted effort to standardize some aspects of this
Starting point is 00:03:28 so that there is vendor interoperability where it can be there and where basically we can get support into operating systems and open source user space libraries to make it easier for end users to deploy this stuff. Because if everyone's proprietary and everyone's bespoke, you end up with this cabling nightmare like this poor gentleman
Starting point is 00:03:48 here, and it gets very painful. So let there be light, and now we've had this light since about FMS last year, so just over a year. We've had the technical working group working for about 10 months. Last night there was a lot
Starting point is 00:04:03 of discussion around, show me the specifics of the spec. You know, unfortunately standards don't happen that quick. Looking at a bunch of standards, people nodding their heads, right? We're moving. Do I wish we were moving faster? Yes. You know, I think everybody would like that.
Starting point is 00:04:18 But standards just take time. Somebody was saying, you know, when, how long did it take to go from Fusion I.O. to NVMe, right? It took a while, an awful long time, right? And so these things, especially in the storage world, tend to take a while, right? But at least we have a lot of good people, and I think a lot of the right people, talking to each other, trying to make some of that happen. There's a speed limit on that, right? people talking to each other, trying to make some of that happen. This one here, that was a birds of a feather at Flash Memory Summit.
Starting point is 00:04:58 We hadn't decided definitively at that meeting that SNEA was the right. That was the meeting where we decided to approach SNEA. So SW, it's a good clarification. SNEA was the right, that was the meeting where we decided to approach SNEA. So SW, it's a good clarification. SNEA came after that. And I apologize for this weird thing in the bottom of my slides, but I'm not going to try and fix that because I'll just break something else.
Starting point is 00:05:16 So what are we trying to achieve? What are the benefits of computational storage? Scott did a great job earlier talking about some of them. The way I think about it is certain tasks are better done by an accelerator than they are by an instruction set architecture. There's certain things, compression, encryption, maybe artificial intelligence inference, maybe video transcoding.
Starting point is 00:05:38 There's definitely cases where doing something using an instruction set-based approach is less efficient in terms of picajoules per bit, which turns into watts once you've normalized by time. Doing that on a Xeon versus doing it on an Intel QuickAssist or versus doing it on an NVIDIA graphics card versus doing it on Fasl's accelerators on a SmartNIC or doing it on our solution or doing it on Fasel's accelerators on a SmartNIC or doing it on our solution or doing it on Scott's. And again, it's an Excel spreadsheet. It depends on the application,
Starting point is 00:06:12 it depends on the transformation you're trying to do, and it depends on the energy efficiency of both the CPU and the accelerator. But you can tell when you go out into the data centers today and look at all the different heterogeneous accelerators that there's definitely people crunching these Excel spreadsheets and making the decision to go, I can't run all the code on Xeon. And I apologize to AMD and ARM, but I'm just going to keep saying Xeon because that's 99.999 recurring percent of the market's care about. So, you know, there's always going to be certain cases where it is just better to keep it on the Z.
Starting point is 00:06:52 And there's also cases sometimes where the customer is like, well, you know what? I got the Z on, and I got it anyway. And I got the one I want because it was the one that they gave me a deal on. And I got spare cycles anyway. So what do I care? And those are very hard accounts to win, right? But it's hard to compete with free or I've already bought it already.
Starting point is 00:07:11 But there's many, many other cases where we're finding people are going, well, you know, to be honest, either I can't do what I want to do with the Xeon because it just isn't even capable of it, or it's just not very energy efficient and I want to reduce that. So taking certain computational tasks and moving them off a Xeon processor
Starting point is 00:07:28 and putting it onto some other processor, now whether that's an ARM and an SSD or whether that's an FPGA or whether that's a GPU, that's almost like a second order thing. It's the fact that they've made that decision that it's more efficient to do that certain operation somewhere else. And that's one part of computational storage. The other one that Scott also covered well earlier today is the reduction of data movement. So if I can have that second point of computation be closer to my data, and that could be on a smart NIC if you want to process data that's ingesting from
Starting point is 00:08:05 a network. That could be on an SSD if you want to process data that's on the NAND and you don't want to have to go out the interface. It could be a computation element on a storage array that processes across all the drives that are in that array, right? It could be on a CXL or a C6 card with persistent memory. It could be any of those placesL or a C6 card with persistent memory. It could be any of those places, but it's somewhere that reduces the amount of data movement required. I don't have to take all my data, DMA it or network it to some host memory and process it there.
Starting point is 00:08:39 And again, that's essentially a picajoules per bit. PCIe lanes consume power. Moving data consumes power. And power is very important to a lot of my customers. And they have Excel spreadsheets. And they negotiate their electricity rates. And they build their own power plants. So power is very important.
Starting point is 00:09:02 And again, that's something you put in an Excel spreadsheet. There are other benefits to not moving data that don't get discussed and they're a little harder to convey to customers sometimes um one of these is sometimes you've added memory to your xeon processor not because you needed the capacity but because you needed more memory bandwidth right so typically you know you'll get a number of different channels on these Xeons or these AMD devices, and you can populate them in different ways, and depending on how much DRAM you put in, sometimes you add DRAM to get more capacity of your host memory, and sometimes you actually just do it because you need more bandwidth. And so by removing data movement to that host memory,
Starting point is 00:09:47 sometimes you can make a volumetric reduction in DRAM, and sometimes you can actually just get away with less channels, which sometimes means a cheaper Xeon processor. Because some people buy Xeons for the memory controllers. They don't actually buy them for the CPU cores. Somebody nodding their head over there. And that happens quite a bit. There's another interesting thing that happens.
Starting point is 00:10:12 If you take one of Fazzle's very high-performance RDMA NICs and you start ingesting a lot of traffic, and then if you go buy a whole bunch of NVMe drives from our friends at Samsung or Intel or SK Hynix or whatever, before long, you realize you now have 10, 20, maybe even 30 gigabytes per second of traffic just flowing through that network to storage path. And the processor may not even have to touch a single bit of that, because it's basically just saying, I need to ingest, and I need to push it onto block storage. And normally, we'll provide some kind of services on that data, but let's just
Starting point is 00:10:49 imagine the case where you don't. At this point, now you're basically consuming 20, 30 gigabytes of data, and that has to go, potentially, it doesn't go into the host memory, because you could have L3 caching, and now we start playing all kinds of fun games with DDIO. But the reality is with that amount of data, you're probably going to flood your caches. And so all that traffic is probably going out on the DRAM channels and it's probably coming back in again. So now it's 60 gigabytes per second of DDR traffic. And in certain applications, maybe that doesn't matter. This is a case where you might have to buy the Xeon just for the memory channels, not for the cores. But in other cases, you care a lot because there are actually applications running on those Xeon cores. For example, in HyperConverge.
Starting point is 00:11:35 So in HyperConverge, you now have the problem that you have VMs that you're probably renting to people who are probably measuring the quality of service. And they start noticing that because of this DMA traffic, when they try to go out to host memory to get a cache line, the quality of that DRAM access is non-deterministic, right? Because it's now fighting with all this DMA traffic. So we've actually had cases where we've been able to go in and use measurement tools on the memory controllers and measurement tools provided by PMON,
Starting point is 00:12:06 Performance Monitoring Counters. And you can actually see what customer, if I get rid of that data movement, the quality of service on the VMs accessing their memory footprint has just gone up by an order of magnitude. I've pulled the tails in by a factor of 20. Some people care about that. It's harder to
Starting point is 00:12:26 put that in an Excel spreadsheet. It does translate into better quality of service and perhaps a competitive advantage over your competitors. And it's of great interest to people in, for example, multi-tenant public cloud or hyper-converge, right? So there's nuances, things that happen there. But that data reduction definitely reduces energy, probably reduces memory requirements in both capacity and bandwidth, and provides additional quality of service. So that's a really important part of what we're trying to do with computational storage, is reduce that data movement. And then the third one, which is the one I'm really keen to focus on,
Starting point is 00:13:04 is to bring a vendor neutrality to this. So most of the accelerators that you can get today to do this kind of stuff, they're incredibly vendor specific. They're even vendor specific in terms of the PCIe interface requires a vendor specific driver. So there's no standardization around the PCIe interface. There's no standardization around discovery.
Starting point is 00:13:29 There's no standardization around configuration. There's no standardization around manageability. And then you go to the large consumers of accelerators and talk to them about this. This is a problem for them because companies will turn up once a week and say, hey, we just raised $50 million. We're doing an AI chip.
Starting point is 00:13:45 It's faster than everybody else's AI chip. And by the way, it has a completely nonstandard PCIe interface that your kernel team are going to have to download a terrible driver from our website under NDA and integrate that with their existing kernel, which runs live on all their critical services that they make trillions of dollars on. And by the way, the user space library is kind of written
Starting point is 00:14:07 in some weird language with some weird compiler that obviously doesn't work with anything that you currently do. Could you give us 10 of your software engineers to integrate this and deploy it so you can try out our hardware? And what do you think the Facebooks and the Googles and the Amazons of the world say to that? They say no, right?
Starting point is 00:14:24 Their software teams are oversubscribed by 200%. They don't have time. They don't, they will not touch their kernel. They certainly won't introduce third-party stuff into their kernel, right? That's insane. So they want some consistency here. And that, for me, is one of the important things that the technical working group needs to bring for this to be successful, is that consistency.
Starting point is 00:14:50 And the way that I say it, but I'm not allowed to say it because my VP of marketing tells me I'm not allowed to say it, is that I want to make computational storage consumable by idiots. I used that in a Google meeting, and the guy was like, did you just call me an idiot? I said, I'm under the impression of an idiot can do it. People who are brighter than an idiot can do it. But that isn't always necessarily true, as my wife will tell me when I try to do it. You know, one of the demos that we did at Supercomputing a couple of years ago was awesome because we were offloading compression in a file system. And basically, as we were plugging in more no loads, the fans kept going down. Because basically, it was getting more efficient and the UFI code or whatever was detecting that things were running cooler
Starting point is 00:15:31 and it could power down the fans. And that's actually the one the customers enjoyed the best. If it had one no-load, the fans go down by 10%. Put in two, the fans wind down again. Three, etc. So the system was running more efficiently because of the computational storage. And I didn't have to do a lot of hard things to make that happen. And ideally, it's vendor neutral
Starting point is 00:15:55 so that the hardware people can fight over who has the best hardware, but the software people don't have to worry about vendor A's code being different to vendor B's code to being different to vendor C's code. So what do we actually do? We basically use, leverage, abuse NVMe. That was a debated term last night.
Starting point is 00:16:19 And we present accelerator functions through a standard NVMe endpoint. And this is some marketing bullshit. Apparently we standard NVMe endpoint. And this is some marketing bullshit. Apparently we're NVMe compliant or whatever that means. The key thing is, the way I look at it, our engineering team is kind of split in two. We have some people, or sorry, the hardware part of our engineering team is split in two. And basically some groups work on a very high-performing NVMe interface that we've done ourselves, and we've made some modifications that make it more appropriate for computation and storage.
Starting point is 00:16:52 But at the end of the day, it is a NVMe front end. NVMe, I'll talk about in a minute, is pretty awesome for talking to accelerators, just like it's awesome for talking to storage. And then the other part of the hardware team are basically a whole bunch of super geeks who work on these different types of accelerators. And we tend to do things in RTL. We've tried things like higher HLS and OpenCL. But typically what our customers care about
Starting point is 00:17:19 is how many gigabytes per second can we process per LUT or per gate, and how many picajoules per bit. And the way you get those numbers where you need them to be is handcrafted RTL. It's not the easiest to work with and normally requires bright PhD people, but we have those, and that's what they do. And the great thing is no matter what accelerators we come up with, they're always exposed through NVMe. So that part of the story stays the same. And that,
Starting point is 00:17:46 for me, is very important. So as we bring in other acceleration functions, or you do, or somebody else does, the consumption model from a software point of view stays the same. And it's aligned with a protocol that many, many of our customers are already very comfortable with because they're deploying NVMe SSDs at scale. These are people who are deploying tens of thousands, if not hundreds of thousands, of NVMe devices, and we're just another NVMe device. We just don't store data.
Starting point is 00:18:16 We process it. I'm not probably going to talk about it. Walk through this. I tend to, whatever. So why NVMe? Before this startup, I was in the CTO of PMC Sierra and then MicroSemi. We were doing a lot of work on NVMe. We did an SSD controller that we were very successful with,
Starting point is 00:18:43 and hyperscalers and storage OEMs used that controller to build SSDs. So I spent a lot of time with those customers, a lot of time. And one of the things I kept seeing is that this NVMe protocol seemed like it was so good, it should really be used in a broader way. NVMe is a transport as opposed to just for storage. And when you look at what an accelerator requires versus what NVMe provides, the matchup is awesome. So low latency, Yep. NVMe is a very low latency protocol. It has to be because it's talking to NAND. And even more
Starting point is 00:19:11 recently, it's now talking to things like Optane, which are not NAND. So that makes them AND. That's my Boolean joke. Many of you will have heard that one before, but I love it. So I keep using it. High throughput. You know, NVMe is pretty darn efficient. Now, there's definitely a couple of things, like WD had an interesting kind of twist on NVMe that was more efficient, but it's still very efficient. I can get a lot of NVMe commands
Starting point is 00:19:36 through a single thread on a Xeon processor. I think there was a talk on Monday that was like 10 million IOPS on a single core. So if all I'm doing is issuing and processing NVMe commands, I can get a lot of those done per second, and I don't necessarily need to use a lot of cores to get that. So NVMe is soft touch. It's lightweight.
Starting point is 00:19:55 The people working on the NVMe driver, like, I couldn't hire that team even if I wanted to, and I had infinite resources, right? So, you know, we have Christoph. I don't know if he's in the room. We've got Sagi. We've got Keith. The team that work on the NVMe driver in Linux and other operating systems are world class.
Starting point is 00:20:16 And they write really, really good code. And they care about the hot path. How many IO per second and how many CPU cores does it take me to get that IO per second? Efficiency, efficiency, efficiency. Multi-core aware. So NVMe was designed from the ground up to be aware of the fact that processors are no longer single-threaded
Starting point is 00:20:36 that run really, really fast. Has anyone got a 12 gigahertz Xeon? If you did, you would have a fire, right? We've gone multi-core and if you look at the new Roam, I think the new Roam has 8,000 cores. I don't know. I lost count after 100 and whatever. So having a protocol that understands that,
Starting point is 00:20:56 that has multiple queues, then queues can be assigned to the threads that run on different cores, makes a lot of sense. So it scales on a multi-core environment. And quality of service. So we have NVMe as a very rich queuing structure. You can have quality of service on the queues. We have things, you know,
Starting point is 00:21:13 we have all kinds of interesting things that we can either do today or we will be able to do in the future around how do I prioritize traffic that's going into an NVMe controller. Do I have certain queues that are higher important than other queues? Can I have priorities within a single queue?
Starting point is 00:21:29 These are all things that NVMe understands because these are things we care about from the storage side, but they also happen to translate incredibly well over to that thing. So the real question is, why not NVMe? We could invent something from scratch, but it would take a long time, and it would look a lot like NVMe. So I want to talk a little about how we use NVMe today inside of Datacom, which is not standard. And I'm not even proposing that the way we do it becomes the standard. I'm not going to be that hubristic.
Starting point is 00:22:03 But I'm going to help kind of describe what we're doing and hopefully the working group and others can provide feedback on what parts suck and what parts don't suck quite so much. So, presentation. If you look, if you put one of our devices, and you're welcome to buy several from me at any time, into a system, a system running Linux. And if you do LSPCI, you will see a PCIe device that has the Adetocon vendor ID. And we just use standard PCIe to recognize. And then as the operating system boots up,
Starting point is 00:22:41 it will look at the class code of that device and go, oh, this thing is telling me it's an NVMe drive. Now, I guess I bind it to the NVMe driver. I don't have any UDEV rules telling me to do something else. So I'll just bind to the NVMe driver. So we don't make any modifications to the kernel. We use the inbox NVMe driver. And what happens then is the accelerators basically turn up as namespaces, NVMe namespaces behind that controller. And if you have a lot of accelerators, you'll see a lot of namespaces,
Starting point is 00:23:12 and there's some games that we play there, but that's the basic idea. Now, the thing is that NVMe as it stands today has no concept of accelerator namespaces. We've talked a little publicly about computational namespaces, but they are in no way officially a thing. So what we do today is, if you look at the namespace identify field in NVMe, you see that the wise people who work on the standard saw that there was a need for a vendor-specific field
Starting point is 00:23:40 in that identifier, and so we can get to put whatever we like in that vendor-specific field. So we put in what our accelerators do. So now you know you have an Adetacom device. You can go, well, I'll do a namespace ID and get all the namespaces, and then I can go to each of those namespaces and say, I know you're a Adetacom device, so tell me exactly
Starting point is 00:24:01 what your namespace does. And we'll say things like Zlib compression or erasure coding or whatever. And the great thing is that now you have a path to a user space library or even an internal consumer who can say, well, if it's a Zlib compression engine, I'm going to use that to do Zlib compression. How do you configure these things? Well, NVMe has admin queues.
Starting point is 00:24:25 Every controller must, must have an admin queue. And those admin queues have vendor-specific commands. So without abusing the NVMe spec, but by doing something that's vendor-specific but still within NVMe, we can configure our namespaces today. We can set compression levels. We can insert GAWA fields for erasure coding.
Starting point is 00:24:44 And all of that can be done through the standard NVMe admin path, which has a user space interface. And then, obviously, how you consume this is very important. So right now, what I would love to have is NVMe commands that are tuned for computation. So maybe pointer to pointer. Maybe some other things. But I don't get that yet.
Starting point is 00:25:07 I don't get to write the NVMe spec. So I've got to live with what's available today. And NVMe commands right now are basically things like reads and writes. But that's okay. I can use writes to get data in to the namespaces, and I can use reads to get results out. And it gets a little bit funky when the output size is different than the input size, but it's not that funky.
Starting point is 00:25:30 And if you've got a bunch of bright engineers like we do, you can find a way of solving that. And so now I have NVMe compliance, and I also have the ability to do things either in kernel space or user space to discover, configure, and use those computational resources. And like I said, I'm not proposing in any way that this is the way that the twig needs to go, but it is an example of something that's out there. And we can learn from it, and we can do things that make sense
Starting point is 00:26:02 and other things that don't. A couple of other things that are interesting, because these are NVMe namespaces and the interface is fabric agnostic, we can literally have a PCIe no-load, or I can put the same no-load, and I've done this with Fazzle from Broadcom, I can put it on the other side of an NVMe over fabrics link
Starting point is 00:26:23 and expose it over the fabric, can put it on the other side of an NVMe over Fabrics link and expose it over the Fabric, and the software on the host neither knows nor cares that that device is no longer local. And I can use NVMe discovery to find it, and I can consume it. So now I have accelerators over Fabrics. So now I could build a rack-scale architecture
Starting point is 00:26:39 where the accelerators sit at the top, I use Ethernet and NVMe over Fabrics, and my compute nodes don't have any accelerators. They just borrow them when they need them. And with namespace management commands, I can program the bit file on the FPGA, if it's an FPGA, in order to get it to do the functions I want
Starting point is 00:26:57 before I borrow it. And all of that is with NVMe. And that's today. That's NVMe as it is today. As we think about adding computation, we can obviously take that in a lot of different ways. One of my favorite questions when a customer asks it is, what's your management story, Stephen? Because I get to say NVMe MI, and I have yet to have a case where that has not raised a smile in my customer. Like every single time. Because
Starting point is 00:27:23 they're like, that's awesome because I've literally got 20 engineers who are writing NVMe MI code right now because I'm literally transitioning to NVMe or I'm kind of in that phase and I just get you for free because you're just an NVMe device. I've got another good story where one of these AI accelerator companies went to a large hyperscaler and the data path stuff looked really good. And then my friend asked them, so what's your management story? And they were, oh, we don't have a management interface. They had built a chip with no management interface, and they expected to sell it to a hyperscaler who was going to, if they were going to deploy it, would deploy it in the tens of thousands.
Starting point is 00:28:02 Management is so important for the consumption of accelerators at scale. Security, we've talked quite a lot about security within the technical working group. Obviously, NVMe is designed to store people's data, and it's being used in public cloud. So NVMe is being deployed in VM, public rentable VMs. So NVMe is been deployed in VM, public rentable VMs. So NVMe is very security conscious already.
Starting point is 00:28:28 And the customers, the big customers of NVMe, are continuing to push features into NVMe to try to make it even more secure. So there are security things that are in there. And there's certainly more that we're going to have to think about. But there's definitely things that we can leverage there today. Long term, like I said, where this goes from a standards point of view, I'm not sure. But this is what we do in production today.
Starting point is 00:28:56 So just for fun, I always like to have a little bit of code. So this is an example of, I don't know if Keith's here, but Keith Bush kind of started this tool, the NVMe command line interface, NVMe CLI. Maybe a lot of you have used that to kind of look at your NVMe devices. So this is the NVMe CLI output for a version of NVMe that we have where we have a plug-in. So many vendors have plug-ins for NVMe CLI. That's not unusual. Intel and Micron and WD and most of the major vendors have a plug-in to do kind of vendor-specific stuff. But we can use that command with our plug-in
Starting point is 00:29:34 to, for example, get what type of namespace we have. So here we actually have a means of communicating to the user space that there's a couple of namespaces. And I use a UDEV rule here to change the name of my computational storage. That's just a UDEV rule. It's not a big deal. But this is just saying that there's a – I made this vendor neutral. I don't quite know why.
Starting point is 00:30:01 There's two namespaces on this device. The device is made by vendor A. That's because I don't want my customers to know they have to call me. Namespace, a number, and then here I've actually pretended we do have namespace types. And I said this is a conventional LBA-based namespace, so this would be a storage-centric namespace. And this one's a computational namespace. And then we can have subtypes. So we know this is a computational namespace. So this would be a storage-centric namespace. And this one's a computational namespace. And then we can have subtypes. So we know this is a computational namespace. So for storage namespaces, maybe we print out the format that it's been formatted in. But for computational types, maybe we say what type of computational namespace it is,
Starting point is 00:30:38 whether it's compression or artificial intelligence inference, or perhaps vendor-specific, because there will be a lot of vendor-specific computation where the main job of the operating system will be just pass it up to user space, and somebody up there will know what to do with it. So we can do all that. And again, this all works for both PCIe and fabric-attached namespaces. And we can leverage all the discovery mechanisms that currently exist in NVMe. And as a disclaimer, I did just want to make very, very clear, I am not proposing this in any way, shape, or form as the way the working group must go. I'm just being illustrative. So, you know, we're a software company.
Starting point is 00:31:20 I have no intention of doing hardware. And so what we deploy on today is basically different hardware that's typically FPGA. It doesn't have to be, but it tends to be FPGA based. But it can come in a range of different form factors. And because we align with NVMe, it kind of makes sense to go with NVMe-centric form factors. And a lot of our customers actually deploy on their own hardware, so we're actually in a form factor of their choosing. But some customers are deploying
Starting point is 00:31:53 on form factors like these. So the U.2 makes a lot of sense. Sometimes people want a lot of horsepower, so they'll go add-in card. And things like E1.S, we did a launch of that at FMS. But there never are hardware. And I tell people we're not a hardware company. Like I would deploy on any form factor at all.
Starting point is 00:32:12 I don't really care. And for me, it's much more about the software and the consumption model through that software. That's interesting. Obviously, the differences from a customer pacing point of view is that typically these would be larger pieces of silicon that you can get more work done on. So you'd be able to get more of different types of acceleration capabilities here than you would here. But you're probably going to pay more for this one.
Starting point is 00:32:37 And so customers have to make decisions on what they want to do and so on and so forth. And this slide is much more interesting to me because this is the software side of things. So like I showed you a couple of slides ago, we don't typically, well, that's not true. We can do a model where we don't touch kernel space. And that was very important.
Starting point is 00:33:01 I made it very clear to the engineering team. We must have a mode where we can just have a user space library and use the Linux kernel or the operating system that the customer already has. And the engineering team ideally would not have made that decision, but I forced that as a technical constraint that they had to live within. Because that basically allows customers who want to, to just use a user space library that we call libnoload, and they actually do the namespace detection
Starting point is 00:33:29 and the utilization of those namespaces. And what we do on the top end of libnoload is basically say noload compress, provide that as a binding to a function that the user space library does that basically then calls down to noload and offloads the compression, right? So now their application can basically call no load compress
Starting point is 00:33:47 rather than calling Xeon compress, and the compression happens down here by a DMA and a DMA back, or maybe a peer-to-peer to a drive if you want, but that's a different story for another day. And people can consume it that way. Now, there is no path for a file system down here to consume that. Well, there is, but it's not a very nice path. So we'll ignore that for now. So if a customer does want to consume us in kernel space, we do play in kernel space. And
Starting point is 00:34:17 we have kernel developers who work for us, who will work with customers on that side of things. But in my view, the long-term vision is that we standardize something that we can push upstream into everybody's Linux, where we extend NVMe to support computation, and then we have that kernel space consumption. And that's something that we're going to be spending quite a bit of time looking at in the working group over the next little while. It doesn't have to be an operating system consumption model. It can be a user space consumption model. So we have worked with SPDK community, and some of our customers do deploy us through SPDK. SPDK, I jokingly swear at it from time to time, but those guys know that I love them very much, and it can be a very useful consumption model. But again, very much aligned with the NVMe standard
Starting point is 00:35:06 as a framework for consuming those different resources. So after computational storage, everything is beautiful and light, and we are all happy, me, myself especially. So one specific point example. So I will actually talk to... I don't know what we're doing for time. Oh my God. Perfect. So we did an example around RocksDB. Here's some figures of merit in terms of we got six times more transactions per second. We were offloading compression. So
Starting point is 00:35:40 doing the compression on the us versus the Xeon, we got more compression. It was more efficient. There was actually some quality of service metrics, et cetera, et cetera. There's Excel spreadsheets that will decide whether that made sense or not. We actually did that on a Xilinx Alveo card. And we did it in an AMD Naples-based system with a bunch of NVMe drives. And it was using the standard Linux operating system. And basically what we did is we used that lib no load. And we made, I think, 20 lines of code changes to upstream RocksDB. So basically there's a compression plug-in in RocksDB. And we just created a new one for low node and then tied that in.
Starting point is 00:36:22 And basically we had it basically run the no load compression rather than run the snappy or whatever compression that was there before. So the software consumption model, in this case, is a user space one. So we have the application, which does have to be modified. So in this model, the application has to be modified. Now, the problem with lib low node is lib low node is specific to a Deticom. So the customer has to be modified. Now, the problem with lib-low-node is lib-low-node is specific to a datacom, so the customer has to make that modification,
Starting point is 00:36:49 and it's never going to be accepted upstream because it's not a standard. Well, maybe it would, but it's harder to upstream that. Now, if this was lib computational storage and multiple vendors were designing to it through a standard, we could openly collaborate on this and probably have much better likelihood of getting it upstream. And if somebody upstreamed it for RocksDB once,
Starting point is 00:37:09 that would be sufficient for all the vendors to enjoy it, right? So one person could update Hadoop and one person updates Cassandra. So that's one way of doing it. The other way of doing it is, you know, we could have done something in the kernel and we could have actually used a file system that supports compression like ButterFS or something like that where we actually hide the compression from the application because the application
Starting point is 00:37:35 is just talking to files through a file system. If the file system is doing the compression offload the application doesn't have to change. And I don't have a lot of slides on that but that is also a model. And I don't have a lot of slides on that, but that is also a model. And obviously what happens is RocksDB talks to our API bindings. Our API bindings basically talk to the device
Starting point is 00:37:53 through iOctal calls through the slash dev slash NVMe. We can also use POSIX commands if you want to do reads and writes. We are actually very much currently working on IOU ring, which, for those of you who don't know, is just awesome. I'm not going to say any more about it. Go look it up. And that's another really cool way
Starting point is 00:38:14 to go from user space to kernel space. And then down here, everything. So the storage is using the NVMe block driver, and we are using the NVMe drop driver, and everything is kind of going through that. But the storage path is still going through a file system, and it's going through the block layer, and it's going through that. So this particular demonstration did not do peer-to-peer. I could talk another talk on peer-to-peer, but I'm not even going
Starting point is 00:38:36 to get, I'm not going near it today. It's awesome, and it's going to be great. We've done the same demo, exactly the same demo, with exactly the same software over Fabrics. So the demo that I showed you here, you can take exactly the same code, exactly the same software code, exactly the same kernel code, exactly the same user space code,
Starting point is 00:39:08 and rather than talking to a PCIe attached no load, you can put the no load on the other end of an NVMe over Fabrics network, and the same software can run on the compute node without modification. So this is beautiful, because now you can Kubernetesify all of this and make it composable. And it all works today. And I don't, like, Bitfusion just got bought by VMware. They should have bought me instead.
Starting point is 00:39:31 I'm very upset about that. Because the way Bitfusion does it is crap, and the way that we do it is awesome. But anyway. But I'm not a marketing guy, so what are you going to do? But this disaggregation of accelerators and the composability of accelerators is going to be a big deal, I think, over the next little while. So if we can align with NVMe in a way that allows that to happen, I think that would be potentially very, very interesting in certain markets. The ability to compose your accelerators as well as your compute and your memory is very, very interesting. And of course, now we have things like CXL.
Starting point is 00:40:10 There's an entire another talk that I could do and I have done a couple of months ago on where CXL will fit into all of this. So it's all very beautiful. So to conclude, I'm only saying things that I'm hearing from the people I have the pleasure of talking to when I go and visit the large hyperscalers. That they are telling me, so this isn't me saying it in the sense that I'm making it up. But what I'm hearing from the large consumers of accelerators is that they want vendor-agnostic, consumable interfaces, software, and management stacks.
Starting point is 00:40:50 And the conjecture is that NVMe gives them all of that. This second line is marketing crap. So using the no-load framework today, we can either accelerate applications in user space, or we can make changes in the operating system and accelerate things like file systems and other internal consumers of accelerators. And like I said, this is just a point solution, but I think that there's things that could be learned from the way we're doing things today, and maybe there's good things and maybe there's bad things. And as an industry, we can work towards a place where there's also a vendor kind of interoperability around some of this or some incarnation of something like this. And so, you know, I think the work that the TWIG is doing is super important, and I very much appreciate the amount of time and effort quite a few people have put into it.
Starting point is 00:41:42 And I hope that continues, and I hope other people continue to join us and get involved and help to shape what I think could be a very interesting future for computational storage. So thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Starting point is 00:42:10 Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.