Storage Developer Conference - #90: FPGA Accelerator Disaggregation Using NVMe-over-Fabrics

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 90. So I'm Sean Gibb, VP of Software at Identicom. Stephen Bates, our CTO, is going to be talking a little bit later in the presentation, but I'm going to start off kind of with an overview, set the stage for some of the discussion, then I'll let Stephen finish up the discussion.

Starting point is 00:00:57 Today I'm going to be talking about FPGA accelerator, disaggregation using NVMe Over Fabrics. And to set the stage for how we're disaggregating FPGAs and how we're using even NVMe to talk to an accelerator, I'll kind of go through some high-level intro slides into how we're using NVMe and then some of the strengths of using NVMe and that decision we made to use NVMe as a protocol to talk to accelerators. And then we'll dig into the meat of accelerator disaggregation using NVMe as a protocol to talk to accelerators. And then we'll dig into the

Starting point is 00:01:25 meat of accelerator desegregation using NVMe over fabrics. Okay, so to begin with, just a very simple slide on acceleration. Not much to say here that's new. Just, you know, we have a typical system maybe where you have a host CPU, some kind of PCIe bus, just as a very high-level discussion. And sitting on that bus, you might have NVMe SSDs. You might have HDDs, GPGPU, RDMA NICs, and then this little no-load accelerator card that we show on the bottom corner here. And what this no-load acceleration card is, is that's our product that I'm going to be talking about that is an NVMe offload engine. And that NVMe offload engine is an NVMe-compliant controller that sits in front of a whole bunch of acceleration algorithms. And it uses NVMe as the protocol to talk to those accelerators. Now because it's an FPGA,

Starting point is 00:02:30 it can come in many form factors. We have a U.2 form factor that we ran some of the results that we're going to be showing later in the presentation on. That U.2 form factor is developed by Nalatech. We have some commercial off-the-shelf FPGA architectures. We've tested our stuff on Amazon F1 Cloud as well. So because we're an FPGA bit file, really it's quite easy to retool or retarget to different FPGA platforms, ranging from, like I said, this U.2, which is a

Starting point is 00:03:05 more recent development accelerator platform. We have our higher performance accelerator platforms that would live on a more traditional FPGA card, and then cloud services as well. So why would we use NVMe for acceleration? Well, really what it boils down to from our perspective is accelerators need several things, and what they need is low latency. You don't want the data to take forever to get out to your accelerator and then forever to get the data back.

Starting point is 00:03:37 You want it to be quick. You want high throughput. So you would like to nominally max the PCIe bandwidth if your accelerator is living on a PCIe bus. You want low CPU overhead, so you don't want to tax your CPU with the staging of data and retrieving data or getting that data into and out of the accelerator.

Starting point is 00:03:59 We don't want to tax our CPU. Multi-core awareness would be very important in an acceleration algorithm. And then quality of service awareness as well. Oh, there's a typo on this. This should say

Starting point is 00:04:15 NVMe, or I don't know what it should say anymore. I'm totally confused. Yes, NVMe supplies maybe. I've been up since three o'clock your time because I flew down today, so my brain is not firing on all cylinders. So NVMe supplies the following things. It supplies low latency. It supplies high throughput. It supplies low CPU overhead, multi-core awareness,

Starting point is 00:04:46 quality of service awareness. All of the things that we want for our acceleration, NVMe supplies those things. And then on top of it all, it supplies something I didn't put on here, but I put it down here. It supplies world-class driver writers. We're a small company, so we can write good software, but we're not a world

Starting point is 00:05:08 class driver writing team, and I don't think that I want to be that either. So because of that, I would say the real question is why not use NVMe for accelerators? So this is a view, and I would say that it's no-load accelerator board, but it's pretty common for just an NVMe device in general. So what do we have here? We have the host CPU with DDR attached, and then across the PCIe bus we have our accelerator board, and on that accelerator board we have an actual NVMe controller that we wrote on an in-house built RISC-V controller.

Starting point is 00:05:49 So that's the brains of our acceleration algorithm. We've built all kinds of muscles to beef up the RISC-V performance where we need it. But really, the heart of it is this RISC-V controller. And then that board that we have has DDR, external DDR, and in that DDR, a portion of that DDR we have allocated to be a controller memory buffer. And that's going to be very important for a lot of the discussions we have today.

Starting point is 00:06:14 The controller memory buffer, or CMB, plays a very important role in a lot of the things we want to do in terms of acceleration offload. And in order to wrap all of this, I'll just come over to this side. So on this side, we have our PCIe controller and our DMA engines. And we have special DMA engines that we've written in-house plus external DMA engines. Those allow us to talk to the CMB very rapidly. We have our NVMe controller, like I said, RISC-V, and then we have acceleration algorithms

Starting point is 00:06:47 that can be plugged in. And I'm going to be talking about two in particular today in terms of examples. And all of that connects through our DDR controller out to our memory. Okay, so now NVMe for accelerators. We present as an NVMe 1.3 compliant device with multiple namespaces.

Starting point is 00:07:11 And what we've chosen to do is we map one namespace per accelerator function. So, you know, that's going to be important when we talk to the over-fabrics portion, but really that's the heart of it. We have the namespaces, and each of those namespaces will represent one accelerator on the card.

Starting point is 00:07:29 And once you make that mapping, then the next step is that when you go to try to discover what acceleration functions are available on the card, you can use the identify namespace command because you know that each namespace maps to one accelerator, each call to identify namespace for a different namespace can give you information very specific to that accelerator.

Starting point is 00:07:54 And the way we do that, in addition to using the standard fields that are available in the identify namespace command, we use the vendor-specific fields to provide certain accelerator-specific information. So for instance, what kind of accelerator is it? Is it a compression core? Is it some kind of RAID acceleration core?

Starting point is 00:08:22 Is it encryption cores? What kind of acceleration algorithm do you want to have there? So that's one of the things it provides. Second thing it provides would be a subtype of that acceleration. So if it's a compression algorithm, is it GZIP compatible?

Starting point is 00:08:37 Is it LZ4? Is it BZIP? What kind of compression do you support in there? And then inside of there, there'd be version numbers of the accelerator and block sizes, various things that are very important for the acceleration algorithm

Starting point is 00:08:52 and the software that would be running on top of that in the user space. Now, because each accelerator is a namespace, there needs to be a way to configure the accelerators to do jobs you want to do. And we have two ways we do that. One is we use overload of vendor-specific command to provide control to configure and set up an acceleration algorithm. The other way we do it is with in-situ or in-datapath configuration. And that's tended to be our go-to now because it allows us to do things like pass in a configuration to the

Starting point is 00:09:27 acceleration function followed by a whole bunch of data and then stage the next one with configuration and data configuration and data just keep staging commands in feeding them in very rapidly into the accelerator and we do that using this impact data path configuration on the output side, oh, and one other thing I want to emphasize. So when we pass input data into the accelerator, we do that using just basic built-in NVMe writes. So nothing special, no magic. The inbox driver is used exactly as is using NVMe writes to provide data into that acceleration function.

Starting point is 00:10:09 Now, on the output side, when we have to get data, we stage using NVMe reads, we stage pulls from the accelerator back out to the host, for instance, to say, give me the data of that last acceleration algorithm I just had you do. And the status is we can either retrieve using vendor-specific commands or, again, in-data path reads. And we tend to favor the in-data path reads again because of the reason that we can stage a whole bunch of data reads followed by a status read at the end to acquire, you know, what was the output results of that acceleration algorithm. So our in-house NVMe controller supports advanced features, including the entire CMB standards. We include, or we support, submission queues, completion queues in the CMB.

Starting point is 00:11:07 We support data in the CMB. We support SGLs in the CMB, PRPs in the CMB. And because of that, you know, we get to take advantage of things like NVMe over Fabrics. We also, because of this, support peer-to-peer operation, which we're not going to dive into too heavily in this presentation, but this does allow us to support peer-to-peer operation. And all of this, again, just to drive home the point, is done with the inbox drivers.

Starting point is 00:11:32 We didn't have to write one line of driver code to make any of this work. And one other nice feature that we benefited from in terms of this for deploying accelerators is we can leverage industry-standard NVMe tools such as NVMe CLI or FIO to test our performance of our controller and some of our acceleration blocks.

Starting point is 00:11:58 And this, again, like I said, has assisted with deployment and benchmarking when a customer gets this. We get them to usually run through some pretty standard NVMe-type deployment steps, like run NVMe CLI on our device. You should be able to see it show up with these certain kind of standard fields.

Starting point is 00:12:17 Then go and run FIO on these particular devices. This is the performance we would expect to see. We didn't have to write any tools ourselves to take advantage of that. Second thing is that we get to leverage the rich NVMe ecosystem, which includes, you know, the highlight of this presentation

Starting point is 00:12:39 is that because of this ability, we can disaggregate our accelerators over fabrics because NVME just does that for us. We don't have to think about it. We just expose our accelerators over the fabrics, and then we're able to take advantage of that. So a quick little look at the software stack. Anywhere you see an IDETICOM symbol,

Starting point is 00:13:03 we've either written software or contributed software. So the primary thing here is we developed an API. And like I said, it all lives on top of the built-in drivers. But there's certain tasks that are pretty common that you want to do. Like you want to go say, find me all the accelerators that I'm aware of. And then this API will go find you all the accelerators and it'll get you handles to them so that you can easily access them. You can say, find me all the accelerators with these certain sets of features.

Starting point is 00:13:32 It'll give those to you. And then, you know, you want to be able to lock an accelerator so that you can say, okay, my process is now using this accelerator. I don't want someone else trouncing on my data as I provide it to the accelerator. So you can lock it for the duration of your acceleration functions

Starting point is 00:13:47 and then choose to unlock it when you're done with the resource. So our API provides that. It provides a very thin wrapper over reading and writing operations, although really, like I said, it's about four or five lines of code over basically just a read and a write call, system call. So really a very thin API to make it easier

Starting point is 00:14:13 to use the accelerators. And SPDK, Stephen contributed CMB support to SPDK. And then just at the bottom to show our complete stack, we can talk through any operating system. And on any process, we've, our processor, so we've connected our accelerators to Intel processors, of course, AMD processors, ARM processors, and now, oh, power, power, ARM processors, and now power processors, and now RISC-V. So because it's NVMe, we haven't had to write any drivers to go and connect to different kinds of processor ecosystems. And our API is BSD licensed and available

Starting point is 00:15:02 on our public GitHub. So just a very short slide on controller performance. This has nothing to do with the accelerators themselves. A couple of highlights. For 32K blocks, we are saturating Gen 3x8, and for 16K blocks, we saturate Gen 3x4, which would be our U.2 form factor. Our focus today has been on accelerators that have greater than or equal to 16K data blocks

Starting point is 00:15:33 for acceleration, but we do have a multi-core risk 5 that we're working on in-house that will allow us to drastically increase the performance for smaller block transfers. Okay, so that kind of takes us through the basic rundown of how are we doing NVMe over fabrics. Sorry, NVMe for acceleration. Now on to NVMe over fabrics.

Starting point is 00:15:58 So, basic thing of NVMe over fabrics. Well, the whole notion of NVMe over fabricsrics, irrespective of what we're doing with acceleration, is it allows resources, NVMe resources, to be accessible remotely on a client over a network or to be shared over a network. And the way it does that is it exposes NVMe namespaces to client machines. And I'm going to have a picture of our accelerators coming up shortly, a screenshot showing our accelerators over a Fabrics connection. But remember that accelerators, again,

Starting point is 00:16:37 just to go back to what I've talked about, are mapped to namespaces on a one-to-one basis. And with NVMe over Fabrics, because we're just a built-in standard NVMe device that looks like a hard drive to the operating system, you can do a one-to-one mapping of accelerators to namespaces on the remote client machine. And we didn't have to write any custom code to do that. It just happens. So here's an

Starting point is 00:17:07 example, for instance, which I'll kind of build on in a moment. But suppose that we have an accelerator and some hard drives that are remote over here. The clients on the left side, one of the clients may say, I want to perform a RAID acceleration. And that client can say, give me your RAID acceleration resources, which have already been mapped using the over fabrics infrastructure in the kernel. So he can say, give me your resource for doing RAID. And that accelerator will essentially appear to the client as though it's locally attached. The client, the user software in the client,

Starting point is 00:17:51 will have no idea that that software, that that accelerator is not directly attached to it. It doesn't have to have any special code to talk to it. It's all handled behind the scenes by NVMe and the NVMe over Fabrics drivers. So this allows us to disaggregate the accelerators. We don't have to have them on all the local clients. Or if we do have them on a local client, but the resource is being unused,

Starting point is 00:18:17 other clients, if they've provided that acceleration, or they've mapped the namespaces using over Fabrics, can share those namespaces with other machines. So this is a little example of... We mapped this one. It's got six accelerators, I guess, sitting on it, plus a couple of other drives. And one thing I wanted to highlight here is you can see that we have all the information

Starting point is 00:18:49 on these accelerators here. That was enabled by some pass-through patches in the kernel that allow us to get the full identify command through to the overfabrics, to the client machine in the overfabrics connection. So typically it would look like this typically it would look more like this Linux if we didn't have the pass-through patches, and you wouldn't know the exact specifics of the accelerator that are there or the devices that are there.

Starting point is 00:19:14 But these pass-through patches, which work for all NVMe devices, allow us to discover information about our accelerators in a very easy way using standard drivers. Okay, so one case study. So the first thing we did, and this was a demonstration or a demo we had at FMS. And in this demo, what we did is we had a local machine that was running the client application, which was a process that was requesting compression acceleration.

Starting point is 00:19:57 And that hooked up through a NIC, a high-speed network, to a Broadcom Stingray that had a JBuff behind it. And inside of that JBuff, there was some SSDs, NVMe SSDs, and there was a no-load card. So what we did is... Oh, I just want to take a side note. We have tested this on both Rocky and TCPIP. It works well on both, and the results that I have for the compression

Starting point is 00:20:27 would be equivalent in both networking protocols. So this one was our U.2 form factor for this demonstration. So because of that, it's Gen 3 by 4 currently, so we would anticipate we'll get about 3.4, you know, a maximum of 3.4 gigabytes per second of data rate into and out of that accelerator. And the local client was unaware that it was using the overfabrics connection. And in the demonstration, just as a side note, because like I said, we demonstrated this at FMS. In that demonstration, we had this running in the Broadcom booth

Starting point is 00:21:08 doing an over-fabrics test. In Xilinx's booth, we had the exact same user code, not one line of code change, running a peer-to-peer example. So this is really one of the strengths of using NVMe and the NVMe over-fabrics and the peer-to-peer functionality of NVMe is that user software doesn't have to be aware of the architecture of the network. So what we anticipate will happen is that we have this software running on the client over here, and it goes and requests an acceleration function

Starting point is 00:21:46 to be performed. You can consider because it's been mapped over fabrics, it appears like it's here. So what it'll do is the client will say, acceleration resource, please run this algorithm, compression, for instance, on the device. It will push the data in, which it's either reading from files

Starting point is 00:22:03 or generating on the fly, push the results into the accelerator, and as the acceleration function completes, it will then go and read the data back into the system memory on the local client, and then will do with it whatever it wants, which might be, which it was in this case, mapping it out to another NVMe SSD,

Starting point is 00:22:23 which may or may not be local. It could itself be an RDMA or an over fabrics, sorry, SSD. So that data will, you know, in the network sense, it will travel through to the accelerator, perform the acceleration function, come back up, and then be written to the SSD, which may be there, it may be here, it may be somewhere else. And here's the results we got for running two of our compression cores. We can fit more on the card, this demo we just had, too. So you can see that we're showing the results. Each core was able, capable,

Starting point is 00:23:06 each core we have is capable of generating about one to one and a half gigabytes per second of compression. And these two in this particular case with this set of files were generating one gigabyte per second of data. And I mean, all this is really showing is that the data across the network is the 2.1 gigabytes per second of data. And I mean, all this is really showing is that the data across the network

Starting point is 00:23:25 is the 2.1 gigabytes per second. Now, one thing I should note is there is a little bit of fabrics latency compared to direct attached. And that fabrics latency, the performance that we saw here is exactly the same as the performance with this data set on a direct attached. And the way we're able to do that, of course, is because we have multiple configuration and data's in flight at a time, the latency is hidden from the throughput. I mean, there's still, the latency is still there, of course, when the data's gone out and come back on a single file, but the latency is hidden in terms of throughput. And we'll talk a little bit about the impact on the target machine in a moment. There is some impact on the target machine

Starting point is 00:24:06 if you think about how the data has traveled in the previous discussion, but that's true of the next example as well. So in this case, we did EC over fabrics, and we have a core that supports up to 32 plus 4 disk groups with block sizes ranging from 16K to 128K bytes, and this one is our Gen 3 by 16 form factor. And in this case, both no load... Well, in this case, no load, again,

Starting point is 00:24:36 was on the remote server, and we're going to pass data into and out of it from the local client. And again, same as the other thing, the user code is completely unaware of the acceleration algorithm it's doing. So in this case, the client software for this one,

Starting point is 00:24:56 I think this was a 10 plus 2, so based on the size of the block, we'll get different performance results, but this was a 10 plus 2, so we had about 6.4 gigabytes per second in with 2 gigabytes per second coming back out. And in this case, this is an older acceleration algorithm for us.

Starting point is 00:25:17 The results have a very small latency penalty versus the direct-attach results. They'd be a tiny bit better with direct-attach. So, again, kind of hit this point where the data is traveling in and out. So what is the performance impact on the target? And for that, I'm going to pass the mic over to Steve. And so he can talk about what are the performance impacts and what are some different creative ways we can mitigate those performance impacts? Thanks, Sean.

Starting point is 00:25:46 Before I go any further, I want to shout out two people in the room, actually. So Alan Cantle from Nalatec has been a huge hardware partner of ours. Alan, stand up for a sec there. So he's instrumental in getting our U.2 form factor device into the market. So we're an IP company.

Starting point is 00:26:03 We put the bit files on the FPGA. We rely on partners to build interesting hardware. So, you know, Sean showed a slide showing the U.2 cloud deployment and add-in card. Obviously, things like ruler form factor and next generation form factors are also interesting. And to be honest, for us, it's just a board spin. And, and you know alan's company is really awesome at doing that the other person i want to thank is uh chitania i think i saw you down the back yeah wait stand up there so sean mentioned pass-through patches uh for the nvme over fabrics target so right now if you go to the inbox driver for nvme certain commands are not

Starting point is 00:26:42 passed directly to the backing nvme drives There's a layer of indirection. In order to do what we want to do, which is present those namespaces pretty much as is over the network, we have to apply his pass-through patches, which he'd very kindly written and submitted. So we didn't have to do that. Just another advantage of using NVMe. I think one more point I want to make clear. One of the things I think that's very interesting about this work, Sean mentioned it a couple of times,

Starting point is 00:27:11 but to my knowledge, this is the only way you can have agnostic user space code running that's identical regardless of whether the PCIe accelerator is in the box with the application or remotely connected to the application in theory over TCP IP which means it could be anywhere else in the planet and you know there may be other frameworks that can do that I certainly don't know of them but the advantage you know the NVMe over fabrics thing really does mean that user space has no idea

Starting point is 00:27:42 of that NVMe namespace is in the box with the application or somewhere else on the planet entirely. And what's interesting about that is if the application is, for example, a virtual machine or a Docker container or some other runtime, you don't have to change your code to migrate your application. And you can migrate your application from things that have accelerators in the box

Starting point is 00:28:04 to things that have the accelerator's network attached. And the application may suffer some quality of service, but it will certainly continue to run. And you haven't had to recompile or touch anything. And I think for, I think, I don't know if kubernetesify is a word, but I'm inventing it if it is not. So basically this lets you kubernetesify FPGA acceleration or any other. It doesn't have to be FPGA, anything else. So obviously, you know, this is all very interesting, but there are repercussions to physically moving a device from one place to another, right? There is this thing called physics, apparently, which, you know, impacts us. And fake news might get around a lot of things

Starting point is 00:28:45 but it doesn't get around physics and there is latency and so forth involved. The other thing that we're doing is we're taking an accelerator that's in a box with a server and we're basically putting it on a target system so there's now two processors involved. There's the processor on the client

Starting point is 00:29:01 on the compute node and there's the processor on the target the storage controller, as we often refer to it. And both of them now have to execute CPU cycles in order to get things done. But what's interesting is that we actually have in some of the new hardware that's coming from people like ourselves and Mellanox and others, to name a few. We have some interesting hardware that's going to help us offload some of the repercussions from the target side. So I'm going to talk about two of them.

Starting point is 00:29:33 One of them is memory offload. So how do we get the DDR subsystem on the storage controller out of the way? I'm going to talk about this very briefly here, but after lunch, I'm going to be talking about it a lot more. And you can either go to Jim Harris's talk or you can come to my talk. It's up to you. They're both going to be awesome. I'm going to go to Jim's talk. So one of them is to get the DDR. And the reason why, I'm going to talk about this a little bit more after lunch, but the reason why I want to get the DDR subsystem

Starting point is 00:30:05 out of the way on targets is because I don't want to use big processors there. I want to use RISC-V SOCs. And they don't necessarily have a lot of DDR channels, but if I want to do 20 gigabyte per second of IO, I better not have the DDR of that little SOC in the hot path. I've got to have that in the cold path. So there's something that we can use to take advantage of that.

Starting point is 00:30:27 I'll talk about it a little bit here. I'll talk about it more after lunch. And then the other thing is, of course, those little SOCs don't necessarily have a lot of processor cores. They might do, but they might not. And they have to execute a lot of IOPS, and every IOP is some lines of C code that are in a driver that have to be executed on the instruction set architecture of that target.

Starting point is 00:30:46 So there are some ways and means that we can get around that. I'm going to look at those a little bit. After we've looked at them, I'm kind of showing you the end game before the slides. But basically, what we can do, at least on the CPU utilization side, is we can take advantage of things like NVMe offload. So this is a feature from Mellanox. It's provided in both Bluefield and the CX5. It's a state machine that's in the PCIe endpoint

Starting point is 00:31:16 that can essentially administer NVMe commands on your behalf. So the driver basically says, I'm not going to be the one ringing the doorbell or pushing submission queue entries or polling for completion. I'm going to use a little piece of hardware to do this. This is either an awesome idea or it's a crazy idea or it's a crazy awesome idea. As someone who works on operating systems, the thought of having a little piece of hardware doing this kind of scares me a little bit. But at the same time, it's pretty interesting. So we've implemented code that takes advantage of that little state machine.

Starting point is 00:31:50 And we're seeing pretty much 98% CPU offload. So the CPU load goes from a nominal factor of 100 to a nominal factor of 2. The processor basically isn't doing anything anymore. And we're still doing an awful lot of I.O. So that's very compelling. There's issues around error handling that have to be thought about, but that's a huge potential win. That's maybe moving away from a pretty serious storage controller processor

Starting point is 00:32:17 to something that's a lot more lightweight. So very, very interesting. And then on the memory side, like I said, I will talk about this a lot more after lunch but there's a framework that that myself and others have been working on for quite some time that's on its way into hopefully into linux hopefully pretty soon here we're getting some good good uh acknowledgements uh even in the last couple of weeks um where we can do dmas directly from one pcie device to another so what we call peer-to-peer DMA, P2P DMA.

Starting point is 00:32:47 Traditionally, a DMA does not do that. Traditionally, if you want to move data from device A to device B, you have to do a DMA through system memory. Now it may, if you have things like data direct IO, it may get stuck in your L3 cache, which can be a good thing or a bad thing, depending on what that L3 cache is trying to do. But often it will end up in DRAM. And that's memory bandwidth that has to be provided. If you're doing 20 gigabytes per second, that's 40 gigabytes per second of DDR bandwidth. That's quite a lot of DDR channels, right?

Starting point is 00:33:20 So that's money and power and so forth. If you have memory on the PCIe endpoints, then they can be the DMA destination or source. And one of the most famous recent examples of a standard that defines memory on a PCIe device is the controller memory buffer or persistent memory region of an NVMe device. So now we have a standards-based way of providing memory for this framework. And if we take advantage of that, basically, as well as getting the factor of 50 offload in the processor, we get roughly the same, about a factor of 50 offload in the DDR bandwidth on the storage controller processor while we're still achieving the same amount of throughput. So again, this is another big bang for the buck. And the great thing about this one compared to the other one

Starting point is 00:34:08 is I'm not taking the OS out of the path. The operating system for the memory offload is still potentially the one doing the IO. It's just that the DMAs are not going through my memory subsystem on my processor anymore. So that's a win-win. So if we put all the things together that we've talked about, basically we have the ability to disaggregate accelerators that could be FPGA-based,

Starting point is 00:34:32 but they could also be SOC-based. They could be GPGU-based. They could be something else. We can disaggregate them across networks. That network could be fiber channel. It could be Ethernet-based with RDMA. It could even be coming soon, Sagi. I think I saw earlier he maybe stepped out. We'll have TCP IP as well. Totally standard using inbox drivers. That's a pretty important point.

Starting point is 00:34:59 And then combining it with techniques like this, we can build some very, very interesting NVMe over fabric target appliances that don't necessarily need a lot of processor horsepower, but contain an awful lot of accelerator capabilities that we can push out onto the network. The other part of this is anyone who's still alive at seven o'clock tonight, come and join us for the birds of feather on computational storage standardization. What we're talking about today is a little bit, we're not the only company doing it, but everyone's kind of, right now, we're all doing it a little bit differently. Standards are the way forward, right?

Starting point is 00:35:35 So this is kind of showing the path. This is saying, this is interesting. There's value in doing this kind of thing. Let's get people in a room and knock our heads together and work out how we standardize this. How do you standardize it in NVMe? How do you standardize it in SCSI? How do you standardize it in Ethernet? Where does it need to be standardized and how does that look like? We push that standard into the drivers, into the operating systems, and we create a vendor agnostic battlefield in which we can all go and try and carve out viable companies that make money and make people. Computational abilities over NVMe enable a lot of stuff, kind of for free, because it's already there.

Starting point is 00:36:16 Come to the Birds of Fathers, standardize it. If you're interested in peer-to-peer, I'm talking about that after lunch. Thank you very much. Questions? Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.

Starting point is 00:36:42 Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit www.storagedeveloper.org.

Storage Developer Conference - #90: FPGA Accelerator Disaggregation Using NVMe-over-Fabrics

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.