Storage Developer Conference - #90: FPGA Accelerator Disaggregation Using NVMe-over-Fabrics
Episode Date: March 25, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 90. So I'm Sean Gibb, VP of Software at
Identicom. Stephen Bates, our CTO, is going to be talking a little bit later in the presentation,
but I'm going to start off kind of with an overview, set the stage for some of the discussion,
then I'll let Stephen finish up the discussion.
Today I'm going to be talking about FPGA accelerator, disaggregation using NVMe Over Fabrics. And to set the stage for how we're disaggregating FPGAs
and how we're using even NVMe to talk to an accelerator,
I'll kind of go through some high-level intro slides
into how we're using NVMe
and then some of the strengths of using NVMe
and that decision we made to use NVMe
as a protocol to talk to accelerators.
And then we'll dig into the meat of accelerator disaggregation using NVMe as a protocol to talk to accelerators. And then we'll dig into the
meat of accelerator desegregation using NVMe over fabrics. Okay, so to begin with, just a very
simple slide on acceleration. Not much to say here that's new. Just, you know, we have a typical
system maybe where you have a host CPU, some kind of PCIe bus, just as a very high-level discussion.
And sitting on that bus, you might have NVMe SSDs.
You might have HDDs, GPGPU, RDMA NICs, and then this little no-load accelerator card that we show on the bottom corner here. And what this no-load acceleration card is, is that's our product that I'm going to be talking about that is an NVMe offload engine.
And that NVMe offload engine is an NVMe-compliant controller that sits in front of a whole bunch of acceleration algorithms.
And it uses NVMe as the protocol to talk to those accelerators.
Now because it's an FPGA,
it can come in many form factors.
We have a U.2 form factor
that we ran some of the results
that we're going to be showing later in the presentation on.
That U.2 form factor is developed by Nalatech.
We have some commercial off-the-shelf FPGA architectures. We've tested our stuff on
Amazon F1 Cloud as well. So because we're an FPGA bit file, really it's quite easy to retool or
retarget to different FPGA platforms, ranging from, like I said, this U.2, which is a
more recent development accelerator platform. We have our higher performance accelerator platforms
that would live on a more traditional FPGA card, and then cloud services as well.
So why would we use NVMe for acceleration? Well, really what it boils down to from our perspective
is accelerators need several things,
and what they need is low latency.
You don't want the data to take forever
to get out to your accelerator
and then forever to get the data back.
You want it to be quick.
You want high throughput.
So you would like to nominally max the PCIe bandwidth
if your accelerator is living on a PCIe bus.
You want low CPU overhead,
so you don't want to tax your CPU
with the staging of data and retrieving data
or getting that data into and out of the accelerator.
We don't want to tax our CPU.
Multi-core awareness would be very important
in an acceleration algorithm.
And then quality of service awareness
as well.
Oh, there's a typo
on this.
This should say
NVMe, or
I don't know what it should say anymore.
I'm totally confused.
Yes,
NVMe supplies maybe. I've been up since three o'clock
your time because I flew down today, so my brain is not firing on all cylinders. So NVMe supplies
the following things. It supplies low latency. It supplies high throughput. It supplies low CPU
overhead, multi-core awareness,
quality of service awareness.
All of the things that we want for our acceleration,
NVMe supplies those things.
And then on top of it all,
it supplies something I didn't put on here,
but I put it down here.
It supplies world-class driver writers.
We're a small company, so we can write good software, but we're not a world
class driver writing team, and I don't think that I want to be that either. So because of that,
I would say the real question is why not use NVMe for accelerators?
So this is a view, and I would say that it's no-load accelerator board,
but it's pretty common for just an NVMe device in general.
So what do we have here?
We have the host CPU with DDR attached,
and then across the PCIe bus we have our accelerator board,
and on that accelerator board we have an actual NVMe controller that we wrote on an in-house built RISC-V controller.
So that's the brains of our acceleration algorithm.
We've built all kinds of muscles to beef up the RISC-V performance where we need it.
But really, the heart of it is this RISC-V controller.
And then that board that we have has DDR, external DDR,
and in that DDR, a portion of that DDR
we have allocated to be a controller memory buffer.
And that's going to be very important
for a lot of the discussions we have today.
The controller memory buffer, or CMB,
plays a very important role
in a lot of the things we want to do
in terms of acceleration offload.
And in order to wrap all of this, I'll just come over to this side. So on this side, we have our PCIe controller and our DMA engines. And we have special DMA
engines that we've written in-house plus external DMA engines. Those allow us to talk to the CMB
very rapidly. We have our NVMe controller, like I said, RISC-V,
and then we have acceleration algorithms
that can be plugged in.
And I'm going to be talking about two in particular today
in terms of examples.
And all of that connects through our DDR controller
out to our memory.
Okay, so now NVMe for accelerators.
We present as an NVMe 1.3 compliant device
with multiple namespaces.
And what we've chosen to do is we map
one namespace per accelerator function.
So, you know, that's going to be important
when we talk to the over-fabrics portion,
but really that's the heart of it.
We have the namespaces,
and each of those namespaces
will represent one accelerator on the card.
And once you make that mapping,
then the next step is that
when you go to try to discover
what acceleration functions are available on the card,
you can use the identify namespace command
because you know that each namespace maps to one accelerator,
each call to identify namespace for a different namespace
can give you information very specific to that accelerator.
And the way we do that,
in addition to using the standard fields
that are available in the identify namespace command,
we use the vendor-specific fields
to provide certain accelerator-specific information.
So for instance, what kind of accelerator is it?
Is it a compression core?
Is it some kind of RAID acceleration core?
Is it encryption cores?
What kind of acceleration algorithm
do you want to have there?
So that's one of the things it provides.
Second thing it provides
would be a subtype of that acceleration.
So if it's a compression algorithm,
is it GZIP compatible?
Is it LZ4?
Is it BZIP?
What kind of compression do you support in there?
And then inside of there,
there'd be version numbers of the accelerator
and block sizes,
various things that are very important
for the acceleration algorithm
and the software that would be running
on top of that in the user space.
Now, because each accelerator is a namespace,
there needs to be a way to configure the accelerators
to do jobs you want to do.
And we have two ways we do that.
One is we use overload of vendor-specific command to provide control to configure and set up an acceleration algorithm. The other way we do it is with in-situ or in-datapath configuration.
And that's tended to be our go-to now because it allows us to do things like pass in a configuration to the
acceleration function followed by a whole bunch of data and then stage the next one with configuration
and data configuration and data just keep staging commands in feeding them in very rapidly into the
accelerator and we do that using this impact data path configuration on the output side, oh, and one other thing I want to emphasize.
So when we pass input data into the accelerator,
we do that using just basic built-in NVMe writes.
So nothing special, no magic.
The inbox driver is used exactly as is
using NVMe writes to provide data into that acceleration function.
Now, on the output side, when we have to get data, we stage using NVMe reads, we stage
pulls from the accelerator back out to the host, for instance, to say, give me the data
of that last acceleration algorithm I just had you do. And the status is we can either retrieve using vendor-specific commands or, again,
in-data path reads. And we tend to favor the in-data path reads again because of the reason
that we can stage a whole bunch of data reads followed by a status read at the end to acquire,
you know, what was the output results of that acceleration algorithm.
So our in-house NVMe controller supports advanced features, including the entire CMB
standards. We include, or we support, submission queues, completion queues in the CMB.
We support data in the CMB.
We support SGLs in the CMB, PRPs in the CMB.
And because of that, you know,
we get to take advantage of things like NVMe over Fabrics.
We also, because of this, support peer-to-peer operation, which we're not going to dive into too heavily in this presentation,
but this does allow us to support peer-to-peer operation.
And all of this, again, just to drive home the point,
is done with the inbox drivers.
We didn't have to write one line of driver code
to make any of this work.
And one other nice feature that we benefited from
in terms of this for deploying accelerators
is we can leverage industry-standard NVMe tools
such as NVMe CLI or FIO
to test our performance of our controller
and some of our acceleration blocks.
And this, again, like I said,
has assisted with deployment and benchmarking
when a customer gets this.
We get them to usually run through
some pretty standard NVMe-type deployment steps,
like run NVMe CLI on our device.
You should be able to see it show up
with these certain kind of standard fields.
Then go and run FIO on these particular devices.
This is the performance we would expect to see.
We didn't have to write any tools ourselves
to take advantage of that.
Second thing is that we get to leverage
the rich NVMe ecosystem,
which includes, you know,
the highlight of this presentation
is that because of this ability,
we can disaggregate our accelerators over fabrics
because NVME just does that for us.
We don't have to think about it.
We just expose our accelerators over the fabrics,
and then we're able to take advantage of that.
So a quick little look at the software stack.
Anywhere you see an IDETICOM symbol,
we've either written software
or contributed software. So the primary thing here is we developed an API. And like I said,
it all lives on top of the built-in drivers. But there's certain tasks that are pretty common that
you want to do. Like you want to go say, find me all the accelerators that I'm aware of. And then
this API will go find you all the accelerators and it'll get you handles to them
so that you can easily access them.
You can say, find me all the accelerators
with these certain sets of features.
It'll give those to you.
And then, you know, you want to be able to lock an accelerator
so that you can say, okay,
my process is now using this accelerator.
I don't want someone else trouncing on my data
as I provide it to the accelerator.
So you can lock it for the duration
of your acceleration functions
and then choose to unlock it
when you're done with the resource.
So our API provides that.
It provides a very thin wrapper
over reading and writing operations,
although really, like I said,
it's about four or five lines of code
over basically just a read and a write call, system call. So really a very thin API to make it easier
to use the accelerators. And SPDK, Stephen contributed CMB support to SPDK. And then just at the bottom to show our complete stack,
we can talk through any operating system.
And on any process, we've, our processor,
so we've connected our accelerators to Intel processors,
of course, AMD processors, ARM processors,
and now, oh, power, power, ARM processors, and now power processors, and
now RISC-V. So because it's NVMe, we haven't had to write any drivers to go and connect
to different kinds of processor ecosystems. And our API is BSD licensed and available
on our public GitHub.
So just a very short slide on controller performance.
This has nothing to do with the accelerators themselves.
A couple of highlights.
For 32K blocks, we are saturating Gen 3x8,
and for 16K blocks, we saturate Gen 3x4,
which would be our U.2 form factor.
Our focus today has been on accelerators that have greater than or equal to 16K data blocks
for acceleration, but we do have a multi-core risk 5
that we're working on in-house
that will allow us to drastically increase the performance
for smaller block transfers.
Okay, so that kind of takes us through the basic rundown of how are we doing NVMe over
fabrics.
Sorry, NVMe for acceleration.
Now on to NVMe over fabrics.
So, basic thing of NVMe over fabrics.
Well, the whole notion of NVMe over fabricsrics, irrespective of what we're doing
with acceleration, is it allows resources, NVMe resources, to be accessible remotely on a client
over a network or to be shared over a network. And the way it does that is it exposes NVMe namespaces
to client machines.
And I'm going to have a picture of our accelerators coming up shortly, a screenshot showing our accelerators
over a Fabrics connection.
But remember that accelerators, again,
just to go back to what I've talked about,
are mapped to namespaces on a one-to-one basis.
And with NVMe over Fabrics,
because we're just a built-in standard NVMe device
that looks like a hard drive to the operating system,
you can do a one-to-one mapping of accelerators
to namespaces on the remote client machine.
And we didn't have to write any custom code to do that. It just happens. So here's an
example, for instance, which I'll kind of build on in a moment. But suppose that we have an
accelerator and some hard drives that are remote over here. The clients on the left side, one of
the clients may say, I want to perform a RAID acceleration. And that client can
say, give me your RAID acceleration resources, which have already been mapped using the over
fabrics infrastructure in the kernel. So he can say, give me your resource for doing RAID.
And that accelerator will essentially appear to the client
as though it's locally attached.
The client, the user software in the client,
will have no idea that that software,
that that accelerator is not directly attached to it.
It doesn't have to have any special code to talk to it.
It's all handled behind the scenes
by NVMe and the NVMe over Fabrics drivers.
So this allows us to disaggregate the accelerators.
We don't have to have them on all the local clients.
Or if we do have them on a local client, but the resource is being unused,
other clients, if they've provided that acceleration,
or they've mapped the namespaces using over Fabrics,
can share those namespaces with other machines.
So this is a little example of...
We mapped this one.
It's got six accelerators, I guess, sitting on it,
plus a couple of other drives.
And one thing I wanted to highlight here is you can see that we have all the information
on these accelerators here. That was enabled by some
pass-through patches in the kernel that allow us to get
the full identify command through to the
overfabrics, to the client machine in the overfabrics connection.
So typically it would look like this typically it would look more like this Linux
if we didn't have the pass-through patches,
and you wouldn't know the exact specifics of the accelerator that are there
or the devices that are there.
But these pass-through patches, which work for all NVMe devices,
allow us to discover information about our accelerators
in a very easy way using standard drivers.
Okay, so one case study.
So the first thing we did,
and this was a demonstration or a demo we had at FMS.
And in this demo, what we did is we had a local machine that was running the client application,
which was a process that was requesting compression acceleration.
And that hooked up through a NIC, a high-speed network, to a Broadcom Stingray that had a JBuff behind it.
And inside of that JBuff, there was some SSDs,
NVMe SSDs, and there was a no-load card.
So what we did is...
Oh, I just want to take a side note.
We have tested this on both Rocky and TCPIP.
It works well on both,
and the results that I have for the compression
would be equivalent in both networking protocols.
So this one was our U.2 form factor for this demonstration.
So because of that, it's Gen 3 by 4 currently,
so we would anticipate we'll get about 3.4,
you know, a maximum of 3.4 gigabytes
per second of data rate into and out of that accelerator. And the local client was unaware
that it was using the overfabrics connection. And in the demonstration, just as a side note,
because like I said, we demonstrated this at FMS. In that demonstration, we had this running in the Broadcom booth
doing an over-fabrics test.
In Xilinx's booth, we had the exact same user code,
not one line of code change, running a peer-to-peer example.
So this is really one of the strengths of using NVMe
and the NVMe over-fabrics and the peer-to-peer functionality
of NVMe is that user software doesn't have to be aware of the architecture of the network.
So what we anticipate will happen is that we have this software running on the client
over here, and it goes and requests an acceleration function
to be performed.
You can consider because it's been mapped over fabrics,
it appears like it's here.
So what it'll do is the client will say,
acceleration resource, please run this algorithm,
compression, for instance, on the device.
It will push the data in,
which it's either reading from files
or generating on the fly,
push the results into the accelerator,
and as the acceleration function completes,
it will then go and read the data
back into the system memory on the local client,
and then will do with it whatever it wants,
which might be, which it was in this case,
mapping it out to another NVMe SSD,
which may or may not be local. It could itself
be an RDMA or an over fabrics, sorry, SSD. So that data will, you know, in the network sense,
it will travel through to the accelerator, perform the acceleration function, come back up,
and then be written to the SSD, which may be there, it may be here, it may be somewhere else.
And here's the results we got for running two of our compression cores.
We can fit more on the card, this demo we just had, too.
So you can see that we're showing the results.
Each core was able, capable,
each core we have is capable of generating
about one to one and a half gigabytes per second
of compression.
And these two in this particular case
with this set of files were generating
one gigabyte per second of data.
And I mean, all this is really showing
is that the data across the network is the 2.1 gigabytes per second of data. And I mean, all this is really showing is that the data across the network
is the 2.1 gigabytes per second. Now, one thing I should note is there is a little bit of fabrics
latency compared to direct attached. And that fabrics latency, the performance that we saw here
is exactly the same as the performance with this data set on a direct attached. And the way we're
able to do that, of course, is because we have multiple configuration and data's in flight at a time,
the latency is hidden from the throughput. I mean, there's still, the latency is still there,
of course, when the data's gone out and come back on a single file, but the latency is hidden in
terms of throughput. And we'll talk a little bit about the impact on the target machine in a moment.
There is some impact on the target machine
if you think about how the data has traveled in the previous discussion,
but that's true of the next example as well.
So in this case, we did EC over fabrics,
and we have a core that supports up to 32 plus 4 disk groups
with block sizes ranging from 16K to 128K bytes,
and this one is our Gen 3 by 16 form factor.
And in this case, both no load...
Well, in this case, no load, again,
was on the remote server,
and we're going to pass data into and out of it
from the local client.
And again, same as the other thing,
the user code is completely unaware
of the acceleration algorithm it's doing.
So in this case,
the client software for this one,
I think this was a 10 plus 2,
so based on the size of the block,
we'll get different performance results,
but this was a 10 plus 2,
so we had about 6.4 gigabytes per second in
with 2 gigabytes per second coming back out.
And in this case,
this is an older acceleration algorithm for us.
The results have a very small latency penalty
versus the direct-attach results.
They'd be a tiny bit better with direct-attach.
So, again,
kind of hit this point where the data is traveling in and out. So what is the performance impact on
the target? And for that, I'm going to pass the mic over to Steve. And so he can talk about what
are the performance impacts and what are some different creative ways we can mitigate those
performance impacts? Thanks, Sean.
Before I go any further,
I want to shout out two people in the room, actually.
So Alan Cantle from Nalatec
has been a huge hardware partner of ours.
Alan, stand up for a sec there.
So he's instrumental in getting
our U.2 form factor device into the market.
So we're an IP company.
We put the bit files
on the FPGA. We rely on partners to build interesting hardware. So, you know, Sean showed
a slide showing the U.2 cloud deployment and add-in card. Obviously, things like ruler form
factor and next generation form factors are also interesting. And to be honest, for us,
it's just a board spin. And, and you know alan's company is really awesome at
doing that the other person i want to thank is uh chitania i think i saw you down the back
yeah wait stand up there so sean mentioned pass-through patches uh for the nvme over
fabrics target so right now if you go to the inbox driver for nvme certain commands are not
passed directly to the backing nvme drives There's a layer of indirection.
In order to do what we want to do, which is present those namespaces pretty much as is over the network,
we have to apply his pass-through patches, which he'd very kindly written and submitted.
So we didn't have to do that.
Just another advantage of using NVMe.
I think one more point I want to make clear.
One of the things I think that's very interesting about this work,
Sean mentioned it a couple of times,
but to my knowledge, this is the only way
you can have agnostic user space code running
that's identical regardless of whether the PCIe accelerator
is in the box with the application
or remotely connected
to the application in theory over TCP IP which means it could be anywhere else in the planet
and you know there may be other frameworks that can do that I certainly don't know of them
but the advantage you know the NVMe over fabrics thing really does mean that user space has no idea
of that NVMe namespace is in the box with the application
or somewhere else on the planet entirely.
And what's interesting about that is
if the application is, for example, a virtual machine
or a Docker container or some other runtime,
you don't have to change your code to migrate your application.
And you can migrate your application
from things that have accelerators in the box
to things that have the accelerator's network attached.
And the application may suffer some quality of service, but it will certainly continue to run.
And you haven't had to recompile or touch anything.
And I think for, I think, I don't know if kubernetesify is a word, but I'm inventing it if it is not.
So basically this lets you kubernetesify FPGA acceleration or any other. It doesn't have
to be FPGA, anything else. So obviously, you know, this is all very interesting, but there are
repercussions to physically moving a device from one place to another, right? There is this thing
called physics, apparently, which, you know, impacts us. And fake news might get around a lot of things
but it doesn't get around physics
and there is latency and so forth involved.
The other thing that we're doing
is we're taking an accelerator
that's in a box with a server
and we're basically putting it on a target system
so there's now two processors involved.
There's the processor on the client
on the compute node
and there's the processor on the target
the storage
controller, as we often refer to it. And both of them now have to execute CPU cycles in order to
get things done. But what's interesting is that we actually have in some of the new hardware that's
coming from people like ourselves and Mellanox and others, to name a few. We have some interesting hardware that's going to help us offload
some of the repercussions from the target side.
So I'm going to talk about two of them.
One of them is memory offload.
So how do we get the DDR subsystem
on the storage controller out of the way?
I'm going to talk about this very briefly here,
but after lunch, I'm going to be talking about it a lot more. And you can either go to Jim Harris's talk or you can come to my talk.
It's up to you. They're both going to be awesome. I'm going to go to Jim's talk.
So one of them is to get the DDR. And the reason why, I'm going to talk about this a little bit
more after lunch, but the reason why I want to get the DDR subsystem
out of the way on targets
is because I don't want to use big processors there.
I want to use RISC-V SOCs.
And they don't necessarily have a lot of DDR channels,
but if I want to do 20 gigabyte per second of IO,
I better not have the DDR of that little SOC in the hot path.
I've got to have that in the cold path.
So there's something that we can use to take advantage of that.
I'll talk about it a little bit here.
I'll talk about it more after lunch.
And then the other thing is, of course,
those little SOCs don't necessarily have a lot of processor cores.
They might do, but they might not.
And they have to execute a lot of IOPS,
and every IOP is some lines of C code that are in a driver
that have to be executed on the instruction set architecture of that target.
So there are some ways and means that we can get around that.
I'm going to look at those a little bit.
After we've looked at them, I'm kind of showing you the end game before the slides.
But basically, what we can do, at least on the CPU utilization side, is we can take advantage
of things like NVMe offload.
So this is a feature from Mellanox.
It's provided in both Bluefield and the CX5.
It's a state machine that's in the PCIe endpoint
that can essentially administer NVMe commands on your behalf.
So the driver basically says,
I'm not going to be the one ringing the doorbell
or pushing
submission queue entries or polling for completion. I'm going to use a little piece of hardware to do
this. This is either an awesome idea or it's a crazy idea or it's a crazy awesome idea. As someone
who works on operating systems, the thought of having a little piece of hardware doing this kind
of scares me a little bit. But at the same time, it's pretty interesting. So we've implemented code that takes advantage of that little state machine.
And we're seeing pretty much 98% CPU offload.
So the CPU load goes from a nominal factor of 100 to a nominal factor of 2.
The processor basically isn't doing anything anymore.
And we're still doing an awful lot of I.O.
So that's very compelling.
There's issues around error handling that have to be thought about,
but that's a huge potential win.
That's maybe moving away from a pretty serious storage controller processor
to something that's a lot more lightweight.
So very, very interesting.
And then on the memory side, like I said,
I will talk about this a lot more after lunch
but there's a framework that that myself and others have been working on for quite some time
that's on its way into hopefully into linux hopefully pretty soon here we're getting some
good good uh acknowledgements uh even in the last couple of weeks um where we can do dmas directly
from one pcie device to another so what we call peer-to-peer DMA, P2P DMA.
Traditionally, a DMA does not do that. Traditionally, if you want to move data from
device A to device B, you have to do a DMA through system memory. Now it may, if you have things like
data direct IO, it may get stuck in your L3 cache, which can be a good thing or a bad thing, depending
on what that L3 cache is trying to do.
But often it will end up in DRAM.
And that's memory bandwidth that has to be provided.
If you're doing 20 gigabytes per second, that's 40 gigabytes per second of DDR bandwidth.
That's quite a lot of DDR channels, right?
So that's money and power and so forth.
If you have memory on the PCIe endpoints, then they can be the DMA destination or source.
And one of the most famous recent examples of a standard that defines memory on a PCIe device is the controller memory buffer or persistent memory region of an NVMe device. So now we have a
standards-based way of providing memory for this framework.
And if we take advantage of that, basically, as well as getting the factor of 50 offload in the processor, we get roughly the same, about a factor of 50 offload in the DDR bandwidth
on the storage controller processor while we're still achieving the same amount of throughput.
So again, this is another big bang for the buck.
And the great thing about this one compared to the other one
is I'm not taking the OS out of the path.
The operating system for the memory offload
is still potentially the one doing the IO.
It's just that the DMAs are not going through
my memory subsystem on my processor anymore.
So that's a win-win.
So if we put all the things together that we've
talked about, basically we have the ability to disaggregate accelerators that could be FPGA-based,
but they could also be SOC-based. They could be GPGU-based. They could be something else.
We can disaggregate them across networks. That network could be fiber channel. It could be
Ethernet-based with RDMA.
It could even be coming soon, Sagi.
I think I saw earlier he maybe stepped out.
We'll have TCP IP as well.
Totally standard using inbox drivers.
That's a pretty important point.
And then combining it with techniques like this,
we can build some very, very interesting NVMe over fabric target appliances that don't necessarily need a
lot of processor horsepower, but contain an awful lot of accelerator capabilities that we can push
out onto the network. The other part of this is anyone who's still alive at seven o'clock tonight,
come and join us for the birds of feather on computational storage standardization.
What we're talking about today is a little bit, we're not the only company doing it,
but everyone's kind of, right now, we're all doing it a little bit differently.
Standards are the way forward, right?
So this is kind of showing the path.
This is saying, this is interesting.
There's value in doing this kind of thing.
Let's get people in a room and knock our heads together and work out how we standardize this. How do you standardize it in NVMe? How do you standardize it in SCSI?
How do you standardize it in Ethernet? Where does it need to be standardized and how does
that look like? We push that standard into the drivers, into the operating systems,
and we create a vendor agnostic battlefield in which we can all go and try and carve out viable companies that make money and make people.
Computational abilities over NVMe enable a lot of stuff, kind of for free, because it's already there.
Come to the Birds of Fathers, standardize it.
If you're interested in peer-to-peer, I'm talking about that after lunch.
Thank you very much.
Questions?
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers
in the storage developer community. For additional information about the storage
developer conference, visit www.storagedeveloper.org.