Storage Developer Conference - #115: Accelerating RocksDB with Eideticom’s NoLoad NVMe-based Computational Storage Processor
Episode Date: December 4, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 115.
I'm going to talk a little about the value proposition of computational storage as I see it.
To be honest, today I want to talk a little more about software consumption
than the actual specifics of the title of this talk,
which is one specific instance where computational storage at least provided some value.
I think we had a really animated, shall I say, birds of a feather last night. And, you know, among the rambunctiousness,
there were some very good points made by some very serious people around,
you know, why on earth would I do this?
Why don't I just buy a bigger Intel processor?
You know, and somebody just commented there's no such thing.
At some point, you can't buy a bigger one
because you're literally buying the biggest one that there is.
But it's a fair point,
and one of the things I say to a lot of people right now,
and people ask me, why do you work with Scott and Thad?
Why did you guys get together and try to create?
And I'm like, because my biggest competitor right now isn't any of them.
It's just an Intel Xeon.
And every single account that I try to sell into,
what it always comes down to is an Excel spreadsheet
of this is what I pay CapEx and OpEx-wise
with an Intel Xeon,
and this is what I'll pay if I do it with you guys.
And if the numbers come out positive,
they're going to buy my stuff.
And if the number comes out negative,
they're not going to buy my stuff.
And it's literally that simple.
So the value proposition we're finding is there.
Now, is it there every single time?
No.
And is it going to be there for every single application?
No.
But there are definitely instances that we're finding,
and I'm finding them, and Thad's finding them,
and Scott's finding them, and other people are finding them,
where the Excel spreadsheet comes out with a positive number.
And these people are rational human beings.
They're like, this will cost me less.
This makes sense, right?
And sometimes it doesn't.
So that's kind of talking about that.
I'm going to talk a little about some of the things I think we're trying to achieve
with this technical working group.
One of the things that is very important to me is how that interface between what we're trying to do in the hardware level with accelerators and offloads and institute processing,
how that interfaces with operating systems and applications, the software consumption model for computational storage. And so one of the things that we're trying to do this time around is,
like I said, all of this has been done before.
We've been using accelerators for a long time.
We've been using accelerators in storage for a long time.
One of the things, and even prior to this, we have tried to standardize some of that.
But again, this time I think we are really trying to have a concerted effort
to standardize some aspects of this
so that there is vendor interoperability
where it can be there
and where basically we can get support
into operating systems and open source user space libraries
to make it easier for end users to deploy this stuff.
Because if everyone's proprietary
and everyone's bespoke, you end up
with this cabling nightmare like this poor gentleman
here, and
it gets very painful.
So let there be light, and now we've had
this light since about FMS
last year, so just over a year.
We've had the technical working group working
for about 10 months.
Last night there was a lot
of discussion around, show me the specifics of the spec.
You know, unfortunately standards don't happen that quick.
Looking at a bunch of standards,
people nodding their heads, right?
We're moving.
Do I wish we were moving faster?
Yes.
You know, I think everybody would like that.
But standards just take time.
Somebody was saying, you know,
when, how long did it take to go from Fusion
I.O. to NVMe, right? It took a while, an awful long time, right? And so these things, especially
in the storage world, tend to take a while, right? But at least we have a lot of good
people, and I think a lot of the right people, talking to each other, trying to make some
of that happen.
There's a speed limit on that, right? people talking to each other, trying to make some of that happen. This one here, that was a birds of a feather at Flash Memory Summit.
We hadn't decided definitively at that meeting that SNEA was the right.
That was the meeting where we decided to approach SNEA.
So SW, it's a good clarification. SNEA was the right, that was the meeting where we decided to approach SNEA.
So SW, it's a good clarification.
SNEA came after that.
And I apologize for this weird thing in the bottom of my slides,
but I'm not going to try and fix that
because I'll just break something else.
So what are we trying to achieve?
What are the benefits of computational storage?
Scott did a great job earlier talking about some of them.
The way I think about it is certain tasks
are better done by an accelerator
than they are by an instruction set architecture.
There's certain things, compression, encryption,
maybe artificial intelligence inference, maybe video transcoding.
There's definitely cases where doing something using an instruction set-based approach
is less efficient in terms of picajoules per bit,
which turns into watts once you've normalized by time.
Doing that on a Xeon versus doing it on an Intel QuickAssist
or versus doing it on an NVIDIA graphics card
versus doing it on Fasl's accelerators on a SmartNIC
or doing it on our solution or doing it on Fasel's accelerators on a SmartNIC or doing it on our
solution or doing it on Scott's. And again, it's an Excel spreadsheet. It depends on the application,
it depends on the transformation you're trying to do, and it depends on the energy efficiency
of both the CPU and the accelerator. But you can tell when you go out into the data centers today
and look at all the different heterogeneous accelerators
that there's definitely people crunching these Excel spreadsheets
and making the decision to go, I can't run all the code on Xeon.
And I apologize to AMD and ARM, but I'm just going to keep saying Xeon
because that's 99.999 recurring percent of the market's care about.
So, you know, there's always going to be certain cases where it is just better to keep it on the Z.
And there's also cases sometimes where the customer is like, well, you know what?
I got the Z on, and I got it anyway.
And I got the one I want because it was the one that they gave me a deal on.
And I got spare cycles anyway.
So what do I care?
And those are very hard accounts to win, right?
But it's hard to compete with free
or I've already bought it already.
But there's many, many other cases
where we're finding people are going,
well, you know, to be honest,
either I can't do what I want to do with the Xeon
because it just isn't even capable of it,
or it's just not very energy efficient
and I want to reduce that.
So taking certain computational tasks and moving them off a Xeon processor
and putting it onto some other processor,
now whether that's an ARM and an SSD or whether that's an FPGA or whether that's a GPU,
that's almost like a second order thing.
It's the fact that they've made that decision
that it's more efficient to do that certain operation somewhere else. And that's
one part of computational storage. The other one that Scott also covered well earlier today is the
reduction of data movement. So if I can have that second point of computation be closer to my data,
and that could be on a smart NIC if you want to process data that's ingesting from
a network. That could be on an SSD if you want to process data that's on the NAND and you don't
want to have to go out the interface. It could be a computation element on a storage array that
processes across all the drives that are in that array, right? It could be on a CXL or a C6 card
with persistent memory. It could be any of those placesL or a C6 card with persistent memory.
It could be any of those places,
but it's somewhere that reduces the amount of data movement required.
I don't have to take all my data,
DMA it or network it to some host memory and process it there.
And again, that's essentially a picajoules per bit.
PCIe lanes consume power.
Moving data consumes power.
And power is very important to a lot of my customers.
And they have Excel spreadsheets.
And they negotiate their electricity rates.
And they build their own power plants.
So power is very important.
And again, that's something you put in an Excel spreadsheet.
There are other benefits to not moving data that don't get discussed and they're a little harder to convey to customers sometimes um one of these is sometimes you've added memory to your xeon
processor not because you needed the capacity but because you needed more memory bandwidth
right so typically you know you'll get a number of different channels on these Xeons or these AMD devices,
and you can populate them in different ways, and depending on how much DRAM you put in,
sometimes you add DRAM to get more capacity of your host memory,
and sometimes you actually just do it because you need more bandwidth.
And so by removing data movement to that host memory,
sometimes you can make a volumetric reduction in DRAM,
and sometimes you can actually just get away with less channels,
which sometimes means a cheaper Xeon processor.
Because some people buy Xeons for the memory controllers.
They don't actually buy them for the CPU cores.
Somebody nodding their head over there.
And that happens quite a bit.
There's another interesting thing that happens.
If you take one of Fazzle's very high-performance RDMA NICs
and you start ingesting a lot of traffic,
and then if you go buy a whole bunch of NVMe drives
from our friends at Samsung or Intel or SK Hynix or whatever,
before long, you realize you now have 10, 20, maybe even 30 gigabytes per second of
traffic just flowing through that network to storage path.
And the processor may not even have to touch a single bit of that, because it's basically
just saying, I need to ingest, and I need to push it onto block storage. And normally, we'll provide some kind of services on that data, but let's just
imagine the case where you don't. At this point, now you're basically consuming 20, 30 gigabytes
of data, and that has to go, potentially, it doesn't go into the host memory, because you
could have L3 caching, and now we start playing all kinds of fun games with DDIO.
But the reality is with that amount of data, you're probably going to flood your caches.
And so all that traffic is probably going out on the DRAM channels and it's probably coming back in again. So now it's 60 gigabytes per second of DDR traffic. And in certain applications,
maybe that doesn't matter. This is a case where you might have to buy the Xeon just for the memory channels, not for the cores.
But in other cases, you care a lot because there are actually applications running on those Xeon cores.
For example, in HyperConverge.
So in HyperConverge, you now have the problem that you have VMs that you're probably renting to people
who are probably measuring the quality of service.
And they start noticing that because
of this DMA traffic, when they try to go out to host memory to get a cache line, the quality
of that DRAM access is non-deterministic, right?
Because it's now fighting with all this DMA traffic.
So we've actually had cases where we've been able to go in and use measurement tools on
the memory controllers and measurement tools provided by PMON,
Performance Monitoring Counters.
And you can actually see what customer,
if I get rid of that data movement,
the quality of service on the VMs
accessing their memory footprint
has just gone up by an order of magnitude.
I've pulled the tails in by a factor of 20.
Some people care about that. It's harder to
put that in an Excel spreadsheet. It does translate into better quality of service and perhaps a
competitive advantage over your competitors. And it's of great interest to people in, for example,
multi-tenant public cloud or hyper-converge, right? So there's nuances, things that happen there. But that data reduction definitely reduces energy,
probably reduces memory requirements in both capacity and bandwidth,
and provides additional quality of service.
So that's a really important part of what we're trying to do with computational storage,
is reduce that data movement.
And then the third one, which is the one I'm really keen to focus on,
is to bring a vendor
neutrality to this.
So most of the accelerators that you can get today to do this kind of stuff, they're incredibly
vendor specific.
They're even vendor specific in terms of the PCIe interface requires a vendor specific
driver.
So there's no standardization around the PCIe interface.
There's no standardization around discovery.
There's no standardization around configuration.
There's no standardization around manageability.
And then you go to the large consumers of accelerators
and talk to them about this.
This is a problem for them
because companies will turn up once a week and say,
hey, we just raised $50 million.
We're doing an AI chip.
It's faster than everybody else's AI chip.
And by the way, it has a completely nonstandard PCIe interface
that your kernel team are going to have to download
a terrible driver from our website under NDA
and integrate that with their existing kernel,
which runs live on all their critical services
that they make trillions of dollars on.
And by the way, the user space library is kind of written
in some weird language with some weird compiler
that obviously doesn't work with anything that you currently
do. Could you give us 10 of your software
engineers to integrate this and deploy it
so you can try out our hardware?
And what do you think the Facebooks
and the Googles and the Amazons of the world say
to that? They say no, right?
Their software teams are oversubscribed by 200%.
They don't have time.
They don't, they will not touch their kernel.
They certainly won't introduce third-party stuff
into their kernel, right?
That's insane.
So they want some consistency here.
And that, for me, is one of the important things that the technical working group needs to bring for this to be successful, is that consistency.
And the way that I say it, but I'm not allowed to say it because my VP of marketing tells me I'm not allowed to say it, is that I want to make computational storage consumable by idiots.
I used that in a Google meeting, and the guy was like, did you just call me an idiot?
I said, I'm under the impression of an idiot can do it. People who are brighter than an idiot can do it. But that isn't always
necessarily true, as my wife will tell me when I try to do it.
You know, one of the demos that we did at Supercomputing a couple of years ago was awesome
because we were offloading compression in a file system. And basically, as we were plugging
in more no loads, the fans kept going down.
Because basically, it was getting more efficient and the UFI code or whatever was detecting that things were running cooler
and it could power down the fans. And that's actually the one the customers
enjoyed the best. If it had one no-load, the fans go down
by 10%. Put in two, the fans wind down again. Three, etc.
So the system was running more efficiently
because of the computational storage.
And I didn't have to do a lot of hard things
to make that happen.
And ideally, it's vendor neutral
so that the hardware people can fight
over who has the best hardware,
but the software people don't have to worry
about vendor A's code being different
to vendor B's code to being different to vendor C's code.
So what do we actually do?
We basically use, leverage, abuse NVMe.
That was a debated term last night.
And we present accelerator functions through a standard NVMe endpoint.
And this is some marketing bullshit. Apparently we standard NVMe endpoint.
And this is some marketing bullshit.
Apparently we're NVMe compliant or whatever that means.
The key thing is, the way I look at it, our engineering team is kind of split in two.
We have some people, or sorry, the hardware part of our engineering team is split in two.
And basically some groups work on a very high-performing NVMe interface that we've done ourselves, and we've made some modifications
that make it more appropriate for computation and storage.
But at the end of the day, it is a NVMe front end.
NVMe, I'll talk about in a minute, is pretty awesome for talking to accelerators,
just like it's awesome for talking to storage.
And then the other part of the hardware team are basically a whole bunch of super geeks
who work on these different types of accelerators.
And we tend to do things in RTL.
We've tried things like higher HLS and OpenCL.
But typically what our customers care about
is how many gigabytes per second can we process
per LUT or per gate, and how many picajoules per bit.
And the way you get those numbers where you need them to be is handcrafted RTL.
It's not the easiest to work with and normally requires bright PhD people,
but we have those, and that's what they do.
And the great thing is no matter what accelerators we come up with,
they're always exposed through NVMe.
So that part of the story stays the same. And that,
for me, is very important. So as we bring in other acceleration functions, or you do,
or somebody else does, the consumption model from a software point of view stays the same.
And it's aligned with a protocol that many, many of our customers are already very comfortable with
because they're deploying NVMe SSDs at scale.
These are people who are deploying tens of thousands,
if not hundreds of thousands, of NVMe devices,
and we're just another NVMe device.
We just don't store data.
We process it.
I'm not probably going to talk about it.
Walk through this.
I tend to, whatever.
So why NVMe?
Before this startup, I was in the CTO of PMC Sierra and then MicroSemi.
We were doing a lot of work on NVMe.
We did an SSD controller that we were very successful with,
and hyperscalers and storage OEMs used that controller to build SSDs. So I spent a lot of time with those customers, a lot of time.
And one of the things I kept seeing is that this NVMe protocol seemed like it was so good,
it should really be used in a broader way.
NVMe is a transport as opposed to just for storage.
And when you look at what an accelerator requires versus what NVMe provides,
the matchup is awesome.
So low latency, Yep. NVMe
is a very low latency protocol. It has to be because it's talking to NAND. And even more
recently, it's now talking to things like Optane, which are not NAND. So that makes them AND. That's
my Boolean joke. Many of you will have heard that one before, but I love it. So I keep using it.
High throughput. You know, NVMe is pretty darn efficient.
Now, there's definitely a couple of things,
like WD had an interesting kind of twist on NVMe
that was more efficient,
but it's still very efficient.
I can get a lot of NVMe commands
through a single thread on a Xeon processor.
I think there was a talk on Monday
that was like 10 million IOPS on a single core.
So if all I'm doing is issuing and processing NVMe commands,
I can get a lot of those done per second,
and I don't necessarily need to use a lot of cores to get that.
So NVMe is soft touch.
It's lightweight.
The people working on the NVMe driver,
like, I couldn't hire that team even if I wanted to,
and I had infinite resources, right?
So, you know, we have Christoph.
I don't know if he's in the room.
We've got Sagi.
We've got Keith.
The team that work on the NVMe driver in Linux and other operating systems are world class.
And they write really, really good code.
And they care about the hot path.
How many IO per second and how many CPU cores does it take me to get that IO per second?
Efficiency, efficiency, efficiency.
Multi-core aware.
So NVMe was designed from the ground up
to be aware of the fact
that processors are no longer single-threaded
that run really, really fast.
Has anyone got a 12 gigahertz Xeon?
If you did, you would have a fire, right?
We've gone multi-core and if you look at the new Roam,
I think the new Roam has 8,000 cores.
I don't know.
I lost count after 100 and whatever.
So having a protocol that understands that,
that has multiple queues,
then queues can be assigned to the threads
that run on different cores, makes a lot of sense.
So it scales on a multi-core environment.
And quality of service.
So we have NVMe as a very rich queuing structure.
You can have quality of service on the queues.
We have things, you know,
we have all kinds of interesting things
that we can either do today
or we will be able to do in the future
around how do I prioritize traffic
that's going into an NVMe controller.
Do I have certain queues
that are higher important than other queues?
Can I have priorities within a single queue?
These are all things that NVMe understands
because these are things we care about from the storage side,
but they also happen to translate incredibly well over to that thing.
So the real question is, why not NVMe?
We could invent something from scratch, but it would take a long time, and it would look a lot like NVMe.
So I want to talk a little about how we use NVMe today inside of Datacom, which is not standard.
And I'm not even proposing that the way we do it becomes the standard.
I'm not going to be that hubristic.
But I'm going to help kind of describe what we're doing and hopefully the working
group and others can provide feedback on what parts suck and what parts
don't suck quite so much. So, presentation.
If you look, if you put one of our
devices, and you're welcome to buy several from me at any time,
into a system, a system
running Linux. And if you do LSPCI, you will see a PCIe device that has the Adetocon vendor
ID. And we just use standard PCIe to recognize. And then as the operating system boots up,
it will look at the class code of that device and go, oh, this thing is telling me it's an NVMe drive.
Now, I guess I bind it to the NVMe driver.
I don't have any UDEV rules telling me to do something else.
So I'll just bind to the NVMe driver.
So we don't make any modifications to the kernel.
We use the inbox NVMe driver.
And what happens then is the accelerators basically turn up as namespaces, NVMe namespaces
behind that controller. And if you have a lot of accelerators, you'll see a lot of namespaces,
and there's some games that we play there, but that's the basic idea. Now, the thing is that
NVMe as it stands today has no concept of accelerator namespaces. We've talked a little
publicly about computational namespaces,
but they are in no way officially a thing.
So what we do today is,
if you look at the namespace identify field in NVMe,
you see that the wise people who work on the standard
saw that there was a need for a vendor-specific field
in that identifier,
and so we can get to put whatever we like
in that vendor-specific
field.
So we put in what our accelerators do.
So now you know you have an Adetacom device.
You can go, well, I'll do a namespace ID and get all the namespaces, and then I can go
to each of those namespaces and say, I know you're a Adetacom device, so tell me exactly
what your namespace does.
And we'll say things like Zlib compression or erasure coding or whatever.
And the great thing is that now you have a path
to a user space library or even an internal consumer
who can say, well, if it's a Zlib compression engine,
I'm going to use that to do Zlib compression.
How do you configure these things?
Well, NVMe has admin queues.
Every controller must, must have an admin queue.
And those admin queues have vendor-specific commands.
So without abusing the NVMe spec,
but by doing something that's vendor-specific
but still within NVMe,
we can configure our namespaces today.
We can set compression levels.
We can insert GAWA fields for erasure coding.
And all of that can be done through the standard NVMe admin path,
which has a user space interface.
And then, obviously, how you consume this is very important.
So right now, what I would love to have is NVMe commands
that are tuned for computation.
So maybe pointer to pointer.
Maybe some other things.
But I don't get that yet.
I don't get to write the NVMe spec.
So I've got to live with what's available today.
And NVMe commands right now
are basically things like reads and writes.
But that's okay.
I can use writes to get data in to the namespaces,
and I can use reads to get results out.
And it gets a little bit funky when the output size is different than the input size, but it's not that funky.
And if you've got a bunch of bright engineers like we do, you can find a way of solving that.
And so now I have NVMe compliance, and I also have the ability to do things either in kernel
space or user space to discover, configure, and use those computational resources.
And like I said, I'm not proposing in any way
that this is the way that the twig needs to go,
but it is an example of something that's out there.
And we can learn from it,
and we can do things that make sense
and other things that don't.
A couple of other things that are interesting,
because these are NVMe namespaces
and the interface is fabric agnostic,
we can literally have a PCIe no-load,
or I can put the same no-load,
and I've done this with Fazzle from Broadcom,
I can put it on the other side of an NVMe over fabrics link
and expose it over the fabric, can put it on the other side of an NVMe over Fabrics link and expose it over the Fabric,
and the software on the host
neither knows nor cares that that device
is no longer local.
And I can use NVMe discovery to find it,
and I can consume it. So now I have
accelerators over Fabrics. So now
I could build a rack-scale architecture
where the accelerators sit at the top,
I use Ethernet and NVMe over Fabrics,
and my compute nodes don't have any accelerators.
They just borrow them when they need them.
And with namespace management commands,
I can program the bit file on the FPGA,
if it's an FPGA,
in order to get it to do the functions I want
before I borrow it.
And all of that is with NVMe.
And that's today.
That's NVMe as it is today.
As we think about adding computation,
we can obviously take that in a lot of different ways. One of my favorite questions when a customer
asks it is, what's your management story, Stephen? Because I get to say NVMe MI, and I have yet to
have a case where that has not raised a smile in my customer. Like every single time. Because
they're like, that's awesome because I've literally
got 20 engineers who are writing NVMe MI code right now because I'm literally transitioning
to NVMe or I'm kind of in that phase and I just get you for free because you're just an NVMe device.
I've got another good story where one of these AI accelerator companies went to a large hyperscaler
and the data path stuff looked really good. And then my friend asked them, so what's your management story?
And they were, oh, we don't have a management interface. They had built a chip with no
management interface, and they expected to sell it to a hyperscaler who was going to,
if they were going to deploy it, would deploy it in the tens of thousands.
Management is so important for the consumption of accelerators at scale.
Security, we've talked quite a lot about security
within the technical working group.
Obviously, NVMe is designed to store people's data,
and it's being used in public cloud.
So NVMe is being deployed in VM,
public rentable VMs. So NVMe is been deployed in VM, public rentable VMs.
So NVMe is very security conscious already.
And the customers, the big customers of NVMe,
are continuing to push features into NVMe
to try to make it even more secure.
So there are security things that are in there.
And there's certainly more that we're going to have to think about.
But there's definitely things that we can leverage there today.
Long term, like I said, where this goes from a standards point of view, I'm not sure.
But this is what we do in production today.
So just for fun, I always like to have a little bit of code.
So this is an example of, I don't know if Keith's here, but Keith Bush kind of started this tool, the NVMe command line interface, NVMe CLI.
Maybe a lot of you have used that to kind of look at your NVMe devices.
So this is the NVMe CLI output for a version of NVMe that we have where we have a plug-in.
So many vendors have plug-ins for NVMe CLI.
That's not unusual. Intel and Micron and WD and most of the major vendors
have a plug-in to do kind of vendor-specific stuff.
But we can use that command with our plug-in
to, for example, get what type of namespace we have.
So here we actually have a means of communicating
to the user space that there's a couple of namespaces.
And I use a UDEV rule here to change the name of my computational storage.
That's just a UDEV rule.
It's not a big deal.
But this is just saying that there's a – I made this vendor neutral.
I don't quite know why.
There's two namespaces on this device.
The device is made by vendor A.
That's because I don't want my customers to know they have to call me.
Namespace, a number, and then here I've actually pretended we do have namespace types.
And I said this is a conventional LBA-based namespace, so this would be a storage-centric namespace.
And this one's a computational namespace.
And then we can have subtypes. So we know this is a computational namespace. So this would be a storage-centric namespace. And this one's a computational namespace. And then we can have subtypes. So we know this is a computational namespace. So for storage namespaces, maybe we print out the format that it's been formatted in.
But for computational types, maybe we say what type of computational namespace it is,
whether it's compression or artificial intelligence inference, or perhaps vendor-specific,
because there will be a lot of vendor-specific computation where the main job of the operating system will be just pass it up to user space,
and somebody up there will know what to do with it. So we can do all that. And again,
this all works for both PCIe and fabric-attached namespaces. And we can leverage all the discovery
mechanisms that currently exist in NVMe. And as a disclaimer, I did just want to make very, very clear,
I am not proposing this in any way, shape, or form as the way the working group must go.
I'm just being illustrative.
So, you know, we're a software company.
I have no intention of doing hardware.
And so what we deploy on today is
basically different hardware that's typically FPGA. It doesn't have to be, but it tends to be FPGA
based. But it can come in a range of different form factors. And because we align with NVMe,
it kind of makes sense to go with NVMe-centric form factors. And a lot of our customers actually deploy
on their own hardware,
so we're actually in a form factor of their choosing.
But some customers are deploying
on form factors like these.
So the U.2 makes a lot of sense.
Sometimes people want a lot of horsepower,
so they'll go add-in card.
And things like E1.S, we did a launch of that at FMS.
But there never are hardware.
And I tell people we're not a hardware company.
Like I would deploy on any form factor at all.
I don't really care.
And for me, it's much more about the software and the consumption model through that software.
That's interesting.
Obviously, the differences from a customer pacing point of view is that typically these would be larger pieces of silicon
that you can get more work done on.
So you'd be able to get more of different types of acceleration capabilities here
than you would here.
But you're probably going to pay more for this one.
And so customers have to make decisions on what they want to do
and so on and so forth.
And this slide is much more interesting to me
because this is the software side of things.
So like I showed you a couple of slides ago,
we don't typically, well, that's not true.
We can do a model where we don't touch kernel space.
And that was very important.
I made it very clear to the engineering team.
We must have a mode where we can just have a user space library
and use the Linux kernel or the operating system that the customer already has.
And the engineering team ideally would not have made that decision,
but I forced that as a technical constraint that they had to live within.
Because that basically allows customers who want to,
to just use a user space library that we call libnoload,
and they actually do the namespace detection
and the utilization of those namespaces.
And what we do on the top end of libnoload
is basically say noload compress,
provide that as a binding to a function
that the user space library does
that basically then calls down to noload
and offloads the compression, right?
So now their application can basically call no load compress
rather than calling Xeon compress,
and the compression happens down here by a DMA and a DMA back,
or maybe a peer-to-peer to a drive if you want,
but that's a different story for another day.
And people can consume it that way.
Now, there is no path for a file system down here
to consume that. Well, there is, but it's not a very nice path. So we'll ignore that for now.
So if a customer does want to consume us in kernel space, we do play in kernel space. And
we have kernel developers who work for us, who will work with customers on that side of things.
But in my view, the long-term vision is that we standardize something that we can push upstream into everybody's Linux,
where we extend NVMe to support computation, and then we have that kernel space consumption.
And that's something that we're going to be spending quite a bit of time looking at in the working group over the next little while.
It doesn't have to be an operating system consumption model. It can be a user space consumption model. So we have worked with SPDK
community, and some of our customers do deploy us through SPDK. SPDK, I jokingly swear at it from
time to time, but those guys know that I love them very much, and it can be a very useful consumption
model. But again, very much aligned with the NVMe standard
as a framework for consuming those different resources.
So after computational storage,
everything is beautiful and light,
and we are all happy, me, myself especially.
So one specific point example.
So I will actually talk to... I don't know what we're
doing for time. Oh my God. Perfect. So we did an example around RocksDB. Here's some figures of
merit in terms of we got six times more transactions per second. We were offloading compression. So
doing the compression on the us versus the Xeon, we got more compression. It was more efficient.
There was actually some quality of service metrics, et cetera, et cetera.
There's Excel spreadsheets that will decide whether that made sense or not.
We actually did that on a Xilinx Alveo card.
And we did it in an AMD Naples-based system with a bunch of NVMe drives. And it was using the standard
Linux operating system. And basically what we did is we used that lib no load. And we
made, I think, 20 lines of code changes to upstream RocksDB. So basically there's a compression
plug-in in RocksDB. And we just created a new one for low node and then tied that in.
And basically we had it basically run the no load compression
rather than run the snappy or whatever compression that was there before.
So the software consumption model, in this case, is a user space one.
So we have the application, which does have to be modified.
So in this model, the application has to be modified.
Now, the problem with lib low node is lib low node is specific to a Deticom. So the customer has to be modified. Now, the problem with lib-low-node
is lib-low-node is specific to a datacom,
so the customer has to make that modification,
and it's never going to be accepted upstream
because it's not a standard.
Well, maybe it would, but it's harder to upstream that.
Now, if this was lib computational storage
and multiple vendors were designing to it
through a standard,
we could openly collaborate on this
and probably have much better likelihood of getting it upstream. And if somebody upstreamed it for RocksDB once,
that would be sufficient for all the vendors to enjoy it, right? So one person could update Hadoop
and one person updates Cassandra. So that's one way of doing it. The other way of doing it is,
you know, we could have done something in the kernel and we could
have actually used a file system that
supports compression like ButterFS
or something like that where we
actually hide the
compression from the application because the application
is just talking to files through a file system.
If the file system is doing the compression offload
the application doesn't have to change.
And I don't have a lot of slides on that
but that is also a model. And I don't have a lot of slides on that,
but that is also a model.
And obviously what happens is RocksDB talks to our API bindings.
Our API bindings basically talk to the device
through iOctal calls through the slash dev slash NVMe.
We can also use POSIX commands
if you want to do reads and writes.
We are actually very much currently working on IOU ring,
which, for those of you who don't know, is just awesome.
I'm not going to say any more about it.
Go look it up.
And that's another really cool way
to go from user space to kernel space.
And then down here, everything.
So the storage is using the NVMe block driver,
and we are using the NVMe drop driver,
and everything is kind of going
through that. But the storage path is still going through a file system, and it's going
through the block layer, and it's going through that. So this particular demonstration did
not do peer-to-peer. I could talk another talk on peer-to-peer, but I'm not even going
to get, I'm not going near it today. It's awesome, and it's going to be great.
We've done the same demo, exactly the same demo,
with exactly the same software over Fabrics.
So the demo that I showed you here,
you can take exactly the same code,
exactly the same software code,
exactly the same kernel code,
exactly the same user space code,
and rather than talking to a PCIe attached no load, you can put the no load on the other end
of an NVMe over Fabrics network, and
the same software can run on the compute node without modification.
So this is beautiful, because now you can Kubernetesify all of this
and make it composable.
And it all works today.
And I don't, like, Bitfusion just got bought by VMware.
They should have bought me instead.
I'm very upset about that.
Because the way Bitfusion does it is crap, and the way that we do it is awesome. But anyway.
But I'm not a marketing guy, so what are you going to do? But this disaggregation of accelerators and the composability
of accelerators is going to be a big deal, I think, over the next little while. So if we can align
with NVMe in a way that allows that to happen, I think that would be potentially very, very
interesting in certain markets. The ability to compose your accelerators as well as your compute and your memory
is very, very interesting.
And of course, now we have things like CXL.
There's an entire another talk that I could do
and I have done a couple of months ago
on where CXL will fit into all of this.
So it's all very beautiful.
So to conclude,
I'm only saying things that I'm hearing from the people I have the pleasure of talking to when I go and visit the large hyperscalers.
That they are telling me, so this isn't me saying it in the sense that I'm making it up.
But what I'm hearing from the large consumers of accelerators is that they want vendor-agnostic, consumable interfaces, software, and management stacks.
And the conjecture is that NVMe gives them all of that.
This second line is marketing crap. So using the no-load framework today, we can either accelerate applications in user space,
or we can make changes in the operating system and accelerate things like file systems and other internal consumers of accelerators.
And like I said, this is just a point solution, but I think that there's things that could be learned from the way we're doing things today,
and maybe there's good things and maybe there's bad things.
And as an industry, we can work towards a place where there's also a vendor kind of interoperability around some of this or some incarnation of something like this.
And so, you know, I think the work that the TWIG is doing is super important,
and I very much appreciate the amount of time and effort quite a few people have put into it.
And I hope that continues, and I hope other people continue to join us
and get involved and help to shape what I think could be
a very interesting future for computational storage.
So thank you very much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer
community. For additional information about the storage developer conference, visit www.storagedeveloper.org.