Storage Developer Conference - #81: FPGA-Based ZLIB/GZIP Compression Engine as an NVMe Namespace
Episode Date: December 3, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 81.
All right, we'll get started. So I'm not Saeed, I'm Steve Bates, I'm the CTO. This is work that we did at Adetacom.
We are a startup working on computational storage using NVMe Express.
That is not something that is currently part of the NVMe standard,
but the wonderful thing about standards is they evolve.
And maybe NVMe and other standards will evolve to take advantage of computational storage.
And we think that's a very interesting idea, and we hope you do too.
And in order to kind of justify why things like NVMe or SCSI or Fiber Channel might change
to allow for computation on or near the storage. We kind of have to showcase some examples
of getting benefit by doing something.
So this is actually work that came out of a customer engagement.
So I'll give you a little history on what the customer was looking for
and kind of how we solved that.
I can't name this customer, but they're quite a large data center customer.
So that's pretty good.
And we're continuing to work with them as they kind of evaluate and go pre-production into production and so forth.
So kind of pretty neat to see, but we'll talk a little about that.
Before I jump into that, most of you, I'm sure, in this room have heard of NVM Express.
If you haven't, why on earth are you here?
You're in the wrong conference.
Please go away uh it's a standard specification that was designed from the ground
up for accessing non-volatile media uh originally for pcie now obviously with fabrics we have
multiple transports you know one of the things that was always interesting i used to work for
a pretty big scuzzy shop for a while,
and we were always like, is NVMe actually really better?
And there are actually some reasons around why it is better,
particularly around multi-queue and stuff like that.
And now, obviously, it's very hard.
Well, it's not hard, but you have to look around now
to find a solid-state drive that isn't PCIe based anymore, NVMe based.
You can still get them for sure, and people are still using them, but it's getting harder
and harder. It is designed to be high speed, high throughput, and pretty CPU efficient,
and very importantly, designed for parallelism. One of the things about NVMe is it supports
many hardware queues, much more than most people put on their controllers.
And you can basically take those queues and map them to the multiple cores you have
running on your processors, because
Denard scaling says we
can't go to 8 gigahertz cores, so
we end up having 2, 4 gigahertz
or 4, 2 gigahertz.
Recently I had
a system in my lab that had
256 cores.
That was a bit painful.
But we take advantage of all those cores with their different threads
and issue I.O. from all those cores against these drives.
For us as a small startup, this next point was very important.
We had a decision to make.
We're going to do computational storage stuff. Do we call it a Detacom and write an Detacom driver that's kind
of shitty because we're not great at writing kernel code? And then do we go to the customers
and say, hey, customers, I know you're using Ubuntu, but can you install this shitty out-of-tree
driver that might not work with your particular backports and might cause a
bug on your bare metal and bring down 10,000 of your servers and then wonder why we sell absolutely
none of them, right? So for us, it wasn't even a choice. We had to align with something that's
already in upstream kernels. We're going after hyperscalers. We're going after data center
customers. We can't provide shitty kernels on our website, right? We have to be
working with inbox drivers. On top of that, the NVMe driver, you know, I haven't done the math,
but if you look at the hours devoted and the average salary of the person who's working on
the NVMe driver, that driver is probably tens, if not a hundred million million worth of software development salary. So, yeah, I'll use that.
It's like, why wouldn't I, right?
And I'll show you in a minute why it makes a lot of sense.
So one of the other things, I talked about this yesterday,
NVMe has standardized something called a controller memory buffer.
It's basically a PCIe bar.
And that's a piece of memory that can be used for DMAs.
But in order to utilize that, we need the operating systems to support it.
So we also have been working on something called the peer-to-peer DMA framework.
I talked about that yesterday.
If you're interested in learning more about that and where we are in terms of upstreaming,
and the answer is hopefully very, very close now, you can
come chat to me.
And the peer-to-peer transfer is there to reduce the load on the CPU system memory and
to free up CPU time.
So NVMe can be used as a high-speed platform for sharing accelerators.
So we use NVMe today to talk to traditional NVMe SSDs
from Intel or Samsung or Seagate or whoever. But we can introduce these new devices. You know,
our product is just one example. There's other companies like NGD Systems and ScaleFlux and
some of the larger companies are starting to take a look at this. And these are also NVMe devices,
but they don't have any storage.
Some of them do.
Ours do not.
We don't store a damn thing.
I spent a long time working for a company
that makes SSD controllers.
I spent six years of my life
trying to get very fucking clever companies
to get SSDs that work based on our controllers.
And I went, there's no way I'm ever fucking doing a startup
that talks to NAND ever again.
Can I be more clear about that?
So I think it's a great business and I love SSDs and that's great. But if I'm doing a small startup,
I don't want to be an SSD. I want to be a computational engine. Why try and solve two
hard problems at the same time? So these are these new accelerators, but they just happen to present
to the OS as NVMe devices. So you take one of our
cards and you plug it in your system. The class code is non-volatile memory, and the subclass code
is NVMexpress. And the driver that it binds to is the inbox NVMe driver. And then we go from there.
So now I've got an interesting system, because now I've got high performance, low latency storage.
This could be NAND. This could be something that's not NAND, which makes it AND.
Boolean joke.
It could be Optane or even something like a Spintorque
or an Antero stuff.
But the innervases envy me.
This could even be over fabrics.
This could be a high performance networking device,
RDMA or TCP IP coming soon or fiber channel.
And then you could have your storage
somewhere else. You can take it. I want to know who it is. But now on top of my storage, I also
have this accelerator compute. And this could be doing a number of different things. I'm going to
focus on compression in this talk. But we are working either ourselves in-house
or with partners to put other interesting
PCIe acceleration functions here.
Artificial intelligence inference, right?
Because our investors say we're worth
two times more than we are if we'd say AI.
So, you know, anything,
and because we're moving to a world
where things are more heterogeneous,
AI is not best served on an instruction set architecture,
at least the ones we get from AMD and Intel.
So if I'm running a lot of AI,
we're already seeing a lot of servers where we have a lot of PCIe accelerators.
They might be Google's TPU, or they might be an NVIDIA card,
or they might be an FPGA card.
And we're struggling to come up with good frameworks for how do we manage these accelerators? How do we
tie them into applications? And one of the things people want to do also is they don't want stranded
resources. So if I have a server that has 16 accelerators in it, and I don't need all 16,
how do I let this server over here that needs 18 borrow the two?
And I talked a little about that yesterday morning with Sean,
and we show you how NVMe can actually allow for that disaggregation,
which for me is another great benefit for using NVMe.
All right.
So I'll dig into the specifics of our platform.
I actually...
Did I bring it?
Did I bring it?
Where is it?
Oh, I didn't bring it.
That's very strange.
Normally I have one with me, but I forgot it today.
Or somebody stole it.
Somebody stole it.
We call it no load.
Am I mic-telling?
We call it no load.
Bang, bang, bang, bang, bang.
Sorry about that.
So no load stands for NVMe offload,
and basically it can present, in this case, FPGA accelerators.
I mean, we could do an ASIC if we raised enough money
and wanted to go through all that pain,
but we're not quite going to do that just yet.
So we deploy on FPGA, and there's
other good reasons to use FPGAs, because they can change their functionality over time. So that's
something that we're doing. But we present, like I said, we present that computation, that FPGA
resource as an NVMe endpoint. So we have an NVMe endpoint in there. And then we can basically use namespaces,
which is an NVMe standard construct,
to present the different accelerators in a way that the operating system can carve up
and give to different applications.
So in the same way, you can take a chunk of NAND
and carve it into different namespaces
and share that from behind a controller.
We can take some amount of compute resources,
chunk it up into namespaces, and give those to different applications that need them. from behind a controller. We can take some amount of compute resources,
chunk it up into namespaces,
and give those to different applications that need them.
It also allows us to have,
one accelerator can be a compression engine,
one can be an erasure coding for RAID or read Solomon calculations.
It could be an artificial intelligence inference
or training or something.
But once that's tied in,
we can use standard NVMe tooling to manage this device. So you can imagine if our customer is
already on a path towards NVMe and they're writing management code, the fact that our accelerator
presents as an NVMe device and can be managed with the same piece of software that they're
managing the drives,
that's a win, right?
Because they don't have to write a different management stack for their accelerators.
You know, we can do things like enclosure management.
We can do, you know, we can follow
all the rules around LEDs.
We can come in the same form factors as FPGA,
or sorry, as NVMe drives can.
So we have a U.2 form factor who was,
the hardware was actually from Nalotech. Alan, just wave,.2 form factor who was, the hardware was actually
from Nalotech,
Alan,
just wait,
Alan,
great partner on the hardware side.
He builds the hardware.
We put the bit file on
that turns it into no load
and everybody wins, right?
But we can also deploy
as an add-in card.
We could deploy on something
that has lots of FPGAs.
We're not particularly fussy
in terms of the physical deployment.
We're more interested in the smarts that we enable on that platform.
And then we have an API that sits in user space,
and I'll show you some of that in a little bit,
that uses the NVMe driver to access the hardware.
So we still have the kernel in the path,
which is great for isolation and security and high performance. But we basically
then operate using a library and user space. And then we present an API that someone like
yourselves could use to tie into your application. So you could do Ceph, Acceleration, you could do
something else. And we have the APIs to do that. So on that point, you you know the software stack for us is very important we do not change a line of
code in the kernel space do not do never touch kernel space if you want to be a successful small
startup because you will die under the workload all right the great thing is you know this driver
is very well defined one of the things i love about NVMe is they didn't do what IBTA is.
They didn't define verbs somewhere up here in this wishy-washy software space.
They said, your PCIe device will have fucking registers at this fucking offset that do exactly this.
So, you know, and then we still get quirks.
I mean, we still have drives that don't all work quite the same way, but at least it's a lot better than RDMA, where you have to have a driver for different NICs, for those particular NICs, right? So the
same driver works for Intel drives as it does for us, as it does for Seagate, as it does for anyone
else. And that's true for VMware, and for Windows, and for Linux, and do we really care about any
other operating system after those three? Also, if you want to do things in user space using things like SPDK, we're supported there. We're just an NVMe device. Anything that
works with NVMe works with us. It's really quite that simple. And then the APIs we provide are free
and permissively licensed under the Apache licensing on a GitHub account.
So if you're interested, you can go there right now and take a look at our code.
And we provide some example applications that build on that API
to actually do interesting things.
We have some customers who have asked us to do certain things in the kernel,
and for them we have made kernel modifications.
So one of our customers is very big on ZFS,
and, yeah, it's public, right?
It's Los Alamos National Labs.
Great guys, great partners,
really enjoying working with them.
And they were like, well, you know,
we do RAID and compression in ZFS.
Can we offload that to your device?
We can't do that in user space.
ZFS is in, well, it has a user space, but it has a kernel space,
and that's the one we were interested in.
So we actually went and took a look.
The compression part of ZFS already has support for accelerators.
Intel's Quick Assist was added as a support device.
It was literally two or three lines of code for us to tie in
to that particular part of the stack.
What's kind of interesting about that work is the way that we did it is we actually changed the driver
so that when we advertised as an NVMe device, we said,
oh, and if the NVMe device is an Adetacom device, don't register it as a disk.
Register it as a special pointer that we can pass to the compression and RAID parts of ZFS, and you can use this in
kernel. And if we change the NVMe standard to standardize that, then we wouldn't have to make
those changes out of tree. The NVMe standard itself could come up with a way of saying, hey,
I'm a namespace type. I'm not a storage namespace. I'm a computational namespace.
Treat me differently. And we can actually have people submit patches to the Linux kernel and the other operating systems in order to support anyone who wants to build NVMe devices that can do compute as well as or instead of storage.
And there was a birds of a feather last night within SNEA.
We're kind of moving forward on discussions on exactly that kind of topic.
I'm not going to say for sure, hand on heart, exactly what's going to happen because
I can't predict standards. I don't think anybody can. But it's going to be an interesting discussion.
For those of you who are familiar with NVMe, you've probably used NVMe CLI. It's a free and online tool from Keith Busch at Intel.
It's kind of one of the de facto tools for managing the drives in your system.
So there's a command called NVMe list, which you have to run as root unless you've set up permissions on your disk devices.
And you get a whole bunch of information about the namespaces in your system at the current time.
You can see here we actually have three Intel drives,
and then we have four namespaces associated with our Adetacom device. So it's pretty typical today
that you get one namespace on a drive. Some other drives have namespace management, so they might
have quite a few. But this device here actually has four. Now, NVMe CLI allows for vendor-specific plugins. So Intel has a plugin, WD have a plugin,
Seagate have a plugin. We have a plugin. Ours is EID. So when you do NVMe EID list, basically,
it only lists Adetacom devices, and it looks at the vendor-specific field of the namespace
identifier to extract more information that's vendor-specific about that namespace. And this
is very standard. All the drive vendors do it. This is how you get things like firmware versions
and all kinds of other wonderful stuff. What we use it for is to identify in a human-readable way
what our different namespaces do from a computation point of view. So we have one RAM drive, which we use for test purposes,
and we have three compression cores
in this particular incarnation.
If we burnt a different bit file,
that list could be different.
We could have Reed-Solomon, we could have SHA,
we could have artificial intelligence, blah, blah, blah.
And maybe some of those names become standard.
Yes?
How come I see the Identicom RAM driver on the bottom,
but I don't see it on the top?
Because there's four here, and there's four here.
In this particular parsing,
because you're not using the plug-in,
you have no idea about non-vendor-specific stuff.
You can only go to the standard defined fields,
and this is basically our namespace name.
So if I look below the left-hand column, it says in1 through in4, and in1 is then the
Adetacon round drop.
Yep.
So that's the same as...
Yep.
Yeah, exactly.
So this particular function call can't do anything vendor-specific, because you haven't
called the plugin.
This particular function call, because you're calling the Adetacon plugin, can look at vendor-specific fields,
and it understands how we've taken the vendor-specific field,
which is quite a big field,
and carved it up into subsections that are useful.
And then this prints it out in human-readable form.
So this particular version is a little older.
Our current version actually dumps, like,
the build time of the FPGA
and the version number of the accelerator
so we can obviously keep track of versions and stuff.
Interestingly, this works over fabrics as well.
If our FPGA was plugged into an NVMe over fabrics target
that was then exposing the namespaces
over Fiber Channel or TCP IP or RDMA, you could
extract the same information
even if the accelerator is
no longer direct attach, but it's
connected over fabrics.
What's that?
Well, that's vendor specific.
So one of my apps guys
would know what it means. I have no fucking idea.
Basically, it means everything's working really well.
Yeah, there's information in there,
and obviously it just depends exactly.
It could be a date stamp.
Yeah, it could be a date stamp.
All right, so that's kind of the no load.
That's the product that we're bringing to market
and we're going to sell for loads of money
and we'll be flying around in private jets
this time next year, apparently,
if you believe the hype.
But on top of that,
one of the things that we recognized
as we were thinking about NVMe for acceleration
was also the fact that PCIe-based servers are getting
very, very fast, an awful lot of IO. So you put in a couple of 100 or now even 200 gig Ethernet NICs,
you put in a bunch of NVMe drives, you put in a bunch of graphic cards, you could easily get 50
or more gigabytes per second of sustained data movement around PCIe. And in the Linux or in
all operating systems today, the way that that works, if you move data between two PCIe endpoints,
so I have an example here of a copy between two NVMe drives, the path today is DMA to system memory
and you might get last level cache. It might not all go to DRAM.
There's DDIO and other things that can happen.
But the path is traditionally you allocate pages out of your main memory pool, typically DRAM.
The DMA goes there, and then when it's done, you get a completion,
and you issue another IO to the other drive saying, hey, can you DMA this data into you?
Now imagine I've got 200 drives and an RDMA Nix and graphics cards, and they're all doing
different things.
All of that data is going either over this interface or it's hitting the last level cache.
That is a huge, huge problem.
That either means one of two things.
Either you need like six DDR channels just to get the bandwidth,
in which case you're paying for a processor
whose instruction cycles you may not even need or want,
but you just need memory channels, right?
Or you're in a hyper-converged environment
and you do need those Xeon instruction cycles
because they're running VMs and applications
and containers and Kubernetes.
And in which case, they are fighting for the same DRAM that the DMA traffic is using.
So there's a quality of service.
There's a contention issue.
So either one of those is bad.
So what peer-to-peer DMAs do is they allow you optionally to say, hey, rather than allocate my DMA buffers from here,
can I allocate them from memory
that's on the PCIe bus already?
That's what we're enabling
with the peer-to-peer DMA framework.
And the most classic,
the most relevant example of a device
with memory already on it,
on a PCIe endpoint,
that's done in a standards-based way,
is the NVMe controller memory buffer.
So other devices have memory. Graphics cards obviously have a ton of memory on them,
but they don't necessarily expose it, and they certainly don't do it in a standards-based way.
So now what we can do is the DMA can basically be this DMA here, and then this becomes an
internal ingestion,
an internal DMA.
It doesn't even go back out on the PCIe device.
We have to make sure,
there's all kinds of issues we have to worry about we can get into.
Certain PCIe topologies have problems with this.
Access control services is very, very scary.
Don't do this in hypervised environment.
Don't do this from a VM, please, right now.
We'll get there, but right now this is kind of a bare metal
only thing
so that legacy
part there could be
legit reasons that you want to use that
you want to change the data
exactly
so is anybody
thinking about being able to
break up the peer toto-peer transfer?
Say, look, at this offset, these number of bytes,
do you want it to go up to the DRAM complex?
Yeah, well, so there's a couple of ways you can do it right now.
I mean, right now, if you're a user space process,
the upstream patches don't actually expose peer-to-peer in any way to user space yet,
but we have a hack that does that that's a driver for P2P mem.
And what you can do in your application is you can allocate buffers in both normal memory and peer-to-peer memory,
and you can make decisions as the application writer on which buffers to use at which given point in time,
given who's talking to who.
So, you know's talking to who.
SPDK is a good example of that. SPDK, you can
either call from the CMB
to allocate memory, or you can call from something
else. You can make decisions based on
your understanding of the PCIe topology
and what you're doing.
In the kernel,
we will have to have discussions
around that. We will have the discussions around
user space APIs that make sense.
But that's an example of what you want to push upstream.
Yeah, I mean, all the things that we are pushing upstream right now are typically,
if peer-to-peer doesn't work, fall back to the legacy path.
Like, we're not saying break the DMA, right?
People may want to break the DMA if it doesn't work, but that's a decision for down the road.
So you're kind of making use of the stop driver
or making the most of it,
but it's not optimal by any sense?
Not yet. No, not at all.
No, and I think there's a lot to be learned.
I think once we get upstream,
that's basically the green light
for a lot of people to start playing with it,
and we're going to get a lot of input
and discussion around how best do we use this, how best do we use it in the kernel? How best do we use it
from user space? What are some of the things we might need to look at to make those
judgments as calls?
A quick question, a longer question. Do you have CMB and BAR? What's BAR?
So, a BAR is a base address register that's a PCIe construct,
a PCI construct, and basically it's memory-mappable I.O. memory.
So the other question I have is,
in your right-hand drawing,
you kind of have this magic line that goes through the CPU.
It doesn't say where...
Oh, it does show a CMB.
Is there a CMB in the NVMe side?
No, no.
So you only need...
Yeah, you actually only need one device
to have a CMB to start playing.
And it could even be that the devices you're copying between, neither of them have a CMB, but another device does.
In which case, you would have a green line to here, and then you would have another green line into the other device.
So you can do DMA, peer-to-peer DMAs between two PCIe endpoints,
but actually use a third PCIe endpoint as the donator of the peer-to-peer DMAs between two PCIe endpoints, but actually use a third PCIe endpoint
as the donator of the peer-to-peer memory. So are CMBs in some sense faster than BARs?
No, no, no. A CMB is just an NVMe definition on top of a BAR. Physically, they are the same.
The CMB is just a way, an NVMe standard way of allowing the operating system
or informing the operating system that part of this bar has certain properties that may be
desirable for the operating system to use. But from a functional point of view, it's just a bar
or a part of a bar. Yes?
So from the beer to beer-peer to work,
you need to have the initial actual physical address.
Correct.
And that's part of what a peer-to-peer framework does.
So it has to go to the CPU's BTP table.
Only if the IOMMU is on.
And even then, it doesn't necessarily,
because you may have translated it before you passed it down.
Yeah.
But typically, we assume the IOMMU is off.
That's why I'm saying don't do this in virtualized environments today
because we haven't really got there yet.
I think we will, but it's going to take a bit of work.
And then you also obviously want to disable access control services
on the path between these endpoints so that the DMA can go.
The SSD in this picture has the address for the CMB.
Correct.
That's what you give it.
It has to work with the IOMM to get the real address.
It has to work either that or with the hypervisor.
It doesn't actually physically go to the IOMM per se,
or it doesn't have to.
There's a couple of different ways of doing it.
But right now, like I said,
we are advising people who want to do peer-to-peer
to do it on bare metal
and preferably just disable the IOM and Mew.
That's what we're advising as people ramp up.
We are starting to have discussions
at the highest level around,
okay, now that it looks like
we're getting the bare metal stuff upstream,
how do we start doing things
in more virtualized environments
where an IOMMU must be on
and we have access control services
and, for example, we might pass through
two virtual functions, right?
So this is a device.
It could have SRIOV.
It could have some virtual functions and we pass a VF So this is a device. It could have SRIOV. It could have some virtual functions.
And we pass a VF of this to a VM. And we pass a VF of this to a VM, the same VM, right? How do
they do peer-to-peer? Can they? Should they? Should we just say never, never do that? And then even
more interestingly, or maybe not, you know, we pass a VF of this to one VM. We pass a VF of this
to a different VM. And we want to do a peer-to-peer between the two. That is starting to terrify me, and
I'm not even a hypervisor person. So, yeah, it's a good question. We don't have all the
answers yet. The great thing about open source software is it evolves. Yes?
How do you work with the multi-sockets?
Yeah.
So it's a good question, yeah.
So what happens in a multi-socket environment?
You know, the reality is that right now,
the code that's upstream will recognize that,
and it won't do the DMA.
It will fall back to using system memory,
because it's a really stupid idea.
We do have some configfs stuff that you can use to override that. If you know your
system is good and you want to do that, then technically it works. The performance may vary
and you're certainly going to load up your socket to socket bus with traffic. So you can do it and
it probably will work, but it might not.
So peer-to-peer DMAs, I mean, the kind of customers we're working with on this right now
are very much people who understand the environment in which they're doing this
and have a good sense of the PCIe topology
and are making well-informed decisions around static systems
that are not going to be changing all the time.
Over time, there may be other people
who want to use it, but we're not there yet.
And if somebody's so unaware of their system
that they try to do peer-to-peer across the socket,
they probably deserve everything they get
because they weren't making good decisions.
This is not something that's going to be on by default.
You will have to go and do a bit of work as a systems architect
to actually turn this on, even in the operating system.
So we took those two different pieces together
and combined them for a customer.
So the customer was basically deploying on, they were actually on an Intel
based system, but we ended up doing a demo with AMD and I'll talk about why that is in a minute.
But they were asking us to do compression on data that they had on some NVMe drives.
And they wanted to get the compressed data onto another set of NVMe drives. And then they were
actually buffering it out the back to a capacity storage tier. But they actually wanted this path to be done in a
peer-to-peer fashion so that they weren't impacting the load store of the applications that were
running on the processor. So what we did is basically said we took our lib no load, we used
the inbox NVMe driver, and we wrote an application that basically did peer-to-peer data movement
with compression as part of that movement,
using the compression engines in the no load to do that.
So the way that that works out is basically you end up doing an NVMe read command
to the input drive, and this works through VFS.
So you can have a file system on here.
We had EXT4,
and basically you can start issuing NVMe IO
against this drive,
and as the gentleman mentioned earlier,
the pointers, the SDLs or the PRPs
in those NVMe commands
point somewhere in our controller memory buffer.
So this guy starts issuing memwrite TLPs,
and hopefully they get routed by PCIe.
Certainly in the AMD system,
which has good peer-to-peer characteristics
between its root ports,
we were seeing very good performance here.
So the memwrite TLPs would come in,
and we'd put them in our CMB.
And then when this guy had issued his last memwrite
for the DMA, he raises an MSIX interrupt.
The NVMe driver gets recall backed,
it handles the completion. At that point, the application knows that that IO has completed
successfully and data is on our CMB. What it then actually did is it actually sent us a command to
actually process that data via the compression namespaces and put the result back in the CMB at a different location.
So that's basically another NVMe command basically saying,
hey, take the data, put it there,
and then put it back in a different place in the CMB.
And obviously, there's multiple threads
doing this simultaneously to different parts of the CMB.
So we've allocated the CMB in different sunk sizes
to different threads. So nobody can allocated the CMB in different sunk sizes to different threads,
so nobody can sit on top of each other, and nobody can DMA over the top of others. And we use the
NVMe completions to know when everything's done and run that through P threads in a P thread safe
way. And then once the data is compressed, we basically issue a completion and raise an interrupt and tell the application, hey, we've done the compression, and by the way, the data is where you asked us to put it, somewhere else in our CMB.
And then what it can do is the drive can issue NVMe write commands to the output drive, pointing again to our CMB, to the compressed data, and the compressed data gets written to the EXT4 file system on the output.
So given the FPGA that we had here,
we can only get three compression cores on that,
and we can do about one and a bit gigabyte per second
of input data compression per core.
So if we had a bigger FPGA, we could go more.
Also, they wanted U.2
because they were deploying in this kind of server, so
U.2 was a good form factor, in which case
you have a PCIe Gen 3
by 4 limitation until
we get Gen 4 processors,
which are starting to appear,
but AMD
are on a path to Gen 4.
I think Intel are on a path to Gen 5 at this point,
but I don't know. I haven't seen their
roadmaps.
Question?
So you said you fit three instances, three GZIP instances into that particular FDTA.
Can you say whether it was LUT limited or block route limited?
That's a good question.
I think it was actually reasonably balanced,
and I think it was really more just closing place and right,
given that we were at like 75% resource utilization.
And was each instance encode and decode?
That's a good question.
So in this particular one I showed you, they were all encode,
and the customer's doing the decompression in software,
just because it's much less expensive.
But we have other customers who want us to do compression,
but they're more concerned about data corruption,
so we actually put decompression in,
so we basically do a SHA on the input data,
and then we compress the data,
and then we decompress that data,
generate a SHA, and make sure the SHA match.
There's no dedupe in here, right?
That particular one does not,
but we do have dedupe engines, yeah.
Question?
Yes.
When you basically have CMDs with pointers
before and after compression in the CMD, are there two different PCI bars? No, no, just one bar.
So the peer-to-peer DMA framework
is responsible for allocating memory out of that bar.
And it's a very safe, well-tested kernel allocator.
So it's not going to give two pages to two different...
Or sorry, the same page to two different processes. So you're not going to give two pages to two different, or sorry, the same page to two different processes
so you're not going to end up with two devices
trying to DMA over the top of each other.
And then the driver,
you know,
when you're done with the memory,
when you free it,
it goes back to the allocator
and the ideas, and obviously the allocator can fail.
If you've allocated all the memory
and somebody comes and says,
can I have some more pages, please?
They're going to get a minus e numem.
And they'll have to follow that.
The application has to handle that error somehow.
Yes?
So your comment about decompression being faster and softer,
is it because your FPGA was too small to have more
to get into the FPGA?
No, so I mean, deflate is a pretty simple algorithm, right?
So it's just not very computationally intensive
on any processor.
So the compression algorithm,
because of the static Huffman tape,
or dynamic Huffman in this case,
and some of the other things,
like libz in particular is quite onerous
on something like an x86-64 instruction set.
So thank you.
So it made sense to target that.
And this particular customer did want LibZ compatibility.
There are other compression algorithms that get reasonably good compression without being as taxing on the processor.
So LZ, for example.
But they wanted CLIP.
And there's obviously, everybody has their own favorite version of compression
at any given time of any given day.
In our particular product, we ship, we actually,
the U.2 has 8 gigabytes of DRAM on it.
We typically expose 512 as a CMB,
but only for the reason that it's a nice round number of two.
We can make it bigger or smaller, and, yeah, we do.
512 is a lot.
Can I make sure I heard that correctly?
You allocate 512 megabytes for a CMB?
For the CMB.
Yeah.
I don't think in this particular application
it even used a quarter of that amount.
Like, I think, you know, 16 megabytes would have been...
It's a kind of...
Think of it as a delay bandwidth product thing.
It's basically...
Your buffer needs to be...
Your throughput multiplied by the time
to write to the output.
So it's just a delay bandwidth.
Yeah.
I guess we're looking at the example of one direct one that we could do this for that
other microphone.
Yeah.
So that's a good question.
Yeah.
So what about scaling?
So there's a couple of things that happen.
So one of them is, I think if anyone's going to XDF next week, we basically are doing this
demo scaled up there.
So basically more NVMe drives,
more no loads,
and more output drives.
And we're basically showing that
if you do that correctly,
you get linear scaling and performance.
Because you're using peer-to-peer,
there's no bottleneck anymore.
In theory, because everything's doing DMA,
I can partition my system in the
PCIe domain in a way that
I can just basically stamp out
this block as many times as I like.
This block being input drives,
no load, and output drives, and
my performance will just continue to scale
ad infinitum.
Are all those devices on the same PCIe
bus? Because at some point, if they were,
that would become the problem.
Yeah, the SSDs and the loader
solvers.
So,
in this particular system,
and it'd be good to draw a diagram of this,
but basically, everything is connected to
every single device on its own
is connected to a single root port
on the AMD EPYC.
And the AMD EPYC was designed to have
great peer-to-peer performance between the root ports,
especially when those root ports reside on the same socket.
But even across sockets, they claim good peer-to-peer
performance.
As long as my atomic unit, as long as my atomic building
block connects to root ports that are always on the same die,
there is no contention between this atomic unit
talking to these reports
and this atomic unit talking to those root ports, right?
Because they don't even need to talk to each other.
So that, in theory, could scale ad infinitum, right?
How do you make sure that the inbox driver
uses the same address?
That's what the peer-to-peer DMA framework
that we're upstreaming into the Linux kernel does.
So go take a look at the patches,
and that will definitively tell you exactly how that is done.
Basically, if you go take a look at our patches
to the NVMe PCI.C file,
you'll see, well, it's not even in there,
it's in the block layer, we basically have the ability
to submit a scatter gather list,
which is backed in part by peer to peer pages.
The NVMe driver has no idea.
It has no idea it's doing a peer to peer DMA.
Because all you're doing is you're asking a DMA master
to start issuing some Memread or Memwrite TLPs.
That DMA master has no need nor should it know whether it's DMA in system memory or somewhere else in the system.
That's not its job to know that.
In fact, we don't trust these things one bit. That's why we have ACS in the first place.
Because we don't trust the endpoints to do shit.
We tell them what to do.
So the NVMe driver doesn't really have to be peer-to-peer aware, per se.
It just needs to pass through the PRPs
that are allocated from the PRP allocator,
which is done at the block layer.
Yes?
So when you've got the data in the CMV
and you want to run the compression command on it,
is there a driver down there that does that?
Do you send a command, or is that a hardware? Yeah, it's actually RTL.
And we're not going to even get to those slides.
But in the deck that will go online,
we actually have some numbers in terms of the RTL
that we wrote to do the compression engine.
So you're right.
It could have been a software compression engine.
To use which device? Oh, yes, yes, yes. So actually, the full version, which will go online,
compares what we're doing to the QuickAssist in a bit more detail. The problem with the QuickAssist
is, A, this is an AMD EPYC, so it doesn't have a QuickAss. And B, normally the quick assist is on the PCH device,
which is DMI connected, and isn't necessarily a peer in the PCIe subsystem. So if you want to do
peer-to-peer with a quick assist, it's not going to work because the peer-to-peer is not peer-to-peer.
It's because the quick assist is hanging off the PCH, and normally your high-performance PCIe
is hanging off the processor core sockets themselves.
Now, you can also go buy a Quick Assist as a PCIe card,
and that could certainly have replaced us,
but that just means Intel make money and I don't.
So I'm not a big fan of that particular model.
Well, there you go.
So we'll do that.
We'll just stop my company and we'll sell more Quick Assist
because I'll do anything to make Intel happy.
Yeah, exactly.
And we do have some performance numbers
that compare to the Intel Quick Assist,
but we're not going to have anything like enough time
to do that.
So, yeah, we're not going to get anything like enough time to do that. So, yeah, we're not going to get into that.
Let's not bother.
So if you're really interested in the minutiae of the RTL we wrote to do the compression,
then feel free to come to me.
I can't even answer the question because Saeed's the best person to talk about the specifics of our compression core. We did do some comparisons to QuickAssist
and some of the other hardware-based compression research
that's been done,
and we feel like we have a pretty good compression core
in terms of compression ratio
and also throughput on standards-based corpuses
like Calgary Corpus and some of the other ones.
But for me, the story is not just
do we have the best compression in the world. For me,
it's much more do we have a framework that's pretty awesome. Compression is one example of
what I put in here. But imagine it's pattern matching or image processing or security or
data recognition or deduplication or whatever. Part of me doesn't even care what service is
behind here. And part of our business model is actually now we're starting to work with some partners
who want to use us as the platform but push their own accelerators, not developed by us,
but push their own accelerators that are maybe based on SDXL and map them in as no-load namespaces.
In which case we become essentially a platform on which other people can innovate
right
yeah so our abstraction is very similar
to SDXL so if
somebody has an accelerator that's already
SDXL so anyone who's doing Amazon F1
work for example that's pretty
SDXL
they can take that accelerator and they
can pretty seamlessly without
us really having to look at any of your proprietary code,
and we can turn that through a wrapper process into something,
assuming it fits in the FPGA,
would be deployable behind this NVMe front end.
So that's kind of where we're going.
So I'm sorry I didn't get it to all the compression stuff,
but all this other stuff.
Yes?
What is the ratio of scalability when using one?
If you have multiple drives on a server,
how many drives can you use to download the CPU?
Yeah, so the question there was around scalability.
In terms of how many drives can...
Obviously, we have some pretty hard limits here, right?
So we have a PCIe Gen 3... Well, we have a PCIe Gen 4 by 4 interface here,
but the servers in which we deploy today are Gen 3,
so we don't get the Gen 4 benefit, but there is a throughput thing there.
You know, the actual throughput that we can hit is always a function of the size of the FPGA.
So some of our customers, while they like this
form factor, don't necessarily like this FPGA. And they can work with people like Alan to see
if they can fit a bigger FPGA here or a smaller one or redesign their own U.2 form factor. So
that's always going to be a function. There is a DDR bandwidth issue inside here. You don't get
that for free. But there's certain tricks you can play around caching and on-chip memory that can come to bear there. But the other thing is, once you've
got something that works, and let's say this is working at three gigabytes per second, if I can
just replete it, if I can just tile it over here, I get six gigabytes. If I can tile it again, I get
nine. If I tile it again, there's no interdependencies between these units.
The only final interdependency would be, thank you,
would be if my processors couldn't issue any more I.O.
But that's an awful lot of I.O.s.
That would be really the main limitation that would come into play.
All right.
Thanks for your time.
We're getting kicked out.
Birds of a feather tonight on NVMe. Thanks for your time. We're getting kicked out. Birds of a feather tonight on NVMe.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers
in the storage developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.