Storage Developer Conference - #37: NVMe Over Fabrics Support in Linux
Episode Date: March 21, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 37.
Today we hear from Christoph Helwig, consultant,
and Sagi Grimberg, co-founder and principal architect at Lightbit Labs,
as they present NVMe over
fabric support in Linux from the 2016 storage developer conference. So us here
is me, Christoph Helvig and Sagi. We're both long-term Linux kernel developers.
I've been doing a lot of Falstom work including NFS serving and block storage and I only got into RDMA for this
project.
I've been working on RDMA for the last, I don't know, five years or so. I worked
for Melnox and basically involved in the whole storage RDMA use cases and applications.
Yes, so for talking NVMe over Fabrics in the beginning, well, we'll first have to assume
you at least know what NVMe is and NVMe over Fabrics.
I hope you attended the last talk or followed all the hype because we're not going to get
into that again here. So NVMe over Fabrics and Linux.
Basically that's been some sort of involvement since the beginning.
So the first demo that kind of had the NVMe over Fabrics name, I'm not sure if other people
did similar technology, was at I think IDF in 2014 by Intel and Chelsea.
And they said they were using Linux, but none of us was really involved with that, I think, IDF in 2014 by Intel and Chelsea, and they said they were using Linux,
but none of us was really involved with that,
so we can't comment.
Then during early spec development,
we had at least two Linux-based prototypes.
So one was done by Intel with a few other people chiming in,
and none of us was directly involved,
but Sagi might know a little bit about it. Not sure if he can comment on that or if it even matters.
But I've been driving the implementation we've done
at HGST, so I can talk a lot about it
and we'll do a little bit later on.
And well, a lot of Linux developers,
us too, Steve Bates, who's in the room here,
Ming Lin from Samsung, have
been involved with both the various implementations and the actual spec development. So I've been
pretty active on the spec, too.
And Jay.
Yeah, and Jay. And, well, he's not a long-term Linux developer, but he's been really
helpful with our project, one of the Intel guys. And Sagi and I are both on the technical proposal
and spend a lot of time working on the spec and so on.
So what was our little HGST prototype?
So my manager there came to me, oh, this
envy me over Fabrics hype.
What is it actually about?
Well, I don't know.
The latency numbers they claim for it,
I already did in my previous project
for SanDisk over SCSI, so I'm not sure what the hype is about. And well, so we looked at it,
and we knew we could get about 10 microsecond added latency over SCSI with SRP before, because
that's the last big project I did for SanDisk, as I said, which happens to be merged with
Western Digital just like HGST now, so we're all one big family again.
So what I did was, oh, well, I know there's this
RDMA protocol, we can already get these really good
numbers out of there actually better than the previous demo.
Let's just tunnel NVMe commands over it.
Turns out that obviously worked pretty well,
but it also obviously wasn't quite what the existing spec
did, which I later got some access to it. The spec at that point actually was pretty horrible and tried
to emulate PCAE doorbell rights over NVMe fabrics and created giant packages over the
wire. It was actually worse than the performance we saw with the existing SRB protocol. So
as part of that, I get involved with the spec,
try to move the spec over to what we did,
moved our implementation towards what the spec did,
and we were basically converging from both sides
to get to something everyone was happy with.
And in 2015, so last year,
the different people with the different prototypes finally tried
to get together and there was a lot of weird legalese involved with that, sharing code
between possible competitors, industry associations and so on.
So we need a lawyer here to explain the details.
But in the end, we managed to get a development group started
that was organized as a working group
of the NVMeXpress organization,
where we could do our Linux development.
And we tried to make it follow Linux-style
development possible so that we're not ending up
with a later big code base that doesn't really
look like anyone wants it.
So we set up Git reposit repository, set up mailing lists,
went for maintainers, one of them is Sagi here,
and basically worked as if we were doing
a normal Linux development mailing list,
but had to do it in private for now
because the spec was under NDA.
But even before we could release our actual new code and tell people what's in this NDA
part of the spec, we tried to get ahead and actually prepare Linux for landing this codebase
so that we wouldn't have to start all the discussions that code refactoring would involve.
And one thing that was very obvious
and that we could go public with
before even showing the spec to anyone
was that clearly what was the existing NVMe
over PCI Express driver in Linux
was something that should evolve more into subsystem
similar to what we have for SCSI or ATA, for example,
so that we have a split between the actual command set
that we're talking.
And NVMe, for example, in theory,
supports multiple command sets, even if only one is supported
in the spec right now.
And a core part, and then the actual transport drivers.
So the first thing I did for that
is to rewrite how the NVMe driver submits
internal commands.
Internal commands are those that don't
come from a file system or a raw block device
and just do the read and write operations,
but internal things for scanning for namespaces,
initialization in the hardware, doing all kinds of issues
where you submit commands internally.
And what we've been doing for SCSI forever
is that we reuse the structure we use for submitting I.O.
requests to block drivers, which is called struct request,
to also be able to attach a fully build up command to it.
So in SCSI, you have your SCSI pass-through command, which
basically means you have your CDB attached to it,
you have your payload attached to it,
and then the low level driver doesn't
try to do any translation to the SCSI command because it already has it.
And that scheme was something we could apply almost one to one to NVMe as well.
It's a little more difficult with NVMe because NVMe has more data in what's called the submission
queue entry or just NVMe CMD in the Linux structure.
So it has the actual data pointers in it, which SCSI doesn't do.
It has the namespace number in it, which SCSI got rid of 20 years ago for a good reason.
So there's a few things we might have to clear out, but the scheme is pretty much the same.
So that was pretty easy to do ahead of time.
And then we could start splitting the structures
so something that is a super generic concept
and doesn't have any knowledge of PCIe
or just MMIO, Memory Map IO in general,
can move to a common layer.
Everything that looks very PCI specific
stays in a PCIe driver.
And the few places we still had to call out
into the low level driver that we couldn't handle
through this command pass through.
So if it was a non-command based abstraction,
could add a little operations vector.
And the other big thing we could do is actually
change the file structure to prepare for this.
So before the NVMe driver was two source files
just in the generic drivers block directory,
we knew that we'd get a lot more two source files just in the generic driver's block directory. We knew that we
would get a lot more different source files for that split so we started creating a hierarchy
where we could just deal with that. And so once we got the actual spec out, we could
fully complete the split and actually split it into two loadable modules and not just
at the source level. And now we have a pretty nice little structure.
So we've got our completely transport independent NVMe core module.
This includes the actual NVM I.O.
command set support.
This includes a lot of the block layer interaction and so on and so on.
And then we have an NVMe fabrics module.
This includes everything that is the common part
of the NVMe fabrics specification.
So if you look at the specification right now,
it has an RDMA transport and a fiber channel transport.
We've actually added another non-standard transport
in Linux that we're gonna talk about later.
And people are already thinking of more transports.
And this is the bit that's shared between them all.
And then in terms of the transport drivers, we have the refactored existing PCI Express
driver.
And we've actually got a RDMA front end driver.
We don't have the fiber channel driver yet.
And we're only going to talk about it a little bit.
Partially because the spec isn't done yet.
It's not done by the NVM Express organization.
It's done by T11 and T11 needs a lot of time
for smoky background politics so it's gonna take a while.
But we already have a prototype code that's been posted
to the Linux mailing lists and so on.
So it's not gonna take too long until we'll be there.
And the architecture that I just drafted in words, we've also got a nice little diagram.
So we tried to use the existing code as much as possible.
Turns out it wasn't actually all that much, also significant.
And one important thing that you'll see in this diagram is if we're in the fast pass,
so the actual block device
I.O. code that comes from the file system or straight from the application, we don't
really dive into common code.
Well, actually, we do a little bit, but there is no layering that puts the common code in
between the actual transport driver and the FastPass I.O.
So we'll actually call into little library functions to format the actual command structure,
but that's about it.
So there's no additional indirect function calls
or heavy-handed abstractions.
Instead, we have the transport driver
in full control of its structure layouts
and cache line footprint and so on.
This was something the PCIe people insisted on
because they've been optimizing for crazy
high-speed I.O. and I actually did that at the same time.
So I actually just did a new block device I.O. implementation for that.
And it also really helped us with the Fabrics development.
What we do use the common code for is, for example, user space pass-through I.O.
tools.
So Linux, we have the nice little NVMe CLI tool that allows you to send every crazy admin
command invented and even some vendor specific commands
And this makes sure the handling is one in the same and buck-to-buck comfortable
Yeah, yeah, it's really great to find bucks
And
Yes, and the the NVMe CLI tool is in fact
what we use for our own new commands, so to connect
to the remote controller, unlike a PCIe controller, it's not immediately found.
We'll talk about that a little bit later.
And we added all that to the NVMe CLI tool so that you have your one single tool to go
for for all the NVMe- related interactions on the Linux system.
So features supported, various RDMA transports.
So my favorite one is still InfiniBand,
even if even Melanox tries to talk about ethernet
all the time.
It's a lot less painful.
But we support the various rocky formats as well, and IWARP.
We don't support Intel OmniPath, which is another little hype thing at the moment, not because we don't support Intel Omni path which is another little
hype thing at the moment not because we don't really like to but because your
driver is so bad and doesn't support everything we need eventually we'll get
there so dynamic connect disconnect to the remote controllers of basically
theoretically unlimited number at some point will run into scalability issues.
We've got the discovery support.
We'll explain a little bit later what discovery actually is,
but actually we don't because we moved the slide.
So I mean, discovery is,
discovery is just a way to find remote controllers.
So in PCIe, it's pretty easy to find the controllers.
You've got your PCI Express bus.
There is a defined enumeration over it,
and your OS will just find them.
Well, now you're on, say, Rocky 2 or iWarp,
and in theory, every possible IPv6 address in the world
might have a controller behind it,
so we'll need some way to find it.
We've got support for the Keep Alive extensions
that were added to NVMe at the same time as
the Fabric support.
So for Fabrics where we don't have a reliable connection loss indicator, we have a way to
probe for that.
And we have some very, very basic multi-pathing support, thanks to Sagi using the existing
device mapper code.
It's very basic, mostly for the reason that NVMe doesn't really have something similar
to Alua in SCSI, so we don't really know which path is active, which is passive, which is
preferred, port grouping and so on.
But if you just have a simple failover multipathing where you just try all paths and if one fails,
you keep using the other ones, it'll just work.
And features that are not supported yet is, one, any form of authentication, mostly because
the spec has been outsourced to the TCG group instead of NVMeXpress and it hasn't happened.
See a scheme there about outsourcing parts of the spec.
It's usually the ones that are delayed.
And the other one briefly mentioned is fiber channel support, which is a work in progress.
The existing fiber channel vendors are all over it, let's say it that way.
It's life or death for them.
But, I mean, the other thing is if you care more about IOPS than latency, which is reasonable for some workloads, Fiber Channel actually does really well, just at a price tag.
So it's not like it's completely stupid, but it's definitely not what we focused on here.
Someone's going to sue me for that.
So I didn't say that.
Ignore it.
As we mentioned earlier, most of the code is shared for different transports but it's
barely the most.
I think it's about 52 or 53% as you can see on this part of the pie chart.
Then the PCIe driver is still pretty big, a lot of code in there because it's actually way more
low level than our VRBs RDMA driver.
The RDMA driver, I just checked it, so the host driver is 1,900 lines of code, so that's
another big check, and the actual fabric-specific common code is pretty small.
And about, not quite two-thirds, but more than half of this is actually a deprecated
SCSI emulation in the NVMe driver which we'll hopefully get rid of.
So if you remove that, the common core is even smaller.
It's actually the largest source file in the NVMe tree.
It's a weird translation to support SCSI commands sent to it which doesn't make any sense but
Intel really wanted to be added back in the early PCIe days.
So, and in addition to this host driver, which you use to access your remote NVMe devices,
the other thing we implemented was a target device.
And target is actually a word that doesn't appear in the NVMe spec ever.
Unfortunately, it doesn't have a word for that concept at all, that target system you know from SCSI.
It only has names for the various subcomponents of it, which make it very awkward to talk
about an implementation.
So I made a habit of always calling it target and maybe eventually getting the word in the
spec.
And the NVMe target supports implementing NVMe controllers in Linux.
And right now, it's just Fabrics controllers,
the RDMA Fabrics controllers we were talking about,
and, well, anything else.
But in the future, it's prepared for adding something
that either is or looks like a real PCI Express controller in the future.
So we try to keep the Fabrics-specific separate
and at least allow for concepts that only exist for PCIe, controller in the future. So we try to keep the Fabric-specific separate
and at least allow for concepts that only exist for PCIe,
like mapping multiple submission queues to a single completion
queue, something we don't do in Fabrics, something
we avoid to do in PCIe, but it's still allowed by the spec.
You can't do it for Fabrics.
No, you can't.
I mean, in terms of NVMe submission completion queues,
the spec explicitly disallows it.
Not RDMA completion queues.
We'll talk about that later.
So what we do is we have our RDMA target,
the in progress fiber channel target,
and the other thing that was the mystery transport
I mentioned earlier is we've got a loop transport.
And the loop transport is basically
a host and a target driver at the same time,
and just does an injection of one into the other, which
allows us to do local testing and feature development
without ever having to go out to a wire.
And it's proven really, really useful for testing as well.
And one thing that we try to make sure is that the NVMe target can use any Linux block
device.
So what we're not doing by default and currently actually at all is command pass through.
So we always implement the actual NVMe engine in Linux.
That means your backend can be NVMe, but it also can be a SaaS or SATA SSD, RAM disk, VIRTIO if you're running in a VM,
crazy fast directly mapped persistent memory devices,
and so on and so on.
And it just uses the normal Linux block layer
to communicate with the device.
In fact, if you look at the actual block IO path
of the NVMe target, it's about 100 lines of code.
It's literally nothing,
which also means we can't add
stupend latency or overhead to it.
That's always a good thing.
We also played around
with the NVMe command pass-through,
so I mostly did it for features
we don't want to support in a target
like security command pass-through.
Some others thought it would be useful
for performance, which I can't really see
as it would use exactly the same infrastructure,
just instead of being a generic request,
we would attach that little 64 byte
NVMe command block to it.
And the supported features are,
well, all the mandatory NVMe AL commands,
which there only are three of,
read, write, and flush, aka cache flush.
We support the full set of admin commands we need to support,
which is a little less for fabrics than PCIe.
We support the dataset management command,
which has a couple different sub-function,
but only one that is really useful,
which is deallocate in Nvm-e speak,
aka trim in ATA speak, aka unmap in SCSI speak,
aka discard in Linux speak.
Basically, deallocate or trim blocks
were written so that the device knows they're unused now
and can make better garbage collection decisions
or unallocating backing stores.
The whole provisioning of the subsystems, controllers,
namespaces is entirely dynamic.
So you can add a new namespace
to a controller that a device is currently connected to.
The device will get an asynchronous event notification and be able to see the new device
and so on and so on.
And we include the discovery service, which I mentioned earlier, which is used to find
the devices, including support for referrals.
Even if you go discovering on one of our targets,
it can just refer you to another one
on the other side of the globe.
Not really useful right now,
but it will be useful at some point,
especially when you do fully clustered implementations,
which we kind of support at the moment,
but without Lua or some other distributed semantics,
it's not that useful yet.
Which actually brings us to the next slide with features.
So features not supported.
One is a couple log pages for the smart, like the usage logs and the error logs.
The first one actually has a patch out on the mailing list now, which we'll hopefully
have soon.
Well, again, no authentication.
An important one is persistent reservations.
So NVMe has something that is almost the same but not quite as the SCSI 3 persistent reservations.
And we'll hopefully support it in the Linux target.
We just want to have a common API that we don't have to implement it on our own and
are able to use clustered backends.
So there's two clustered persistent reservation backends for Linux at the moment.
One is using the devise.
One is using a distributed lock manager when you do more like active passive failover things
using DRBD.
And they're both in weird out of tree shape so we're trying to
work to make them all sit behind an APA so that the SCSI and NVMe targets can use the
same backend code.
Yeah?
What's the target completions for the adding and sending?
None really at the moment.
So persistent reservations, well actually,
so the smart lock really soon.
The error lock we'll just don't plan to do anytime soon
because the spec requires the error lock to be there.
We have it, it's always zero.
So the way the NVMe error lock works is
when you get back an error,
there's a bit that's set to be in it that tells you
there might be more information about it in the error lock.
And a lot of PCIe controllers never set it, so we followed their lead.
And you get your error lock, there's just not going to be anything in it.
And the real driver of implementing that is a host that actually makes use of it.
So right now, none of the NVMe drivers I know would even ever ask for it, which makes it a little pointless to implement.
But if people come up with good use cases for it,
it should be a fairly simple implementation.
Not in the code base right now, but I've been prototyping various things.
It's just that the current hinting story in NVMe
isn't all that nice, so Martin and me
and some other guy have been trying to change that,
but it's gonna take a while
until we actually have useful hints in the spec.
How about the authentication?
Well, the problem is, again,
there's no spec for that right now.
So what the NVMe spec basically says,
it defines two new fabric-specific commands,
authentication send and authentication receive,
which then tunnel it to security protocol.
None of those is defined yet,
so NVMe worked with the trusted computing group
to define a spec for that.
It's just that nothing has happened so far.
Okay, so the next one is fused commands commands and there's actually only one of them, compare
and write.
In SCSI that's something VMware really likes and no one else ever uses and in SCSI it's
implemented really awkwardly with a command that has a data in buffer that is actually
split into two buffers in a magic way and creates real problems for the implementations. So what envy me and we needed for that is something slightly
less ugly but still a little hairy. So you get two commands,
a compare command and a right command and they're linked by the fuse bit.
So only if the compare succeeds will then execute the right as part of the fuse
command.
So think of it a bit like a load-linked store conditional over a storage protocol.
That becomes very well defined
when you have relaxed order.
Well, but NVMe doesn't have relaxed order,
but anyway, it's a bit painful.
I think it's actually less painful
than the SCSI version of it,
but the concept is hairy one way or another.
And we're most likely not gonna see it
until there is a VM for host implementation
or someone actually starts using it on Linux,
which I don't expect to.
Then independent of the authentication,
there's also the traditional NVMe security protocols,
which is more like drive, lock, unlock,
either by plain text pin or TCG magic.
Right now we don't have an implementation of that either.
We're thinking about a few ideas there internally at HGST, but not anytime soon.
The other thing we haven't done isn't really a protocol feature, but more an implementation
feature that a lot of people are playing around with right now,
including HST or WD for which I work,
is some form of PCI peer-to-peer I.O.
So that we do transfers straight to the device,
say an NVMe controller memory buffer,
say a PCIe card that is just directly,
just has directly LBA addressable storage.
And for that, we don't want RDMA into host memory first,
but straight to the device.
And it's very hard to get a framework in for that
and we'll need to do a lot more talking
that doesn't involve just the NVMe target,
but the PCI subsystem maintainers,
the driver maintainers, memory management people and so on.
Yeah. system maintainers, the driver maintainers, memory management people and so on.
Yeah.
Yeah, so listen to Stephen Bates, you'll find him on the schedule.
He's probably the one who's been playing
with that for the longest.
So then we again got our little pie chart.
So again, the most code is common.
In fact, just a little bit more than on the host, but at least we don't have a crazy SCSI
emulation here, so there's actually real work done in the common code.
Then we have our loop device, which is pretty easy.
We've got the RDMA implementation, which looks larger in this chart, but it's actually smaller
than on the host.
So our RDMA transport drive is only slightly less than 1,500 lines of code.
And then we've got a little bit of fabric specific code too but it's actually fairly trivial.
And the size of the whole core is actually smaller than some of the SCSI target transport drivers
because we try to keep it A very simple because the spec is very simple and we implement the minimum
possible or necessary. And the other reason is that we try to aggressively offload our
work if it's not really NVMe specific to other people, which mostly were us two just wearing
different hats. So we did a lot of RDMA subsystem changes
with Sagi is gonna talk after I finish one more slide
or something like that.
The other is the configfs subsystem.
It's a kernel file system that allows
to configure subsystems in the kernel
using a file system like interface.
I actually started maintaining that just for that
so I could push it all in there.
And as I mentioned on the last slide, we plan to go on like that, especially for persistent
reservations which is another giant code blob that we really don't want to have in our driver
because it's not specific to us and has multiple possible backend limitations.
So NVMe target configuration, we have this little configfs that I mentioned interface
that lets user space tools configure it.
And it's kind of modeled after what we did, what's done in the Linux kernel for the SCSI
target but just simplified down to the minimum.
Part of that is just that the NVMe spec is much more coherent than the various
SCSI transports which have a lot of leeway to do things differently, use different identifiers
and so on, different features. And part of that is just that we developed everything
at a go instead of adding things piecemeal and tried to look for the big picture. And
we've got a user space utility that is written in Python, uses an existing library
for these config.fs interactions, which allows a little, actually I should just move to the
next slide, which has a picture of it.
So it's this little command line, a little bit graphic by using colors and ASCII art
interface.
There's tab completion for everything.
So nice little interactive tool and it also allows you to save and restore the configuration.
We ship a systemd unit file that you automatically get everything done, system boot and so on.
And after this, I'm going to hand over to Sagi who's going to tell a lot more about the RDMA subsystem work we did as part of this project.
Hi, is it on?
All right.
Okay, so I'm going to talk about all the RDMA stuff and all the work that was done in the
RDMA stack in the scope of the NVMe RDMA driver development.
Basically when we came in and implemented the code,
we started looking on all the other storage RDMA protocols
that exist in Linux today, and we identified some spots
that actually duplicate code and basically re-implement a lot of boilerplate code for things that
really should live, really are generic and should live in the generic RDMA stack.
So this is a very simplified look of the Linux RDMA stack today. You have in green the RDMA protocol drivers.
We call them ULPs, upper layer protocols.
These are the protocol implementers
that work on top of the RDMA core.
The RDMA core is basically an API layer
with combined with management.
And underneath we have the specific device drivers that implement the RDMA. We
have iWarp drivers like 6DB3 and 4. We have IB and Rocky drivers, MLX Mellanox, MLX 4
and 5. We have others as well. We can see that we have the SCSI RDMA protocol, iSCSI
over RDMA which is ICER. We have the NFS over RDMA which Chuck's been doing
a lot of work lately and we have the new NVMe RDMA transport.
So as I said, the stack is logically separated
to these three components and we basically did
a lot of code centralization
and we'll start enumerating them one by one
So the first was memory registration that was a big a big pain or
Very painful as an RDMA developer or someone who is involved in the stack
Basically memory registration if you attended the last talk, is the way to allow remote
peers to access your local buffers.
You need to basically pass the physical mappings of the buffers themselves into the ACA so
it will know to access them directly.
And then you get some kind of token.
It's called a remote key in RDMA language.
And you pass it to a peer, and it uses that to basically stream data
into your local buffers.
So that is called memory registration.
And funnily enough, the stack offered maybe five or six ways,
different ways to do basically the same thing.
Different methods had different semantics
and speeds meanings.
And I think at some point I tried to introduce a seventh way
but I started getting anonymous threats to my mailbox.
So, okay, so I said we should rethink the whole thing.
So we started basically, so again, it was a lot of effort. I think
the most ambitious driver was the NFS client who actually tried to support all of them.
But Chuck did a great work cleaning that up and taking the NFS client to a new level,
only resulting in two different ways to do that.
So we started on discussion on the mailing list that we wanted to converge on a single method.
We basically have one mission, we want to register memory,
we want to find a single way to do that,
we want new developers basically to have a natural choice
and they know how to do that.
So we converge on fast registration work request, which is basically one method that was available.
It was basically the most widely supported by various device drivers, devices, and it
was pretty, it was basically performing very well.
Because it's fast.
Because it's fast.
We called it fast because it's fast.
Remember that name is important.
It's very fast.
So then we looked into the FRWR interface
that the core layer presents,
and we found that a lot of the code,
a lot of the code, a lot of the tasks that
a ULP or a protocol driver needs to do is way too much, and it basically, we saw the
same code replicated in each driver.
So what we did, we basically took all the shared code or replicated code, and we put
it inside the RDMA core, and we provide very intuitive and simple interfaces
to the different ULPs and we allow them to test
scatter lists rather than construct their own page vectors
with the alignment constraints of each various
ARNIC or device and we also looked at the provider drivers,
the device drivers themselves.
And we saw suboptimal implementations.
Most of them used like a shadow page vectors used
for NDNS conversion and flagging,
which was really unnecessary.
So we also offered a set of helpers from the core
to do that correctly.
And we basically did that and migrated all the ULPs to use this registration method.
I think RDS is the last one that needs conversion.
We have it too now.
Yeah, all right.
So we migrated all the ULPs and I think now it's, if you're going to write a new RDMA
driver, protocol driver, it should be a fairly simple task.
So that was good for everyone
and everyone benefited from that.
Another aspect that we looked on
was the completion queue API.
So basically the completion queue polling implementation
was replicated through every ULP that existed in the stack in some form
and
Some got some got the the AP some
Implemented the API correctly, but we found a lot of
No, no
I don't think we got it some each Each driver has its own mistakes in terms of correctly re-arming the CQ to generate
more interrupts, fairness and context abuses.
Basically we saw drivers that can stay in hard IRQ or soft IRQ forever.
But what we did is what we wanted to do is basically abstract all the details on how to do correct polling
from the RNICs themselves and provide a simple and intuitive interface to all the ULPs.
So what we did, we took basically all the ideas from all the ULPs and put that in the core layer. Okay.
Both the allocation and the allocation and the polling, uh,
polling implementation itself.
And we basically allowed the ULPs to pass a dumb function,
which basically executes once the, once the work request complete.
It's very similar to the bio, the Linux bio interface, if you're familiar with it. And we also, for different,
different modes of your piece, if it's a server or a,
or an initiator mode driver,
we offered several contexts completion queue completion polling context that are
available. One is soft IRQ, um, with,
with a new library called direct you poll, which was migrated from, what was it?
Yeah, some kind of implementation
that stayed in the block layer.
We made it generic and used that.
And we also have a work queue interface for target mode,
which usually allocate buffers from its run to completion stack from
The completion to the back end and we also offer direct polling for application driven
polling
and we might and we migrated all the UOPs or the vast majority of them to use this API and
the nice thing that was our indication
that we got it right is that we had several developers
from the RDMA ecosystem subsystem
converting to this API themselves.
So once they got it correctly and it
was very intuitive for them, we knew that basically we
got the API right.
Another aspect is that we found that is useful is pooling completion queues.
So basically RDMA applications can achieve better parallelism
or completion processing efficiency
if they basically use pools of completion queues So, if you look at the completion processing efficiency,
if they basically use pools of completion queues
and stack correctly assigns RAQ affinity to cores
and stack multiple queue pairs on the same completion queue.
So, basically you get the benefit of better completion aggregation per interrupts
and overall reduce the number of interrupts per completion queue. And smart assignments of RAQ affinity as I said,
helps the overall parallelism of processing in the software.
So what we did is move all the completion queue allocation
API into the RDMA core, basically hinting
from the QP creation
on which CQ you want to do that.
So we moved it to the RDMA core.
So basically completion queue pools
are allocated on demand in a lazy fashion
in a per-core semantics.
And the ULP can pass an affinity hint
on which core or completion vector it wants to see
interrupt on.
And it can leave it blank for wildcard assignments.
And the nice thing is that basically all the wisdom that we found in the individual ULPs
now move to the core. And ULPs that weren't aware of smart assignments
and correct pooling gain benefit from using
nice and simple APIs.
It's a work in progress.
I expect it to make kernel 4.9 once we get the SRP bug out
of the way.
Any questions? None. All right.
Too hardcore.
Too hardcore. Yeah, it's – you need to bear with me. It's a very high-level description
of low-level implementations that we did.
It's funny that you talk about pooling and polling in this.
Yeah, yeah. It is. It is.
I'm going gonna get some.
Yeah, all right, okay.
So one more aspect that we noticed that is duplicated
on all the protocol drivers was the RDMA data
transfer generation from each driver itself.
So all the RDMA protocols share the same classic RPC model.
You've seen it in the last slides if you attended.
For reads, basically, the initiator
sends a read request to the target.
It is followed by one or more RDMA writes of the data
and reported back in a completion,
basically, which is another send work request.
Writes basically work the same way.
You get a write request followed by a series of RDMA reads
and notified back on the write completion.
This is not mandatory.
If the data is small enough, it can write on the
write request on some protocols.
So what we saw that basically every single ULP implemented
the same sequence of operations to transfer data
from local scattered gather of buffers
into a remote scattered buffers.
Lots of code duplications.
Inevitably or not surprising, some got it right, some got it wrong. And
the most surprising thing is that all the logic existed in the ULPs, none existed in
the core layer, nor the provider drivers. That would just pass through all the logic
itself just to the wire. So we needed some high-level API
to make all the data transfer transparent
and all the code that should exist once
be implemented correctly
and give a nice API to the individual UOPs.
So what we did, we created a new core library
to basically generate a series of RDMA reads and writes
in a very generic way with a very simple interface.
Basically you get three functions.
You get an init of an RDMA read write context,
you get a destruction of that context,
and you get a post.
Just post a series of RDMA writes which were initiated.
It basically works on common scatter list
which is basically I think most of the RDMA ULPs.
I think NFS needs a bit of conversion to use it seamlessly.
But it's very useful to have like generic common scatter
list that are used in Linux.
And it masks away all the details of DMA mapping and correct wisdom. Let's say it implements all the wisdom of correct signal and completion signaling
and what completions can be signaled or suppressed, correct batching of multiple work requests,
RDMA work requests, basically get the highest performance
from implementation from a ULP.
It also have support for the Mellanox signature extensions
to implement T10PI which can be implemented in NVMe
over RDMA as well.
It should be easy enough to implement.
And again, after we did all that work,
we migrated the individual UOPs, at least ICER and SRP.
NFS still needs a conversion.
We're hoping to get that soon.
And to use all the new API,
and I wish I could have captured the number of code lines
that were eliminated for each driver.
So that was, for ICER, I just remember
it was very, very significant.
I think it was bigger than our whole.
Yeah, probably.
OK, so performance results, I'm going
to give it back to Christoph.
Any questions before I give it back to Christoph?
Does the user-based implementation also
get the benefit of the combination?
Good question.
The short answer is no.
They get none of these benefits.
In RDMA, the user space is pretty,
all the data plane is pretty much disjoint from the kernel data path implementation,
so it will need to be re-implemented in user space
for user space to get benefits of that.
That's basically why we haven't.
We don't expect, the user space application
will not see the same benefits.
But once someone picks up the glove and re-implements it in
user space we can.
So short answer, it's doable but no one's done it yet.
We're almost running out of time so let's go on to some benchmarks. They're a little
bit older but the basic scheme still works.
So what we've done here is we've done a benchmark
on little QDepth 1 reads, 512 byte reads.
And the things we're comparing is a RAM disk and NVMe SSD.
And if you glance at the numbers,
you notice it's definitely not a flash SSD.
It's one of our HDST prototypes
that's actually DRAM backed,
so it's way, way faster than what you expect
from like a real SSD.
And then we've benchmarked that either locally
or going out over NVMe over Fabrics
on a Mellanox HCA with all the little optimizations
you can apply to.
So for example, we're avoiding the memory registrations for that by using a nice to
use but not quite safe feature that all the RDMA protocol drivers use to
avoid memory registration. Not going to get into that. Don't do that at home.
So basically, the local RAM disk latency is very, very low.
And then we're at about four microsecond latencies with our NVMe DRAM device.
We get about, oh, and this is the percentile of the I.O.
So these are the latency spikes we might actually see in the implementation
if you don't just go for the average latency, but for the worst-case latency.
That's actually something that's gotten a lot better recently and then we get about seven
micros seven to eight microsecond latency if we do NVMe over fabrics to
that synchronous RAM disk on the real remote machine so that's a good proxy if
you care about remote persistent memory for. It will not be quite as fast as the RAM disk, but very close.
And then we get 12 to 13-ish latency for a real NVMe device behind the PCIe bus
on the remote system, even if it's a very fast one
for our QDIPS 1 random reads.
And this is the normal interrupt-driven performance.
What is the load size of I-O? I-O size? What is load size? normal interrupt-driven performance. And I...
I.O. size?
512 byte I.O., I mentioned that earlier. So the smallest possible block size
on any block device in general.
So the other thing that we do was we added polling support
and one thing we have the polling support
in the NVMe over Fabrics host driver.
So what we do there basically is once we submitted an I.O. And one thing we have the polling support in the NVMe over Fabrics host driver.
So what we do there basically is once we submitted an I.O. and we know the tag, we start hammering
the RDMA completion queue from the submission context.
We actually have an experimental polling support in the target driver as well, which gives
another slightly better number, but in practice it's very useless because once you start polling
for one tag, you can't handle another one.
For useful server applications it's downgrade but it's trivial so we left it in.
With that our latency numbers get a lot lower especially for the RAMDIS case right now but
also for our actually the RAMDIS case is the same, but for our NVMiver fabrics with polling,
which is the number that goes right into here.
So for that, we go down to about 10.3,
I think the best reproducible number we got.
Do you have the P99 and 99 point whatever
broken out anywhere?
Damien has it somewhere.
Where's Damien when you need him?
He was here in the last talk.
Yeah. I only have the one most talked Yeah. Not anymore, okay, yeah.
But anyway, we need to rerun a lot of this
because we made even further improvements
to the RDMA stack.
But the whole idea is once we start doing intelligent
polling on the host, the target doesn't really matter
because it's pretty much busy all the time anyway.
We can get another two to three microseconds out of it.
Another project that it recently was to actually much busy all the time anyway, we can get another two to three microseconds out of it.
Another project that it recently was to actually rewrite the Linux Paul Block I.O., Block Direct
I.O. code for small I.O.s and that gave us another 18 percent reduction on local latency.
So we're work that we have to work in all the different passes in there. It's not really
just the protocol. What's the distribution of the spy data from one way or the other?
Okay.
Was that on 100 gigabit?
What's here? Not 40, 55, or what is the smaller one?
50, FDR, 50 gig.
Yes, 56, not 55, yeah. Not sure really, so.
Yeah.
Yeah.
So right now we're spending a lot of time on looking at these graphs just for local
IO and once we're done with local we we'll move back to Fabrics and see
if there's more in the RDMA stack.
The block stack rewrite,
or the whole load, is that coming in 4.9 as well?
No, so the basic block polling support
has been in for a few releases, Jens and I did it.
For 4.9, I won't have it ready,
but I have a Git tree, so it's actually 200 lines of code
because I've ripped everything out I didn't need and then made it go faster.
So I plan for 4.10 hopefully, but we'll have some more discussion on that.
And so that's about it.
I think you wanted to do studies again.
Okay.
I'll go real quick because the time is up.
All the code is merged into 4.0 kernel.
Fiber channel support is still pending,
and we have updated NVMe CLIs and NVMe TCLIs
that will get into distribution soon.
You can find the links of the NVMe over Fabrics code
either upstream or our foreign X tree
that we basically maintain
and the NVMe CLI and NVMe TCLI code repositories.
Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.