Storage Developer Conference - #64: Past and Present of the Linux NVMe Driver
Episode Date: February 13, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 64.
Hello, everyone, including the latecomers that were enjoying the music outside.
So I'm going to talk about the past and present of the Linux NVMe driver, and it's been there for a while.
Now I'm one of the co-maintainers, and I'm also active in the NVMe technical working group,
hassling all the device and standards people to think of, or us poor host people that actually have to deal with
drivers.
And, well, what's a driver?
Everyone will know, and if you think of golf right now, you're in the wrong room.
So, piece of computer software that controls input and output.
Sounds harmless, huh?
But it could be a whole lot of different things.
So just if we're sticking to storage drivers, because we're all storage people, and look
at good old SCSI, you could have a tiny little driver like the virtioSCSI for virtual machines,
which is including the header, 1,300 lines of code, or we could be three orders of magnitude
bigger for a nice little fiber channel driver that everyone loves.
So, well, drivers can be quite a lot of different things.
So it could be lots of different hardware types.
Like if you have seven generations of different hardware in the same driver, it's going to be big.
If you support an initiator and a target and SCSI and NVMe in one driver, well, it's going to be big, and all these factors multiply.
So if we're moving to NVMe drivers, this could look a little different, too.
So if we go back to Linux 4.4, our NVMe driver was about 7 and a half K lines of code,
split over a few files.
Actually, in 4.12, it got a lot smaller,
which is interesting.
But that's because at that point,
we actually had two drivers,
and in the end, it's actually more.
It's just the one that's actually driving
the little PCIe device you think of when NVMe.
It's a lot smaller.
And either way, we're doing pretty good compared to other NVMe drivers.
So if I look at the open source OFED Windows driver,
it's getting closer to our fiber channel monster on the Linux side.
And it actually has a lot less functionality, which is pretty interesting.
So the humble beginning of our little baby, the NVMe driver, was our friend and colleague
Matthew Wilcox back in the day, early 2011.
And if you look at the lines of code, the two files down here, it's actually pretty
much exactly the size as our example for trivial little driver.
The other interesting thing when looking at that commit in GitWeb is if you look at the date of the commit,
that was actually a month before the NVMe 1.0 spec
got released. Someone had a head start in there.
So if you look at this very first
commit, it was very little in there.
In fact, in some ways, it doesn't really look at
what your friendly marketeer will tell you about NVMe.
So there was only single submission queue,
single completion queue,
only did small data transfer,
so only up to single page size of actual data transfers,
just did a read and a write and a few admin commands.
It was tiny, simple, and didn't actually work.
So it took a couple more iterations before it was even
in shape to get merged into the Linux kernel.
So January 2012, one year exactly.
And at that point, it started to resemble what we think of as NVMe.
So it got multiple queue support, basically one per CPU.
It was pretty strict about that.
It didn't really work well at that point if you had less queues than CPUs.
It supported larger data transfers, lots and lots of fixes,
and it grown about 800 lines of code.
So, well, quite a bit bigger than before.
Well, we had that driver in the main Linux kernel.
There was still no NVMe product on the market by then.
So just a couple prototypes and labs, giant FPGAs.
I think Martin had one of those and really loved it.
And continued like that, basically, until we got a few products on the market,
got lots of little bug fixes and a couple features that are not too major but interesting.
So we got support for the NVMe deallocate command, what we Linux people know of as discards.
So telling a device that that data had a bunch of LBAs, it's not needed anymore. Go reclaim it. Optimize your garbage collection.
Actual support for cache flushing, which was committed before it first got into Linux kernel, actually started working because someone tested it with a device that had a volatile write cache, which they probably didn't do before.
We got a feature that has been really useful,
and that is the character devices.
So normally if you talk about block storage in Unix systems,
you've got your block device node that you do the actual I.O. on,
which for NVMe we've got one per namespace.
But if you want to actually do administrative and set up things,
we've got the nice little user space NVMe CLI tool
that allows you to basically exercise every NVMe command. And for that, we have a character
device handle that offers the Ioctl to do it, which is really useful if you, say, have
a drive that doesn't have any namespaces. There's not going to be a block device node
for it. Or you have a really broken device that doesn't come up and can't create IOQs,
which apparently is a common failure mode for Intel devices,
so you could actually bang that back into a shape where you can use it with a firmware update.
Then another thing we got in there, which was a little controversial mostly later,
is that someone at Intel, not the maintainer, added a weird SCSI translation scheme that you could use the SCSI path through iOctl that we have in the SCSI subsystem on the NVMe namespace block device node and would silently translate it to NVMe commands and submit them.
Turns out that code was extremely buggy, had a few exploits in there, and generally didn't work, but a few people relied on that anyway. And the first really, really major change, which also went along with a bump of the version
number to the comfortable sounding 1.0 from the Odo, whatever releases, was the switch to
BlockMQ. So BlockMQ is a piece of infrastructure we have in Linux. It's sort of the block layer, or rather the new version of the block layer.
So before that, our application, our file system, our block device node, more or less called straight into the driver.
So it filled out a little bio structure that describes every block I.O. and dispatches that straight to the driver,
which is, well, very low overhead
calling convention for sure, but it also meant that every driver written to that interface
needs to do a lot of work.
And the prime user of that interface in Linux traditionally was remapping drivers.
So if you do a software RAID or a volume manager, those were the drivers written to that interface.
And we always had another layer called the request layer that sits there.
We've got the block and queue, which did a lot of work that I'll get into on the next slide
that really helps with block drivers.
But once PCAE flashcards started showing up, the performance of that alt layer was just too bad for people to use it, so they
started duplicating lots and lots of bits of the infrastructure, and we had to act.
So first prototype in 2011, we got this new block and queue layer, which was designed
as a replacement of the request layer, and so what it does is it splits and merges IO requests
because whatever the application the file system submits
might not be what the driver really wants.
And the easy case you can think of is
you're doing really large IO,
but your device can handle smaller IO,
so it needs to be split.
The other is you have a stupid file system
not looking at any XT, whatever developers developers that just submits a lot of very
small I.O. for what's actually a contiguous big I.O. So we're
merging that back together.
And it has a couple other interesting helpers.
So it manages multiple submission and completion
queues, so this whole spreading of I have N queues
available and M queues. So this whole spreading of, I have N queues available and M
CPU cores, how do I perfectly spread them so that we're
having one as local as possible is taken care of.
We've got a command ID or tag allocator, which is actually
tied into the management of the per IO data structure.
So for every possible outstanding command, we've got
a pre-allocated data structure,
both for the common block work and a driver specific part,
which is indexed by our command ID.
So there's one very, very easy to use bitmap allocator
that gets you all the data you need for your IO
and there's no massive memory allocations
and so on and so on.
And as I said, we first got this in 2011,
merged it into the Linux kernel three years later
and initially used it for the Word.io block driver.
And then later that year,
we converted the SCSI layer over to optionally use it,
and I did a lot of work on that
because it really helped with performance on things like SRP or high-end rate adapters that were really
limited by the only way we've done it and then in the next year
well next 3.19 we converted over NVMe and by now we've got another 10 12 or
maybe even 15 drivers so the latest one I saw converted over just two days ago was the IBM S390 DASD driver.
So even mainframe technologies from the 70s is now using our best and fastest infrastructure.
And in general, we didn't have to do much work on block and queue to fit NVMe in, but there was one really interesting thing where NVMe differs from most of the block storage we're
doing in Linux, and that's the way it describes data transfers for DMA.
So what NVMe's got is a concept of PRP, physical region pointers whatever and the whole idea is it's not a scatter
gather list as most IO does where you have an offset in the lengths but it
only has offsets so you can't describe the data transfer with one descriptor
that spans a page and a page in that sense is actually an NVMe concept it's
the same concept as the page in a typical operating system,
so usually 4K, but it's a different setting,
so they don't have to be the same.
And in fact, in Linux, they are usually, but not always,
because our NVMe page size is always 4K,
while the system page size might be larger at 8, 16, 64K.
But this means that, for example,
if we've got a transfer
where we have two pages that are contiguous,
for NVMe, we actually have to set
two PRP entries for it.
Well, in this getter-gather list,
the lengths would just increase.
And that was something
the Linux block layer didn't really expect.
So we had to get new code in there that basically tells the
merging code in BlockMQs, like don't merge the IOs together
if they would span multiple pages.
And well, we got that in.
We rewrote it later, and it was buggy for a while.
But in the end, it was really useful because it turns
out there were a couple other drivers that had the same sort of limitation and worked around it by
bounce buffering and doing weird things. So all the RDM, not all of them, so Mellanox now has a
memory registration type for RDMA that supports gather-getterless as Wendry extension, but all
the other RDMA memory registration schemes
have the same limitation.
They're basically using the same data structures as PRPs
and used them even earlier.
They just didn't have a fancy name for it.
And the Hyper-V virtualized driver for Microsoft,
Hyper-V and Azure also has the same strange limitation in there.
And, well, in the meantime,
NVMe had actually grown support for scatter gather lists.
I guess a lot of people complained very loudly,
and the rumor was it was the array people
that probably can't as easily fix their stack as we could.
So NVMe 1.1 now has XGL support, but it's optional.
So very few devices actually support it.
And it's only supported for I.O. commands.
So when you think of I.O. commands, it's the read and the write.
Well, a few others that don't matter.
But all the classical admin setup stuff had to be peer piece.
Well, except Antra and VME Fabrics a little bit later, where now everyone uses SGLs, but they're different SGLs, just to make it complicated.
And we've got patches out there for SGL support in Linux.
We had the first one a couple years ago, but it didn't come with benchmark numbers and wasn't quite as pretty.
But now another engineer started on it and actually backed it up with numbers.
And as soon as we've got actually larger contiguous transfers,
at least 16K, obviously the SGLs win because they're much more efficient to describe that.
While on the other hand, for example, for 4K transfers,
the PRP will always be more efficient because it's smaller.
So we're doing some more fix-ups
to do the perfect detection of these thresholds.
And then we will use whatever fits better if the hardware
actually supports this GL.
And well, after we've done the block and queue switch, the
usual small tripling in of features continued.
So we got support for T10 protection
information in February of 2015. A little bit later, we got support for T10 protection information in February of 2015.
A little bit later, we got support for the controller memory buffer.
So that's a little piece of memory in a PCIe bar that the controller can expose
where you can place data or submission queue entries in.
And we have the basic infrastructure.
There's only submission queue entries in there.
A lot of people really want to use it for PCIe peer-to-peer transfers eventually,
but we're still missing the overall infrastructure for that higher up in the PCIe layer for discovery
and higher up in the I.O. subsystems to describe physical memory that's not mapped into the kernel virtual address space.
So there's a lot of heavyweight infrastructure going on that before we can make full use of that.
We've got support for a persistent reservation API.
So that's something I did for a PNFS layout
that was primarily targeted to SCSI
but has been extended to NVMe as well.
So the applications can use iOctols,
do persistent reservations without doing weird SCSI
path- through tricks that
break more often than not. And another really interesting thing is support for Apple devices.
And now people might say, well, it's NVMe, right? You should support every device. Well,
you haven't talked to Apple. So Apple came up with the idea of building an NVMe device where
they used the wrong class code because it was big Andean instead of little Andean
as in the NVMe specification,
where you could not do 64-bit MMIO reads
but had to split it into two 32-bit MMIO reads,
and you could not use a queue depth more than one reliably.
But all of that would have been easily caught
if they had just run the freely available
NVMe conformance test suite,
but apparently they didn't because it doesn't run on macOS.
So we had to work around that a little bit.
And the other thing is we got basic SRIOV support.
So the single root IO virtualization allows a PCI device to create virtual subfunctions,
which, if you're in Linux or just like any other PCIe subfunction,
so you can either use them in your hypervisor host or assign to guests.
In other OSs, you can only assign them to guests.
I have no idea where they put in that arbitrary limitation.
But, yeah, so you can create virtual little NVMe-functions, which at that point are really basic.
Now that we have NVMe 1.3, since a couple months now, there is actually much bigger NVMe virtualization specification
that actually allows you to do useful things with these SRIOV functions and manage them in a vendor-independent way.
But we've not implemented
that quite yet. The next big change after the block-and-queue move, which was started early
last year, actually really at the end of 2015, was that we moved away from having the NVMe driver,
which we did before, to more of a subsystem model,
which is pretty similar to what we're doing in SCSI.
So the first thing we did was actually copy SCSI for a method,
and the method is that we started to split away,
split out the upper level,
which fills out a little command structure
that describes the NVMe command we're doing, then uses a
pass-through layer, which then goes down to the actual driver
that deals with the hardware or transport, because we were
going to get some more of these transports pretty soon.
And the way we've done this in SCSI for the last 10, 15
years, I guess, it's more than 10 years, about 15 years, is that we've used the concept of a block layer pass-through.
So the same request structure that the block layer builds up
to pass the file system or application IOs,
we abuse a little bit and take it as a pass-through command
that has a pointer to the fully initialized command structure,
so SCSI CDB or NVMe submission queue entry,
and we can use all the existing
queuing infrastructure to handle both cases with the same piece of code.
And that allowed to split out the little PCIe driver from the common NVMe code, allowed
us to support the fabrics transports, which we're going to go to next.
And the other interesting thing it allowed us to do is support different I.O. command sets.
And now some people might say,
there is no other I.O. command set.
Bill hasn't got his key value store through yet,
even if he's talking about it.
Well, there is, it's just not in the standard.
So the Open Channel Light NVMe SSD people,
and Matthias is not here, I guess.
Nope.
They've come up with their own NVMe command set for open channel SSDs,
and we actually have support for that in Linux,
where we do an FTL in the host and then inject the low-level,
very low-level NAND operations in specific non-standardized NVMe commands,
and that sort of structure has really helped with supporting that as well.
And, yeah, so now with NVMe over Fabrics, first we've done a move.
So before that we had the NVMe driver in two little files and drivers block with all the
weird random block devices that are not SCSI or ATA.
And now that we were going to grow a lot more files in NVMe, we decided it's worse to have its own directory.
And then we split out a new NVMe core module
from the existing NVMe module,
and that kept its NVMe name.
It didn't become NVMe PCIe
just for backwards compatibility reasons
because you can't just change people's driver name.
They'll get a little upset.
And if you're on PCIe, you'll just need these two little modules. So you've got your
low-level PCIe driver and you've got a common shared core code. If you're on Fabrics,
there's actually another layer inserted because we have a common Fabrics library. So there's a
fair amount of code that's common between all the Fabrics transports and not shared by PCIe.
So we've got another little module for that.
And all of that went on for a couple months and finished in June 2016
when we published the NVMe over Fabrics code,
basically the day after the specification went public.
We've been working on that quite a bit.
And if we look at Linux 4.12, let's move that away, it's a little loud.
So we've got all these different bits of the NVMe driver.
And the funny thing, it turns out, the biggest part is not the core driver, but the NVMe over Fabrics fiber channel driver, which we got late last year.
And it's slightly bigger than the actual real core code,
which is about 4,000 lines of code.
Then we had another optional bit in the core driver,
which is the SCSI translation, which is not much smaller
than the actual core NVMe code,
despite not actually doing anything
but translating around a bit.
We got rid of that in Linux 4.13,
so I wanted to have 4.12 so we can see the numbers.
Then we've got a little bit of optional code
for the open-channel SSDs,
which if you pull it in, it's in the core module.
We've got a pretty tiny, again, fabrics library
at just above 1,000 lines of code.
And we've then got the PCIe driver,
which is about 2,300 lines of code.
It's still pretty small for a Linux driver.
And the RDMA driver, which is even smaller.
So all in all, it's still a pretty tiny subsystem.
In fact, if you count lines of code if we include our
NVMe over fabrics target so the controller side implementation all our NVMe related code is still smaller
than the OFED Windows PCIe driver
And well, yeah, as I said in the meantime family's grown a little bit
So we got the fabric support first posted in June 2016,
then merged in July.
We got the fiber channel support in the end of last year.
And well, the next interesting thing
is that we put a little lock on your little NVMe devices
by getting TCG Opal support, which
supports authentication for the device
and full disk encryption.
Got that code from Intel, so now you can use your TCG-enabled devices with Linux out of the box. No more weird user space tools that generate init ramfs files or whatever.
Just a proper unlock on a suspend from RAM and from disk,
which is pretty important if you want to have your system back working with a disk.
And the interesting thing, or maybe not interesting if you're used to Linux development, is that
of course we didn't add all of that code to the NVMe driver, but we have a common little
module in the block layer that contains all the low-level TCG Opal functionality. And we have not even 50 lines of code to wire it up in NVMe.
And a couple months later, I did the same wire up for ATA disks.
So the same code will now work not just on your NVMe device,
but on your SATA device as well.
And in theory, also on your SCSI devices,
except that as far as I know, there's no single SAS or fiber
channel or whatever device that actually implements TCG Opal. And well, so the next big interesting
thing is optimizing things. So if we look at the traditional IO flow of any driver, and thanks to
Damien Lamolle, from whom I stole this beautiful graphics and a few more later on,
is so the traditional way we execute block I.O. or network I.O. or any I.O. is interrupt-driven.
So we go down the whole stack, submit our command, let the device execute it, and let the device generate an interrupt when it's done.
And then we get a context switch, and we jump all the little layers up again.
Turns out that with NVMe and a fast enough device,
this was starting to be a bottleneck.
So we had optimized a lot in block path,
and somewhere around the time when we did the blocking,
we optimized even more in it,
and we needed to figure out what we can do instead.
So what we looked for instead was to introduce a polling mode.
The networking people have been doing some of that.
Users-based drivers have been doing some of that.
So in Linux 4.4, we got the initial poll to IOMode.
So Stephen talked a little bit about that in his talk today,
so you might have seen some of that.
So we've got a FLAC and a couple of new system calls that are basically the extended version of the extended version of the extended version of read and write.
We keep adding more of those.
I added two generations of them, by the way.
Where you tell it's a high-priority command, and if it's a high-priority command, we're going to Paul for it. And in this very first version, we started Pauling as soon
as you submitted the I.O. to make sure we don't ever lose any of that I.O.
So with the beautiful DRAM-based NVMe card I have
in my test lab, we got our
latency for 512 byte reads down for
6 microseconds to four and a half. And also got
a lot less jitter if you look at it. So that was pretty nice. The only really major downside of
this scheme is you're always using 100% of your CPU while you're polling, which is
good for Intel because they can sell you more CPU, but it's not really very efficient.
So a couple of releases later, we got a scheme called hybrid polling.
And basically what we're doing is after the I.O. submission, we don't start polling, but we start polling later.
And that sounds very easy except for the fact that you need to figure out when do we start polling. So the big part of that was building an infrastructure that tracks your average I.O. completion times
and starts polling when we're halfway there in the first version.
And that was pretty nice because it drastically reduces the CPU usage.
So we've got 100% CPU usage
if we go to the adaptive hybrid polling scheme,
which is the one we ended up with,
where we're down to 58%
of the CPU used for polling
because we're only polling a little later.
If you look at the latency graphs,
we're actually getting exactly the same performance as that.
It turns out this is still not very optimal, but it's a very safe first guess because you're very unlikely to miss any of that I.O.
unless your I.O. sizes are drastically different.
So the next thing that Stephen, who gave the talk before now, added a new mode where we start polling just before the expected I.O. completion time.
And who would have guessed that that provides a lot more savings on CPU cycles,
almost down to the same amount of CPU usage as the interrupt-based one.
So I'm really excited about that.
But at the same time, we'll probably have to improve our I.O. completion time estimation
to be bucket-based for different I.O. sizes and so on and so on
for this to become production-ready.
And the other interesting polling thing that we don't have in the kernel yet,
which Sagi, one of our co-maintainers, has been working on,
is an optimistic polling.
So this polling is interesting if you've got someone synchronously waiting for your I.O.
So your user space database is waiting for exactly that I.O.
But if we're using Linux as, say, an NVMe or SCSI target device
where we get RDMA ops in and NVMe ops out,
we don't really want to poll for a specific I.O.
because there's lots of them.
So what we came up with a scheme where we have a thread
that pulls the completion queues of both the RDMA device
and the NVMe device and just reaps I.O.s as long as we can
and then sleeps for a while and goes back to it,
which still allows to avoid all the interrupts
and do pretty efficient processing of the I.O.s.
It just turns out fine-tuning this is pretty hard,
so we don't have it in yet, but we're looking at that for one of the nextOs. It just turns out fine-tuning this is pretty hard. So we don't have it in yet,
but we're looking at that for one of
the next kernel releases.
Well, in the meantime,
last year's drivers getting older,
getting fatter, lots of exciting
new features.
The one I really liked was the
range deallocate support. So if you look
at the NVMe deallocate command
or the ATA trim command or the SCSI UNMAP command,
they all don't just contain a single LBA range
that is unused now, but they contain a few ranges.
Each of them has a slightly different format,
but the concept is the same.
And it turns out the Linux BIO structure
that the file system used to submit I.O.
is very much built around the fact that it contains the start offset and the length.
So that concept didn't fit in very well.
Well, our block and queue merging layer came to help.
So I did a little hack to that, that basically the request can always have multiple BIOS.
And in that case, we could have used them for that fact, which provided really big speed-ups to applications that do a lot of
de-allocate operations.
So that was really nice.
On well-built devices, there's basically no performance
overhead of that anymore.
On other devices, not looking at Intel, it's still pretty
severe because their de-allocate performance is so
bad.
So it still depends if you want to use it or not
because it's device dependent.
Then, now that NVMe is moving into the client space,
so how many people have you already got
an NVMe device in your laptop?
How many of you are running Linux on it?
Awesome.
Yeah, once you've got the NVMe device in your laptop,
you start caring about power management.
You probably don't in your server, but the one exciting feature that got the NVMe device in your laptop, you start caring about power management. You probably don't in your server,
but the one exciting feature that's in NVMe
is the autonomous power state transition support.
So you basically give the device a little table
about which power state to enter
after it's been idle for how long,
and then the device is going to do automatic transitions
between those to make sure it's always using power,
but it's still available with not too much latency.
We added that support in early 2017. It turns out
a lot of devices broke because of that.
People were only testing it with the Intel Windows RST E-Driver, which had really
weird policies for it. Once we used Linux,
things started falling apart.
So the worst combination was Samsung devices and Dell laptops,
which basically was a guarantee to break loose. It might have been part of their PCI root setup as well.
And, well, now we have a blacklist of a couple devices
where we can't use that,
but the other ones are really happy about it.
And then we got another feature I did, the host memory buffer support.
So after we had the controller memory buffer two years earlier, we now have a host memory
buffer.
And that name sounds very fancy for having a really cheap device that doesn't have any
or very little DRAM and wants to steal it from the host.
So what the driver basically does is it allocates a big chunk of memory, gives it to the device, gives it a DMA
address for it, and says, here, you can
use that poor device.
Your parents didn't even give you any DRAM.
Well, let's have some from me.
And this was just the catching up on the old features.
And once we got NVMe 1.3, we started adding all the exciting new features.
And some of them we even added before NVMe 1.3 went public, because the way NVMe works,
once a technical proposal, which is the individual bids to get into the new standards version are
rectified, we're allowed to implement it. So the first one we got was the set doorbell buffer feature,
which sounds a little weird and might make more sense
if you have the other name that was proposed for it,
which was para-virtualized NVMe.
So if you've got a virtual NVMe device in your hypervisor
that's exposed to the guests,
it's a very nice idea because every modern OS has an NVMe driver
and things will just work.
It just turns out that any sort of traditional device that does a lot of MMIO operations,
so memory map diode operations,
leads to pretty bad performance because every MMIO write is a trap from the guest into the hypervisor.
So to get good performance, you want to avoid that by just using a piece of memory instead.
And the one scheme that really works well with that is Word.io, like the Linux transport for virtualized drivers.
And Linux people love it.
It just turns out other people don't.
So for Windows, you will always need a third-party driver and it's a pain or whatever. So some people at Google came up with this really cool idea
that they're going to use the event,
like the buffer mechanism from Ritio and Gretra fitted into NVMe.
And that gives really nice performance numbers.
It's just they sent a patch for it at a point where it wasn't standardized.
We were like, uh-uh, go to the NVMe working group.
We want to have a standardized version of that.
It's a cool feature, but as a random, underspecified vendor extension, I don't think we can have
that in the driver.
So it took about another year, and we got a technical proposal in NVMe that standardized
it, and we've got it supported now.
So that's pretty cool.
The next one is we've got UUID identifiers, so this is the
fourth generation of identifiers in NVMe. NVMe 1.0 did not have any sort of global unique identifier
for the namespace, so if you're coming from the SCSI world, that's the device identification
VPD page, and it wasn't then there, so NVMe 1.1 added a 64-bit IEEE-assigned identifier,
which solved the problem, but it's generally still way too small
for the typical use cases.
So NVMe 1.2 added a 128-bit version of it.
It just turns out that the IEEE identifier is not very good
for software-defined storage
because it requires a vendor prefix
and then a statically handed-out number inside of that.
And if you're just shipping a Linux-based controller
that random people can't configure,
it's A, very hard to get a vendor prefix,
and B, very hard to make sure
people don't accidentally use the same one.
So what SCSI did a couple years ago
is they added a new UUID identifier,
which uses the RFC4122 randomly generated UUID
as an identifier,
and we added that to NVMe as well
and built a little bit of infrastructure
that when we have to add another identifier type,
we can reuse it for that.
And then we got support for the streams feature,
which we're actually not really using the streams.
So it turns out we build an infrastructure
where the application can tell the kernel
if it's hot data, cold data, medium cold data,
really cold data.
And we're passing that down the block IO stack
and then the NVMe driver all the way down. We'll try to allocate a few streams really cold data. And we're passing that down the block IO stack,
and then the NVMe driver all the way down.
We'll try to allocate a few streams and stash it into that,
which seems to work okay for some people,
but it would have been much easier
to just pass these hands on to the device.
Well, I mean, we had a bit of a big fight in the committee,
and the thing that we're doing in the Linux API is something we Linux people would have loved to just pass to the device,
while the people that argued for streams was like,
well, it's not just about hot and cold.
We want to separate different data streams.
But we can't really use that easily
because it's very hard to pass that
sort of stream ID all the way through the I.O. stack, through all the remappers and
file systems.
So we ended up basically implementing the scheme we wanted anyway and translate between
the streams at the lowest, the schemes at the lowest level of the stack.
And last but not least, that's the wrong way, and last but not least is my favorite current project,
multipathing.
Starting in NVMe 1.1,
NVMe gained the feature that
your controller is not your controller anymore,
but your subsystem, and then the subsystem has controllers.
So basically you can't build one NVMe device
that has two PCIe connectors.
Both of them are called a controller, just to confuse everyone who thinks of that PCIe
device as the controller, but the whole thing is the subsystem.
And you can access both of these controllers independently, either from two systems, which
would be the traditional shared storage use case, or you can access it by two different ways from the same system.
And the typical use cases for multipath access are aggregating the bandwidth over multiple connections.
While that doesn't really make sense on PCIe, where you can just add a couple more lanes,
you are in there for redundancy, which you're typically doing for network-attached storage.
It doesn't really make sense if you've got a single little device that sits inside of your computer and just has two different PCIe connections.
Or you can do it for locality of access, which makes a little more sense.
Let's say you have a dual-socket system and you want a PCIe connection to each of them
and make sure your I.O. stays local and doesn't go over the interconnect.
But in general, the PCIe NVMe multipath thing
didn't get all that much use.
It's a bit of a fringe feature that's supported
by some high-end enterprise controllers,
but has very few users.
It became interesting with NVMe or Fabrics
when we've now got network storage
where the whole redundancy and load balancing
becomes interesting interesting and that
makes me always think of driving from san francisco to san jose and then you're at that split of 101
and 280 and it's like which one you take you'll need the most up-to-date information of which of
your paths is actually going to be the good one right now and that's what what a TP that's currently doing in NVMe is trying to solve for NVMe, not for
101. And it's called ANA, the asynchronous namespace access, which, as the name suggests, is the
logical equivalent of Alua and SCSI for all these people that love dealing with SCSI arrays.
And the idea is that for each controller in namespace where the controller is telling the host,
is this an optimized path?
Is this a path that works but is not optimized?
Is this a path that's currently offline and can't be used at all?
And we're getting notifications for that,
and then the host will decide,
am I going to route it through my active path
right to the storage device?
Am I going through the other path where we might have to do some I.O. shipping on the other end,
but it will still work, or do I better not send that I.O. at all?
And if we look at the existing SCSI multipathing in Linux,
we basically got a layer between SCSI and the I.Oter that's called Device Mapper Multipath.
It's a little module, stackable block driver module,
that decides which path are we going to send things down.
And it's a little bit messy because the information
about the state comes from the SCSI layer,
so we have to communicate that up to the other layer.
And to make it even more complicated,
a lot of the decision making and path probing
is actually in a user space daemon,
so there's a lot of interaction
between different components that should be separate.
And the device handlers are a modular piece
of the SCSI layer, yeah,
so I could have made this even more complicated.
What I'm trying to do in NVMe
is to pretty much cut all of that out. So once we've
got the NVMe driver, we know which namespace IDs on the controllers refer to the same thing,
because NVMe is relatively strict about the identifiers, not perfect, but way better than
SCSI. And the way that ANA is built, it builds on the typical NVMe concepts for that, like asynchronous event
notifications, log pages that are all really nicely handled by the driver. So the driver always has
the full picture of what's going on. And the other nice thing in NVMe is NVMe doesn't do
partial completions of IOs, unlike SCSI. So a command is either done or not. And because of that, we can actually literally bounce I.O.
between the different NVMe controllers with no measurable overhead.
It's just another 60 x86 instructions and two more cache lines that we're touching.
So it's way better than using device mapper multipath on top of NVMe,
which some people are trying to do.
And this adds up to 5 to 6 microseconds.
And if you looked at my previous numbers,
if you're polling, that's the time we need
for the whole I.O. in NVMe right now.
So it's basically doubling your excess latency.
And the other really nice bit is
you don't need to do a setup in user space
with a config file and UDIF rules.
It's like the kernel driver just creates another node
that gives you access to that namespace,
a unique identifier through any of the available paths.
And I'm pretty much done now.
There's a couple nice references, but...
And would you like to test it on the VLCs for any fabrics?
It's pretty much only used for fabrics.
So it will work for PCIe too, but the main use case is fabrics.
Well, you should not use both at the same time
because your upper layer will never see never then.
But, I mean... Yeah, and we had a lot of arguments about that.
It just turns out there's basically no code
to the multipathing, right?
So the big part about multipathing
is just discovery of the topology
and reacting to the events for topology changes.
And that is completely transport-specific.
That's what we have, say, in the device handlers in SCSI right now.
And the part that the device member multipass does right now is basically it clones the request.
It clones the BIOS.
It does a lot of memory allocations to deal with the fact that SCSI can do partial completions,
and we have to partially retry it.
But the actual implementation of doing a path selection
and a failover, it's less than 100 lines of code,
so I have no problem with having that duplicated.
And a couple of the helpers in the block layer
to deal with that go in the block layer.
So there is common code, but in general,
there just isn't much code.
The whole multipath implementation is less than 2,000 lines of code, out of which most
is discovery.
We'd still like to get to the plot version.
Do you guys think the same way?
Yeah.
And I mean, SCSI, as I said, is a little more complicated, mostly because of partial completions,
also because historically people did winter-specific multipathing implementations.
So before LUA came out, there was an EMC way of doing it, an HP way of doing it, a LSI logic way of doing it,
and we had to support all of that.
Well, for NVMe, we've already told Huawei, it's like, get lost, join the NVMe working group
on ANA. We don't want to enter a specific way of doing it.
Well, it's not going away, right? People have their existing SCSI setups, but I don't want
people to use it on NVMe, and I'll make sure it won't be usable with ANA because we're not going to expose the information to upper layers.
And for, especially, you know, assuming we find some big insights, we can get to work on this.
We'd like to do it just for a little bit, and we use all these things for some abstract.
Yeah.
And we can't break people's setups on an upgrade, right? Yeah.
Yeah.
Just for the shiny, cool new stuff,
use the shiny new cool code,
and for your legacy setup, you're stuck with what's there. How do you see that ending field in the new P-lead and P-write and developing over time?
How would you like to see it develop?
Well, so, I mean, it's a flex field,
so we have a couple other flex in there that are pretty exciting
and could be a talk on their own.
But the whole idea is, in a typical application,
you have foreground and background I.O., right?
So think of a simple lock structure database
where your main workload is writing to the lock.
That's high priority.
You really care about your transaction commit
because you're going to reply to the network
and your records need to be on disk.
And then you go in the cleaning phase
where you can just do background operations
and your active main logging thread
will have the high priority bit set
and Paul and the other one will just run in the background, interrupt driven.
I was thinking like for caching and screens and other stuff,
do you think that same thing would be able to be used for that?
So an initial version of our lifetime hands actually abused fields in that bit.
We decided on a different interface in the end
that is a little more general.
But yeah, we could use more fields for that.
We just want to have reasonable semantics before.
We don't want to throw random special cases in there
that keep accumulating.
How do you manage all the quirks and bugs?
Let me open the source file.
Okay.
The question is, any vendor that will come with every small bug
will be putting the fix here?
Well...
I was thinking you have this register over there
instead of over here, sorry.
Well, I mean, it depends, right?
I mean, right now, the quirks aren't all that bad.
So we've gotten the Intel quirks.
They're actually not really quirks.
They're just behavior that's beyond the standard
that they really want to do.
So they want IOs, not spread a certain boundary. We finally actually got that standardized in NVMe
1.3. It's a trivial little 16-bit field in the identified data. They guarantee they're zeroing
out everything if you do a deallocate, which is not guaranteed by the standard. Then, well,
we've got a couple devices that can't enter the deepest power state,
and we've actually got another quirk list
that's not based on PCI IDs,
but the identify string for that
because all the Samsung devices,
a lot of Samsung devices have the same PCI ID
but actually behave differently.
We have a QEMU virtual thing that crashes
if you use a new identify command that was only added in NVMe 1.1.
We've got a few adapters that need a little delay after reset before you're checking they're ready.
We've got the completely fucked up Apple devices.
So we've got a new Samsung device that you can't drive with a high QDAP.
So it keeps accumulating, but it's nothing new in Linux, right?
The HCI driver looks much worse
than the USB storage drivers.
So we've been, or SCSI for that matter, yeah.
So we've been doing that long.
And I mean, at some point,
we might say this drive is too broken.
We're not going to support it
or we'll have a copy of the driver.
But so far, it's a little box here and there
that can be worked around.
But then there's also things that they can say, right?
So we'll logically implement support for one that does have a workaround. little box here and there that can be worked around. Yeah.
Yeah, but I mean this is... For the disks and tapes and CD-ROMs.
Yeah.
And I mean the same, I mean, look at HCI.
HCI is the same model as NVMe.
It's a common spec with a common PCIe class code
and, like, five gazillion vendors implementing it.
And that quirk list in HCI is two pages long.
And the NVMe one will grow to that, too.
And does it impose more conditionals during the code?
It does, but we try to keep them out of the fast path in general.
Yeah.
Yeah.
Well, if we had to do something for the fast path, life wouldn't be so fun.
We'd probably have a different version of the fast path routine that we select at setup time,
but I hope we don't have to go there. For HCI, we didn't. HCI quirks are still all in the setup path. There's tons of them, but they are out of the FastPath.
Do you respect that?
Well, a lot of the time it's not PCI spec, but buying an EMA and getting the IP from
somebody that didn't know what they were doing.
Yep.
So, thanks everyone.
Thanks for listening.
If you have questions about the material
presented in this podcast,
be sure and join
our developers mailing list by sending an email to
developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with
your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.