Storage Developer Conference - #118: Linux NVMe and Block Layer Status Update
Episode Date: January 28, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual
Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to STC Podcast
Episode 118.
Thanks for joining me.
I'm talking about the Linux NVMe driver and block layer status update.
I've given a pretty similar talk two years ago here at SDC,
so I'm trying to mostly not get into the older history that we were looking at there.
There's a few references.
Back in September 2017, we were talking about Linux 4.13, which is a couple of releases old, obviously.
Just this month, we got a new Linux 5.3 release.
That's to about where we're talking here.
A couple more recent developments as well.
And back from that all talk, this is kind of what we're talking of when we say the Linux NVMe driver.
It's not really a single driver anymore.
It's really driver stack.
Got a core module down in the lower left corner.
This is like completely generic NVMe code that you use over PCAE, over the various network transports. And then for PCAE, we have the driver that's just called NVMe for historic reasons
because that's how NVMe started out.
It was all just PCI, so it's now called NVMe-PCI or-PCAE
because we had to keep compatibility.
And then two years ago, we had two transport drivers for the so-called NVMe over Fabric spec,
RDMA and Fiber Channel sitting on top of a shared NVMe over Fabrics module.
And if we really want to put this into the bigger picture,
there is an amazing graph that doesn't even fit on my slide.
So I've taken their diagram directly.
So the props go to my friends and engineers at Thomas Kran,
who are a German-slash-Austrian systems integrator
who've been doing great work on these diagrams
and updating them regularly.
So we're talking about the I-O stack. This is kind
of where the I-O originates. It's like the various file systems, block-based, where you
really think of file system, network file systems, or I-O directly on the block device
nodes. And then we go through a couple remapping layers in Linux, like software RAID, volume
management, yada, yada. I'm not going to get into the details here. And then we enter what's really the core of the block layer. And the interesting
part, and both me and other people used to talk a lot about it in the last couple of
years, was the block MQ layer, the multi-Q block layer, which is really essential to
the functioning of the NVMe driver as we have it today.
And the big news, and I think it was Linux 5.1, a couple of releases ago, is that we
finally got rid of the legacy code base that you see here on the left, like the classic
IO schedulers, the classic request structure, which was old code with a lineage back to
the early 90s, which had huge, huge scalability problems.
That's why we never used it with NVMe
and finally moved SCSI and other protocols over.
And our block layer maintainer, Jens Axpo,
did a lot of work converting random weird drivers
very few people use over to the new infrastructure,
like the Amiga floppy driver and stuff like that.
So that we could finally get rid of the old stuff.
And then once you're down there,
this is where all the drivers get into.
And because this diagram is still a little older,
it's just a single NVMe driver,
but just used it as a cross-reference
to the other diagram I did.
And what's kind of important and interesting, even if it's off topic here, the people that
are coming from, say, the Windows or VMware world, find interesting is NVMe is in no way
under the SCSI layer. So we have a SCSI layer for weird historic reasons. ATA actually sits
underneath it with a translation layer, despite us hating it. It wasn't even always like that.
But that's a totally separate discussion to be had over hard
liquor. But NVMe sits aside at no scuzzy anywhere,
and we're, we're really glad about that.
And in this code structure, what's new is basically that we got two major new, or three-ish major
new pieces of code since then.
One is a new NVMe over Fabrics TCP transport.
It's just another of these network-based NVMe transports.
One that a lot of people find very exciting. Not really because it's the fastest or the
most advanced or anything, but it doesn't really require
any special hardware. It runs on every network card
you can imagine, because TCP IP is just everywhere.
So it's, it's really exciting to drive NVMe into
spaces where before people would have to deal with
expansive RDMA or fiber channel hardware and
associated switches and new knowledge and so on and
so on. And there are a couple additions to
the core code, and there's actually a third one
that's pretty small but useful, I forgot here.
The big one is multipathing, the support to access
an NVMe namespace through multiple controllers in a, well, fast and well-defined way.
We'll have a couple slides on that soon.
The other is that we now added NVMe-specific tracing support.
And the one that's missing here is that we actually gained support for failure injection
so that we have better ways to test all kinds of error handling path.
That sane hardware should never trigger insane one,
unfortunately does,
and not everyone can have every piece of insane hardware.
And with those additions, with the TCP one,
we just got another new driver, basically,
that is equal to RDMA and fiber channel
in that it uses the shared fabrics helper
in addition to the
core NVMe code and this is kind of how this looks in the actual code base.
So we've got our core.c which is really most of the core code.
It's a pretty big chunk of code.
The fabrics library including its header is pretty tiny. of the core code, it's a pretty big chunk of code. The fabrics library, including its headers,
pretty tiny, the fold injection code the same.
Fiber channel is actually pretty gigantic,
and that's despite sucking in a lot of other layers of code, too,
and having some of the biggest drivers in the kernel
only dwarfed by the GPU drivers,
which are basically each of them two OSs and a couple
frameworks. And we've got the line NVM stuff, which is basically early-ish open channel
hardware support, which we're actually kind of hoping to maybe eventually get rid of because
it didn't work out too well. The multipath encode, a common header,
PCI, which is actually pretty big,
but the PCI NVMe driver actually drives hardware,
so it's really a low-level hardware driver,
unlike the Fabrics drivers, which are just protocol drivers,
which, after a couple layers of abstraction,
eventually end up with a real hardware driver underneath
that we're not counting here, like the RDMA HCA drivers, the fiber channel HBA drivers, and so on.
TCP is, well, RDMA is next. Again, not tiny, but actually pretty small for transport. TCP,
a little larger tracing code, and altogether it's not even 20,000 lines of code. So we're still pretty lean compared to something.
I think last time I compared to the Windows OFET driver,
which even just for basic NVMe PCA functionality
was about twice the size of what we do for all protocols.
Okay, so multipathing, the big new thing,
except for TCP, but TCP, while interesting,
can't really explain much in detail.
So as I mentioned before, in NVMe, just like in SCSI,
unlike some simpler protocol like ATA,
you have the concept that there are multiple paths, multiple ways to access
a namespace in NVMe terms, in SCSI it would be logical unit, the concepts are pretty similar. And the interesting thing in NVMe is that NVMe actually had a very tight architecture
of what really is valid as another access path.
So in SCSI, a lot of it is just,
yeah, maybe it's going to work.
But NVMe has this concept of a subsystem.
Controllers are part of a subsystem.
We can actually check for a subsystem NQN identifier
to make sure it really is the same thing.
And when doing the ANA protocol later,
we even further tighten up some of these requirements
so the host can actually be relatively safe, or actually is safe, binding to controllers
that have the same namespace, that it really is the same one and not different ones, which
used to be a problem in SCSI in earlier days. And as I mentioned, the thing that actually makes the multipathing interesting, at least
for fabrics or more complicated setups, is a protocol extension called ANA, asynchronous
namespace access.
And those who have been around in the SCSI world, the naming similarity to ALUA and SCSI
is not coincidental.
It's like, it's modeled after that with a lot of lessons learned,
with a lot of simplified in many ways
to only allow one way to do it
instead of many ways to do it,
and we finally had to agree on it.
And we allow multipath access not only with ANA,
because, for example, if you're having a PCI NVMe controller, a lot of the more expensive
enterprise controllers actually are dual ported. So you have two different PCI ports to access
them and sometimes people use it for failover. But one interesting use case that we specifically
had in mind here is to connect it to NUMA sockets. You have a dual socket system with very high cost
of going across the nodes to the other socket.
And in that case, you can't just have your NVMe device
attached to both of them, and you're actually
doing local I-O access instead of going over
to the interconnect that might slow you down.
And one of the things that's really interesting,
especially in this PCIe NUMA case
but also in highly optimized RDMA environments
is that we actually really, really do care
about IOPS, about latency, about these
modern performance characteristics
and not just maxing out the bandwidth as the old array people
really like to do with their multipathing implementations
where people were happy as soon as you could fill up the wire.
And while we're talking about these older multipathing architectures, so what Linux
does in the SCSI world is the SCSI layer itself has almost no logic related to multipathing.
It almost isn't in there
because there's actually some logic
to detect what we're doing
and something we call a device handler,
but it's all actually driven by a middleman
called the device mapper multipath code,
which is a kernel component in the actual IOPath
and a daemon in user space that manages it.
And it's, some people said it's the biggest pain of enterprise Linux deployments.
I'm not sure it is the biggest, but it's definitely up there.
And because your device nodes change if you're on a single path or multiple paths,
you've got that daemon to deal with.
It's a lot of hassle.
So for NVMe, both because of the performance reasons
and the manageability reasons,
and the fact that we have a very tightly integrated spec,
decided that we're actually going to do
a new multipathing implementation
that's part of the NVMe driver.
And as you've seen before,
I mean, the actual.c file is 700-something lines of code.
There's little stuff in headers and other files.
But it's a very, very small edition.
And it just shows up transparently.
So your block devices referring to the namespaces
will just use all the available paths
without you doing anything.
And the path and decisions are based on the ANA state.
So if the controller, well, the subsystem through the controller
tells you a state is not available, we're not optimized,
we're not going to try to use that when we've got an actual optimized state.
There, we try to use the Pneuma proximity,
and then people come in kind of from the fiber channel-ish background.
We later added a round-dropping mode.
And in many ways, round-dropping is not a very smart idea
if you have high IOPS storage where you have cache line misses
every time you hit another path.
But they still wanted to use up the bandwidth of their dual-ported card
in a single PCIe socket, and that's kind of the only way to do it.
So I'm not happy about the round-dropping,
but there are use cases, and people like it.
And I've seen some numbers from people I've worked with
at Western Digital for NVMe over Fabrics
where we're seeing six times better IOPS
with the NVMe multipath
compared to using device-member multipath, even on NVMe,
while actually using less CPU cycles,
because device-mepper multipath was CPU cycle-bound
in that use case.
So the other interesting bit,
and I think I'm just going to jump ahead there,
is the tracing.
So Linux has a pretty nice trace event framework
where you have static trace points
inserted in functions that are pre-formatted,
a couple tools to parse it.
Also, I personally prefer the text output like here,
but there's graphic versions too.
And on the block layer, we used to,
we still have something called block trace,
which is a really, really amazingly useful tracer for any block layer, we used to, we still have something called block trace, which is a really, really amazingly useful tracer
for any block layer interaction
because it traces the block IO requests
from the file system that issues it
through remapping, through the actual issue in the driver.
But by being generic,
it works on generic block layer concepts,
and now we have some low-level NVMe tracing
where we can trace things like the
actual hardware queue we go out to, the command IDs, like, additional command fields if we're
using T10 diff, if we're using DSM bits. And for anyone who's trying to figure out what
actually goes out to the wire, what goes on very low level, it's, it's a really nice tool,
but remember, it's in no way a
replacement for block trace. It doesn't have anywhere near
the functionality. It's just another tidbit for very
low level tracing.
So yeah, these were the completely new code blocks,
and we've also redone a lot of stuff. And
I guess the most exciting part is really the
IO polling rework for people
who want extremely high performance I.O. And back a couple years ago, we did the first
version of the polling support in Linux, which landed in Linux 4.4, where an application
that has a high priority I.O. and we had to specifically enable it for the driver can
request that before returning
that synchronous read or write request to the application, it will actually pull the
completion queue instead of waiting for an interrupt. And this actually gives pretty
nice performance already but has the big, big limitation that it's limited to a queue
depth of one. And in the classic polling version,
kind of like what you think of polling,
it would literally spend all the CPU cycles you have on that core just polling
because it doesn't do anything else.
A little later, Damien, who sits there,
came up with the concept of a hybrid polling
where you keep statistics
about average IO completion latencies,
and we only start polling
when we actually start to expect that completion to happen
instead of all the time,
which gives gigantic reductions in CPU cycle usage
while being almost as fast.
But all that was still pretty limited.
And then, mostly driven by ENDS-AXBO,
our block layer maintainer,
we came up with a new interface called IOU Ring.
And if you're interested in that,
there's actually talk just on that here at SDC on Thursday
called Improved Storage Performance
Using the Neon Linux Kernel IO Interface.
And if this sounds interesting to you, go there.
It's cool.
And the idea behind that interface
is that we have a ring-based asynchronous IO interface.
You have a submission ring, a completion ring.
People familiar with NVMe might understand that concept.
And part of that, besides a couple other improvements, is the idea of a dedicated polling thread,
similar to what some user space drivers like SPDK have already been doing,
where instead of your application process that submits the
I.O., we have some other thread hogging a core to do polling.
We can actually combine that to some extent with hybrid polling, but basically you have
your applications just submitting I.O. through one ring, has another ring where it can get
notifications for completions, and then the actual polling of the hardware is done by another thread.
And initially, we've just done that for PCAE.
By now, we also have it for RDMA and TCP.
And this is actually a really nice graph
from benchmarks from Jens that he recently Twittered.
I think the numbers are actually not that recent
because we've improved a little since then.
But we're at 1.6 million IOPS here without any real work,
just a little user space application using the IOU ring appalling thread.
And even without appalling, the IOU ring is a pretty nice performance advantage
when the classic asynchronous IO interface just leveled out.
But again, the NVMe changes were actually relatively small
compared to the actual user space interface
and the block layer changes.
First off.
Yeah.
I want to mention that at the point,
we had support for interrupt-less pulls.
Oh, yeah. That was pretty significant in performance and the effort to mention that at the polling, we had support for interrupt-less polls. Oh, yeah.
That was pretty significant in performance and the effort to make that.
Yeah, that's actually because we have those separate polling queues
and not share the queues with the normal ones, which was part of that.
What Keith, who's, by the way, one of the other Linux NVMe driver maintainers
out of the three of us, so he really knows what he's talking about.
And the big advantage of having those separate queues
in addition to avoiding false cache line contention
is that we don't actually have to enable interrupts
in hardware for that queue.
So NVMe, when you create the queue,
you can tell it's like, do I want interrupts enabled or not?
And if you don't enable them, the hardware
doesn't even have to generate the interrupts
and no one has to handle them
and nothing actually happens.
And I'll actually get into a few other optimizations
that kind of came out of that later
and I think this is actually mentioned as a bullet point too
but it's good that we covered it here.
And yeah, so we have pretty nice performance numbers.
The other thing I'm pretty excited about,
work that Chaitanya did about two years ago,
the scatter-gather list support.
So scatter-gather lists are kind of what
just about every storage interface
but HCI and the initial NVMe uses.
And it's just a way to efficiently describe
variable length data for transfers.
And NVMe 1.0 did not support these SGLs,
just something called peer piece.
NVMe 1.1, I think, finally added SGL support to PCIe.
The various fabrics transports always used SGLs
in a slightly different form.
And Linux 4.15 finally gained the SGL support for PCIe.
And this is kind of, so if I'm doing a large data transfer,
so with a PRP, which always just describe a page,
and a page could be a couple different sizes,
but for practical purposes is 4K most of the time.
With a PRP, you really need yet another entry
for every 4K chunk. So if you're doing,
say, I.O. on a single x86 huge page, which is two megabytes in size, you have a giant PRP list,
because every little 4K chunk needs an entry. With scatter-gather lists, on the other hand,
you have exactly one entry. So there's one entry start here, and there's length field,
and that really helps generating much more
efficient I O as soon as they get larger
than a page or two. And I think that's
about the threshold we have, right? Sixteen K
or thirty-two K.
Yeah. And something that kind of goes along
with this, even if it was done at a
different time separately, but really helps with the same
sort of problems
is a block layer thing.
And it's something called the multi-page BioWix structure.
And the BioWix structure is a very central structure
in Linux block IOS app system.
It's basically this tuple of a page struct,
which is Linux internal abstraction,
but for our purpose here,
it's a placeholder
for physical address. That's the interesting part in this case. So we can generate the
physical address of the memory from here. We have a length field and an offset field
into the page because the page is always aligned. So the physical address is split into the
page frame number, as some low-level hardware people might know, and an offset into it.
And the structure as is would already be pretty flexible
because it's got these 32-bit lengths and offset fields.
But in practice, at least at the upper layer of the stacks,
we always just used it to describe transfers inside a single page,
kind of like the PRPs I was just complaining about.
We did the same thing.
And making Lay from Red Hat based on patches from canned overstreet a long time ago
finally got around to fixing up a lot of places in the block layer
to get rid of these assumptions,
which now means that, again, if we're doing I.O.
on, say, a single huge page of two Macs,
we can store that in one of those structures
instead of a lot of them.
Now, we actually used to merge these together before
or slightly after they hit the driver, kind of
depending on where you see it. So it, it
never went out to the wire like this, unless
a driver like NVMe forced it. But there
was still a lot of, like, merging, splitting, thrashing
a lot of cache lines for no good reason before
we got this in. So in Linux 5.0, this got fixed, and the other interesting thing is
just in Linux 5.4 merge window, which is going on, like, right now, we finally got a serious
merge that switches to networking stack to use the same structure. They basically had
their own version of it before, so it really helps to make, again, a couple of these data transfers that fly from the
block layer to the network stack, like NVMe over TCP, a little nicer to handle.
Another thing that kind of came out of this is optimizations for single segments. So if we just have a single of these biovacs, we can actually handle them a little more efficiently than the normal IOPath.
So the normal IOPath from the biovacs creates a struct that we call scatter list,
which is a really weirdly designed structure because it basically contains the same fields that we already have
in the BioWack and then contains another two structures
looking like a scatter list which contain a
EMA address as we call it on Linux.
I mean this is the address as seen by the IO device
which for many cases might be the same as a physical
address but it could be the physical address with an offset
or it could be something entirely different if you're using an IOMMU.
And what the scatter list structure kind of helps with is the concept of merging segments
in the IOMMU.
So most IOMMUs have the idea that you give them multiple discontinuous scatter gather
elements, and they actually merge them into a single continuous range out on the wire.
And for that, the scatter list kind of makes sense, but it's just a very
cash-line inefficient way of doing that. And I've been
actually working for a while, including taking over maintainership of the
DMA mapping subsystem to come up with a better way of doing it, but I'm already spending
two years on it, so I decided to take a little shortcut just for the small IOS emitter.
And the idea is, if we only have a single
biovac, it's very easy to not merge it into that because obviously the IOMMU is not going
to merge it. There's really only a single segment. There's no reason to optimize it.
And then we just map it directly, store the DMA address in our NVMe private request structure.
We do need a separate length field because I've already got one.
Nothing's got merged.
And with a PRP, that means we can only easily do it
for a very small size of IO,
but with a scatter gather list,
this can actually, as long as it's one
physically contiguous entry, it can actually be gigantic.
Again, the prototype example of a huge page
when people are doing databases or HPC
really helps for that.
And I saw on my test setups, I actually saw
some pretty small speedups with that,
and then Jens tested it on his IOU ring benchmarking rig,
and he actually saw a 4% speedup,
which some people would kill for it, which surprised me.
And this just means, again, I mean, with modern NVMe devices, modern I.O. interfaces, we're
now living in a world where every single cache line counts.
It's like, every time you touch another cache line you don't have to do, it's gonna show
up in benchmarks.
So there, we're getting to the point where we're really just, there's not much FET to
trim anymore.
But it's exciting.
And because of that, there's all these little
performance optimizations that aren't really huge changes
to the code, but kind of nice and useful.
So one interesting thing that also sort of fell
out of the explicit poll queues in IO-Uring work,
even if it's not directly related, just using infrastructure,
is the idea of dedicated read queues in IOU ring work, even if it's not directly related, just using infrastructure, is the idea of dedicated read queues.
So instead of having, like, writes and delocate commands and reads all on the same queue,
this directs the reads to a dedicated set of NVMe queues, and Jens, who works for Facebook,
apparently has a couple sensitive critical read workloads where he'd rather not have
big writes disturb the
reads in the same queues so that they are not blocked. I'm not sure if anyone else
has ever been interested in using it actually. But yeah.
Lock less completion queues. So actually I didn't have the interrupt disabling in there,
but that's kind of coming from the same thing. Now that we don't have an interrupt and a
user process both banging the same CQ, that we don't have an interrupt and a user process
both being the same CQ, but only the interrupt,
which has exclusion by definition,
we could actually get rid of the spin locks
and interrupt disabling on reading the completion cues,
optimizing away a few more atomic instructions,
getting rid of a few fields.
I think Keith actually has a couple more patches
to shrink the cue structure that it fits
a single cache line, getting back to our old topic.
We might finally have some time that we get back to that.
We have a concept of batch doorbell rights.
So the NVMe interface, if you submit any submission queue,
it works by placing the submission queue entry in the queue
and then you eventually ring the doorbell
after you place one or more of these entries
to actually tell the hardware,
okay, there's something to read now.
And traditional Linux driver, like most drivers,
was always putting in one entry
and then ringing the doorbell, which works okay.
It's good for latency, but if you actually have
a lot of SQEs that you know they're batched up already, it makes sense to defer that a little bit and batch it up.
And this is especially true for virtualized NVMe controllers
where the MMI overwrites that a drawable ring does
are really, really expensive.
So if you can reduce that a little bit,
it gives it a little speed up.
And because we gave so much love to PCIe here,
there's another nice
little thingy for RDMA. And RDMA has this concept of an inline segment. So you send
your RDMA packet that includes, that contains the submission queue entry over to the controller,
and normally the controller would then use RDMA primitives to actually read the data
that the other side writes to itself.
It's kind of like two protocol round trips.
And it also has the concept of an inline data
where you can just add a segment of data
right at the end of your initial write
that contains the SQE.
And that is nice because it avoids a round trip,
reduces latency, the downside is that you need
to pre-allocate that space in every buffer,
so it bloats the amount of space used by the queues.
And what we did initially is to only support that
for a single segment, like a single biovac.
And Steve Wise figured out that even if we have the same size,
it's actually not very costly to allow multiple segments.
So if we have, say, an 8K buffer,
there's no reason not to allow up to four segments if they're 512 byte, now contiguous ones, because you can get that
pretty much for free and it actually helps to speed up some workloads that, say, always
do 8K writes, which often might be two segments because you're not using contiguous memory.
All right. So the other interesting, really, really interesting, in every sense of the
word, enterprise feature, is the PCAE peer-to-peer support.
Some people might know, others not.
So PCA Express, while for the host kind of looking like good old PCI with like the physical
signaling, is really
a network protocol underneath.
It's pretty well hidden, but for anyone
reading the spec, it's like that. And one thing that
PCI allows is to not just have
your host CPU
through the root port talk to
a device in pack, but the devices can
actually under circumstances talk
to each other. This is especially
interesting if you've got a PCI switch in there
so that your uplink to the host isn't touched,
but there's a couple other use cases for it.
And we finally grew support in Linux
to have a limited version of that PCIe peer-to-peer support
in Linux 4.19.
And that support is basically a generic support
in the PCI layer, which handles PCIe as well,
to register bars,
like the big memory windows PCIe cards have,
with that layer, so they can hand out allocations,
and then a way to figure out
if two devices can actually talk to each other,
depending where they sit in the hierarchy
and what kind of settings, like ACS. What's ACS? Advanced control?
Access control.
Access control services, yeah. The access control services allow for it. So basically
it's like, hello, can we actually talk to each other or do we always have to go through
mom? And if we can, we can then initiate the direct transfers. And right now there's
basically just two entities that can talk to each other
because nothing else has been wired up.
And one is the NVMe PCIe driver, which can export the CMB, the controller memory buffer,
which is basically just a giant memory bar that has no real functionality,
but which the NVMe device can read from and write to internally is this peer-to-peer memory,
and then RDMA cards can address that
so that if you're doing an NVMe or Fabrics target,
that can directly access the PCIe bar
on the NVMe controller from the RDMA HCA.
And there's all kinds of work going on,
like actually supporting setups that have IOMM use instead
of directly mapped DMA addresses.
And there's one really big warning. So CMB, as
specified in NVMe 1.3 and earlier, is really
gravely broken for typical virtualization setups. NVMe 1.4 has
kind of a shoehorn fix that solves some of that.
Some of this might be dangerous if you're using interesting remappings with IOM use.
And it's not actually Linux fault.
Is that Linux now dedicating the queue for reads?
Is there any reason why we're not telling the MDME devices that read are a specific read queue
and they can take advantage of that?
Okay, so the question was,
now that Linux has a dedicated read queue,
can't we tell the device that it is a dedicated read queue?
So one correction first is,
Linux, by default, it doesn't do that.
There is an option to do it,
which Yance, I guess, for Facebook uses, which I don't think many people
use, just as a spoiler. But if we do this and we had
a way in the standard to tell that device, we'd happily do it. I don't think there is a way right now.
Yeah, but if
there was, we would happily set that bit or two.
Oh, yeah.
We can do this for FICON today, okay?
Yeah. Over Fibre Channel?
Yeah.
All right.
It's part of mainframe.
Yeah.
Will you be able to do this not just for devices that are in the same PCI bus,
PCIe bus, which is what you're talking about here,
but for devices that are on, let's say,
point-to-point fiber channel fabric
that are in two different PCIe buses.
So if you've got, like, two different PCIe root ports
on the CPU complex, it'd be...
You've got a system here and a system here,
and I want to do the peer-to-peer transfer
without dragging it through the host
and exposing it to the network.
Well, not as a single hop thing.
I mean, you could probably build something, but it's not like a single entity that, I
mean, not at this level.
So PCIe, this is just the peer-to-peer just at the PCIe level, and you are either behind
the same root port with a switch, or you have some CPUs that actually allow that routing
from one root port to another,
but it's basically PCIe-specific.
Now, you could do things like, for example,
Mellanox has a NIC that can actually directly do peer-to-peer
and do NVMe transfers without going to the host.
I think it's kind of sketchy, but I think it's cool.
And if you have that on both sides,
you could actually do multiple hops without ever touching a CPU, but that think it's cool. And if you have that on both sides, you could actually do multiple hops
without ever touching a CPU,
but that's not directly solved by this.
The reason we do that on PyCon
is because in analyzing over many
years the amount of data
that's moved every day
in a typical setup,
the majority of the data moves device to device.
Well, there is another
thing, sort of in that area,
which we're not doing, which I know SCSI does and NFS does.
They have this idea of a target-to-target in SCSI
or server-to-server in NFS copy,
where the client just says, okay, you talk to you now over the network.
But that's not really related to this.
And then we'll announce this next month for heterogeneous devices.
It'll go from disk to tape. Tape.
Yeah, but that's, I mean, that's higher level storage protocols than this.
This is like super low level, bus level.
Yeah, so after all this interesting performance and enterprise stuff, there's another big
angle, and that is consumer-grade NVMe has finally fully arrived.
If you buy a decently high-end laptop these days, you will get it with an NVMe device.
Only the shitty stuff will still come with SATA.
And the real shitty stuff will come with the UMFC and UFS. Don't buy those.
So all these beautiful
little M.2 devices, or
BGA, which is the soldered-on version
of it, or Toshiba at FMS
has announced another weird small form
factor, and there's like CF
Express and SD Express, which is
NVMe-based compact
flash and SD card.
So it's getting tinier and tinier.
It's getting everywhere, and it's getting consumer-grade.
And consumer-grade means a lot more buggy devices.
I mean, the enterprise ones are buggy too, but usually, except for a few ones,
not in a way that we really have to work around them in the driver for consumer stuff.
So we've now used up 13 of our quirk bits,
so we have 13 different misbehaviors we need to work around
for specific devices, and that doesn't even count the ones
we work around for all that don't even need it.
What's really interesting in that thing
is Linux 5.4, which is developed right now,
will contain support for these so-called Apple NVMe controllers
that you find in recent MacBooks, which look a lot like NVMe, except that it turns out
the submission queue entry, which in NVMe has to be 64 bytes, it's actually 128 bytes.
They apparently have something in there, but you don't actually need it to work. And they
won't work if you use more than a single interrupt vector. And
they actually use a shared text space
between all the queues, so if you have a command
ID of 1 on your admin queue, you better
don't use that on an I-O queue, because it's going to
blow up. But
fortunately, Linux kind of has support
for all this in the block layer, so it's just setting
a few bits here and there, and things will actually
work. And the people who reverse-engineered were
pretty happy that Linux now runs nicely on these MacBooks. But it's weird stuff and
I hope people actually read the NVMe spang next time before they implement it.
The other thing that is really not 100% specific to consumer devices but 98% is power management.
So NVMe has this concept of power states.
You can either manually manage them
where the operating system always says,
okay, go into power state in.
And we think that's kind of a nanny approach
and the device should really know better.
So the Windows Microsoft driver
actually does the explicit management. And Linux
and some other drivers tell the device, it's just, okay, here's a couple
parameters. Please just pick your power state that you think is best.
And in a lot of cases, this works really well and gives really nice power
savings. In a few cases, it spectacularly blows up.
And the interesting thing is that
very often, it's not even specific to a particular NVMe M.2 device. It's particular to a specific
firmware version on a specific device combined with a specific BIOS version of a specific
laptop. And that leads to some interesting workarounds.
Since Linux 5.3, when Keith did some cool work after Dell tried to do some
not so cool work first, we can also use APSD for a system level suspend. So not just for
runtime power management, but when you close the lid or put the thing away. And before
that, we basically fully shut down the thing, and you'd think that would use, save most
power, right? It's off. It turns out, in many cases, not all of them,
it's not. And that's because
Microsoft has pushed the concept
of a modern standby on
people, where they'd rather do that,
and if you don't do that, the
laptop does weird things.
So for a couple platforms,
this, again, helped with power saving.
A couple others got nicely
broken, so
it's a bit of a pain.
Well, and
last but not least, we have the sad puppy
thanks to Intel. So one of our
biggest, biggest issues
in consumer
envy me is
Intel's chipset division, who's really
out there to screw up our life.
And Intel has something called the RAID mode.
Don't think of RAID because it has nothing to do with that.
So the great idea is you have an HCI controller on your device, which then hides one or more,
I think up to three NVMe controllers inside its PCIe bar, behind the HCI bar.
There is no sane way to enumerate it
because it's not documented.
All our quirks for the buggy devices I just mentioned
based on the PCI IDs couldn't ever work
because they're hiding the PCI ID for the real device.
And there's no way to do things like SRIV,
creating new functions, assigning functions to guests.
It's like it's all broken because Intel's BIOS and chipset people decided, oh, there's
an NVMe device.
We're not actually going to show it to you.
Your NVMe device is not going to show up in many laptops with Intel chipsets if your BIOS
is in the wrong mode and some of them are hardwired to that.
Instead, there will be some magic in an HCI device
that isn't even otherwise used.
And it very much seems like an intentional sabotage.
So we had one Intel guy, kernel developer,
trying to post draft patches for it,
and ever since he couldn't even talk about that anymore
because he got some gag order.
And, yeah, this is our biggest problem, in addition to
power management. Everything else is cool to me. Thank you
very much.
Questions?
Yeah, do you have the same automated setup on the
back end?
Yeah. Questions?
Yeah.
Yeah, automated setup basically means as soon as we discover multiple controllers
on the same subsystem,
we just link them together in the kernel,
and you will just get multipass access.
You don't do anything.
It's just there.
So that's about as automated
as it gets.
What is the timeline for 5.4?
The timeline for Linux is 5.4. So we don't
even have the first release candidate. So it's another seven weeks from now-ish.
Eight. There are.
Seven to eight weeks from now.
Also, you would be either. seven, two, eight weeks from now?
Okay, so the question was if it's possible for kernel users
to use IOU ring.
And the direct answer is no.
But you get,
so you don't get the ring,
but you can use things
like the dedicated polling thread
from kernel space two, and we've
actually looked at that for the NVMe over
fabrics target. So you'll get, you'll get a
similar use case, it's just not exactly the same
interface, because kernel interfaces just look very different.
Or is it because it needs to rewrite?
So, I mean, first, I mean, the question
was, Fiber Channel code base is so good, is that
because it's so complicated or it's junk and needs to rewrite?
So it's not really 20% of the kernel.
It's a huge part of the NVMe driver.
It's also a huge part of the SCSI layer, but overall, it's not anywhere near as much.
And I'm not sure.
I mean, one problem I know with Fiber Channel is just that a lot of stuff that really is
protocol generic isn't really in a generic layer,
but either in the drivers or in firmware.
And a lot of the drivers support
very different hardware revisions.
They're very different.
No one's really done NVMe for Fiber Channel.
I mean, not NVMe over Fiber Channel,
but an NVMe-like hardware interface for Fiber Channel.
So there's a couple drivers from different companies.
Each of them supports very different hardware generations,
and each of them duplicates a lot of code.
And I don't think there's one single clear answer,
but you kind of get where this is heading to.
So the canonical example,
which I guess might even be the most interesting one,
is to inject the timeout and figure out how the timeout handling works.
Something like that.
Asymmetric, yeah. You're right. Sorry.
No, you're right. My name is on the TP and I still can't spell it.
Damn it. Yeah, well, it's, anyway, I have a hard time getting that over, but I'll try to
explain it. No, so actually for a long time there was no I.O. scheduling for NVMe at all. I don't
think I said that earlier, but it definitely is true. And with the move to BlockMQ, we then later
gained support for I.O. schedules for that,
but they've always been shared by all drivers
using that infrastructure, which includes SCSI.
So they're not NVMe-specific,
but there are I.O. schedulers now
that didn't used to be there,
and they're not NVMe-specific.
Okay.
Okay, remaining questions are to be asked over coffee.
Thank you very much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org. Here you can
ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.