Storage Developer Conference - #140: Introduction to libnvme
Episode Date: February 11, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 140.
Hi, everyone. Thank you all for joining this presentation today. I hope everyone is having a safe and informative time here at our virtual SDC this year.
Please let me introduce myself. My name is Keith Bush. I am a researcher at Western Digital and have been here for a little over a year now.
I have been working on NVM Express for a little over 10 years now, though, and I occasionally contribute to the
specification and committee processes, but I am mainly focused on the host software enabling for
this protocol, specifically for the Linux operating system. I am one of the co-maintainers for the
Linux Kernels NVMe driver, and I also maintain and contribute to various other Linux NVMe-related
projects. The most frequently used of those include the QEMU-emulated NVMe controller,
and that is a really nice way to test NVMe over PCIe without any actual hardware.
The other project I contribute to is called NVMe CLI,
and that is a shell utility for managing and querying NVMe devices. Today I'd like to introduce
one of the other projects I've been working on. It's a Linux library for NVMexpress, aptly named
libNVMe. Now a library for Linux NVme is probably a bit overdue. This would have
been a much smaller project had we done this in the earlier days back when
nvme 1.0 was introduced and back when the Linux kernel was still on version 3.
And if we had it would have just been a very minor issue to release release it back then. And then it would just would have been a maintenance thing as things have progressed throughout the years.
But we didn't develop a common library then. And as the standard grew and became more complex. So did the Linux kernel and
The burden to create a library has grown with it. So just to set up what this library aims to help out with, we'll do a quick comparison of where we started and where we are now with both the standard and the Linux driver that implements it.
Then we can do a quick tour of the library itself and see how it may be useful for developers. NVMe Express was publicly released in 2011,
and the Linux driver supporting it was provided at about the same time,
and it was first integrated in the 3.3 kernel release.
Back then, NVMe was often described as being the streamlined
and lightweight storage interface,
and it didn't have any of the baggage of the older
and more established protocols that came before it. And with this protocol being freely available
in an open specification that any vendor could implement, this provided a great opportunity to
develop common software solutions in the open source world. And driver development has definitely
converged on two common sources.
The Linux kernel provides just a single NVMe driver used by all vendors, and it works with all types of devices from direct-attached clients and enterprise server devices to the external NVMe over Fabrics arrays.
And just to keep everything working under one driver, the Linux kernel does provide several ways to do vendor-specific things.
But those are typically reserved for quirky devices that are not spec-compliant.
So relying on these is strongly discouraged by the community in favor of actual compliance on the spec.
But having the single-entry driver integrated well within the kernel has actually improved the Linux kernel.
So it's been a very symbiotic relationship.
A couple examples include NVMe was one of the biggest drivers in making PCIe hotplug as mature as it is today.
Prior to that, it was not very common to see such a feature.
And NVMe made it as a normal thing.
The block layer improved as well due to NVMe
integration. It introduced several high-performance features like IO pulling. We also see the new
interface IO yearing that was largely motivated by NVMe performance. So while we do have a common
software in the kernel space for so many NVMe devices, pushing the boundaries of other software components, we have not really seen that kind of convergence happen in user space.
The software occupying user space often re-implements various subsets of the specification as fragmented in their implementations.
It's not just the specifications either. The Linux driver has also evolved to become more complex,
and there's been quite a bit of duplication in the software user space
to reach all those features.
So the primary goal for libnvme is to provide a common place
to expose all the NVMe features provided by Linux
and on the devices running in this operating system.
So there are certainly a lot of pieces in the NVMe protocol,
but LinVMe is intended to run in the user space on Linux.
So it's not really concerned with all the features
that a device maker needs to worry about.
The library's concern is more about what a Linux user can observe.
So we'll focus on those parts of the specification.
Back in the 1.0 days,
NVMe was indeed a very slim minimalist interface. We had exactly one document to refer to for everything you need to know about NVMe, and it targeted exactly one transport interface,
and that was PCI Express. It also had a very small command set to consider, as depicted in
these two columns here.
We had a mere 15 administrative commands and six I-O commands to consider.
And many of these admin commands do not even need to be part of a user space library.
And I will get into more details on that in a few more slides.
But in addition to what's shown here, some of these admin commands had some subtypes, and depending on various command parameters that you can give to them. So it's just a little bit more complicated than this.
But it's not a whole lot more to consider either.
We actually had just two types of identifications. We had namespace and controller. We had these three different log pages that were defined by the specification and 12 tunable feature settings. And even then, a lot of these
features were optional. The point though is that if we were to have developed a library back in
these early days, the entire admin command set that we'd be interested in providing fits all on this one slide.
And now we fast forward today where we're beyond NVMe 1.4. And I say beyond 1.4 because while 1.4 is the most recently released version of the NVMe specification as of today, there have been
several large technical proposals that have been published
and that have pushed the specification beyond that, but there has not been a new version
released since then.
But instead of just a single document like we had before that was needed to understand
the protocol, we now have more than five different documents managed by the NVM Express
committees.
This is also going to get even more spread out through some of the NVMe refactoring efforts that we anticipate we'll see when NVMe 2.0 is released in the near future.
That effort aims to group common functionality together, but also split unrelated features
away into their own documents.
So we should probably see even more specifications released around the NVM protocol
when that major version comes out.
But one of the more recent enhancements to NVM Express are these multiple command sets.
These are beyond the more traditional I-O commands that we have in the base spec.
One of those is the zoned namespaces, or ZNS. And ZNS, it's pretty
cool stuff and support for these devices has very recently been integrated in the Linux kernel.
And since the Linux kernel supports it, there is support for its unique commands and characteristics
and lib and VME as well. And while I would love to talk more about it, there are other talks here at SDC focused on ZNS. So I would recommend checking those out if this topic is of interest to you.
And since I mentioned ZNS, I will just mention here that the key value supports, well, the published by the NVMe committee, it's not currently supported in the Linux kernel, so that particular feature set
is not existing in the library either. So that may have to be a topic for another day.
But moving over to the transport, we've also added four more in addition to the existing PCI Express,
and those include RDMA, Fiber Channel, and TCP. And for testing purposes, the other category was provided by the NME Express Committee,
and Linux uses that as a loopback target, and that's simply a software-defined local NVMe transport.
The number of commands we're concerned with have also grown.
We've gone from 15 to 27 admin commands, and 6 to 13 IO commands. And just to make it even more difficult,
the individual command sets provide their own operations that we need to consider. So these
include ZNS and K key value, as I previously mentioned. So not only has the number of commands
grown, the subtypes of these commands and variants have also grown
significantly. So we had just two possible identifications before, and now we have 18.
Our three log pages has grown to 19, and we now have 30 tunable features up from 12.
We also have these weird fabrics commands that can both send and receive data. It's rather unique to that operation.
And we also have several other Admin commands with their various subtypes,
which include directives, namespace management, and attachment.
And these are just some of the more commonly used capabilities.
There are still even more that are not shown here.
But the point of why I'm bothering to show all this is just to provide a visualization that our once tiny and elegant interface has grown into something so much more and it's getting bigger all of the time.
The one thing we'd like a library to do is provide a common location to define all of these features so that they don't need to be implemented for every NVMe-specific component of software.
Now, it's not just the NVMe protocol and programming interfaces that's gotten so much more complicated either.
The Linux kernel supporting this has also grown in complexity.
So let's just take a little dive into see where we started and where we are at the moment. The NVMe driver was initially very simple, just like the programming interface that we
had.
The namespace for each NVMe device was surfaced up to the user land, and the block layer and
virtual file system stack went through it.
And we provided a very straightforward handle name called NVMeXNY, where X is the kernel instance
and Y is the namespace instance. Beyond that simple block interface, the driver also provided
a special iOctl interface for device management through each namespace. The iOctls provided could
tell us the unique namespace ID, or you could submit arbitrary admin commands through it.
It also had a special interface to submit IO commands.
But this initial implementation was a bit misguided, and I'll explain a bit more on that in a minute.
But those are really the only entry points and exported handles the initial driver provided. So it's a very fast but lean driver for our NVM interface.
Now the Linux kernel today is exposing quite a bit more to the user and even more entry points.
Now as before, we still have the same interfaces through the block layer and virtual file system, but we also have added more.
For one, we've added six more ioctals in addition to the previous three.
We have this generic IO command interface, and we have additional control ioctals.
I have been occasionally asked about what is this difference between the IO command and the submit IO ioctal.
The submit IO ioctal, as I said, was a bit misguided.
It was parameterized way too specific to the read and write commands.
But a lot of IO commands just don't align with those parameters.
For example, some commands like dataset management, the flush command, the persistent reservations, they just don't have the sort of parameters that read and write that require.
So we needed a much more generic and flexible interface.
And this was provided through this new IO command ioctal.
And this was modeled very much after the existing admin command ioctal because of its flexibility. Later though, the NVMe standard
defined additional bits for return data. Initially, the first four bytes of the completion queue entry
was defined for a command to return specific data about that command, but that was later expanded
to the first eight bytes for some commands to use. So we had to provide a new version of both admin and IO ioctals that were
capable of returning that data for commands that require it. So now we have these 64-bit versions
of admin and IO command ioctals. One example of an IO command that requires the 64-bit is the ZNS append command.
With that command, the user requests to append data to a zone, and the drive replies with the LBA it was written to.
And the 32-bit version of this command just wouldn't be able to report the LBA for some of the larger capacity ZNS drives.
Beyond the I-Octo interface, though,
the driver now exports quite a few more user-visible handles.
First, we had very early on discovered
that we needed to expose more than just namespace handles to user space.
We just can't always count on a controller
having a viable namespace for the driver to attach to.
For example, NVMe had introduced namespace management capabilities,
and when that happens, there may not even be a namespace in existence.
So the driver started exporting these special controller handles as these character devices,
and those take on the form of just NVMe X, where X is the controller instance.
Later, we had multi-ported controllers,
and we needed a way to report the relationship
among the different ports of the controller.
So we have these NVMe SysFS interfaces,
and in addition to showing relationships
among different controllers,
they have various different attributes
that can be exported through them.
Later, NVMe over Fabrics became standardized,
and we needed a way to initiate connections to remote targets.
So unlike PCIe targets, which are direct attached to the host processor
and don't need any information for the operating system to bind to them,
the Fabrics targets, they require the user tell the driver how to connect to them.
So the driver provides a special NVMe-fabrics handle, and that is what provides the NVMe
discover and initiator capabilities. One last thing I'll mention is related to the multi-pathing.
More recent versions of the NVMe driver provide a very efficient way to handle NVMe subsystems with multiple controllers.
The user-visible result is that you don't see duplicate namespaces for each path.
There are lots of benefits to having that, like not having to stack device mappers to manage those paths,
and you get automatic failover and optimizations.
But a bit of a misstep in our initial implementation was that we broke user expectations with respect
to device names.
With native multipathing, the namespace handles are provided through the subsystems rather
than the controller.
So the name of the namespace inherits the subsystems rather than the controller. So the name of a namespace inherits the subsystem identifier
rather than the controllers.
Users had become accustomed to the namespaces
being related to their controllers.
So for example, if you had a namespace node named NVMe0 and 1,
a user would assume that the parent controller was NVMe0.
With native multipathing, there is no such relationship.
So if they do align, it's just purely coincidence.
And unfortunately, many users ended up performing destructive actions to the wrong namespace.
So this all happened despite the topology information being provided in SysFS.
It's just really not the most friendly interface for users to examine, so many of them don't.
So I do feel maybe if we had a library to make sense of all the topology, maybe we could have avoided some of those accidents.
More recently, though, the kernel has developed a more sane approach to the exposed handle names, but the relationship among the paths and the namespaces is not immediately obvious still.
So you need to examine the sysfs hierarchy. So making more sense of that is something that we'd like out of Libby and VME as well. And those are just some of the pain points. So now I think it's nice to look into
LibinVME and what it can do for developers. The remaining of this presentation is going to take a
fairly high level look at where LibinVME fits in the software stack and a few simple examples. And
we can examine the various layers of the library
with some other concrete examples to help guide the presentation.
So in the software stack, libnvme lives in the user space
in support of nvme features using the Linux kernel driver,
and it sits alongside applications that use the storage.
The library works with the kernel. It doesn't work around it. This is not a user space driver,
so it's not intended to replace normal IO access. And IO should continue to work just as it does
today through the file systems and block stack with your various generic IO interfaces like PreadWrite, LibAIO, or IOUring
if you're a little more adventurous.
The NVMe driver does provide a pass-through interface
for arbitrary IO, as I mentioned earlier,
and LibNVMe can make use of that.
So you could use it to directly insert NVMe read
or write commands if you really wanted to.
But the interface is entirely synchronous and it's not necessarily optimized.
So it's mainly provided for device testing and debug purposes
or for when you really want to bypass some of the block optimization paths.
But rather than intended specifically for IO, the library is mainly for
finding all the devices enumerated by the kernel and figuring out how they're related to each other.
It's also there to communicate with the Fabrics initiator to discover and connect to new targets.
And the biggest part is that it's there to utilize the pass-through interface to send arbitrary admin commands.
It also provides utilities to set up and decode payloads for those commands.
The kernel driver continues to own all the low-level details.
So this is outside LibNVMe now. The driver still owns initializing the controller, setting up DMAs for commands,
queuing commands, and handling any low-level errors associated with the device.
As previously mentioned, the Linux NVMe driver exports many artifacts,
and LibNVMe interfaces with all of them.
First on the left, all the NVMe controllers can be managed through
this library. It just connects through those different device nodes, and that's where you
can submit your administrative commands. LibnVMe can also match up those device handles with their
sysfs interface side, and from there it can report information about the subsystem and other
controllers in the subsystem, what namespaces are accessible through any of those controllers and whether those paths are optimal or failover and various other attributes about that.
Moving over to the next handle, the NVMe Fabrics interface.
This can be a bit tricky to work with, which we'll go into a little bit more later. But this is used for initiating new connections,
and this library provides a simple interface for configuring that.
And finally, we have the special directory slash etc slash NVMe.
That's actually not something the driver provides.
It's instead an artifact installed by other applications
for configuring NVMe host and remote targets
that have been saved. And libNVMe can decode that special directory and help set up connections
from what's in there. Much of everything I've described so far, there's actually quite a bit
of overlap with another Linux user space program that I mentioned earlier, NVMe CLI.
So that's already provided there.
What is limited VME doing?
Well, NVMe CLI, it's strictly a command line utility, and it performs just a single user requested action per invocation.
So that utility is more about taking user input and formatting output.
It's mainly concerned with interacting with the users, and it's not so much about interacting with other applications.
So if you have your own NVMe-centric application, NVMe-CLI is probably not going to be very useful.
But LibNVMe can easily be a drop-in replacement to NVMe CLI's backend.
And it's actually something I'd very much like to complete in the near future.
And that way, LibNVMe CLI can focus mainly on the user-facing side, and LibNVMe can take on
all the responsibilities for interacting with the kernel and providing a coder-friendly API to other applications.
So let's just take a look at the current snapshot of this project. The repository for this library, it is open source on GitHub, and it's provided at this link here.
It is written in C, but it also integrates well with C++.
It is provided with an LGPL license.
So any changes to this library
will continue to be open source,
but proprietary software may link with it
without concern of any license contamination.
It's currently just a bit over 20,000 lines of code.
I would just say over half of that is documentation, though.
It's embedded directly in the code comments.
So it's quite a heavily documented repository.
There are over 250 exported functions from the library,
and most of the specification-defined structures
that can be returned from the controller
or sent to it are provided,
as well as all the enumerations for the constant values
and command parameters defined by the spec and various functions for decoding all the fields
defined by the spec. So if you install the library, its header files and linkable objects
will be installed to your system's library path. Both a shared object and statically linkable archive I provided.
So you have either option depending on how you want your application
to link with it.
And to link with it, you just add the dash L NVMe compiler option.
There are so many functions exported from this.
So the documentation is treated as a pretty high priority.
It's intended to provide a useful way to navigate what's available and how to invoke the functions from your application.
The documentation is provided in both man pages and HTML formats.
And the documentation format leverages the Linux kernel doc.
So it should be familiar if you have prior experience
working with the kernel documentation.
And since the NVMe specification
is an ever-moving target,
sometimes we expect an exported function
may need to add parameters
or change parameters
to match the specification.
And if that does happen,
the maintenance goal for a lib in VME
is to provide a symbol versioning
so that these exported functions
will continue to work in an application
developed on an older version of the library
if you happen to install a newer library later.
Okay, so now let's look into some of the various layers
provided by the library.
At the lowest layer of the library, we have the base types.
This is where the specification-defined structures, constant values, and decodings for specific
fields are provided.
Whenever the NVMe committee publishes an update to the specification or with new technical
proposals, libnvme intends to be updated to match.
The documentation will also provide cross-links to related functions that use or return a
structure and to other structures and enumerations that reference it.
And this is provided in the hope that it's useful for programmers to navigate related information without having to have a copy of the specification open at all times.
The example here on the right is a screenshot of the documentation's telemetry log.
This is just one of many NVMe-defined log structures.
The telemetry log was defined by the NVMe committee.
It's a generic way to retrieve vendor-specific information about your controller, and it's vital to debugging issues in a production environment.
So it may happen that if you experience a problem with your device, a vendor may request that you send back this log, and they give it to their former developers to analyze analyze and it's part of the process to fix it.
And I'll continue to refer back to this login later examples to help drive the story for this library.
So now that we have these NVMe specification defined types, it would be great to provide a convenient way to retrieve them.
The driver provides this capability with those various iOctals we talked about earlier
through their NVMe pass-through commands.
The library provides parameterized functions
for each possible iOctal type,
and these are there to help set up
the kernel's iOctal-specific structures.
The one driver iOctal that the library does not support is that old submit io ioctl that we talked about earlier.
This ioctl has largely been replaced by the more generic io command ioctl, and I expect the kernel will eventually deprecate submit io.
So we're not going to implement it in this library.
The pass-through interfaces that we do use, they're quite flexible. These can
be used to send just about any possible NVMe command and arbitrary payload. So it's generic
enough at this level that it should continue to be forward compatible for any future additions
or any vendor-specific command. But using this particular interface is not very coder-friendly.
This below example is the prototype for the base admin command library function.
And as you can see, it has a lot of rather opaque-looking parameters.
So a developer would pretty much be required to have the specs open in front of them
in order to decode which bits and bytes they need to set and which D words.
Otherwise, they wouldn't know how to craft the desired command.
So since parameters at this level
don't really provide a convenient programming interface
to know how to use it,
another layer specific to command opcodes
is going to make it just a bit more convenient.
So for every NVMe opcode,
the specification best defined,
or at least most of them, the library exports more functions specific to those opcodes.
This should hopefully save the developer some round trips to the specs so they can instead focus on their application.
There are some NVMe operations, though, that the library does not provide these sorts of convenient functions. I had mentioned earlier, this library
is intended to work with the driver, and we don't want to provide actions that are harmful to the
driver's operation. For example, we don't want to provide a way to tear down resources the driver
is actively using. So an action like deleting a queue will most likely just result in confusing the driver or induce errors on the device side.
The abort command is another one.
It's an odd one that I'm asked about frequently why the library doesn't provide it.
So I'll just mention here.
We can't really support it here because the driver is responsible for assigning the command
identifiers and as well as the queue that it's dispatched to
both both of those uh components are required in order to submit a successful abort command
and neither of those are communicated back to user interface back to the user space
so since that's the case there really isn't a good way to to craft an abort command from there. And some of the other commands that are owned by the driver include like asynchronous events,
keep alive timeouts, and the connection commands.
Those are all owned within the driver.
So we're not providing convenient functions for them in the library.
But at the same time, the library is not going to police what you send.
So while there's no exported convenience functions for these,
you could still submit whatever you want through the generic pass-through interface
that I showed on the previous slide.
But in many cases, if you were to do something like that,
it's just going to confuse some other part of the system,
and a controller reset is likely going to happen when the errors happen.
But you could still use it for error injection or testing
other broken scenarios or maybe triggering an analyzer snapshot if that's something you wanted
to do. Beginning back to the functions this library does export, this is just another example
of a spec-defined command for retrieving log pages. The following is the library's provided API
for the generic NVMe log command.
It's a little bit more coder-friendly
than what we had before.
There are fewer parameters,
and the types are now specific to this command,
so we have a little more type safety.
And for some of the commands, though,
the opcode's specific function is about as simple as we can make it from a developer's perspective.
And we don't need to go any further in the specifications to know how to use it.
But as we saw earlier, many types of commands have subtypes, and that includes this log page.
So another layer of helpers is going to be most helpful. So with that in mind, the library exports even more functions for all the command subtype variations.
These include all the log page types that I just mentioned, as well as the various identifications, which we said there are 18 of them, and all the tunable features and the directives.
ZNS provides various zone management commands as well.
So that's one of the newer additions.
So there are many functions at this level.
Currently a little over 150 are provided.
And continuing with our controller's telemetry log page, this is an example of one of the library functions for retrieving that specific
type of log. And we now have even fewer parameters to consider. And this is looking pretty easy now
for a developer to interact with. And while this looks pretty simple, this isn't quite the end of
it. It turns out that some commands require a more complex sequence than just a single command to successfully complete.
So we have another layer that we can consider using if needed.
So for these NVMe actions that can't necessarily be completed with a single command,
libNVMe is going to provide even more convenient functions for those transactions.
There are several types of NVMe commands
that need multiple steps to successfully complete.
For example, a firmware download may be required in multiple steps.
Another example with our controller telemetry log,
this is also going to need multiple pieces to complete.
And the reason this level of convenience is provided
is because in some cases, an incorrect
sequence may result in a torn or incomplete transfer. And this can be frustrating if you
experience something like that. So for example, if you're debugging one of those very difficult
to reproduce problems in your production environment, you may send the requested logs
back to your vendor. And then you'll be unhappy to hear that the log you sent wasn't very useful
because it's either incomplete or corrupted.
So for some of these types of sequences,
libnvme provides some utilities to help manage that.
So finishing up with our controller telemetry log,
we finally arrived at the highest level function
the library provides
for getting one. It's now just down to the two parameters, and it really couldn't be any easier
to use. But just to go into the sequence for how it works, the telemetry log starts out with waiting
for the events that indicates a controller's log is available. The specification tells us that the
log data should be latched and unchanging until the host releases that latch by rearming the event.
So upon seeing the event, the API will first figure out how large the log is and then read each section in sequence into a buffer it allocates for you.
So the partial reads and a loop is required because the total log size typically exceeds the driver's single command maximum transfer size.
So this loop is pretty crucial.
And then once it reaches the end of the log,
lib nvme verifies the generation sequence to ensure that nothing interrupted our transfer
and that the log is indeed complete.
And after that, the function will rearm the event
so that a new log sequence can be created by the controller in the future if needed.
The log is then returned by this function,
and your application can then save it into a file
or send it off to the vendor for their analysis.
And this pretty much concludes our look at the pass-through command support
provided by LibinVME, so let's move on to other components. So the NVMe over Fabrics, it's gotten a lot of interest in recent years and Linux has
been leading the way in open source development for it. This driver provides both a host and a
target driver for all the supported transports, but this library is really only concerned with the host side.
The fabrics component of the NVMe host driver provides a single entry point to the user
to discover and initiate connections to targets,
and that entry point is provided by this special device handle called at slash dev slash NVMe dash fabrics.
And rather than providing an i ioctal interface for programmers,
the user API for this special handle
takes on the form of writing magic strings into it
and then reading back the result of the action you issued.
Those magic strings have this special form
of this key value pair for all the options
that a connection might
use. And these options are not particularly well documented by the kernel. And occasionally new
options are added by the driver. And so keeping up with this can be a bit difficult for a user
to keep up with. But if you do happen to know the options you want to use, it's just
something you can easily invoke with an echo command from the shell, but that's not particularly
coder-friendly to another application. So LibinVME provides parameterized functions that generate and
submit these magic strings for you. And if you use the library to connect to a discovery controller,
the library can then recursively discover and
connect targets. If the driver ever needs to add new options, the maintenance goal of LibinVME
is to be updated to match what the Linux kernel provides at all times.
And the following here is just a simple example, a little C program for getting the entire
discovery log for the
local loopback targets.
You might recall, as I mentioned before, the
loop interface is purely a
locally defined software fabric
targets. So the parameters are
quite simple compared to, say,
a remote one. Those would require
the target name and address.
The loopback one is a bit
more simple. So I'm just including it here because it fits on one slide.
And the code is just super simple.
The libnvme parts are all highlighted in the blue text,
and essentially all we're doing is specifying
the well-defined NVMe qualified name of a discovery controller
and then specifying to the initiative
that we're looking for the local loop transport.
If the local host has defined a host qualified name, we'll use that, or it'll just be null.
And then we'll request the library add the discovery controller based on these configuration
parameters.
And once we have that controller, we can read its discovery log page.
From there, if we wanted to, we could connect to each of those individual targets
defined in that log. But this is just a really simple, silly example. So we don't actually do
anything with it. We've just retrieved it and then we release it all and clean up.
The last interface that LibinVime interacts with is the driver's exported sysfs attributes. The kernel
uses sysfs for drivers and modules to report all sorts of characteristics about the current state
of things within the operating system. For nvme, this is just used to report information about
subsystems, the controllers in those subsystems, and the namespaces attached to those controllers,
and finally information about individual paths to those namespaces attached to those controllers. And finally, information about individual paths to those namespaces.
So as I mentioned earlier, the complexity of what NVMe exports here
has gotten just a little bit more confusing when we introduced the multipathing.
So libNVMe provides methods to scan the NVMe hierarchy,
search and filter the topology, retrieve attributes, and link the SysFetch entries to the device nodes.
And if you wanted to, you can use those links to submit commands through.
I mentioned filtering, and some examples of that,
they can include, say, like you only want to see devices from one particular vendor,
or maybe you want to see only targets on a specific
transport, like you want to see your RDMA transports or maybe only your local PCIe.
So those are just some of the examples that the sysfs part of this library provides.
The driver sysfs interface doesn't change that often, so it's probably not going to be too much
maintenance in LibinVME to traverse it, but when it does, LibNVME will provide updates as needed
to match the kernel side.
And the output snippet here,
it's just a tree representation of some of the information
that this component of LibNVME retrieves.
It's just showing the hierarchy of NVMe subsystems
and devices present in this particular system.
Finally, just for fun, I mentioned earlier
that libnvme can be used to dispatch pass-through IO
directly into the driver,
and it bypasses the file systems and block stack
in that case.
And while I don't recommend bypassing them in general,
I thought it was a bit fun to see what it would take
to add an IO engine to the flexible IO tester.
If you're not familiar with FIO, this
is one of the best tools to exercise in benchmark storage in the Linux operating system, and it
supports many various IO engines. Most of the IO engine implementations, they're not too
complicated to implement, but in LibinVMEs is especially simple. It's so simple that almost
the entirety of it can fit on a single slide.
It's right here.
And the only parts missing from this
are just registering these functions.
It's just a few more lines of code that didn't quite fit.
The high-level gist of this is that it just finds
and opens the NVMe device,
and then based on the requested action,
it will construct read or write commands
at the requested offsets and sizes.
You don't even need to know about the low-level device formats in order to use it.
It's a pretty simple interface to use.
And I thought it was a kind of a cute example to show how simple the library makes IO access.
So while there is a whole lot of functionality packed into this library,
there's still a bit more work to do.
There are several moderately sized features from the spec
that have not been implemented yet.
Those include the persistent event logs
and some of the management interface, ME and VMI features.
I mentioned earlier the key value command sets.
As of now, it still doesn't have kernel support,
but perhaps maybe we can add support for it from LibinVME anyways.
There are also still a few features that are implemented in the library
that are actually not quite fully tested.
Finding hardware that supports some of the less common features
from the spec can be difficult to source.
So any assistance on that aspect would actually be much appreciated.
We still try to do a lot of testing under emulation, and QEMU provides that.
But we definitely prefer to verify the features on real hardware.
And as always, the NVMe committee is pushing new features at a pretty rapid cadence.
So LibNVMe will always need to be following that moving target.
So it's going to be an ongoing maintenance goal.
And finally, my next goal for this library is to complete integration with NVMe CLI.
And once that's completed, we should be able to tag the release and request package support with all the major Linux distributions.
And then from there, it can be conveniently installed through their package management
systems.
And I expect that to probably happen over the next few months, maybe the end of the
year.
And then it should be downloadable like through AppsGates or Yum or DNF, whatever your preferred
distribution uses.
And that is really all I have for this discussion right now.
Thank you all for watching.
Please feel free to shoot me a message
or start a discussion on the GitHub source repository
if you're interested.
Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing
list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and
discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit