Storage Developer Conference - #140: Introduction to libnvme

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 140. Hi, everyone. Thank you all for joining this presentation today. I hope everyone is having a safe and informative time here at our virtual SDC this year. Please let me introduce myself. My name is Keith Bush. I am a researcher at Western Digital and have been here for a little over a year now. I have been working on NVM Express for a little over 10 years now, though, and I occasionally contribute to the

Starting point is 00:01:05 specification and committee processes, but I am mainly focused on the host software enabling for this protocol, specifically for the Linux operating system. I am one of the co-maintainers for the Linux Kernels NVMe driver, and I also maintain and contribute to various other Linux NVMe-related projects. The most frequently used of those include the QEMU-emulated NVMe controller, and that is a really nice way to test NVMe over PCIe without any actual hardware. The other project I contribute to is called NVMe CLI, and that is a shell utility for managing and querying NVMe devices. Today I'd like to introduce one of the other projects I've been working on. It's a Linux library for NVMexpress, aptly named

Starting point is 00:02:00 libNVMe. Now a library for Linux NVme is probably a bit overdue. This would have been a much smaller project had we done this in the earlier days back when nvme 1.0 was introduced and back when the Linux kernel was still on version 3. And if we had it would have just been a very minor issue to release release it back then. And then it would just would have been a maintenance thing as things have progressed throughout the years. But we didn't develop a common library then. And as the standard grew and became more complex. So did the Linux kernel and The burden to create a library has grown with it. So just to set up what this library aims to help out with, we'll do a quick comparison of where we started and where we are now with both the standard and the Linux driver that implements it. Then we can do a quick tour of the library itself and see how it may be useful for developers. NVMe Express was publicly released in 2011, and the Linux driver supporting it was provided at about the same time,

Starting point is 00:03:11 and it was first integrated in the 3.3 kernel release. Back then, NVMe was often described as being the streamlined and lightweight storage interface, and it didn't have any of the baggage of the older and more established protocols that came before it. And with this protocol being freely available in an open specification that any vendor could implement, this provided a great opportunity to develop common software solutions in the open source world. And driver development has definitely converged on two common sources.

Starting point is 00:03:46 The Linux kernel provides just a single NVMe driver used by all vendors, and it works with all types of devices from direct-attached clients and enterprise server devices to the external NVMe over Fabrics arrays. And just to keep everything working under one driver, the Linux kernel does provide several ways to do vendor-specific things. But those are typically reserved for quirky devices that are not spec-compliant. So relying on these is strongly discouraged by the community in favor of actual compliance on the spec. But having the single-entry driver integrated well within the kernel has actually improved the Linux kernel. So it's been a very symbiotic relationship. A couple examples include NVMe was one of the biggest drivers in making PCIe hotplug as mature as it is today. Prior to that, it was not very common to see such a feature.

Starting point is 00:04:41 And NVMe made it as a normal thing. The block layer improved as well due to NVMe integration. It introduced several high-performance features like IO pulling. We also see the new interface IO yearing that was largely motivated by NVMe performance. So while we do have a common software in the kernel space for so many NVMe devices, pushing the boundaries of other software components, we have not really seen that kind of convergence happen in user space. The software occupying user space often re-implements various subsets of the specification as fragmented in their implementations. It's not just the specifications either. The Linux driver has also evolved to become more complex, and there's been quite a bit of duplication in the software user space

Starting point is 00:05:30 to reach all those features. So the primary goal for libnvme is to provide a common place to expose all the NVMe features provided by Linux and on the devices running in this operating system. So there are certainly a lot of pieces in the NVMe protocol, but LinVMe is intended to run in the user space on Linux. So it's not really concerned with all the features that a device maker needs to worry about.

Starting point is 00:05:57 The library's concern is more about what a Linux user can observe. So we'll focus on those parts of the specification. Back in the 1.0 days, NVMe was indeed a very slim minimalist interface. We had exactly one document to refer to for everything you need to know about NVMe, and it targeted exactly one transport interface, and that was PCI Express. It also had a very small command set to consider, as depicted in these two columns here. We had a mere 15 administrative commands and six I-O commands to consider. And many of these admin commands do not even need to be part of a user space library.

Starting point is 00:06:38 And I will get into more details on that in a few more slides. But in addition to what's shown here, some of these admin commands had some subtypes, and depending on various command parameters that you can give to them. So it's just a little bit more complicated than this. But it's not a whole lot more to consider either. We actually had just two types of identifications. We had namespace and controller. We had these three different log pages that were defined by the specification and 12 tunable feature settings. And even then, a lot of these features were optional. The point though is that if we were to have developed a library back in these early days, the entire admin command set that we'd be interested in providing fits all on this one slide. And now we fast forward today where we're beyond NVMe 1.4. And I say beyond 1.4 because while 1.4 is the most recently released version of the NVMe specification as of today, there have been several large technical proposals that have been published

Starting point is 00:07:45 and that have pushed the specification beyond that, but there has not been a new version released since then. But instead of just a single document like we had before that was needed to understand the protocol, we now have more than five different documents managed by the NVM Express committees. This is also going to get even more spread out through some of the NVMe refactoring efforts that we anticipate we'll see when NVMe 2.0 is released in the near future. That effort aims to group common functionality together, but also split unrelated features away into their own documents.

Starting point is 00:08:23 So we should probably see even more specifications released around the NVM protocol when that major version comes out. But one of the more recent enhancements to NVM Express are these multiple command sets. These are beyond the more traditional I-O commands that we have in the base spec. One of those is the zoned namespaces, or ZNS. And ZNS, it's pretty cool stuff and support for these devices has very recently been integrated in the Linux kernel. And since the Linux kernel supports it, there is support for its unique commands and characteristics and lib and VME as well. And while I would love to talk more about it, there are other talks here at SDC focused on ZNS. So I would recommend checking those out if this topic is of interest to you.

Starting point is 00:09:12 And since I mentioned ZNS, I will just mention here that the key value supports, well, the published by the NVMe committee, it's not currently supported in the Linux kernel, so that particular feature set is not existing in the library either. So that may have to be a topic for another day. But moving over to the transport, we've also added four more in addition to the existing PCI Express, and those include RDMA, Fiber Channel, and TCP. And for testing purposes, the other category was provided by the NME Express Committee, and Linux uses that as a loopback target, and that's simply a software-defined local NVMe transport. The number of commands we're concerned with have also grown. We've gone from 15 to 27 admin commands, and 6 to 13 IO commands. And just to make it even more difficult, the individual command sets provide their own operations that we need to consider. So these

Starting point is 00:10:13 include ZNS and K key value, as I previously mentioned. So not only has the number of commands grown, the subtypes of these commands and variants have also grown significantly. So we had just two possible identifications before, and now we have 18. Our three log pages has grown to 19, and we now have 30 tunable features up from 12. We also have these weird fabrics commands that can both send and receive data. It's rather unique to that operation. And we also have several other Admin commands with their various subtypes, which include directives, namespace management, and attachment. And these are just some of the more commonly used capabilities.

Starting point is 00:10:57 There are still even more that are not shown here. But the point of why I'm bothering to show all this is just to provide a visualization that our once tiny and elegant interface has grown into something so much more and it's getting bigger all of the time. The one thing we'd like a library to do is provide a common location to define all of these features so that they don't need to be implemented for every NVMe-specific component of software. Now, it's not just the NVMe protocol and programming interfaces that's gotten so much more complicated either. The Linux kernel supporting this has also grown in complexity. So let's just take a little dive into see where we started and where we are at the moment. The NVMe driver was initially very simple, just like the programming interface that we had. The namespace for each NVMe device was surfaced up to the user land, and the block layer and

Starting point is 00:11:57 virtual file system stack went through it. And we provided a very straightforward handle name called NVMeXNY, where X is the kernel instance and Y is the namespace instance. Beyond that simple block interface, the driver also provided a special iOctl interface for device management through each namespace. The iOctls provided could tell us the unique namespace ID, or you could submit arbitrary admin commands through it. It also had a special interface to submit IO commands. But this initial implementation was a bit misguided, and I'll explain a bit more on that in a minute. But those are really the only entry points and exported handles the initial driver provided. So it's a very fast but lean driver for our NVM interface.

Starting point is 00:12:53 Now the Linux kernel today is exposing quite a bit more to the user and even more entry points. Now as before, we still have the same interfaces through the block layer and virtual file system, but we also have added more. For one, we've added six more ioctals in addition to the previous three. We have this generic IO command interface, and we have additional control ioctals. I have been occasionally asked about what is this difference between the IO command and the submit IO ioctal. The submit IO ioctal, as I said, was a bit misguided. It was parameterized way too specific to the read and write commands. But a lot of IO commands just don't align with those parameters.

Starting point is 00:13:39 For example, some commands like dataset management, the flush command, the persistent reservations, they just don't have the sort of parameters that read and write that require. So we needed a much more generic and flexible interface. And this was provided through this new IO command ioctal. And this was modeled very much after the existing admin command ioctal because of its flexibility. Later though, the NVMe standard defined additional bits for return data. Initially, the first four bytes of the completion queue entry was defined for a command to return specific data about that command, but that was later expanded to the first eight bytes for some commands to use. So we had to provide a new version of both admin and IO ioctals that were capable of returning that data for commands that require it. So now we have these 64-bit versions

Starting point is 00:14:37 of admin and IO command ioctals. One example of an IO command that requires the 64-bit is the ZNS append command. With that command, the user requests to append data to a zone, and the drive replies with the LBA it was written to. And the 32-bit version of this command just wouldn't be able to report the LBA for some of the larger capacity ZNS drives. Beyond the I-Octo interface, though, the driver now exports quite a few more user-visible handles. First, we had very early on discovered that we needed to expose more than just namespace handles to user space. We just can't always count on a controller

Starting point is 00:15:20 having a viable namespace for the driver to attach to. For example, NVMe had introduced namespace management capabilities, and when that happens, there may not even be a namespace in existence. So the driver started exporting these special controller handles as these character devices, and those take on the form of just NVMe X, where X is the controller instance. Later, we had multi-ported controllers, and we needed a way to report the relationship among the different ports of the controller.

Starting point is 00:15:53 So we have these NVMe SysFS interfaces, and in addition to showing relationships among different controllers, they have various different attributes that can be exported through them. Later, NVMe over Fabrics became standardized, and we needed a way to initiate connections to remote targets. So unlike PCIe targets, which are direct attached to the host processor

Starting point is 00:16:20 and don't need any information for the operating system to bind to them, the Fabrics targets, they require the user tell the driver how to connect to them. So the driver provides a special NVMe-fabrics handle, and that is what provides the NVMe discover and initiator capabilities. One last thing I'll mention is related to the multi-pathing. More recent versions of the NVMe driver provide a very efficient way to handle NVMe subsystems with multiple controllers. The user-visible result is that you don't see duplicate namespaces for each path. There are lots of benefits to having that, like not having to stack device mappers to manage those paths, and you get automatic failover and optimizations.

Starting point is 00:17:08 But a bit of a misstep in our initial implementation was that we broke user expectations with respect to device names. With native multipathing, the namespace handles are provided through the subsystems rather than the controller. So the name of the namespace inherits the subsystems rather than the controller. So the name of a namespace inherits the subsystem identifier rather than the controllers. Users had become accustomed to the namespaces being related to their controllers.

Starting point is 00:17:35 So for example, if you had a namespace node named NVMe0 and 1, a user would assume that the parent controller was NVMe0. With native multipathing, there is no such relationship. So if they do align, it's just purely coincidence. And unfortunately, many users ended up performing destructive actions to the wrong namespace. So this all happened despite the topology information being provided in SysFS. It's just really not the most friendly interface for users to examine, so many of them don't. So I do feel maybe if we had a library to make sense of all the topology, maybe we could have avoided some of those accidents.

Starting point is 00:18:21 More recently, though, the kernel has developed a more sane approach to the exposed handle names, but the relationship among the paths and the namespaces is not immediately obvious still. So you need to examine the sysfs hierarchy. So making more sense of that is something that we'd like out of Libby and VME as well. And those are just some of the pain points. So now I think it's nice to look into LibinVME and what it can do for developers. The remaining of this presentation is going to take a fairly high level look at where LibinVME fits in the software stack and a few simple examples. And we can examine the various layers of the library with some other concrete examples to help guide the presentation. So in the software stack, libnvme lives in the user space in support of nvme features using the Linux kernel driver,

Starting point is 00:19:22 and it sits alongside applications that use the storage. The library works with the kernel. It doesn't work around it. This is not a user space driver, so it's not intended to replace normal IO access. And IO should continue to work just as it does today through the file systems and block stack with your various generic IO interfaces like PreadWrite, LibAIO, or IOUring if you're a little more adventurous. The NVMe driver does provide a pass-through interface for arbitrary IO, as I mentioned earlier, and LibNVMe can make use of that.

Starting point is 00:20:00 So you could use it to directly insert NVMe read or write commands if you really wanted to. But the interface is entirely synchronous and it's not necessarily optimized. So it's mainly provided for device testing and debug purposes or for when you really want to bypass some of the block optimization paths. But rather than intended specifically for IO, the library is mainly for finding all the devices enumerated by the kernel and figuring out how they're related to each other. It's also there to communicate with the Fabrics initiator to discover and connect to new targets.

Starting point is 00:20:40 And the biggest part is that it's there to utilize the pass-through interface to send arbitrary admin commands. It also provides utilities to set up and decode payloads for those commands. The kernel driver continues to own all the low-level details. So this is outside LibNVMe now. The driver still owns initializing the controller, setting up DMAs for commands, queuing commands, and handling any low-level errors associated with the device. As previously mentioned, the Linux NVMe driver exports many artifacts, and LibNVMe interfaces with all of them. First on the left, all the NVMe controllers can be managed through

Starting point is 00:21:26 this library. It just connects through those different device nodes, and that's where you can submit your administrative commands. LibnVMe can also match up those device handles with their sysfs interface side, and from there it can report information about the subsystem and other controllers in the subsystem, what namespaces are accessible through any of those controllers and whether those paths are optimal or failover and various other attributes about that. Moving over to the next handle, the NVMe Fabrics interface. This can be a bit tricky to work with, which we'll go into a little bit more later. But this is used for initiating new connections, and this library provides a simple interface for configuring that. And finally, we have the special directory slash etc slash NVMe.

Starting point is 00:22:17 That's actually not something the driver provides. It's instead an artifact installed by other applications for configuring NVMe host and remote targets that have been saved. And libNVMe can decode that special directory and help set up connections from what's in there. Much of everything I've described so far, there's actually quite a bit of overlap with another Linux user space program that I mentioned earlier, NVMe CLI. So that's already provided there. What is limited VME doing?

Starting point is 00:22:52 Well, NVMe CLI, it's strictly a command line utility, and it performs just a single user requested action per invocation. So that utility is more about taking user input and formatting output. It's mainly concerned with interacting with the users, and it's not so much about interacting with other applications. So if you have your own NVMe-centric application, NVMe-CLI is probably not going to be very useful. But LibNVMe can easily be a drop-in replacement to NVMe CLI's backend. And it's actually something I'd very much like to complete in the near future. And that way, LibNVMe CLI can focus mainly on the user-facing side, and LibNVMe can take on all the responsibilities for interacting with the kernel and providing a coder-friendly API to other applications.

Starting point is 00:23:50 So let's just take a look at the current snapshot of this project. The repository for this library, it is open source on GitHub, and it's provided at this link here. It is written in C, but it also integrates well with C++. It is provided with an LGPL license. So any changes to this library will continue to be open source, but proprietary software may link with it without concern of any license contamination. It's currently just a bit over 20,000 lines of code.

Starting point is 00:24:22 I would just say over half of that is documentation, though. It's embedded directly in the code comments. So it's quite a heavily documented repository. There are over 250 exported functions from the library, and most of the specification-defined structures that can be returned from the controller or sent to it are provided, as well as all the enumerations for the constant values

Starting point is 00:24:45 and command parameters defined by the spec and various functions for decoding all the fields defined by the spec. So if you install the library, its header files and linkable objects will be installed to your system's library path. Both a shared object and statically linkable archive I provided. So you have either option depending on how you want your application to link with it. And to link with it, you just add the dash L NVMe compiler option. There are so many functions exported from this. So the documentation is treated as a pretty high priority.

Starting point is 00:25:25 It's intended to provide a useful way to navigate what's available and how to invoke the functions from your application. The documentation is provided in both man pages and HTML formats. And the documentation format leverages the Linux kernel doc. So it should be familiar if you have prior experience working with the kernel documentation. And since the NVMe specification is an ever-moving target, sometimes we expect an exported function

Starting point is 00:25:57 may need to add parameters or change parameters to match the specification. And if that does happen, the maintenance goal for a lib in VME is to provide a symbol versioning so that these exported functions will continue to work in an application

Starting point is 00:26:12 developed on an older version of the library if you happen to install a newer library later. Okay, so now let's look into some of the various layers provided by the library. At the lowest layer of the library, we have the base types. This is where the specification-defined structures, constant values, and decodings for specific fields are provided. Whenever the NVMe committee publishes an update to the specification or with new technical

Starting point is 00:26:43 proposals, libnvme intends to be updated to match. The documentation will also provide cross-links to related functions that use or return a structure and to other structures and enumerations that reference it. And this is provided in the hope that it's useful for programmers to navigate related information without having to have a copy of the specification open at all times. The example here on the right is a screenshot of the documentation's telemetry log. This is just one of many NVMe-defined log structures. The telemetry log was defined by the NVMe committee. It's a generic way to retrieve vendor-specific information about your controller, and it's vital to debugging issues in a production environment.

Starting point is 00:27:32 So it may happen that if you experience a problem with your device, a vendor may request that you send back this log, and they give it to their former developers to analyze analyze and it's part of the process to fix it. And I'll continue to refer back to this login later examples to help drive the story for this library. So now that we have these NVMe specification defined types, it would be great to provide a convenient way to retrieve them. The driver provides this capability with those various iOctals we talked about earlier through their NVMe pass-through commands. The library provides parameterized functions for each possible iOctal type, and these are there to help set up

Starting point is 00:28:17 the kernel's iOctal-specific structures. The one driver iOctal that the library does not support is that old submit io ioctl that we talked about earlier. This ioctl has largely been replaced by the more generic io command ioctl, and I expect the kernel will eventually deprecate submit io. So we're not going to implement it in this library. The pass-through interfaces that we do use, they're quite flexible. These can be used to send just about any possible NVMe command and arbitrary payload. So it's generic enough at this level that it should continue to be forward compatible for any future additions or any vendor-specific command. But using this particular interface is not very coder-friendly.

Starting point is 00:29:04 This below example is the prototype for the base admin command library function. And as you can see, it has a lot of rather opaque-looking parameters. So a developer would pretty much be required to have the specs open in front of them in order to decode which bits and bytes they need to set and which D words. Otherwise, they wouldn't know how to craft the desired command. So since parameters at this level don't really provide a convenient programming interface to know how to use it,

Starting point is 00:29:33 another layer specific to command opcodes is going to make it just a bit more convenient. So for every NVMe opcode, the specification best defined, or at least most of them, the library exports more functions specific to those opcodes. This should hopefully save the developer some round trips to the specs so they can instead focus on their application. There are some NVMe operations, though, that the library does not provide these sorts of convenient functions. I had mentioned earlier, this library is intended to work with the driver, and we don't want to provide actions that are harmful to the

Starting point is 00:30:11 driver's operation. For example, we don't want to provide a way to tear down resources the driver is actively using. So an action like deleting a queue will most likely just result in confusing the driver or induce errors on the device side. The abort command is another one. It's an odd one that I'm asked about frequently why the library doesn't provide it. So I'll just mention here. We can't really support it here because the driver is responsible for assigning the command identifiers and as well as the queue that it's dispatched to both both of those uh components are required in order to submit a successful abort command

Starting point is 00:30:52 and neither of those are communicated back to user interface back to the user space so since that's the case there really isn't a good way to to craft an abort command from there. And some of the other commands that are owned by the driver include like asynchronous events, keep alive timeouts, and the connection commands. Those are all owned within the driver. So we're not providing convenient functions for them in the library. But at the same time, the library is not going to police what you send. So while there's no exported convenience functions for these, you could still submit whatever you want through the generic pass-through interface

Starting point is 00:31:31 that I showed on the previous slide. But in many cases, if you were to do something like that, it's just going to confuse some other part of the system, and a controller reset is likely going to happen when the errors happen. But you could still use it for error injection or testing other broken scenarios or maybe triggering an analyzer snapshot if that's something you wanted to do. Beginning back to the functions this library does export, this is just another example of a spec-defined command for retrieving log pages. The following is the library's provided API

Starting point is 00:32:05 for the generic NVMe log command. It's a little bit more coder-friendly than what we had before. There are fewer parameters, and the types are now specific to this command, so we have a little more type safety. And for some of the commands, though, the opcode's specific function is about as simple as we can make it from a developer's perspective.

Starting point is 00:32:29 And we don't need to go any further in the specifications to know how to use it. But as we saw earlier, many types of commands have subtypes, and that includes this log page. So another layer of helpers is going to be most helpful. So with that in mind, the library exports even more functions for all the command subtype variations. These include all the log page types that I just mentioned, as well as the various identifications, which we said there are 18 of them, and all the tunable features and the directives. ZNS provides various zone management commands as well. So that's one of the newer additions. So there are many functions at this level. Currently a little over 150 are provided.

Starting point is 00:33:24 And continuing with our controller's telemetry log page, this is an example of one of the library functions for retrieving that specific type of log. And we now have even fewer parameters to consider. And this is looking pretty easy now for a developer to interact with. And while this looks pretty simple, this isn't quite the end of it. It turns out that some commands require a more complex sequence than just a single command to successfully complete. So we have another layer that we can consider using if needed. So for these NVMe actions that can't necessarily be completed with a single command, libNVMe is going to provide even more convenient functions for those transactions. There are several types of NVMe commands

Starting point is 00:34:06 that need multiple steps to successfully complete. For example, a firmware download may be required in multiple steps. Another example with our controller telemetry log, this is also going to need multiple pieces to complete. And the reason this level of convenience is provided is because in some cases, an incorrect sequence may result in a torn or incomplete transfer. And this can be frustrating if you experience something like that. So for example, if you're debugging one of those very difficult

Starting point is 00:34:36 to reproduce problems in your production environment, you may send the requested logs back to your vendor. And then you'll be unhappy to hear that the log you sent wasn't very useful because it's either incomplete or corrupted. So for some of these types of sequences, libnvme provides some utilities to help manage that. So finishing up with our controller telemetry log, we finally arrived at the highest level function the library provides

Starting point is 00:35:05 for getting one. It's now just down to the two parameters, and it really couldn't be any easier to use. But just to go into the sequence for how it works, the telemetry log starts out with waiting for the events that indicates a controller's log is available. The specification tells us that the log data should be latched and unchanging until the host releases that latch by rearming the event. So upon seeing the event, the API will first figure out how large the log is and then read each section in sequence into a buffer it allocates for you. So the partial reads and a loop is required because the total log size typically exceeds the driver's single command maximum transfer size. So this loop is pretty crucial. And then once it reaches the end of the log,

Starting point is 00:35:51 lib nvme verifies the generation sequence to ensure that nothing interrupted our transfer and that the log is indeed complete. And after that, the function will rearm the event so that a new log sequence can be created by the controller in the future if needed. The log is then returned by this function, and your application can then save it into a file or send it off to the vendor for their analysis. And this pretty much concludes our look at the pass-through command support

Starting point is 00:36:21 provided by LibinVME, so let's move on to other components. So the NVMe over Fabrics, it's gotten a lot of interest in recent years and Linux has been leading the way in open source development for it. This driver provides both a host and a target driver for all the supported transports, but this library is really only concerned with the host side. The fabrics component of the NVMe host driver provides a single entry point to the user to discover and initiate connections to targets, and that entry point is provided by this special device handle called at slash dev slash NVMe dash fabrics. And rather than providing an i ioctal interface for programmers, the user API for this special handle

Starting point is 00:37:10 takes on the form of writing magic strings into it and then reading back the result of the action you issued. Those magic strings have this special form of this key value pair for all the options that a connection might use. And these options are not particularly well documented by the kernel. And occasionally new options are added by the driver. And so keeping up with this can be a bit difficult for a user to keep up with. But if you do happen to know the options you want to use, it's just

Starting point is 00:37:45 something you can easily invoke with an echo command from the shell, but that's not particularly coder-friendly to another application. So LibinVME provides parameterized functions that generate and submit these magic strings for you. And if you use the library to connect to a discovery controller, the library can then recursively discover and connect targets. If the driver ever needs to add new options, the maintenance goal of LibinVME is to be updated to match what the Linux kernel provides at all times. And the following here is just a simple example, a little C program for getting the entire discovery log for the

Starting point is 00:38:25 local loopback targets. You might recall, as I mentioned before, the loop interface is purely a locally defined software fabric targets. So the parameters are quite simple compared to, say, a remote one. Those would require the target name and address.

Starting point is 00:38:43 The loopback one is a bit more simple. So I'm just including it here because it fits on one slide. And the code is just super simple. The libnvme parts are all highlighted in the blue text, and essentially all we're doing is specifying the well-defined NVMe qualified name of a discovery controller and then specifying to the initiative that we're looking for the local loop transport.

Starting point is 00:39:04 If the local host has defined a host qualified name, we'll use that, or it'll just be null. And then we'll request the library add the discovery controller based on these configuration parameters. And once we have that controller, we can read its discovery log page. From there, if we wanted to, we could connect to each of those individual targets defined in that log. But this is just a really simple, silly example. So we don't actually do anything with it. We've just retrieved it and then we release it all and clean up. The last interface that LibinVime interacts with is the driver's exported sysfs attributes. The kernel

Starting point is 00:39:46 uses sysfs for drivers and modules to report all sorts of characteristics about the current state of things within the operating system. For nvme, this is just used to report information about subsystems, the controllers in those subsystems, and the namespaces attached to those controllers, and finally information about individual paths to those namespaces attached to those controllers. And finally, information about individual paths to those namespaces. So as I mentioned earlier, the complexity of what NVMe exports here has gotten just a little bit more confusing when we introduced the multipathing. So libNVMe provides methods to scan the NVMe hierarchy, search and filter the topology, retrieve attributes, and link the SysFetch entries to the device nodes.

Starting point is 00:40:28 And if you wanted to, you can use those links to submit commands through. I mentioned filtering, and some examples of that, they can include, say, like you only want to see devices from one particular vendor, or maybe you want to see only targets on a specific transport, like you want to see your RDMA transports or maybe only your local PCIe. So those are just some of the examples that the sysfs part of this library provides. The driver sysfs interface doesn't change that often, so it's probably not going to be too much maintenance in LibinVME to traverse it, but when it does, LibNVME will provide updates as needed

Starting point is 00:41:06 to match the kernel side. And the output snippet here, it's just a tree representation of some of the information that this component of LibNVME retrieves. It's just showing the hierarchy of NVMe subsystems and devices present in this particular system. Finally, just for fun, I mentioned earlier that libnvme can be used to dispatch pass-through IO

Starting point is 00:41:29 directly into the driver, and it bypasses the file systems and block stack in that case. And while I don't recommend bypassing them in general, I thought it was a bit fun to see what it would take to add an IO engine to the flexible IO tester. If you're not familiar with FIO, this is one of the best tools to exercise in benchmark storage in the Linux operating system, and it

Starting point is 00:41:51 supports many various IO engines. Most of the IO engine implementations, they're not too complicated to implement, but in LibinVMEs is especially simple. It's so simple that almost the entirety of it can fit on a single slide. It's right here. And the only parts missing from this are just registering these functions. It's just a few more lines of code that didn't quite fit. The high-level gist of this is that it just finds

Starting point is 00:42:18 and opens the NVMe device, and then based on the requested action, it will construct read or write commands at the requested offsets and sizes. You don't even need to know about the low-level device formats in order to use it. It's a pretty simple interface to use. And I thought it was a kind of a cute example to show how simple the library makes IO access. So while there is a whole lot of functionality packed into this library,

Starting point is 00:42:45 there's still a bit more work to do. There are several moderately sized features from the spec that have not been implemented yet. Those include the persistent event logs and some of the management interface, ME and VMI features. I mentioned earlier the key value command sets. As of now, it still doesn't have kernel support, but perhaps maybe we can add support for it from LibinVME anyways.

Starting point is 00:43:12 There are also still a few features that are implemented in the library that are actually not quite fully tested. Finding hardware that supports some of the less common features from the spec can be difficult to source. So any assistance on that aspect would actually be much appreciated. We still try to do a lot of testing under emulation, and QEMU provides that. But we definitely prefer to verify the features on real hardware. And as always, the NVMe committee is pushing new features at a pretty rapid cadence.

Starting point is 00:43:46 So LibNVMe will always need to be following that moving target. So it's going to be an ongoing maintenance goal. And finally, my next goal for this library is to complete integration with NVMe CLI. And once that's completed, we should be able to tag the release and request package support with all the major Linux distributions. And then from there, it can be conveniently installed through their package management systems. And I expect that to probably happen over the next few months, maybe the end of the year.

Starting point is 00:44:12 And then it should be downloadable like through AppsGates or Yum or DNF, whatever your preferred distribution uses. And that is really all I have for this discussion right now. Thank you all for watching. Please feel free to shoot me a message or start a discussion on the GitHub source repository if you're interested. Thank you.

Starting point is 00:44:37 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit

Storage Developer Conference - #140: Introduction to libnvme

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #140: Introduction to libnvme

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.