Storage Developer Conference - #158: NVMe 2.0 Specifications: The Next Generation of NVMe Technology

Episode Date: December 13, 2021

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 158. Hello, my name is Peter Anufrik, and I'm the NVMe Technical Workgroup Chair, and I'm going to describe the NVMe 2.0 specifications, which form the next generation of NVMe technology. Before we dive into details of NVMe 2.0, let's first take a look at NVMe today. Unlike previous storage technologies, where there was one technology for servers, another for client, and yet another for mobile, NVMe technology is used across storage applications. We see NVMe
Starting point is 00:01:10 technology used today in everything from cell phones and tablets to client laptops and desktops, all the way to storage arrays and data centers. The table on the right shows the unit growth for NVMe, and you can see that NVMe has grown from an emerging technology in 2016 to a mainstream technology in 2021. In 2021, we expect over 7 million units shipped into enterprise, over 20 million units shipped into cloud applications, and over 350 million units shipped into client applications. The NVMe architecture has been called the new language of storage since we are now seeing many people design applications. The NVMe architecture has been called a new language of storage since we are now seeing many people design their systems around NVMe technology.
Starting point is 00:01:54 Although NVMe has evolved into much more than we ever imagined when we started, our initial goal in creating NVMe was to create a standard interface for PCIe SSDs. We wanted the same experience as one had with hard drives, where you just plug it in and it worked. As you can see from the chart on this slide, in 2021, more NVMe SSDs will ship than all other SSD interfaces combined. And the dominance of NVMe is projected to continue to grow. So with that introduction to where NVMe is today, let's take a look at how we got here.
Starting point is 00:02:32 We started work on NVMe in 2009. The first spec that was released was the NVMe 1.0 spec, which defined the NVMe storage architecture, command set, and interface for PCIe SSDs. As NVMe adoption grew, we started work in 2014 on extending the NVMe architecture and command set to fabrics beyond PCIe. We published the first NVMe over fabric specification in 2016. When we published the NVMe over fabric spec, we didn't want to disrupt a growing NVMe PCIe ecosystem, so we made minor changes to the NVMe spec, which continued to largely describe PCIe SSDs and describe the architectural extensions necessary for fabrics in an NVMe over fabric specification. Around this time, we also started referring to the NVMe spec as the NVMe-based
Starting point is 00:03:25 specification. This achieved our goal of not disrupting the NVMe specification, but created an unnatural partitioning of functionality across specifications. In 2013, we started work on an NVMe management specification since there was no good storage standard for out-of-band management. This resulted in the NVMe management interface specification published in 2015. So as a result of all this, we had three specifications, the NVMe base specification, the NVMe over fabric specification, and the NVMe management interface specification that were all undergoing active development with specifications being released.
Starting point is 00:04:08 So with that introduction into NVMe history, let's take a look at where NVMe is headed. As I previously mentioned, our focus was on creating a standard interface for PCIe SSDs. We were focused on the base NVMe architecture and command set. Our goal was to unify PCIe SSDs around a common interface and get the same behavior as one
Starting point is 00:04:31 had with hard drives, where you just plug it in and it works. This required getting an inbox driver into all major operating systems. This work ultimately resulted in the NVMe base specification. As NVMe SSD adoption grew, our focus shifted to scaling the NVMe architecture to fabrics. This resulted in the NVMe over fabric specification. With the NVMe over fabric specification done and having achieved adoption, our focus is now shifting to two areas. The first area is on leveraging the unique capabilities of MDM to enable new storage innovations, such as new command sets and various spec enhancements. The second area is on leveraging MDME into new use cases, such as automotive,
Starting point is 00:05:16 warehouse scale storage, and computational storage. As we expand into these new areas, a challenge emerged. How do we maintain stable specifications for mature volume applications while still enabling the rapid innovation that NVMe is known for? As a result of this, we decided to refactor the NVMe specifications. So the big thing in NVMe 2.0 is specification refactoring. A lot of people may expect that NVMe 2.0 means some major new feature. And while we have many new features in NVMe 2.0 that I'll describe later, the big change is a complete restructuring in a way that the NVMe specifications are organized. We did this for three reasons.
Starting point is 00:06:00 The first is that we wanted to simplify development of NVMe-based technology. So, for example, if you're designing an NVMe over fabric storage device, you now don't need to comb through a bunch of PCIe transport spec information to determine what you need to do. The second reason is that we want to enable rapid innovation while minimizing the impact of broadly deployed solutions. To enable innovation, we understand that some of the new things that we may standardize may not gain widespread adoption. So refactoring the specifications allows us to contain these innovations in their own specifications. If they don't gain adoption, then they're simply specifications that no one reads, and we're not complicating the specifications
Starting point is 00:06:41 for mature volume applications that people rely on. This allows us to take more risk and innovation. Finally, refactoring allows us to create a more maintainable structure where we are only updating the specifications that need to be updated. So, for example, if we need to make an enhancement for PCIe SSDs, we can contain it to the PCIe transport specification. This slide shows the restructuring of the specifications. We took the NVMe base specification, and we broke it into three pieces. The base specification, which only describes the base NVMe architecture and command set. An NVM command set specification, which describes the IO command set for block storage,
Starting point is 00:07:25 and the PCIe transport specification, which describes all things related to PCIe. We also took the NVMe over Fabric spec, and we moved architectural components into the NVMe-based specification and broke out the individual transports into their own specifications. So there's no longer an NVMe over Fabrics specification. The NVMe management interface was largely self-contained, and while we moved a few minor architectural elements around, it remains largely unchanged. This slide shows the NVMe 2.0 family of specifications. We have the base specification. We have a collection of command
Starting point is 00:08:05 set specifications. The NVM command set is what people think of when they think of NVMe. It describes traditional block storage. We also have two new command sets, which I'll briefly describe later. The zone namespaces command set and the key value command set. We also have a collection of transport specifications. We have the NVMe over PCIe transport specification, the NVMe over RDNA transport specification, and the NVMe over TCP transport specification. The NVMe over fiber channel transport is defined by T11 and not by the NVMe organization, so it's not listed here. So to summarize, NVMe 2.0 is all about refactoring the NVMe specifications, and the new NVMe 2.0 family specifications are shown here.
Starting point is 00:08:54 With that introduction to NVMe 2.0 spec refactoring, let's turn our attention to what is technically new in NVMe 2.0. The first big thing that we did in NVMe 2.0 was redefine how multiple I-O command sets are handled in NVMe. NVMe had support for multiple command sets from the beginning. NVMe 1.0 had support for four I-O command sets,
Starting point is 00:09:20 although there was only one command set defined, the NVM command set. If you notice, the controller capabilities register has support for four IDO command sets, but the controller configuration register has a three-bit field, which allows eight command sets to be selected. Zero selects the NVM command set. We didn't like this inconsistency, so in NVMe 1.1, we added support for eight command sets, although again, only one was defined, the NVM command set. Things stayed that way for a while, and then in NVMe 1.4, we introduced the top encoding of the cap.css field
Starting point is 00:10:05 to indicate that only the admin command set was supported or alternatively, no I.O. command sets were supported. This reduced the number of allowed I.O. command sets to seven. Again, up to this point, the only I.O. command set that we had was the MDM command set. It's important to note that while we had the foundation for multiple command sets from the beginning in NVMe, we were missing a lot of the key pieces required to actually
Starting point is 00:10:30 implement multiple command sets. In NVMe 2.0, we defined a new mechanism for supporting up to 64 IO command sets, defined all the pieces necessary to actually support multiple command sets, and defined two new command sets, the zone namespaces command set and the key value command set. Let's take a look at the new command set mechanism in a bit more detail. Associated with each namespace is now a command set. An NVM subsystem may support namespaces associated with different command sets at the same time, and you can mix and match attaching them to controllers in an arbitrary manner. On this slide, we have an NVM subsystem with four controllers and nine namespaces.
Starting point is 00:11:15 Four of the namespaces support command set number one, shown in green. Four of the namespaces support command set number two, shown in blue. And one namespace supports command set set number two shown in blue and one namespace supports command set number three shown in purple. As this figure shows you can have a controller that has all namespaces associated with one command set which effectively means that controller can only use that one command set at that point in time. This is true for controllers zero and one on this slide. Controller zero only supports command set number two, and controller one only supports command set number one. Or you can have a controller that has namespaces attached associated with different
Starting point is 00:11:57 command sets. This is true for controllers two and three. When this occurs, the host can issue commands from different command sets that target these namespaces, and the commands can be interleaved in an arbitrary manner. So there's no notion of a controller being in a command set mode. You can issue a command from command set number one to one namespace and immediately follow it by a command associated with command set number two to another namespace. If you're building an NVM subsystem, you may want to support multiple command sets, but you may not want to allow multiple command sets
Starting point is 00:12:33 at the same time associated with a controller with commands being interweaved in an arbitrary manner. Since your use case may not require this and supporting arbitrary interweaving increases implementation validation complexity, you may want to prohibit this. To allow this, we added a restriction using the identify IO command set data structure that contains multiple command set combinations. Each command set combination is a bit vector with a bit per command set,
Starting point is 00:13:05 which allows you to select the subset of command sets that may be enabled by a controller at a point in time. This restricts the interleaving of command sets associated with different command sets by the host. Using the IO command set profile feature, the host selects the command set combination that it wants to enable on a controller. So this allows you to control implementation complexity. It allows you to do things like the following. At one extreme, you can say, I want to support IO command sets one, two, and three on my controller, but I only want to have one enabled at a time. At the other extreme, you can say, I want to support all three command sets enabled at the same time and with commands interleaved in an arbitrary manner. You can also do things in between. For example, you can express that you want to allow command sets one and three at the same time, or you can enable command set two, but you can't enable all three command sets at the same time.
Starting point is 00:14:08 This slide shows how a command set is selected when you have a controller that is enabled for simultaneously supporting multiple command sets. Recall that each namespace is associated with exactly one command set. So the way it works is that when a controller decodes a command, it takes the namespace identifier field that selects the namespace operated on by that command and determines the command set with which that namespace is associated. Now that it knows the command set on which the command is operating as determined by the namespace, it uses that information to parse the actual command.
Starting point is 00:14:40 With that introduction to multiple command sets, let's take a look at the two new command sets that we've defined. The first new command set is the zone namespaces command set. It's similar in structure to what is done in other storage architectures for SMR drives, but is optimized for NVM such as NAND. Logical blocks are grouped into zone and the logical blocks of a zone must be written sequentially. There's a state machine associated with each zone, and in order to read or write logical blocks in a zone, the zone needs to be in a particular state. State transitions may be triggered exclusively by a host or implicitly as a result of a host action. There are three benefits of the sequential writes in the zone namespace command set.
Starting point is 00:15:27 The first is that they reduce write amplification. The second is that they reduce the amount of required over provisioning. Finally, sequential writes reduce the size of the mapping table required over what is required for normal NAND FTL associated with the NBN command set. The second new command set is the key value command set. This command set is optimized for unstructured data. Instead of reading and writing logical blocks, the host can access data using a key value pair. You can think of the key as an identifier for an object and the value as the data for an
Starting point is 00:16:05 object. A key can be 1 to 16 bytes in size and the value can be 0 to 4 gigabytes in size. With this command set, you don't read and write data. You retrieve a key value and you store a key value. So it's very different from the NVM command set we have today. So with this brief introduction to the new NVMe command sets, let's move on to some other new features in NVMe 2.0. When many of you think of NVMe, you think of a PCIe SSD. However, the NVMe architecture is now the new language of storage and is being used to construct warehouse-scale storage systems. Many of these warehouse-scale storage systems. Many of these warehouse-scale storage
Starting point is 00:16:45 systems are represented by a single NVM subsystem. Up to this point, an NVM subsystem was a monolithic thing with a controller and namespaces. In an NVMe SSD, it's typically implemented by an ASIC with some NAND. In a warehouse-scale storage system, it could consist of many refrigerator-sized racks. These racks fail and need to be replaced. Sometimes racks need to be added to expand capacity. In other cases, firmware associated with a rack needs to be updated. What we've done in NVMe 2.0 is define how an NVM subsystem may be partitioned into multiple domains. Domains are a new architectural element in the NVMe architecture. A domain may contain capacity, controllers, ports, and communicate with other domains. In the NVMe 2.0 base specification, we define how domains may be added, removed, reconfigured, or partitioned.
Starting point is 00:17:40 This includes things like how reservations are handled when an NVM subsystem that was partitioned is unified. We also define how an NVM subsystem signals a domain change to a host. Another new feature in NVMe 2.0 is the addition of a copy command to the NVM and zone namespaces IO command sets. This command is pretty simple. It allows you to copy one or more non-contiguous LBA ranges to a contiguous LBA range in the same namespace without transferring data through the host. This operation is useful for things like host-based FTLs. So before this command, in order to perform a copy,
Starting point is 00:18:22 the host had to read data into host memory with an I-O read command and then write data back to a namespace using an IO write command. The new copy command allows you to do that in one operation without transferring data through the host. This saves host memory bandwidth, interconnect bandwidth, and reduces latency. Next, we have command group control. We added a new lockdown admin command to NVMe that may be used to prohibit the execution of a command or modification of a feature. The lockdown command can be used to prohibit execution
Starting point is 00:18:59 of an admin command, a set feature command that modifies a specific feature identifier, or an NVMe management interface command. With the lockdown command, one can also control the interface on which the command is prohibited. The command could be prohibited in-band on an admin queue, out-of-band through a management interface, or both in-band and out-of-band. When a command is locked down or prohibited, it is prohibited until the next power cycle or explicitly re-enabled using the lockdown command. Another new feature in NVMe
Starting point is 00:19:35 2.0 are enhancements in the way protection information works. Since we wanted NVMe to be used across storage applications and protection information was required in many enterprise applications. We supported protection information from the beginning in NVMe 1.0. To maintain compatibility with other storage architectures, we use the same format as protection information in T10. As shown in the upper left of this slide, there's a 16-bit guard, a 16-bit application tag, and a 32-bit reference tag. Since a 16-bit guard is viewed as insufficient for large logical blocks, the first change we made in NVMe 2.0 to protection information was to add support for the 32-bit guard and a 64-bit guard. You can see the format of the 32-bit guard protection information in the middle and the 64-bit guard on the right. The second change that we made was to add support for a storage tag in addition to a reference tag. The storage tag is an opaque data field not interpreted by the
Starting point is 00:20:39 controller. During protection information checking, the value of the storage tag is compared to a value provided in the command. Since the use and size of the storage tag and reference tag vary per application, we added a storage and reference space field in the protection information so that one could configure how many bits one wanted to allocate from this field for the storage tag and how many bits one wanted to allocate for a reference tag. Some applications may want a large reference tag. Other applications may want a large storage tag. Yet other applications may want to split this field into a storage and reference tag. The figure in the lower middle of this slide shows how this works.
Starting point is 00:21:23 When you split the storage and reference space field, the upper bits become the storage tag and the lower bits become the reference tag. In NVMe 1.4, we added endurance groups and media units as architectural elements to the NVMe architecture. Before NVMe 1.4, we had namespaces that could be used in an NVM set. In NVMe 1.4, we allowed NVM sets to be in an endurance group. An endurance group is a portion of NVM in an NVM subsystem whose endurance can be managed as a collection. In NVMe 1.4, we also added media units that can be used to construct endurance group. A media unit represents a component of the underlying media in an MDM subsystem. All this was great, but we didn't have a mechanism to configure any of this. The idea was that it came pre-configured from the factory. This is common in the way we do
Starting point is 00:22:16 things in NVMe. We start simple and then we add things as things are needed. What we added in NVMe 2.0 was the ability to configure all this. We added a new capacity management command to the admin command set. And with this command, you can create and delete NVM sets, create and delete endurance groups, allocate media units to an endurance group, and allocate media units to NVM sets.
Starting point is 00:22:44 Turning our attention to Fabric, NVMe 2.0 adds new NVMe over Fabric security features. NVMe over TCP previously supported Transport Layer Security, or TLS version 1.2. TLS 1.2 was released in 2008 and was considered fairly secure, but vulnerabilities have been discovered that call the security of TLS 1.2 into question. For this reason, we are moving NVMe to TLS 1.3. Starting with NVMe 2.0, all NVMe over TCP implementations that implement TLS are now required to support TLS 1.3. NVMe 2.0 still allows TLS 1.2 as legacy, but everyone is strongly encouraged to move to TLS 1.3. Another security feature that we added was mutual host and NVM subsystem in-band authentication
Starting point is 00:23:39 using a Diffie-Hellman HMAC CHAP protocol. This allows a host and an NVM subsystem to authenticate each other and make sure they're actually talking to who they think they're talking to. The final major addition to NVMe 2.0 that I'm going to describe is support for rotational media, such as hard drives. I never thought I would see this day, but as NVMe has become the new language of storage, people are architecting their systems around NVMe natively and want to
Starting point is 00:24:11 be able to plug in hard drives into these systems. So we added support for hard drives to NVMe 2.0. Enhancements for rotational media include an indication that an endurance group and associated namespace store data on rotational media, a log page that describes rotational media, it includes things such as the number of actuators, rotational speed, and so on. Finally, since spinning up a hard drive can consume a lot of power, we added spin-up control. In summary, NVMe has become the new language of storage and is now being used in everything from mobile devices and simple PCIe SSDs to warehouse-scale storage arrays. We now support every major transport
Starting point is 00:24:56 and are shifting our focus to new innovations enabled by NVM and new use cases. The NVMe 2.0 specifications are all about refactoring. We refactor the specifications to ease development of NVMe-based technology, enable rapid innovation by minimizing change to specifications associated with broadly deployed solutions, and creating an extensible specification structure
Starting point is 00:25:20 that allow the next phase of growth for NVMe. The NVMe technical community continues to accelerate technical development by maintaining and extending existing specifications while at the same time delivering new innovations. In this presentation, I've given you an overview of NVMe 2.0. I encourage you to go to the NVMe Express website and download the specifications for more details. Thank you.
Starting point is 00:25:44 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.