Storage Developer Conference - #158: NVMe 2.0 Specifications: The Next Generation of NVMe Technology
Episode Date: December 13, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 158.
Hello, my name is Peter Anufrik, and I'm the NVMe Technical Workgroup Chair, and I'm going to describe the NVMe 2.0 specifications, which form the next generation of NVMe technology.
Before we dive into details of NVMe 2.0, let's first take a look at NVMe today.
Unlike previous storage technologies, where there was one technology for servers, another for client, and yet another for mobile, NVMe technology is used across storage applications. We see NVMe
technology used today in everything from cell phones and tablets to client laptops and desktops,
all the way to storage arrays and data centers. The table on the right shows the unit growth for
NVMe, and you can see that NVMe has grown from an emerging technology in 2016 to a mainstream technology in 2021. In 2021, we expect over 7 million units
shipped into enterprise, over 20 million units shipped into cloud applications, and over 350
million units shipped into client applications. The NVMe architecture has been called the new
language of storage since we are now seeing many people design applications. The NVMe architecture has been called a new language of storage
since we are now seeing many people
design their systems around NVMe technology.
Although NVMe has evolved into much more
than we ever imagined when we started,
our initial goal in creating NVMe
was to create a standard interface for PCIe SSDs.
We wanted the same experience as
one had with hard drives, where you just plug it in and it worked. As you can see from the chart
on this slide, in 2021, more NVMe SSDs will ship than all other SSD interfaces combined.
And the dominance of NVMe is projected to continue to grow. So with that introduction to where NVMe is today, let's take a look at how we got here.
We started work on NVMe in 2009.
The first spec that was released was the NVMe 1.0 spec, which defined the NVMe storage architecture, command set, and interface for PCIe SSDs.
As NVMe adoption grew, we started work in 2014 on extending the NVMe architecture and command
set to fabrics beyond PCIe. We published the first NVMe over fabric specification in 2016.
When we published the NVMe over fabric spec, we didn't want to disrupt a growing NVMe
PCIe ecosystem, so we made minor changes to the NVMe spec, which continued to largely describe
PCIe SSDs and describe the architectural extensions necessary for fabrics in an NVMe over fabric
specification. Around this time, we also started referring to the NVMe spec as the NVMe-based
specification. This achieved our goal of not disrupting the NVMe specification, but created
an unnatural partitioning of functionality across specifications. In 2013, we started work on an
NVMe management specification since there was no good storage standard for out-of-band management.
This resulted in the NVMe management interface specification published in 2015.
So as a result of all this, we had three specifications, the NVMe base specification,
the NVMe over fabric specification, and the NVMe management interface specification
that were all undergoing active development
with specifications being released.
So with that introduction into NVMe history,
let's take a look at where NVMe is headed.
As I previously mentioned,
our focus was on creating a standard interface
for PCIe SSDs.
We were focused on the base NVMe architecture
and command set.
Our goal was to unify PCIe SSDs around a common interface and get the same behavior as one
had with hard drives, where you just plug it in and it works.
This required getting an inbox driver into all major operating systems.
This work ultimately resulted in the NVMe base specification. As NVMe SSD adoption grew, our focus shifted to scaling
the NVMe architecture to fabrics. This resulted in the NVMe over fabric specification. With the
NVMe over fabric specification done and having achieved adoption, our focus is now shifting
to two areas. The first area is on leveraging the unique capabilities of MDM to
enable new storage innovations, such as new command sets and various spec enhancements.
The second area is on leveraging MDME into new use cases, such as automotive,
warehouse scale storage, and computational storage. As we expand into these new areas,
a challenge emerged. How do we maintain stable specifications for mature volume applications while still enabling the rapid innovation that NVMe is known for?
As a result of this, we decided to refactor the NVMe specifications.
So the big thing in NVMe 2.0 is specification refactoring.
A lot of people may expect that NVMe 2.0 means some major new feature.
And while we have many new features in NVMe 2.0 that I'll describe later,
the big change is a complete restructuring in a way that the NVMe specifications are organized.
We did this for three reasons.
The first is that we wanted to simplify development of NVMe-based technology. So, for
example, if you're designing an NVMe over fabric storage device, you now don't need to comb through
a bunch of PCIe transport spec information to determine what you need to do. The second reason
is that we want to enable rapid innovation while minimizing the impact of broadly deployed solutions.
To enable innovation, we understand that some of the new
things that we may standardize may not gain widespread adoption. So refactoring the specifications
allows us to contain these innovations in their own specifications. If they don't gain adoption,
then they're simply specifications that no one reads, and we're not complicating the specifications
for mature volume applications that people rely on. This allows us to take more risk and innovation.
Finally, refactoring allows us to create a more maintainable structure
where we are only updating the specifications that need to be updated.
So, for example, if we need to make an enhancement for PCIe SSDs,
we can contain it to the PCIe transport specification.
This slide shows the restructuring of the specifications. We took the NVMe base specification, and we broke it into
three pieces. The base specification, which only describes the base NVMe architecture and command
set. An NVM command set specification, which describes the IO command set for block storage,
and the PCIe transport specification, which describes all things related to PCIe.
We also took the NVMe over Fabric spec, and we moved architectural components into the NVMe-based specification
and broke out the individual transports into their own specifications.
So there's no longer an NVMe over Fabrics specification.
The NVMe management interface was largely self-contained, and while we moved a few
minor architectural elements around, it remains largely unchanged.
This slide shows the NVMe 2.0 family of specifications. We have the base specification.
We have a collection of command
set specifications. The NVM command set is what people think of when they think of NVMe. It
describes traditional block storage. We also have two new command sets, which I'll briefly describe
later. The zone namespaces command set and the key value command set. We also have a collection
of transport specifications. We have the NVMe over
PCIe transport specification, the NVMe over RDNA transport specification, and the NVMe over TCP
transport specification. The NVMe over fiber channel transport is defined by T11 and not by
the NVMe organization, so it's not listed here. So to summarize, NVMe 2.0 is all about refactoring the NVMe specifications,
and the new NVMe 2.0 family specifications are shown here.
With that introduction to NVMe 2.0 spec refactoring,
let's turn our attention to what is technically new in NVMe 2.0.
The first big thing that we did in NVMe 2.0
was redefine how multiple I-O command sets
are handled in NVMe.
NVMe had support for multiple command sets
from the beginning.
NVMe 1.0 had support for four I-O command sets,
although there was only one command set defined,
the NVM command set.
If you notice, the controller capabilities register has support for four IDO command sets,
but the controller configuration register has a three-bit field,
which allows eight command sets to be selected.
Zero selects the NVM command set.
We didn't like this inconsistency, so in NVMe 1.1, we added support for eight command sets, although again, only one was defined, the NVM command set.
Things stayed that way for a while, and then in NVMe 1.4, we introduced the top encoding of the cap.css field
to indicate that only the admin command set was supported
or alternatively, no I.O. command sets were supported.
This reduced the number of allowed I.O. command sets to seven.
Again, up to this point,
the only I.O. command set that we had was the MDM command set.
It's important to note that while we had the foundation
for multiple command
sets from the beginning in NVMe, we were missing a lot of the key pieces required to actually
implement multiple command sets. In NVMe 2.0, we defined a new mechanism for supporting up to 64
IO command sets, defined all the pieces necessary to actually support multiple command sets,
and defined two new command sets,
the zone namespaces command set and the key value command set.
Let's take a look at the new command set mechanism in a bit more detail.
Associated with each namespace is now a command set. An NVM subsystem may support namespaces
associated with different command sets at the same time, and you can mix and match attaching them to controllers in an arbitrary manner.
On this slide, we have an NVM subsystem with four controllers and nine namespaces.
Four of the namespaces support command set number one, shown in green.
Four of the namespaces support command set number two, shown in blue.
And one namespace supports command set set number two shown in blue and one namespace supports command
set number three shown in purple. As this figure shows you can have a controller that has all
namespaces associated with one command set which effectively means that controller can only use
that one command set at that point in time. This is true for controllers zero and one on this slide.
Controller zero only supports command set number two, and controller one only supports command set
number one. Or you can have a controller that has namespaces attached associated with different
command sets. This is true for controllers two and three. When this occurs, the host can issue
commands from different command sets that target
these namespaces, and the commands can be interleaved in an arbitrary manner. So there's
no notion of a controller being in a command set mode. You can issue a command from command set
number one to one namespace and immediately follow it by a command associated with command
set number two to another namespace. If you're building an NVM subsystem,
you may want to support multiple command sets,
but you may not want to allow multiple command sets
at the same time associated with a controller
with commands being interweaved in an arbitrary manner.
Since your use case may not require this
and supporting arbitrary interweaving
increases implementation
validation complexity, you may want to prohibit this. To allow this, we added a restriction
using the identify IO command set data structure that contains multiple command set combinations.
Each command set combination is a bit vector with a bit per command set,
which allows you to select the subset of command sets that may be enabled by a controller at a
point in time. This restricts the interleaving of command sets associated with different
command sets by the host. Using the IO command set profile feature, the host selects the command
set combination that it wants to enable on a controller.
So this allows you to control implementation complexity. It allows you to do things like the following. At one extreme, you can say, I want to support IO command sets one, two,
and three on my controller, but I only want to have one enabled at a time. At the other extreme,
you can say, I want to support all three command sets enabled at the same time and with commands interleaved in an arbitrary manner.
You can also do things in between. For example, you can express that you want to allow command sets one and three at the same time, or you can enable command set two, but you can't enable all three command sets at the same time.
This slide shows how a command set is selected when you have a controller that is enabled
for simultaneously supporting multiple command sets.
Recall that each namespace is associated
with exactly one command set.
So the way it works is that when a controller decodes
a command, it takes the namespace identifier field
that selects the namespace operated on by that command and determines the command set with which that namespace is associated.
Now that it knows the command set on which the command is operating as determined by the namespace, it uses that information to parse the actual command.
With that introduction to multiple command sets, let's take a look at the two new command sets that we've defined.
The first new command set is the zone namespaces command set.
It's similar in structure to what is done in other storage architectures for SMR drives, but is optimized for NVM such as NAND.
Logical blocks are grouped into zone and the logical blocks of a zone must be written sequentially.
There's a state machine associated with each zone, and in order to read or write logical blocks in a zone, the zone needs to be in a particular state.
State transitions may be triggered exclusively by a host or implicitly as a result of a host action.
There are three benefits of the sequential writes
in the zone namespace command set.
The first is that they reduce write amplification.
The second is that they reduce the amount of required
over provisioning.
Finally, sequential writes reduce the size
of the mapping table required over what is required
for normal NAND FTL associated with the NBN command set.
The second new command set is the key value command set. This command set is optimized for unstructured data. Instead of reading and writing logical blocks, the host can access data using a
key value pair. You can think of the key as an identifier for an object and the value as the data for an
object. A key can be 1 to 16 bytes in size and the value can be 0 to 4 gigabytes in size.
With this command set, you don't read and write data. You retrieve a key value and you store a
key value. So it's very different from the NVM command set we have today. So with this brief
introduction to the new NVMe command sets,
let's move on to some other new features in NVMe 2.0.
When many of you think of NVMe, you think of a PCIe SSD. However, the NVMe architecture is now
the new language of storage and is being used to construct warehouse-scale storage systems.
Many of these warehouse-scale storage systems. Many of these warehouse-scale storage
systems are represented by a single NVM subsystem. Up to this point, an NVM subsystem was a monolithic
thing with a controller and namespaces. In an NVMe SSD, it's typically implemented by an ASIC
with some NAND. In a warehouse-scale storage system, it could consist of many refrigerator-sized racks. These racks fail
and need to be replaced. Sometimes racks need to be added to expand capacity. In other cases,
firmware associated with a rack needs to be updated. What we've done in NVMe 2.0 is define
how an NVM subsystem may be partitioned into multiple domains. Domains are a new architectural element in the NVMe architecture. A domain may
contain capacity, controllers, ports, and communicate with other domains. In the NVMe 2.0
base specification, we define how domains may be added, removed, reconfigured, or partitioned.
This includes things like how reservations are handled when an NVM subsystem that was partitioned is unified.
We also define how an NVM subsystem signals a domain change to a host.
Another new feature in NVMe 2.0 is the addition of a copy command to the NVM and zone namespaces IO command sets.
This command is pretty simple.
It allows you to copy one or more non-contiguous LBA ranges to a contiguous LBA range in the same namespace
without transferring data through the host.
This operation is useful for things like host-based FTLs.
So before this command, in order to perform a copy,
the host had to read data into host memory with an I-O read command and then write data back to a namespace using an IO write command.
The new copy command allows you to do that in one operation without transferring data through the host.
This saves host memory bandwidth, interconnect bandwidth, and reduces latency.
Next, we have command group control.
We added a new lockdown admin command to NVMe
that may be used to prohibit the execution of a command
or modification of a feature.
The lockdown command can be used to prohibit execution
of an admin command, a set feature command
that modifies a specific feature identifier,
or an NVMe management
interface command. With the lockdown command, one can also control the interface on which the
command is prohibited. The command could be prohibited in-band on an admin queue,
out-of-band through a management interface, or both in-band and out-of-band. When a command is
locked down or prohibited, it is prohibited until the
next power cycle or explicitly re-enabled using the lockdown command. Another new feature in NVMe
2.0 are enhancements in the way protection information works. Since we wanted NVMe to be
used across storage applications and protection information was required in many enterprise applications. We supported protection information from the beginning in NVMe 1.0.
To maintain compatibility with other storage architectures, we use the same format as
protection information in T10. As shown in the upper left of this slide, there's a 16-bit guard, a 16-bit application tag, and a 32-bit reference tag.
Since a 16-bit guard is viewed as insufficient for large logical blocks, the first change we made in NVMe 2.0 to protection information was to add support for the 32-bit guard and a 64-bit guard.
You can see the format of the 32-bit guard protection information in the middle and the
64-bit guard on the right. The second change that we made was to add support for a storage tag in
addition to a reference tag. The storage tag is an opaque data field not interpreted by the
controller. During protection information checking, the value of the storage tag is compared to a value
provided in the command. Since the use and size of the storage tag and reference tag vary per
application, we added a storage and reference space field in the protection information so that one
could configure how many bits one wanted to allocate from this field for the storage tag and how many bits one wanted to allocate for a reference tag.
Some applications may want a large reference tag.
Other applications may want a large storage tag.
Yet other applications may want to split this field into a storage and reference tag.
The figure in the lower middle of this slide shows how this works.
When you split the storage and reference space field, the upper bits become the storage tag and the lower bits become the reference tag.
In NVMe 1.4, we added endurance groups and media units as architectural elements to the NVMe
architecture. Before NVMe 1.4, we had namespaces that could be used in an NVM set. In NVMe 1.4, we allowed NVM sets to be
in an endurance group. An endurance group is a portion of NVM in an NVM subsystem whose endurance
can be managed as a collection. In NVMe 1.4, we also added media units that can be used to
construct endurance group. A media unit represents a component of the underlying media in an MDM
subsystem. All this was great, but we didn't have a mechanism to configure any of this.
The idea was that it came pre-configured from the factory. This is common in the way we do
things in NVMe. We start simple and then we add things as things are needed. What we added in
NVMe 2.0 was the ability to configure all this.
We added a new capacity management command
to the admin command set.
And with this command, you can create and delete NVM sets,
create and delete endurance groups,
allocate media units to an endurance group,
and allocate media units to NVM sets.
Turning our attention to Fabric, NVMe 2.0 adds new NVMe over Fabric security features.
NVMe over TCP previously supported Transport Layer Security, or TLS version 1.2.
TLS 1.2 was released in 2008 and was considered fairly secure, but vulnerabilities have been discovered that call the security of TLS 1.2 into question.
For this reason, we are moving NVMe to TLS 1.3.
Starting with NVMe 2.0, all NVMe over TCP implementations that implement TLS are now required to support TLS 1.3. NVMe 2.0 still allows TLS 1.2 as legacy,
but everyone is strongly encouraged to move to TLS 1.3.
Another security feature that we added
was mutual host and NVM subsystem in-band authentication
using a Diffie-Hellman HMAC CHAP protocol.
This allows a host and an NVM subsystem to authenticate each other
and make sure they're actually talking to who they think they're talking to.
The final major addition to NVMe 2.0 that I'm going to describe
is support for rotational media, such as hard drives.
I never thought I would see this day,
but as NVMe has become the
new language of storage, people are architecting their systems around NVMe natively and want to
be able to plug in hard drives into these systems. So we added support for hard drives to NVMe 2.0.
Enhancements for rotational media include an indication that an endurance group and associated namespace store
data on rotational media, a log page that describes rotational media, it includes things such as the
number of actuators, rotational speed, and so on. Finally, since spinning up a hard drive can consume
a lot of power, we added spin-up control. In summary, NVMe has become the new language of storage
and is now being used in everything from mobile devices
and simple PCIe SSDs to warehouse-scale storage arrays.
We now support every major transport
and are shifting our focus to new innovations
enabled by NVM and new use cases.
The NVMe 2.0 specifications are all about refactoring.
We refactor the specifications to ease development
of NVMe-based technology, enable rapid innovation
by minimizing change to specifications associated
with broadly deployed solutions,
and creating an extensible specification structure
that allow the next phase of growth for NVMe.
The NVMe technical community continues to accelerate technical development
by maintaining and extending existing specifications
while at the same time delivering new innovations.
In this presentation, I've given you an overview of NVMe 2.0.
I encourage you to go to the NVMe Express website
and download the specifications for more details.
Thank you.
Thanks for listening. If you
have questions about the material presented in this podcast, be sure and join our developers
mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask
questions and discuss this topic further
with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.