Storage Developer Conference - #163: Automating the Discovery of NVMe-oF Subsystems over an IP Network
Episode Date: March 2, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, Episode 163.
Hi, I'm Eric Smith. I'm a distinguished member of technical staff working for Dell's Integrated Products and Solutions CTIO team.
Today, I'm going to provide you with an overview of the work Dell Technologies and many other companies have been doing over the past several years to create a new NVMe IP-based SAN ecosystem that supports automated discovery from end to end.
During this session, we'll discuss NVMeOF's discovery problem,
provide an overview of the network topologies that support automated discovery,
explore the differences between a fiber channel SAN and an NVMe IP SAN used to transport NVMe TCP
traffic, and then get into an in-depth explanation of the discovery protocol along with a use case
example. Now, how do we get here? Well, NVMeOS IP-based discovery problem is well documented
and was even acknowledged in the standard itself. For example, the method that a host uses to obtain
the information necessary to connect to the initial discovery service is implementation specific. This information may be determined using a host configuration file, a hypervisor,
or OS property, or some other mechanism. Again, that quote was pulled right out of NVMe over
Fabrics 1.1. So the problem with this? Well, the methods described above all limit the scale and interoperability of any IP-based SAN and VMUF solution.
And the reason is, in order to get discovery to work, the host admin needs to go to every host and use a command line interface to enter the command to discover the subsystems manually.
Now, to address this limitation, in late 2019, a group of companies got together to see if we could agree on a standardized automated discovery process.
And we actually decided to base our approach on Fiber Channel's fabric services.
Now, you might be wondering why. Well, Fiber Channel already provides a very robust automated discovery protocol.
And at the time, almost everyone involved in the project had some amount of FCX piece, so it just seemed like a natural thing to do. It turned out to be a bit more complicated
than we had hoped, though, and required two separate technical proposals to get it done.
They were TP-8009, which is the automated discovery of NVMeOF discovery controllers,
and TP-8010, which is the centralized discovery controller definition.
Now, in addition to those two technical proposals, we have also done a bunch of work on things like
security with TP8006 and TP8011, covering authentication and encrypted channels,
respectively. And we think that's going to be really important as customers definitely care
about security. And we've also worked on boot from SAN with TP-8012 and TP-4126.
And again, both of those are really important from an overall ecosystem perspective.
Now, if you want more information about any of these technical proposals, I'd encourage you to join NVM Express and participate.
Now, when I think about discovery automation between hosts and storage across a fabric, I think of the configuration steps in two high-level categories.
The first, underlay fabric configuration, enables the switches to form a fabric and ensures end-to-end reachability as possible.
The second is automated end-to-end discovery, and this enables the ports on the fabric to automatically discover one another. With Fiber Channel, automated fabric formation is accomplished via the build fabric process and is rock solid today.
Now, although it's based on the FCSW standard, the solutions here are generally not interoperable between vendors.
Fiber Channel's discovery services are provided by the switch vendors and run on their switch hardware.
Discovery clients are HBA-specific.
With NVMe TCP, automated fabric formation can be accomplished by a number of solutions.
Most of them, like Dell's Smart Fabric Services, configure specific types of switches.
NVMe TCP's discovery services, such as the Centralized Discovery Controller or CDC, can and do run anywhere, for example, on a virtual machine or on a switch.
The discovery clients are all software-based and, as a result, are OS-specific.
Now, the work we've done with TP-8009 and TP-8010 is specific to the automation of end-to-end discovery, as highlighted here.
As we started work on TP-8009 and 8010, we agreed that we wanted discovery automation to work seamlessly with three basic deployment types or topologies.
The first, direct connect, involves a host being directly connected to a subsystem. The second involves multiple hosts and subsystems connected to an IP fabric that does not contain a centralized discovery controller or a CDC.
The third involves multiple hosts and subsystems connected to an IP fabric that does contain a CDC.
The terms CDC and DDC, by the way, are defined in TP-8010. But essentially, a CDC supports registration and zoning, and it typically runs either standalone as a virtual machine or embedded when the CDC is running on a switch.
DDCs can be thought of as a legacy NVMe discovery controller and are typically associated with the storage subsystem.
Now, these deployment types will come up again when I describe how the discovery protocol works later on in the
presentation. Before I describe how centralized discovery works and get into the benefits of
using a CDC, I'll describe how the existing direct discovery process works with NVMe TCP today.
It starts with an administrator determining the IP address of the discovery controller
associated with a specific NVM subsystem and using it as a part of the NVMe discover command.
Now, this step could be skipped if the admin knows the NVMe qualified name or NQN of the host
and they are prepared to manually provide this NQN to the subsystem's user interface
during the storage provisioning step. Once this command has been completed, the NQN of the host
should be visible to the storage admin via the subsystem user interface, and storage can be
provisioned to it. I typically refer to this process as mapping and masking, or in other words, mapping namespaces to specific subsystem
interfaces and masking hosts to access those namespaces. Next, the host admin could use NVMe
connect all to discover and connect to all the subsystem interfaces that the host has been mapped
to. These steps would then need to be repeated for each additional subsystem, and then all of the steps would need to be repeated on each host.
Centralized discovery, such as the type we've defined in TP-8009 and 8010, is a bit different.
It starts with the host and subsystem interfaces automatically discovering the IP address of the CDC using MDNS, then connecting to it and registering discovery information with it.
This is analogous to Fiber Channel's floggy and name server registration process.
Next, the CDC admin could choose to perform zoning on the CDC,
but this step is optional because CDCs are required to support subsystem-driven zoning, or SDZ,
which is very similar to target-driven zoning, or SDZ, which is very similar to target driven zoning or TDZ in Fiber Channel.
As a result, when the storage admin performs mapping and masking, the subsystem can send the appropriate zoning configuration directly to the, the CDC will send an asynchronous event notification, or AEN, to the impacted hosts.
The hosts will retrieve a discovery log page from the CDC in response to that AEN, and then connect to all of the I.O. controllers contained in the log page.
These mapping and masking steps can then be performed for other hosts, and in fact entire groups of hosts can be mapped and masked in one shot.
To summarize the difference between the direct and centralized discovery methods, I'll use this chart.
The direct discovery method requires three steps per host per subsystem.
The centralized method requires one step per host per subsystem,
and this will impact users as the number of hosts begin to scale, as shown on the chart.
But what the chart doesn't really show is that direct discovery starts to become impractical
at greater than 64 hosts. And this is primarily because direct discovery requires interaction with each host
every time a storage subsystem is added or removed. And also, direct discovery may lead to
extended discovery times if many subsystem interfaces are present, and the host has to
iterate through all of them to discover which I.O. controllers provide namespaces for it. So it's for
all those reasons that we think greater than 64 hosts is not
really a great match for direct discovery. Some additional points about discovery automation.
Discovery automation does not depend entirely upon the concept of a centralized discovery
controller. Smaller scale environments could make use of MDNS as described in TP-8009
to automatically discover NVMe discovery controllers without a CDC. The problem is this
approach does not allow for centralized control, and this means access control at the network is
much more complicated, bordering on impractical. Also, hosts will not be notified when a new
storage subsystem is added to the environment.
And one final point, MDNS can become excessively chatty in larger configurations, especially when there are more than 1,000 ports in a single broadcast domain, although this is a pretty large configuration.
In this section, I'm going to provide you with an overview of the discovery protocol itself. The purpose of the overview is to help you understand the concepts involved and some of their implications, and not necessarily share all of the details from 8009 and 8010.
For this overview, I've broken the centralized discovery protocol down into eight different phases.
Three of these phases represent the new functionality we've added to support automated discovery, and we'll spend the majority of this section discussing them.
The other phases already exist in one form or another, and we won't talk about them as much.
At the top of the diagram, from the left to the right, we have a host, a network with a CDC in it, as well as a subsystem with a DDC. The first phase, network authentication and configuration,
represents the configuration of the underlay I briefly mentioned earlier on slide four,
and I also typically point out at this point that network level authentication,
such as what's possible with 802.1X, could be used here if you want it.
Now the first new functionality I'll describe in any level of detail is related to the automated discovery of NVMeOF discovery controllers in an IP network.
We call this TP-8009 for short.
And in it, we define how MDNS or DNS can be used to discover discovery controllers.
Now you may recall back on slide 5, I described the three configurations we wanted to ensure
would support automated discovery.
I'm just going to quickly show them here again for reference because they'll be important
over the next few slides.
Again, there are configurations one, two, and three.
Now I'll start by describing what happens when a new host comes online in a direct connect
configuration.
Again, that was configuration number one. It starts with the host sending out an MDNS query that requests a response from any network
entities that support the NVMe disk service. When the DDC associated with the storage receives this
query, it will transmit an MDNS response that contains a number of DNS SD records.
The TXT record contains some information about the subsystem, including the
subsystem NQN or sub-NQN, as well as the protocol supported by it. The A record contains the IPv4
address of the DDC. Now, in addition to the other records that are typically included but not shown
here, there could also be a quad A record that contains the IPv6 address of the DDC if one
is so configured. Now after the host has learned the IP address of the DDC, it will establish a
transport connection to it and then connect to the DDC before initializing it and setting a
get log page to it to continue with discovery. When the subsystem comes online, either with or without a host being present,
because the DDC is technically an MDNS responder, it first has to transmit a probe to ensure the
discovery records it is about to announce will not conflict with an existing service discovery record.
Once the probe completes, the DDC can transmit the announcement containing the service discovery information.
If there is a host present, it can use the information in the announce to establish a transport connection with the DDC and then connect to it.
With multiple hosts and subsystems, things get a bit more complicated.
Note that traffic from or to host 1 and storage 1 is represented by a dashed line,
and traffic from or to host 2 and storage 2 is represented by a solid line.
The example in this case starts by both host 1 and host 2 transmitting an MDNS request
for the NVMe disk service. The switch receives these packets, and since they are multicast,
the storage interfaces receive the MDNS query from both hosts. The MDNS response from the storage interface is then transmitted and then forwarded
to each host. Now please keep in mind, I am showing the response is multicast, but it could be unicast.
The host will then establish transport connections to both DDCs and connect to them.
In an environment with a CDC, the storage interfaces may or may not respond
depending on their configuration or whether or not they've detected the presence of a CDC.
For example, before the MDNS responder functionality on the DDC is enabled, it can
transmit an MDNS query that requests any network entities that support the CDC subtype of the NVMe disk
service to respond. The CDC will respond to this query because it's the CDC, and when the DDC
receives the response, it will leave its MDNS responder functionality disabled. Now, the reason
for doing this is to ensure the hosts only receive MDNS responses from the CDC and not from every DDC that happens
to be present in the broadcast domain and may have already registered information with the CDC.
If the host were to receive responses from every DDC, including those that have already
registered with the CDC, it would defeat the purpose of having a CDC in the first place. Now, it's important to point out that not every DDC that supports MDNS
will also support interacting with a CDC.
As a result, there may and probably will be configurations
in which the host interfaces get MDNS responses from both CDCs and DDCs,
and this is completely fine.
Now, before I go too much further describing how discovery using MDNS works,
I need to introduce a bit of the functionality we've defined in TP-8010,
especially as it relates to how discovery information is registered.
So with regards to discovery information registration techniques,
we've defined two of them, push and pull.
Push registration uses the same basic approach as Fiber Channel does.
Each host or storage interface sends a registration command to the Fabric services.
The Fabric services store the registration information in a database such as a name server.
Pull registration, though, is pretty much a new concept.
A couple of caveats about it it's only
allowed to be used by subsystem interfaces and each subsystem interface informs the cdc
that it has information that it would like to register the subsystem does this by either using
mdns or by sending something called a kickstart discovery request to the CDC. We'll discuss both
of those in much more detail later. The CDC will then connect to the subsystem interface and
retrieve the discovery log page, and the discovery log page entries are then added to the CDC's name
server database. Now, we created pull registration originally to allow legacy implementations of
subsystems to take advantage of centralized discovery without having to make changes.
However, some new subsystem implementations have chosen to use it because it is simpler for them to implement, pull registration, it can send out a probe and announce that indicates it provides the DDC pull subtype for the NVMe disk service.
Now, when the CDC receives this, it will establish a transport connection to the DDC, connect to it, and then perform a pull registration as described in the previous slide.
If the CDC were to come online after the DDC, the CDC can discover which DDCs require a pull registration by transmitting an MDNS query for the DDC pull subtype for the NVMe disk service.
The DDC's response would trigger the CDC to establish a transport connection to the DDC and connect to it.
Now, the problem with MDNS is, by default, it is constrained to a single broadcast domain.
As a result, if you want to deploy an NVMe IP SAN that spans multiple subnets, you'll need to find a way to address this. The way we have been thinking about this problem was to create the concept of an MDNS
beacon, which can be thought of as an MDNS responder that resides on each network.
The beacon handles the MDNS requests for the NVMe disk service by sending an MDNS response,
pointing them to the IP address of the CDC, which happens to reside on a different network.
The DDC is then free to send CONN Connect to the CDC as it is routable.
If the DDC would like to request a pull registration and plans to do so via MDNS,
things get a little bit more complicated. The problem is, even if the MDNS beacon was
configured to respond with the IP address of the CDC, there is no way for the DDC to notify the CDC that a pull registration is needed using MDNS.
This is because MDNS, again, is constrained to the local broadcast domain by default.
Now, the easiest way for us to solve this problem turned out to be adding what we refer to as a kickstart discovery request or KDREC.
Now, why is a kickstart necessary?
Well, after the CDC IP address has been discovered via an MDNS beacon,
how does the CDC know that the subsystem needs to have information pulled from it?
What actually causes the CDC to send Connect to
the subsystem in the first place? And also, we need to be able to differentiate between a DDC
that hasn't registered yet and a DDC that wants pull registration. We couldn't just assume one
way or the other. It turns out if we use a Kickstart discovery request, the MDNS beacon
functionality can be a simple MDNS responder. Without KD-REC, the MDNS beacon functionality can be a simple MDNS responder.
Without KD-REC, the MDNS beacon would need to be an MDNS proxy, which is much more complicated for us to implement and for end users to configure.
So for all these reasons, we were interested in using KD-REC.
Now we'll talk about the discovery controller initialization process from both the host and the subsystem as it relates to push and pull.
And we need this to get into a little bit more detail about KDREC.
So after the discovery controller IP addresses have been either discovered via MDNS or configured via CLI,
an NVMe host will establish a connection to the discovery controllers and initialize them.
This process is defined in the NVMe 2.0 base spec, and there's nothing new here. This has been around forever. With a centralized discovery controller as shown, there will typically only be one CDC
per VLAN. Without a centralized discovery controller, there will be multiple discovery
controllers for VLAN, at least one per NVM subsystem interface. Now at the end of this step,
the NVMe host has established a connection to the CDC.
With the subsystem push registration, it's actually very similar to what happens with a host.
After the CDC IP address has either been discovered or configured, the NVM subsystem will establish a
connection to the CDC and initialize it.
And with the CDC shown, there will usually only be one CDC per VLAN.
And at the end of this step, the NVM subsystem has established a connection to the CDC.
In fact, the DDC on that NVM subsystem is acting like a host and the CDC is acting like a discovery controller. Now, pull registration is slightly
different. So after this, again, the CDC IP address has been discovered or configured,
the NVM subsystem or the DDC on it will send the CDC a kickstart discovery request or KDREC
command. The CDC will respond by sending a connect to the subsystem and initializing the DDC on it.
Again, this is all previously defined, nothing new here.
At the end of this step, the CDC has established a connection to the DDC on the NVM subsystem.
And in this case, the CDC is acting like the host and the DDC on the subsystem is acting like a discovery controller.
This would be like a traditional host connecting to the DDC on the subsystem.
Okay, so we've just discovered, just discussed discovery control initialization. Now we're going to get into the process of registration. We'll start with host registration and then
then get into subsystem registration using both the push and pull methods.
Okay, so I'm, we covered this in the previous, you know, during step three, the discovery controller initialization process. Actually, during that process, the type of discovery controller will have been discovered. It'll either be registered is effectively an enhanced NVM discovery log page as defined in the 2.0 base specification. It's close. It's not exactly the
same, but it's close. The extra information registered could include a symbolic name,
which is like a user-friendly name, a group name, or both. It's just really intended to help the
user understand what's connected. Now, whether or not the discovery controller is a CDC or a DDC, the host will register
for asynchronous event notifications or AENs by using the asynchronous event request.
Now at the end of the step, each interface on the host that discovered a centralized
discovery controller will have registered effectively a log page and transmitted one
or more AERs.
In the CDC case, the host is now discoverable by the subsystem.
So what happens when the subsystem registers with the CDC using push registration?
Again, during step three, we learned the type of discovery controller that's available, either CDC or DDC.
If it's a CDC, the subsystem will register with them.
It's using the discovery information management command, as was the case with the host.
Information to be registered is effectively an enhanced NVM discovery log page, as it was previously for the host.
And same thing here with user-friendly names. It's a
really good idea for subsystems to provide this information just to, again, to make the lives of
the end users easier. The subsystem will register for asynchronous event requests, probably more
than one. End of this step, each interface on the subsystem that discovered a CDC will have
registered a log page and transmitted one or more asynchronous event requests. The subsystem is now discoverable by the host.
To describe how a subsystem would use KD-REC to request a pull registration,
I think it's probably best just to review the entire process starting with MDNS.
The process could start by the subsystem transmitting an MDNS
query for the network entities that support the CDC subtype of the NVMe disk service.
The MDNS response from the CDC will contain the CDC's NQN and the IP address it can be reached at.
The CDC NQN can be used to help to determine if the MDNS response being received by the subsystem is for a CDC that had already discovered or not. Next, the DDC establishes a transport connection
to the CDC and transmits an updated version of IC-REC to the CDC. This new version of IC-REC
indicates that the connection being established from the DDC to the CDC will only be used to transport Kickstart Discovery Request, or KDREC, and Kickstart Discovery Response, KDResponse.
KDREC provides some information to the CDC about the DDC and indicates that the DDC would like the CDC to perform a pull registration against it. Once the CDC responds with KDRASP,
the connection established from the DDC to the CDC is torn down, and then the CDC establishes
a connection to the DDC. It would then use a traditional ICREC-ICRESP exchange and then
connect to the DDC. After the connect is completed and the DDC has been
initialized by the CDC, the CDC will transmit GetLogPage to the DDC. Now this GetLogPage
command should only request what we call port local discovery log page entries.
By requesting these port local pages only, each interface on the subsystem will only be registered with the correct CDC instance.
So now we'll talk a little bit about subsystem driven zoning.
So subsystem driven zoning, again, SDZ, very similar to target driven zoning or TDZ and fiber channel.
As part of the storage provisioning process, as I mentioned earlier in the mapping and masking,
the subsystem, after that mapping and masking process, may send a zone group to the CDC.
Again, the zone group describes which hosts are allowed to access each subsystem interface.
In the context of a CDC, the zone group is the unit of activation
like a fiber channels zone set
or zone configuration.
A CDC instance may have multiple zone groups
active at the same time
to avoid potential configuration clashes
between multiple administrators.
That's a big reason why we decided to define it that way.
The process starts by the subsystem retrieving a zoning data key,
and the subsystem can then send the zone group definition using Fabric Zoning Send.
The zone group definition should match the namespace masking definition. In fact, it really
has to. I mean, otherwise it would make sense. And this allows for a single pane of glass management experience for end users.
And it's a big part of the reason why we think this is such a great idea and hope that everybody implements.
To give you an idea of how these new features all relate to one another, I'd like to walk you through the discovery process by using an example.
It starts with some kind of automated underlay configuration service, configuring the switches and enabling IP reachability between the appropriate switch interfaces to create an NVMe
IP SAN. By using an automated underlay config service for this purpose, you can reduce the
number of initial network configuration steps from the thousands down to less than 10.
Next, the server and storage interfaces use MDNS
to discover the CDC's IP address automatically.
Once the IP address of the CDC instance
has been discovered, the server and storage interfaces will all register
discovery information with the CDC, which will then store this information in its name server
database.
At this point, the operator or an orchestrator could read the contents of the name server
database and use this information to configure zoning that allows servers to access the appropriate
storage interfaces.
Once the zoning information is sent to this,
this zoning information could also be sent
to the CDC automatically as part
of the next configuration steps of mapping and masking.
In either case, the concept of zoning
is a critical component that enables discovery automation
without causing scalability problems
that would result from every server attempting to connect
to every storage port in the SAN and determine if the storage port provides storage volumes for that particular server or not.
Once the zoning has been configured, the CDC will use asynchronous event notifications
to notify the servers and storage interfaces that a change in discovery information has occurred.
In response to this, the server will query the CDC name server using get log page
to determine what storage interfaces are available to it.
And then, finally, each server interface will connect to the storage interfaces returned to it
in the get log page response, perform NVMe discovery to discover the storage volumes
or namespaces that have been allocated to it, and then start transferring data.
So some key takeaways from this session.
Again, discovery automation does not entirely depend upon the presence of a centralized discovery controller,
but we do think it enables an NVMe TCP environment to scale. Smaller scale environments can make use of MDNS as
described in TP-8009 to automatically discover NVMe discovery controllers.
A couple of other points, CDCs and subsystems that will support interacting with them should
use port local log pages. It provides a much better user experience and prevents leaking information between tenants.
Should also make use of subsystem-driven zoning.
Storage admins will only need to interact with one UI for storage provisioning.
We think this is a big advantage.
Should also make use of extended attributes and register symbolic names that are meaningful to end users.
It's just going to make their lives so much easier rather than having to deal with NQMs. And also contribute to the open source NVMeOF discovery client called
NVMeStas being led by Dell. This will be available for review after 8009 and 8010 are ratified at the
end of the year. Thank you very much for taking the time to listen to this session about our new NVMe IP SAN ecosystem.
Please do take a few moments to rate this session.
Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask
questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.