Storage Developer Conference - #149: Enabling Ethernet Drives
Episode Date: July 13, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 149. Welcome to the 2020 Storage Developer Conference.
My talk is on enabling Ethernet drives.
I'm Mark Carlson with Kioxia.
I co-chair the Object Drive Technical Work Group along with the SNEA Technical Council.
And I want to talk today about Ethernet drives.
The direct attached storage in the original sense was a drive that plugs into a host, perhaps used SCSI,
and then along came storage area networks where multiple hosts can share the storage.
This avoided those silos of storage and enabled storage efficiencies.
Examples include Fiber Channel and today iSCSI storage networks.
These have a controller in front of the actual drives,
so they're handling all the networking in a controller.
Hyperscale folks went backwards to direct attached storage again
on commodity systems, but then they created special software,
software-defined storage that manages those hyperscale nodes in a solution.
And now the industry is moving to NVMe over fabrics, systems, and devices on native Ethernet as a storage network.
Ethernet was initially just a transport
Endpoints performed all the storage services such as iSCSI
The use of Ethernet did mature however
With specialized protocols
There's key value protocols to access data in the mainframe context
There's object protocols to access massive amounts of unstructured
data. And now we have NVMe over Ethernet, where storage is in a queuing paradigm. This enables
high performance, low latency, few or no processing blockages to the traffic, no longer gated by a transaction paradigm, in other words,
waiting for an acknowledgement.
The next step, the next logical step, is to have NVMe over Ethernet to the actual drive
itself.
And this removes that storage controller that was necessary in front of the drive and that
processing bottleneck that existed.
So NVMe over Fabrics, there's several talks this week on this topic.
It enables the sharing of NVMe-based storage across the network.
So you have better utilization, you have capacity, you have rack space, you have power,
better scalability, management, and fault isolation. Now the NVMe OF standard is NVMe.org.
There's 50 plus contributors. Version 1 was released in 2016. The fabrics that it supports include Ethernet, InfiniBand, Fibre Channel, and now TCP IP as well.
Now products are starting to reach the market from most major storage system vendors, so look for them.
So when you look at those products that are coming out that support NVMeOF, most of them have a controller of some sort.
That controller terminates the NVMeOF connections and uses PCIe-based SSDs internally or on the back end.
And those SSDs are behind an array or a controller of some sort.
But there are performance limits to this.
The SSD performance is increasing faster than the CPU NVMe over Ethernet to drive use cases.
The NIC itself can become a bottleneck.
And then this whole idea of storing it forward increases the latency.
And then if you look at the cost, you've got CPUs in the mix,
system-on-a-chips, RNICs, switches, memory,
and this doesn't scale well to match this increased SSD performance.
So along comes NVMeOF Ethernet SSDs.
With the NVMeOF termination on the drive itself, the controller functionality is now distributed
across all the drives.
The scaling point then becomes a single drive in an inexpensive controller.
It enables this idea of eBoffs, which is an Ethernet-attached bunch of flash,
so power cooling, SSDs, and an Ethernet switch. That's pretty simple. I don't think it'd make it any simpler. But doesn't this make each drive more expensive?
Maybe initially, but now customer buys a controller incrementally as needed for new capacity.
So efficiencies of scale are now applied to this controller functionality in the manufacture of these drives.
Lower costs per bandwidth and lower costs per IOPS.
The scaling unit, as you can see over on the right, is a single drive.
You buy a bunch of empty enclosures, and you add a drive at a time as your capacity needs.
But keep in mind, as it goes, the cost of the storage is going down.
So the last drive you add to that enclosure may be a lot cheaper than the first drive you add to that enclosure.
So there is some complexity we want to talk about here in the existing controller device.
The SSD transport is increasing faster than the network bandwidth,
such as PCIe even.
SSD throughput will triple because every time you double the number of bits
in a cell, you're doubling the speed at which you can pull the data out of that media.
Network speeds double every few years. Now that's accelerating due to the hyperscalers making the next generation of Ethernet commodity a lot faster. so whenever you're taking an SSD and using CPU, NIC, DRAM software, PCIe
there's a lot of cost here but also it becomes a bottleneck
because of the processing power of the CPU
of the NIC staying on the last generation of Ethernet, etc.
But an Ethernet JBoff can actually just have an Ethernet switch in there.
Simple and scalable performance from that Ethernet switch will accommodate the SSDs.
So there's different ESSD designs today some will support multiple interfaces and protocols others are just simple rocky or TCP drive you can put a sort of interposer or bridge
in front of a standard PCIe drive or you can start to pull in that bridge onto the same board
or eventually integrate it with your system on a controller.
Now, what you want is not to plug in RJ45s to these drives,
but rather have them plug into your standard 8639 connector that's already on the mid-planes of a lot of these boards.
So the first use case we'll talk about is why not put these Ethernet drives behind the controller in existing systems. This allows them to scale a lot further than the PCI switch would allow them to do.
So perhaps what might only scale to a tray or two with PCIe can now go maybe to the whole rack. So using ESSDs allows higher scaling,
still hiding the individual SSD management from users,
still a system, a storage system with perhaps data services running on it as well.
The data services add value to those Ethernet drives, but the connection and orchestration is still done by that controller.
So then you can do robust data protection schemes and distributed controllers, etc.
Now, the second use case is just a completely disaggregated SSD storage with no controllers anywhere.
Instead of converting from the network into PCIe, why not have these native eBOS devices,
and they're able to respond to hosts directly.
What's needed for this is management.
In order to do this, you have to go in and manage each NVMeOF SSD separately.
You want to set them up with namespaces.
You want to set them up with a host-to-namespace mapping so that the host only sees the namespaces it's allowed to.
And then the host can discover through standard NVMe means those namespaces
and start using them, start mounting them, et cetera.
So what SNIA has done is it's created a native NVMeOF drive specification.
Version 1.0 is out now, and it allows you to discover and configure the drives,
the interfaces, the speeds, and the management capabilities.
We actually went and defined the connectors for this.
So some connectors may need to configure the PHY signals based on the type of drive interface.
Survivability and mutual detection is important.
We've defined the pit outs for these common connectors, not just 8639, but also SFF1002 form factor devices.
We are going to continue to integrate with NVMeOF.
There's a lot of work happening on discovery, new commands being added to the admin controllers.
And the management, though, is instead of in-band over NVMeOF, or in addition to in-band NVMeOF admin commands, this enables the management to go through Ethernet TCP in-band of the Ethernet interface, but out-of-band of the NVMeOF protocol.
And this allows all the drives in a data center-wide management to be directly
contacted by the orchestration software that, for example,
creates pools of drives.
Maybe initially the drives, when they come in the back of the building,
enter a spare pool.
Then they get configured with namespaces.
Then they get put in a pool for the types of drives they are.
They're also allowed to now be configured to deploy their namespaces in a host mapping situation to specific hosts on demand as needed and
also return to that pool when they're no longer needed.
So the management has been the work that we're doing right now.
We want to be able to scale out the orchestration of tens of thousands of drives.
And we make this possible by using a RESTful API,
such as DMTF Redfish.
Redfish and SNEAD Swordfish follow a principle that each element report its
own management interface information.
So instead of the controller reporting a management interface information,
now the drive provides that information directly.
They can follow links in the higher level management application
directly to the drive's management endpoint.
Everything is HTTP, TCP, Ethernet based.
All the abilities to secure those interfaces are still enabled.
And what will make this interoperable over time is what's called interoperability profile.
So we will be producing an interoperability profile that sort of sets a minimum bar for how much of this management interface every drive should implement.
Hopefully, we'll have some ability to check that and get certified for that.
So what do we have?
We have a mock-up to start.
I'll show a little bit of that.
We're going to continue to push new models through swordfish contributions to redfish. And then we're going to publish this interoperability profile at the DMTF for everybody who uses.
So we've been mapping the profile,
the redfish elements to NVMe and NVMe MI properties and actions.
And there's more detail on this at other SDC talks, so seek those out.
But what we have is sort of a three-way effort.
It's hosted by the SNEA SSM TWIG, which is developing Swordfish and Special Task Force.
And we've got chartered work from Redfish, Swordfish, and NVME discussions we had last year.
So we really are investigating how will NVME and NVME-OF be managed in large-scale environments.
And this is where this work is targeting. We also want to come up with a common way that all organizations agree with to represent NVMe and NVMe-OF in Redfish and Swordfish.
Redfish did have some NVMe drive properties added, but not really a comprehensive view.
We want to provide a clear map for NVMe folks that don't know redfish or swordfish.
So something coming from NVMe will better understand the redfish swordfish
environment and provide commonality where possible between the models and
the solutions. So we use the available low-level transports to get the device transport-specific information into the common models.
We use commands that are provided in the NVMe or NVMe-ORF or NVMe-MI specs.
NVMe-MI can be used as a low-level to get the information into the high-level management environment as out-of-band access mechanism when appropriate.
So, our scope, the NVMe subsystem, the NVMe-OF, and the NVMe domain models.
So, the overall NVMe subsystem model reflects a unified view of all NVMe device types. Devices
will instantiate an appropriate subset of the model depending on what the device does. The
model diagrams do not reflect all available schema elements. That happens at the same time,
but it does leverage and coarsely maps to existing Redfish and Stored Fish storage models. So here are the NVMe objects that we map to Redfish
and Stored Fish so far. We have a map for the NVMe subsystem. NVMe subsystem
includes one or more controllers, zero or more namespaces, and one or more parts.
And the NVMe controller is used for I.O., admin, and discovery.
This is the thing that actually processes the NVMe commands.
So the admin controller is a controller that exposes capabilities that allow a host to manage an NVMe subsystem. We've modeled all these admin aspects of an NVMe controller into Redfish and Swordfish.
Discovery in the NVMe sense is a controller that exposes capabilities to allow
hosts to retrieve a discovery log page. Now you may discover an ESSD
in your discovery log page,
then go contact it in band
using admin commands,
go fetch the Redfish URL
to manage it out of band
and manage it out of band
from that point on.
The IO is a controller that implements IOQs and intended to be used to access non-volatile
memory storage medium.
Namespace is essentially a volume in Redfish.
So it's a quantity of non-volatile memory. The endurance group is a portion of NVMe who is managed as a group by the drive side, by the controller side.
MVM-set is a way to logically partition the media such that it can achieve predictable latency and other benefits.
And the MVM domain is the smallest indivisible unit that shares state,
such as capacity or power.
So this is a model.
On the left, we have the Redfish Swordfish model,
which uses existing or newly defined Redfish concepts.
On the right is the NVMe model, and it uses those concepts I just described that are coming from the NVMe specs.
So we've defined several different ways to do the device model.
A simple SSD could be used for a PCI attached drive, for example.
This could be retrieved from the drive by the BMC, and it may perhaps use NVMe MI but then the BMC may present this information in the model
that we're defining right now in a JBOF you have a PCIe front-end attached to a set of drives and
then eBOS are what we're really concentrating is. It's an Ethernet switch front end to a set of drives.
We're also dealing with a fabric-attached way where we have a simple RAID front end,
a simple controller attached to a set of drives.
It's able to create, export namespaces,aces attaching to specific hosts.
There's other things we'll get to eventually,
but these two highlighted ones are what we have for markups today.
So a simple SSD implementation
would basically have one endurance group
by default
and one NVM set by default and perhaps even
one NVMe namespace by default. There is some physical element representation
info up above such as a drive, a chassis, temperature elements, et cetera.
We have a simple NVMe drive sort of bubble diagram, we call this,
which gives names to some of the elements that we're implementing.
We really want a chassis element so you can locate the drive within the overall tray, overall rack.
Power and thermal is something that the BMC wants to get a hold of,
but also the host wants to know about.
Drives, volumes, storage, controllers, systems, and storage
are the other elements we've chosen.
So who is developing Redfish and Swordfish? These are the companies that are participating
in each group, and some of the companies participate in both groups. And we expect this momentum behind this effort to continue on and develop quite a bit of
management interfaces over the next couple of years. So join us in SNEA, join us in DMTF,
join us in NVME, and let's get this work done. Thank you very much. Thanks for listening. If you have
questions about the material presented in this podcast, be sure and join our developers mailing
list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers
in the Storage Developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.