Storage Developer Conference - #167: NVMe-oF: Protocols & Transports Deep Dive
Episode Date: April 27, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 169.
Hello everyone and welcome to my session at SDC 2021 on Envy Me Over Fabric Protocols and Transports Not Quite So Deep Dive.
I'm Joseph White and I have a, I revised my abstract a little bit after I started
to produce the session. And originally I was going to be much more expansive and this led to an
enormous deck. And I thought it would be better and more concise if I looked at just a little bit of the history of block storage
and at what the protocols and transports have in common, and then look at some of the transport
properties and behavior trade-offs between NVMe over TCP and NVMe over Rocky. And then I'll use
some simplified examples to discuss offload processing in a data processing unit, also known as a smart.
So a little bit of background is interesting here. I mean, I've been doing this work since about 1998, so 23 years now. And if you look at the history of the storage networking
protocols and industry, you know, for a long time, all you had was parallel SCSI.
And, you know, that served fairly well for the systems in the late 80s and early 90s.
And, you know, NAS was around then, but I'll focus on block storage, storage area networks.
In the late 90s, you saw the rise of fiber channel.
And fiber channel is interesting because it's gone through about eight major upgrades with
many shelf feet of specification. And it's still around today and still quite popular
for storage networking. In the late 90s and early 2000s, you also had a couple of other
phenomenon come up. So you had several switch to network parallel SCSI approaches. There was,
SCSI over Ethernet, and there was MFCP and IFCP that were proposed and produced, and a few other kind of early IP Ethernet-based transports.
You also saw the rise of iSCSI, and then finally FCOE.
iSCSI, again, is still in use today, as is FCOE.
And what all of these things have in common is that they're they're basically carrying
SCSI command formats in one form or another and Fiber Channel still can do that with FCP What happened in the mid-2010s is NVMe over Fabric arose out of the PCIe definition for NVMe.
Local flash drives started appearing.
They started to become cost effective and a much cleaner, more straightforward, more performant protocol was needed.
NVMe was created for that.
And pretty soon people said, well, we should network this. So created NVMe over fabric standards.
And that has been mapped to a variety of transports, including Rocky, TCP, Fiber Channel,
InfiniBand, and iWarp. And it's picking up steam and popularity.
There's a lot of work going on in the Envium Express Standards Organization.
And we should see a fairly wide expansion of adoption of various forms of Envium over fabric accelerating over the next few years.
So analyst reports are showing considerable growth
across all of the protocols. And in particular, NVMe over TCP is expected to really pick up.
So what do all these protocols have in common? Well, their core purpose is to transfer data from one system to another across a network or fabric or issue commands to a system.
So I want to basically put the system in a certain state or to tell it to do something related to block storage,
or I want to do a read or write command and move data one direction or the other between systems. Typically, this is a server to a storage device,
but it doesn't strictly have to be that.
Now, NVMe over Fabric says,
hey, we've got hosts and subsystems,
and the host transfers to and from the subsystems,
and that cleans up some of the protocol things.
It means if you want bidirectional communication, you have to be both a host and a subsystem. And all of these protocols
and in the way they transfer data have the commands and transfers broken into packets
with an indent in protocol on top of a transport protocol. So there's a protocol by which you exchange the commands and the data.
And then there's a transport protocol that moves the bytes across the network from one system to another.
They all have a method for sequencing commands and allowing multiple commands.
Even Parallel SCSI had that, you know, different kinds of QDEPs or, yous or parallel paths, or there's a whole variety of
things there. Within the communicating systems, the actual endpoints are typically logical or
virtualized on top of the hardware. You don't talk about this Ethernet port talking to that Ethernet port in terms of the upper level protocol.
It's usually a host initiator talking through an interface that's likely virtualized to a controller on a subsystem that's hiding the details of the actual physical layout of the block storage.
This is valuable because it lets you abstract and virtualize a number of layers
and present services and thin provisioning and logical volumes
and all the good things that you need out of, you know, enterprise quality storage systems.
And then most of these protocols have some notion of a set of services, controllers,
or orchestrators that are used to manage the systems and their connections.
And then finally, there's a set of admin or control commands that are in BAN
to assist with discovery, connection setup, and operations.
So what we're going to do is focus on how things are transported.
And I'm not going to focus on kind of the overview and the commonality and then talk about two of the predominant flavors of actually transporting the data. Control, QoS, some way to do tunneling or encapsulation so you can jump fabrics or isolate an overlay from an underlay.
There's codes that identify the type of packets or the next header, and there are a number of names for these things.
And then there's the payload, which will carry the commands, the data, some status information, some other control bits.
There's typically a form of data integrity. So, you know, CRCs are checksums, so you know that
nothing's been accidentally altered in the packet. And then there's typically a data
confidentiality and a tamper resistance, so a capability. And so this is to know no one's messed with your packet deliberately
and so that no one can see the contents of your packet if you set that up.
And you can see in the Fibre Channel example, you know. I've got the ability to put a virtual fabric tag or
some encapsulation header or inner fabric routing, which never really got used much, but it was an
interesting protocol. You know, I have OX and RX IDs that tell me which exchange these things belong to, which communication they belong to.
And then you have a payload. If we look at a couple of other examples, again,
you can see the commonality. FCOE moved essentially fiber channel frames with
Ethernet encapsulation. So it's an inherently layer two protocol for the Ethernet
segment. And so you have an Ethernet header, an SCOE header to tell you, hey, I've got a fiber
channel header immediately coming so that the forwarding pipelines can do their job and send
the packets to the correct place. If you look at iSCSI, it's, you know, I show the IPv4 example, or it could be IPv6.
This is a traditional Ethernet IPv4 TCP. And then there's a, typically a payload that has
a number of commands. And iSCSI is a stream protocol. so not all frames will line up exactly like this. I'll explain that
more in the context of NVMe over TCP. So finally we come to NVMe over Fabric. As I said earlier
we've got a variety of protocol encapsulations and the possible stacks are quite complex. But there are common properties for NVMeOF, right? Common command formats, multiple command sets of multiple mappings, multiple transport mappings, rich discovery and connection setup, efficient read or write command formats. And then there's a notion of lots of,
or the capability for lots of key pairs for concurrency and multiple outstanding commands.
So let's look at the two of the more prevalent versions of encapsulation. So the first one's
NVMe over Rocky V2. Here we have an Ethernet header, an IP header, which is a UDP header.
And I'll show more details on that part of the protocol coming up here.
And then we have a full InfiniBand header and payload inside.
So I'm doing RDMA using RDMA verbs. And that encapsulates the NVMe
inside the InfiniBand payload. So what this is, is it's a tunnel or a wrapper for
InfiniBand RDMA commands. And that has some nice advantages and some disadvantages,
some interesting trade-offs, especially with using UDP as a transport. Because UDP as a
transport basically requires Ethernet to be lossless. And that's typically done with either
link overflow control or priority flow control.
Okay, so now let's look at a little bit about NVMe over TCP.
And then we'll get into, you know, an example and do the contrast between them.
So NVMe over TCP has an encapsulation shown on the left to form a protocol data unit header, a PDU header.
And then there's a PDU payload, which is the contents or the data that's being transported. And if I have a stream of those, they can be placed in the TCP packets in arbitrary fashion.
So think of a set of commands and their data as just a byte stream,
and it gets chopped up arbitrarily into TCP segments,
which have IP headers on them.
And so there's no particular alignment relationship between the carried PDUs and the package you would see on the wire.
But it can, with it, because it's TCP, it can run over either velocity or lossless networks equally.
Okay, so let's peel back one more.
And I'm not going to go do a deep dive on either UDP or TCP, but I do want to point out sort of the top level differences.
So UDP is a straightforward datagram protocol. It's send in a UDP header, and I push it onto the network, and it goes to the packet up to the application that's registered to handle that port or the thread of execution or the kernel area that's registered to handle that port.
And you have your datagram and you work on the data.
With TCP, it's a little more complex in that it's a byte stream protocol that's connection oriented with guaranteed delivery.
And flow control is built into the protocol
so you have to handle the windowing you know there's a a window you can send into there's a
sender's congestion window there's sequence numbers and acknowledgement numbers uh the
guaranteed delivery means you have to get an acknowledgement back before you can forget about
the data on the transmitting side uh The receiver has to actually send that acknowledgement back.
There's a whole variety of retransmission protocols
and ways in which you slow down with, you know,
fast retransmit with combined congestion avoidance,
retransmission timeout, also known as slow start.
So there's a lot of stuff going on in TCP.
But again, fundamentally, I've got
a byte stream that I chop into packets and put on the wire.
Okay, so let's discuss the consequences of these two approaches with a simplified example. In the datagram example, let's suppose a host wants to read three chunks of data, data one, data two, data three, into buffer one, two, and three.
So this would be an example of using NVMe over Rocky. And the packets that go out have network headers, which are green, and protocol headers, which are brown. And if things have to be broken up,
so big deal.
You just get additional protocol headers.
And it's very,
it's fairly straightforward.
The RDMA has set up the buffering such that the data can be DMA or
placed directly into the buffers.
And, you know, this protocol works quite well, but any drops have to be recovered at kind of the command and upper level protocol layer.
So, again, as I said before, you don't want drops on this one.
The other example would be using, know nvme over tcp and there i might have one packet with
three commands in it so i've got three protocol headers three commands that say read data one
read data two read data three the subsystem returns the data as a byte stream so there
are protocol headers that get spread across the packets and data can be broken up
in arbitrary ways.
And the system, so the subsystem has to walk
through the command packets to get each read data command.
And then the host has to wait for the second packet before it can deal with
all of data two.
With zero copy TCP, you can still DMA the data into a buffer.
So you can get sort of equivalent behavior there.
And there's a consequence though, if I dropped the first first packet the one that has data one in the
first part of data two in it I wouldn't know what to do with the second packet because it just starts
with data there's no header there's no nothing and I've lost track of where I am in the sequence so
I have to wait for that first packet to arrive before I can handle the second packet.
Whereas in the datagram example,
I can handle each packet in order because I know where it's supposed to go
based on the header in that packet.
So that's sort of the trade-off between the two.
Now, what I want to point out is the commonality again.
In each case, the commands and the data transferred are the same. So there's no meaningful difference in number of packets. They can, you know, both protocols can use jumbo frames.
Both protocols have to, you know, chop the data up if it exceeds the size of a single frame.
So there's not really an inherent advantage or disadvantage
from the point of view of the commands and data
themselves. So finally, I want to talk about the implications of this for data processing units or
smart NICs and protocol acceleration. So a data processing unit has five functional blocks at its base.
It's got a PCI interface to the host.
It's got ARM cores with DRAM or high bandwidth memory.
It's got protocol acceleration pipelines. So it's got embedded cores or P4 pipelines or hard-coded blocks that can deal with various kinds of offload.
It has security and support accelerators.
So if I want TLS encryption or IPsec encryption, I can get that in band, in line in the hardware.
And then finally, it's got a network interface. Typically, you see two to eight ports.
Those are increasing over time. That can do embedded switching, network, traditional network packet processing, you know, filters and policers and those sorts of things.
It does the forwarding lookup. And then it also has NIC functions so that the host says,
oh, I have a NIC at this SRIOV address. I know how to deal with that.
And if you look at where it sits within a server,
it sits in the position a NIC would sit, but it's much more capable.
It can do direct access to other PCI devices, persistent memory,
GPUs, NVMe over fabric.
It can also do DMAs into and out of the x86 CPU's DRAM.
And again, strictly, it doesn't have to be an x86 and the DPU doesn't strictly have to have ARM cores,
but this is just what we see typically.
And then it's allowed to have a sideband interface
to a BMC, a baseboard controller.
So what can you do with a DPU in terms of protocol acceleration?
Well, you can support dedicated NVMe fabric processing in hardware pipelines. This means
that you're offloading your x86 cores and not requiring your kernel to do that work.
Typically, an offload processing can be done much more efficiently, both in terms of power
and latency and throughput. The DPU can directly place or send data to or from the host DRAM
via DMA, and it could also operate across PCIe to local devices.
So I could expose my NVMe drive out to the network through the DPU, for example.
Both TCP and RDMA Rocky transports can be offloaded.
So both of them can be offloaded. So both of them can be offloaded. The accelerators and pipelines
must be constructed correctly to match the transport. So as we saw in the diagram before,
there are differences in the way the transport is executed and the requirements on the network,
but there's no core limitation or core problem with doing the acceleration.
And you just have to take into account the way the transports move the commands and data.
And that's really my core conclusion here, is that this set of differences in the transports matters in terms of how you do the implementation,
but you can get effective performance out of both transports depending on how your system and your offloads are built.
That concludes my shortened presentation,
and hopefully it was interesting,
entertaining, and to the point.
And so thank you very much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to
developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer
community. For additional information about the storage developer conference, visit www.storagedeveloper.org.