Storage Developer Conference - #167: NVMe-oF: Protocols & Transports Deep Dive

Episode Date: April 27, 2022

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 169. Hello everyone and welcome to my session at SDC 2021 on Envy Me Over Fabric Protocols and Transports Not Quite So Deep Dive. I'm Joseph White and I have a, I revised my abstract a little bit after I started to produce the session. And originally I was going to be much more expansive and this led to an
Starting point is 00:01:16 enormous deck. And I thought it would be better and more concise if I looked at just a little bit of the history of block storage and at what the protocols and transports have in common, and then look at some of the transport properties and behavior trade-offs between NVMe over TCP and NVMe over Rocky. And then I'll use some simplified examples to discuss offload processing in a data processing unit, also known as a smart. So a little bit of background is interesting here. I mean, I've been doing this work since about 1998, so 23 years now. And if you look at the history of the storage networking protocols and industry, you know, for a long time, all you had was parallel SCSI. And, you know, that served fairly well for the systems in the late 80s and early 90s. And, you know, NAS was around then, but I'll focus on block storage, storage area networks.
Starting point is 00:02:24 In the late 90s, you saw the rise of fiber channel. And fiber channel is interesting because it's gone through about eight major upgrades with many shelf feet of specification. And it's still around today and still quite popular for storage networking. In the late 90s and early 2000s, you also had a couple of other phenomenon come up. So you had several switch to network parallel SCSI approaches. There was, SCSI over Ethernet, and there was MFCP and IFCP that were proposed and produced, and a few other kind of early IP Ethernet-based transports. You also saw the rise of iSCSI, and then finally FCOE. iSCSI, again, is still in use today, as is FCOE.
Starting point is 00:03:22 And what all of these things have in common is that they're they're basically carrying SCSI command formats in one form or another and Fiber Channel still can do that with FCP What happened in the mid-2010s is NVMe over Fabric arose out of the PCIe definition for NVMe. Local flash drives started appearing. They started to become cost effective and a much cleaner, more straightforward, more performant protocol was needed. NVMe was created for that. And pretty soon people said, well, we should network this. So created NVMe over fabric standards. And that has been mapped to a variety of transports, including Rocky, TCP, Fiber Channel, InfiniBand, and iWarp. And it's picking up steam and popularity.
Starting point is 00:04:32 There's a lot of work going on in the Envium Express Standards Organization. And we should see a fairly wide expansion of adoption of various forms of Envium over fabric accelerating over the next few years. So analyst reports are showing considerable growth across all of the protocols. And in particular, NVMe over TCP is expected to really pick up. So what do all these protocols have in common? Well, their core purpose is to transfer data from one system to another across a network or fabric or issue commands to a system. So I want to basically put the system in a certain state or to tell it to do something related to block storage, or I want to do a read or write command and move data one direction or the other between systems. Typically, this is a server to a storage device, but it doesn't strictly have to be that.
Starting point is 00:05:32 Now, NVMe over Fabric says, hey, we've got hosts and subsystems, and the host transfers to and from the subsystems, and that cleans up some of the protocol things. It means if you want bidirectional communication, you have to be both a host and a subsystem. And all of these protocols and in the way they transfer data have the commands and transfers broken into packets with an indent in protocol on top of a transport protocol. So there's a protocol by which you exchange the commands and the data. And then there's a transport protocol that moves the bytes across the network from one system to another.
Starting point is 00:06:15 They all have a method for sequencing commands and allowing multiple commands. Even Parallel SCSI had that, you know, different kinds of QDEPs or, yous or parallel paths, or there's a whole variety of things there. Within the communicating systems, the actual endpoints are typically logical or virtualized on top of the hardware. You don't talk about this Ethernet port talking to that Ethernet port in terms of the upper level protocol. It's usually a host initiator talking through an interface that's likely virtualized to a controller on a subsystem that's hiding the details of the actual physical layout of the block storage. This is valuable because it lets you abstract and virtualize a number of layers and present services and thin provisioning and logical volumes and all the good things that you need out of, you know, enterprise quality storage systems.
Starting point is 00:07:27 And then most of these protocols have some notion of a set of services, controllers, or orchestrators that are used to manage the systems and their connections. And then finally, there's a set of admin or control commands that are in BAN to assist with discovery, connection setup, and operations. So what we're going to do is focus on how things are transported. And I'm not going to focus on kind of the overview and the commonality and then talk about two of the predominant flavors of actually transporting the data. Control, QoS, some way to do tunneling or encapsulation so you can jump fabrics or isolate an overlay from an underlay. There's codes that identify the type of packets or the next header, and there are a number of names for these things. And then there's the payload, which will carry the commands, the data, some status information, some other control bits.
Starting point is 00:08:46 There's typically a form of data integrity. So, you know, CRCs are checksums, so you know that nothing's been accidentally altered in the packet. And then there's typically a data confidentiality and a tamper resistance, so a capability. And so this is to know no one's messed with your packet deliberately and so that no one can see the contents of your packet if you set that up. And you can see in the Fibre Channel example, you know. I've got the ability to put a virtual fabric tag or some encapsulation header or inner fabric routing, which never really got used much, but it was an interesting protocol. You know, I have OX and RX IDs that tell me which exchange these things belong to, which communication they belong to. And then you have a payload. If we look at a couple of other examples, again,
Starting point is 00:09:52 you can see the commonality. FCOE moved essentially fiber channel frames with Ethernet encapsulation. So it's an inherently layer two protocol for the Ethernet segment. And so you have an Ethernet header, an SCOE header to tell you, hey, I've got a fiber channel header immediately coming so that the forwarding pipelines can do their job and send the packets to the correct place. If you look at iSCSI, it's, you know, I show the IPv4 example, or it could be IPv6. This is a traditional Ethernet IPv4 TCP. And then there's a, typically a payload that has a number of commands. And iSCSI is a stream protocol. so not all frames will line up exactly like this. I'll explain that more in the context of NVMe over TCP. So finally we come to NVMe over Fabric. As I said earlier
Starting point is 00:10:54 we've got a variety of protocol encapsulations and the possible stacks are quite complex. But there are common properties for NVMeOF, right? Common command formats, multiple command sets of multiple mappings, multiple transport mappings, rich discovery and connection setup, efficient read or write command formats. And then there's a notion of lots of, or the capability for lots of key pairs for concurrency and multiple outstanding commands. So let's look at the two of the more prevalent versions of encapsulation. So the first one's NVMe over Rocky V2. Here we have an Ethernet header, an IP header, which is a UDP header. And I'll show more details on that part of the protocol coming up here. And then we have a full InfiniBand header and payload inside. So I'm doing RDMA using RDMA verbs. And that encapsulates the NVMe inside the InfiniBand payload. So what this is, is it's a tunnel or a wrapper for
Starting point is 00:12:18 InfiniBand RDMA commands. And that has some nice advantages and some disadvantages, some interesting trade-offs, especially with using UDP as a transport. Because UDP as a transport basically requires Ethernet to be lossless. And that's typically done with either link overflow control or priority flow control. Okay, so now let's look at a little bit about NVMe over TCP. And then we'll get into, you know, an example and do the contrast between them. So NVMe over TCP has an encapsulation shown on the left to form a protocol data unit header, a PDU header. And then there's a PDU payload, which is the contents or the data that's being transported. And if I have a stream of those, they can be placed in the TCP packets in arbitrary fashion.
Starting point is 00:13:26 So think of a set of commands and their data as just a byte stream, and it gets chopped up arbitrarily into TCP segments, which have IP headers on them. And so there's no particular alignment relationship between the carried PDUs and the package you would see on the wire. But it can, with it, because it's TCP, it can run over either velocity or lossless networks equally. Okay, so let's peel back one more. And I'm not going to go do a deep dive on either UDP or TCP, but I do want to point out sort of the top level differences. So UDP is a straightforward datagram protocol. It's send in a UDP header, and I push it onto the network, and it goes to the packet up to the application that's registered to handle that port or the thread of execution or the kernel area that's registered to handle that port.
Starting point is 00:14:51 And you have your datagram and you work on the data. With TCP, it's a little more complex in that it's a byte stream protocol that's connection oriented with guaranteed delivery. And flow control is built into the protocol so you have to handle the windowing you know there's a a window you can send into there's a sender's congestion window there's sequence numbers and acknowledgement numbers uh the guaranteed delivery means you have to get an acknowledgement back before you can forget about the data on the transmitting side uh The receiver has to actually send that acknowledgement back. There's a whole variety of retransmission protocols
Starting point is 00:15:29 and ways in which you slow down with, you know, fast retransmit with combined congestion avoidance, retransmission timeout, also known as slow start. So there's a lot of stuff going on in TCP. But again, fundamentally, I've got a byte stream that I chop into packets and put on the wire. Okay, so let's discuss the consequences of these two approaches with a simplified example. In the datagram example, let's suppose a host wants to read three chunks of data, data one, data two, data three, into buffer one, two, and three. So this would be an example of using NVMe over Rocky. And the packets that go out have network headers, which are green, and protocol headers, which are brown. And if things have to be broken up,
Starting point is 00:16:47 so big deal. You just get additional protocol headers. And it's very, it's fairly straightforward. The RDMA has set up the buffering such that the data can be DMA or placed directly into the buffers. And, you know, this protocol works quite well, but any drops have to be recovered at kind of the command and upper level protocol layer. So, again, as I said before, you don't want drops on this one.
Starting point is 00:17:21 The other example would be using, know nvme over tcp and there i might have one packet with three commands in it so i've got three protocol headers three commands that say read data one read data two read data three the subsystem returns the data as a byte stream so there are protocol headers that get spread across the packets and data can be broken up in arbitrary ways. And the system, so the subsystem has to walk through the command packets to get each read data command. And then the host has to wait for the second packet before it can deal with
Starting point is 00:18:09 all of data two. With zero copy TCP, you can still DMA the data into a buffer. So you can get sort of equivalent behavior there. And there's a consequence though, if I dropped the first first packet the one that has data one in the first part of data two in it I wouldn't know what to do with the second packet because it just starts with data there's no header there's no nothing and I've lost track of where I am in the sequence so I have to wait for that first packet to arrive before I can handle the second packet. Whereas in the datagram example,
Starting point is 00:18:48 I can handle each packet in order because I know where it's supposed to go based on the header in that packet. So that's sort of the trade-off between the two. Now, what I want to point out is the commonality again. In each case, the commands and the data transferred are the same. So there's no meaningful difference in number of packets. They can, you know, both protocols can use jumbo frames. Both protocols have to, you know, chop the data up if it exceeds the size of a single frame. So there's not really an inherent advantage or disadvantage from the point of view of the commands and data
Starting point is 00:19:45 themselves. So finally, I want to talk about the implications of this for data processing units or smart NICs and protocol acceleration. So a data processing unit has five functional blocks at its base. It's got a PCI interface to the host. It's got ARM cores with DRAM or high bandwidth memory. It's got protocol acceleration pipelines. So it's got embedded cores or P4 pipelines or hard-coded blocks that can deal with various kinds of offload. It has security and support accelerators. So if I want TLS encryption or IPsec encryption, I can get that in band, in line in the hardware. And then finally, it's got a network interface. Typically, you see two to eight ports.
Starting point is 00:20:55 Those are increasing over time. That can do embedded switching, network, traditional network packet processing, you know, filters and policers and those sorts of things. It does the forwarding lookup. And then it also has NIC functions so that the host says, oh, I have a NIC at this SRIOV address. I know how to deal with that. And if you look at where it sits within a server, it sits in the position a NIC would sit, but it's much more capable. It can do direct access to other PCI devices, persistent memory, GPUs, NVMe over fabric. It can also do DMAs into and out of the x86 CPU's DRAM.
Starting point is 00:21:47 And again, strictly, it doesn't have to be an x86 and the DPU doesn't strictly have to have ARM cores, but this is just what we see typically. And then it's allowed to have a sideband interface to a BMC, a baseboard controller. So what can you do with a DPU in terms of protocol acceleration? Well, you can support dedicated NVMe fabric processing in hardware pipelines. This means that you're offloading your x86 cores and not requiring your kernel to do that work. Typically, an offload processing can be done much more efficiently, both in terms of power
Starting point is 00:22:28 and latency and throughput. The DPU can directly place or send data to or from the host DRAM via DMA, and it could also operate across PCIe to local devices. So I could expose my NVMe drive out to the network through the DPU, for example. Both TCP and RDMA Rocky transports can be offloaded. So both of them can be offloaded. So both of them can be offloaded. The accelerators and pipelines must be constructed correctly to match the transport. So as we saw in the diagram before, there are differences in the way the transport is executed and the requirements on the network, but there's no core limitation or core problem with doing the acceleration.
Starting point is 00:23:29 And you just have to take into account the way the transports move the commands and data. And that's really my core conclusion here, is that this set of differences in the transports matters in terms of how you do the implementation, but you can get effective performance out of both transports depending on how your system and your offloads are built. That concludes my shortened presentation, and hopefully it was interesting, entertaining, and to the point. And so thank you very much. Thanks for listening.
Starting point is 00:24:16 If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.