Storage Developer Conference - #59: Introducing Fibre Channel NVMe

Episode Date: January 2, 2018

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 59. So welcome to the introduction to FC-NVME, which if you don't know the acronym, is Fiber Channel over NVMe Express. I'm sorry, the other way around, NVMe Express over Fiber Channel. So a little bit about myself. I'm Craig Carlson. I'm with Cavium.
Starting point is 00:01:00 I chair the FC NVME working group within T11. I also do some other things in the industry. I also want to thank Jay Metz of Cisco for contributions to the slide deck, as well as to, as to taking one of the only pictures of myself I actually like. So the agenda for this talk is we're gonna do some refreshers on a background on Fc-Fiber Channel and on NVMe.
Starting point is 00:01:29 And then we're going to go into what Fc-NVMe is itself, and then also a short section on why you might want to use Fc-NVMe. FC NVME. So just some ground work here. This presentation is a reminder of how fiber channel work works, how NVMore Fabrics works, and a high-level overview of fiber channel NVME and how they work together. What we're not going to do today is do a technical deep dive, no boiling the ocean, and we're not going to do a comparison of NVME over fabrics, other methods of doing NVME over fabrics. If you have any further questions, if you want to get a deeper dive, you can come and ask me and I can bore you as much as you want and give you all the details you want. Just come find me in the hallway. So Fibre Channel.
Starting point is 00:02:29 So what is Fibre Channel? I'm sure a lot of you here already know or have some experience with Fibre Channel. But Fibre Channel is purpose built for storage. It's a high speed connection between a host and a storage. And it's a logical protocol also between the host and a storage, and it's a logical protocol also between the host and storage. So what were some of the design requirements that went into making Fiber Channel in the first place? One of the primary design requirements was a one-to-one connectivity. Even though you're on a network, distributed network or a fabric, the devices and the hosts in storage really act like they're connected one-to-one.
Starting point is 00:03:10 Also, transport and services are on the same layer. We don't have different protocols on top to do like PCP where you have different layers. We do have layers, but not as many as that. So, most stuff is on the same layer. There's also a well-defined end device relationship initiator and targets that comes from SCSI and the year there is no built-in packet drop now of course things things happen packets can be corrupted and things can drop but there is no congestion management procedures or packets get dropped like that there's only really only traffic north-south traffic meaning traffic between posts and the storage devices and the fire channel network is also optimized for high availability you can have multiple paths through a fabric and there's also services built in.
Starting point is 00:04:09 And I'll go into that in a second. So a basic fiber channel configuration looks like this. You have an initiator, which is your host. You have your fabric in between and of course you have your service device so for a fire channel the initiator contains something called the the HBA host bus adapter in other words, in other network technologies, this is called NIC in terms of in ethernet world. This is where the protocols get encapsulated in the fiber channel frames.
Starting point is 00:04:58 And then of course the fabric, which is a set of switches, one or more switches. And the Fabric has, and Fabric Channel has a lot of intelligence. It has something called the name server, which is the repository of all the information. It's implemented in the Fabric as a redundant distributed network, so there's no single point of failure. And every single device that logs into the fabric is
Starting point is 00:05:30 registered in the name server. So a bar channel typically uses an unacknowledged datagram class of service. This is known as class 3. It's defined as a reliable datagram, meaning it won't be dropped. It won't be dropped for congestion reasons like that. If you get a bit of error, frames can get dropped.
Starting point is 00:06:00 Stuff happens, but it won't be dropped as a matter of the protocol. And within the FI-Returnal data transfer, there are three fundamental constructs. There are frames, which is a packet of data, sequences, which is a set of frames collected together, and exchanges, which depending on the protocols, is many times associated with a command response tied together. So for a frame each unit is 221 12 bytes and that consists of a fiber channel header, a payload, and a CRC. And then multiple sequences can be bundled together.
Starting point is 00:06:51 I'm sorry, multiple friends can be bundled together into a sequence, and this allows you to transfer large amounts of data, megabytes, gigabytes, what have you. And then an interaction between two fiber channel points is bundled into something called an exchange. And for protocols like SCSI and FCNVME, an exchange is mapped to a single command response. So the important thing about exchanges is in fiber channel, the frames within the individual exchange are guaranteed to be delivered in order. Exchanges themselves, individual exchanges, may be delivered out of order in the fabric so that the fabric can take advantage of any optimizations in paths that may exist between
Starting point is 00:07:43 switches. So what that means is different commands can take different paths to the fabric, so they can be delivered in a different order that they were sent. And as I mentioned before, the other thing that Fiber Channel has is a discovery layer, which is handled with the name server. And the name server. The name server contains information such as worldwide names of all devices, the port
Starting point is 00:08:12 IDs that they exist on, what type of device they are, so on and so forth. Fabric also provides a service called zoning, which allows ports to be separated from each other. It's a security method. It's also a data integrity method so that you don't have devices or hosts messing with your storage that you don't want them to touch. And zoning is implemented in each switch in the fabric in a distributed fashion as well, so it's also high availability. So if you look at the layer that exists in Fibre Channel, you have at the lowest level layer you have FC0, which is the physical layer, which is the bits and photons.
Starting point is 00:09:09 And then you have the FC1 layer, which is the encoding, which is in any high-speed network, if anybody's looked at how networks work these days, because we're pushing so much data through copper that sometimes can't really handle it, there's a lot of encoding that goes into it for error recovery and correcting bit errors. Then above the byte encoding layer, you have the framing layer, and then you have the services.
Starting point is 00:09:40 FC3 is the services, which is the name server, zoning server. And then FC4 is the upper layer, which is the protocols, which would be SCSI, FCNVME, FICON, what have you. So there's one term that keeps on getting brought up in fiber channel discussion, and it's FCP. And so what is FCP? FCP was traditionally or historically defined to carry one storage protocol, and that was SCSI.
Starting point is 00:10:21 And since that time, it's been adapted to carry other storage protocols. So FCP really has evolved into a data transfer protocol, which can carry SCSI, can carry FICON, and now we've been using it for FC-MVME as well. And the reason that we do that is because the fabric and the HPAs have optimized paths for FCP. So it allows us to take advantage of existing optimizations. So on to a quick NVMe refresher. So NVMe stands for Non-Volatile Memory Express. It began as a PCIe-attached storage protocol.
Starting point is 00:11:25 Many of you in this room probably already having, if you have one of these, these laptops, the Apple, newest Apples have NVMe Express drives in them, and I know a lot of other laptops do now. And about two or three years ago, there was a project within the NVMe group to define a fabric method of sending the data. And so NVMe over Fabrics was born. And the NVMe over Fabrics itself is a generic set of tools to extend NVMe over Fabric.
Starting point is 00:11:58 The initial fabrics that were defined are RDMA-based fabric protocols in Fiber Channel. So some basics on NVMe. Different layers are, so you have, of course, drivers. And, you know, for inbox NVMe devices, you have one set of drivers, and of course for the fabric NVMe devices, those will require new drivers, which the group is working on right now, and a lot of them have been pushed into the upstream kernel for Linux and are also being ported to other operating systems.
Starting point is 00:12:44 The next layer is the subsystem, other operating systems. The next layer is the subsystem. And a subsystem is really what's contained in a storage device. And it contains the controllers, the media, namespaces and interfaces. Controller is, of course, what you would expect it to be. It's the actual entity that executes the commands and returns the responses and manages the storage. And within the preview of a controller are the namespaces, which are the actual storage extents, which is the actual disk, equivalent to a disk image or a disk.
Starting point is 00:13:39 And the one thing that NVMe has over other storage protocols is a very deep set of queuing in a large set of possible queues that can be associated with any particular set of controllers. And so the important thing about NVMe is maintaining this queue method so that you have large amounts of queues which are associated with any particular controller. So if you look at a taxonomy of the transport, NVMe itself was defined as a memory-based model. Now, of course, you can't, even in RFRDMA,
Starting point is 00:14:27 you can't use memory-based models in a fabric because the data still has to be moved over the fabric. So fabrics evolve into a message-based model. And Fiber Channel maps the NVMe data, of course, onto fiber channel frames. And I mentioned, as I mentioned, the Q pairs are important in NVMe. So in order to port NVMe under fabric, you have to have a method for porting the queue pairs
Starting point is 00:15:05 across the fabric. And so the NVMe over fabrics definition has a set of commands and a set of methods for maintaining the queues across the fabric which were before in local memory and now are spread across a network. So for FCMDME, in this section we're gonna look at a high level understanding of how it works and understand how SCP can be used to map NVMe to fiber channel. So when we're designing FC NVMe, we had some goals.
Starting point is 00:15:53 First off, we wanted to comply with the NVMe over fabric spec. And of course, we wanted to maintain high performance, low latency. NVMe, of course, is a low latency protocol, and so we wanted to maintain high performance, low latency. NVMe, of course, is a low latency protocol, and so we wanted to maintain that. So in order to do this, we want to use existing, we also want to use existing HPA and switch hardware. We didn't want to require ASICs to respond to implement FC-NVMe.
Starting point is 00:16:23 And we wanted to fit into the existing fiber channel infrastructure, management, name server, and all those other things that exist for other fiber channel protocols. We wanted to be able to pass the NVME commands with little or no interaction from the fiber channel layer. And of course, the name server zoning management comes with it. So high performance, low latency.
Starting point is 00:17:02 In order to maintain parity with existing protocols and improve on existing protocols too, we need to use the same tools. So we wanted to keep the same hardware acceleration that exists, say, for SCSI. And Fiber Channel does not have an RDMA protocol, so we use FCP as a data transfer. Currently, both SCSI and FICON use FCP. And FCP is deployed in many, if not all,
Starting point is 00:17:39 implementations as a hardware accelerated layer. So to map them to FCP, we have to map the NVMe command response data onto fiber channel FCP frames. And an NVMe I.O. operation is mapped directly into a fiber channel exchange. So that means that a single read operation would be one exchange. So, for example, if you take an SQE, which stands for a submission queue entry, a submission queue entry in NVMe is basically a command. It's 64 bytes long, and it's the entry that's put on the queue
Starting point is 00:18:27 that tells the controller what to do. So the first step is to map one of these SQEs into a FCP command IU. And then if that command results in any data transfer, of course, then the data has to be mapped into SCP data IUs. And the data IU portion is what is accelerated by hardware. The hardware engines will automatically transfer across the power channel network, place into memory, so that there's no software handling of the data when it's in progress.
Starting point is 00:19:10 And then, of course, the response, which is the CQE, which stands for completion queue entry, is mapped into the FCP response I.O. And then the transactions for a particular I.O. are bundled into an exchange. So in this example, for example, you have a read operation, which is a single exchange, and a write operation, which is a single exchange. And one thing I keep on hearing about FCP is, well, how do you do zero copy? RDMA was designed to allow network protocols to do zero copy implementations easily. The fact is that FCP has been doing that for 20 years. It just wasn't called RDMA.
Starting point is 00:20:09 SCP, current implementations going back 20 years, have been doing zero copy. So you don't need to have RDMA in order to do zero copy. RDMA is a set of tools, a semantics, that make it easier to do zero copy, but it's not required. And so really the difference is the APIs. So for RDMA, you have a defined set of APIs that enforce or make it easier to do zero copy, where FCP, it's the implementation which does it. So for example, if you look at the transactions that take place on the wire,
Starting point is 00:20:47 FCP transactions, the latter diagram here, we have a read and a write operation taking place at the bottom. And if you look at the same operation done with RDMA, you have basically the same flow of commands. No matter what you're doing on the wire, you still have to send a command, get the data, or send the command and wait for the other side to say, I'm ready for the data. So it's basically the same operation at the lower level. Of course, the other thing that is important is to maintain discovery mechanisms with an FCMDME.
Starting point is 00:21:48 And in order to facilitate that, we use pieces from both layers. We use the Fibre Channel name server to do the discovery of the ports. So we can go out and say, what are the NVMe ports out here? And once we find them, then we can use the NVMe Discovery Controller, which knows the details about the subsystems which exist in the NVMe devices. This allows each component to manage the data that it has the most knowledge about. So for example, for an FCMME initiator to connect, the first thing it will do is go to the nameserver, the FibreTel nameserver, and it will identify where the discovery controllers are.
Starting point is 00:22:46 And then it will talk to the discovery controller, which identifies the subsystems that it can talk to, and then once it has that, then it can start talking directly to the storage devices. Is that for every exchange, or just for the... That happens during initialization. That would happen once for each initiator target pair. And then once you have that,
Starting point is 00:23:11 unless something changes in the fabric, if something changes in the fabric, then you'll get a notification, you have to do it again. And of course, the zoning and management server and other service mechanisms continue to work with FCMME. So why do you want to use it? So I think the key thing about Fibre Channel is it's a dedicated storage network. It's not sharing resources with anybody else. It's not sharing administrators with anybody else is it's a dedicated storage network. It's not sharing resources with anybody else.
Starting point is 00:23:46 It's not sharing administrators with anybody else. It's a dedicated storage network. You can also run NVMe and SCSI side by side on the same wire. A lot of the implementations that exist right now can run both protocols simultaneously on the same port from an HPA into the switch. Our channel's been around for a while. All the testing that has gone into making Fiber Channel
Starting point is 00:24:26 an enterprise storage solution in the last 20 years, 25 years, 30 years, continues to be there for FCNVME because we're keeping the same data transfer layer, we're keeping the same fabric, we're touching as little as we can in the path. Question? When you say NBME is crazy, not only the side-by-side, you mean the fiber channel.
Starting point is 00:24:53 The fiber channel is more common than the side-by-side. So here's where… You have a fiber channel gate, you're running NBME on top of SCP. That's right, yeah. Okay, so when you say it's SCSI side-by-side, you mean I can run NBME on SCP and fiber channel on the same gate? Well, fiber channel is not SCSI.
Starting point is 00:25:20 Fiber channel is protocol. Fiber channel is a fiber channel. Yeah, if you're taking the traditional definition of FCP, which is it's SCSI, you're right. So what I'm saying is you can run NVMe, which is running over FCP, side by side with SCSI, which is running over FCP on the same wire. Does that make sense? Okay. And then, of course, the built-in zoning security for fiber channel remains in place.
Starting point is 00:25:55 Of course, that picture is not a good example of security. And there has been a lot of qualification of fiber-tunnel devices, and the idea for deployment is that while a change is required, it's not a hardware change, it's a firmware change. So the change is adding the cues to the EPA and stuff like that? It's a bit more complicated than that. It is a different protocol, so it does require a change to the firmware
Starting point is 00:26:34 to understand the commands and responses. The queues, yes. And it depends on the implementation. A lot of times the queues are stored in the hosts. Sometimes they may be in the controller adapter itself. It really depends on the implementation. Okay, so you can tell this slide was written by Jay, who's the marketing guy. So SCME, wicked fast.
Starting point is 00:27:12 It builds on top of fiber channel, which is fast, too. We're looking at 64 gigabit fiber channel right now. Currently, 32 gigabit fiber channel is in the field. So fiber channel remains one of the fastest storage networks that you can deploy. It builds on 20 years of the storage network experience. It can be run side-by- existing SCSI based fiber channel storage. And actually you can run it side by side with FIACON2 because they're all layers on top of the existing protocols.
Starting point is 00:27:53 It inherits the benefits of the devices that exist out there. So more info, you can go to fibrochannel.org. My email is up there if you want to talk to me. You can talk to me in the hallway. Any questions? Can you describe why you need to be compatible with NVMe or Fabric for this type? Because we want to work with the existing infrastructure, people are designing devices that, in particular large all- flash arrays which do MV newer fabrics and the interface type then becomes possibly a political module you
Starting point is 00:28:53 can put already may in there you put power channel in there but the protocol that the device is talking is in your fabrics so we want to maintain that compatibility of existing emerging devices. Does that make sense? Any questions? A significant number of organizations, maybe 50% or better of the Fortune 100, are running fiber channel protocol and discussing over FCP of the same directors. Would the fact that we already have two protocols running on top of FCP preclude adding the third, or is the only, no?
Starting point is 00:29:39 No, you can run with a compatible storage and an update to the HPAs, you can run NVMe side-by-side as SCSI and FICON if you have it in a data center. And considering that those directories are on the ground in so many organizations, whether they're running FICON or just SCSI or some combination, is this a protocol and is this a code update to that hardware? So when you talk about the directors, are you talking about the switches themselves, the fabric itself?
Starting point is 00:30:17 Well, let's say we have Cisco 9509s, and so we're running, let's say just discussing, how do we upgrade that? So for a switch, there really is no change. It's data, it's frames. Now, some switches do have additional layers that do some traffic analysis and things like that. You may have to update that to get the new functionality, but in order to pass, to send the data,
Starting point is 00:30:48 the switch doesn't have to change at all. So the directors don't have to change it all in order to send NVMe. It's really a heavier lift on the HPA and storage side to make that work. And another question I have is we have, now we have the appearance of companies like Vexata here in the Valley with super fast NVME storage arrays, very unique architectures. They're running 6 million IO ops with good workloads with 8K block sizes, like unbelievable performance. What kind of, if we connect two Vixata type arrays or equivalents over distance, what kind of performance, do you have some comparison?
Starting point is 00:31:30 Because today we're using fiber channel for that. Okay, so I did not bring any, this is meant to be a tutorial, so I didn't bring any specific vendor performance numbers. One thing I can say, you know, you're not going to speed up the network storage or network performance by running a different protocol. The latency in the network is going to be the same no matter what you're running. What you gain is on the endpoints and you gain a lower latency in the NVMe stack. And NVMe has a possibility of much deeper queues. Deeper queues. You can have 64,000 queues or 64,000 entries.
Starting point is 00:32:17 And that's where in some cases we're seeing some big performance improvements because you're able to use the aggregate bandwidth of the fabric much more effectively. So you're picking up parallelism? Exactly, yeah. And do you have any benchmarks or anything? I don't have any, I can show you right now, sorry.
Starting point is 00:32:36 If you want to talk to me offline, I can provide you with some pointers. That is including this slide deck, so. Thank you. Any other questions? Yes. So I guess what's not making sense in my head is RDMA is pretty much, it reverses the role of the initiator and target in data transfer.
Starting point is 00:33:14 Because when you're using SCSI, you issue a transfer ready, and then the data gets written out. But in RDMA, it's a read from the target back to the queue. Doesn't that, I don't know, it seems like it's... So I'm not sure what your question is. Well, doesn't that conflict with what's existing in FCP? It seems to me that... Isn't that a difficult thing to do without changing your lower layers? Well, we're not trying to work like Artie Mae,
Starting point is 00:33:39 but if you look at the lower layers, yes, the semantics of setting something up, you have to set up your regions, and you have to set up your regions and you have to are very different but once the data starts going across the wire it looks the same because so it's just a semantic change and then basically the frames look the same yeah yeah you know of course between different protocols that you know you have different obviously different frames but the data transactions look the same on the wires.
Starting point is 00:34:07 Does that make sense to you? I'd have to look at it in more detail. Any other questions? All right, thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Starting point is 00:34:41 Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.