Storage Developer Conference - #92: Fibre Channel – The Most Trusted Fabric Delivers NVMe

Episode Date: April 16, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 92. I'm going to be speaking on FCMVME today. A little bit about myself. My name is Craig Carlson. I'm with Marvell.
Starting point is 00:00:48 A bunch of other qualifications, which really mean that I have tested lots of airline seats. And if you have any questions about what airlines or planes have the best seats, you can come see me afterwards. I'm sure I can tell you. I also want to thank Jay Metz from Cisco for providing some of the material for this deck. So the agenda for my talk, we start with a refresher on FC, on NVMe. I know some of you may already have some experience with NVMe and Fiber Channel, but I just wanted to bring everybody up to the same level.
Starting point is 00:01:22 And then we'll talk about FC-NV we're going to talk about FCNVMe and then do an FCNVMe update because FCNVMe has been out for at least a year or so. Then we're going to talk about the next generation, FCNVMe2, and go on to some reasons why you might want to use FCNVMe. So just some watermarks here. We're going to talk about how MCMME works, how fabrics works, how MVMore Fabrics works,
Starting point is 00:01:55 over a review of FCMME and an update on MVME2. We're not going to do a real deep dive, and we're not going to be comprehensive and mention every single feature that's out there for these technologies. We're also not going to try to do a comparison between different fabric technologies. That's not my goal for here.
Starting point is 00:02:14 Basics on Fiber Channel. For those of you who may not be familiar with it, Fiber Channel is a purpose-built network for storage. There is a physical connection between each host and its storage in terms of there's a one-to-one connection to the switches for each device. And there's also a logical connection for a host and storage, which can go from one host to multiple storage devices. Some of the design requirements for Fibre Channel provide that one-to-one connectivity. The transport and the services layers are on the same layer.
Starting point is 00:02:57 There's a well-defined role for each device. You have initiators and targets. It does not tolerate packet drop, meaning that the fabric does not intentionally drop packets if it gets in a congestion condition. And there's really only north-south traffic, meaning that there's really only traffic between the initiators and targets.
Starting point is 00:03:18 The targets don't usually talk to each other, and the initiators don't usually talk to each other. The network is designed for scale and availability. You can have multiple fabrics connected to one host, which means that if one fabric goes down, you have another fabric there to keep things running. Some of the elements in a fiber channel configuration, you have the initiator, you have a switch fabric, and the target.
Starting point is 00:03:50 And this is maybe familiar for those of you who are familiar with SCSI. This is some of the same names that old SCSI uses. For a fiber channel, the adapter is called an HBA this plays the role of if you're networking a NIC in an Ethernet network basically it encapsulates the data into a fiber channel frame
Starting point is 00:04:16 and puts it on the network and that would be for our purposes that's NVMe SCSI, FICON so the switch is really For our purposes, that's NVMe, SCSI, Vicon. So the switch is really the heavy lifter in a fiber channel network. The switch, or the fabric, manages basically the entire configuration. The biggest component of the switch is the name server,
Starting point is 00:04:45 which contains all the information about what exists in the fabric. The name server is implemented as a redundant database, which means that each switch has a copy of all the data, so that if one switch goes down, the name server can continue to run. There's no single point of failure. Essentially, the name server knows about everything that's going on in the fabric. So for the actual data transfer, Fibre Channel typically uses an unacknowledged datagram service. Now, there are other classes of service which are not used as often, but typically what
Starting point is 00:05:21 it is is a service known as class 3. And class 3 is a datagram, which means it's connectionless. It also has the property that the fabric will not drop it if there's a congestion situation. Some other protocols, you could have a drop if there's congestion. Usually it doesn't happen, but you still theoretically could have a drop. You won't have a drop in a fiber channel frame unless there's some unrecoverable error such as a bit error or a frame corruption or something like that. There's three fundamental constructs. There's the frame, which is the packet. That's where you send the data in. There's sequences, which is a collection of frames, which allows you to send more than the single frame length of data.
Starting point is 00:06:12 And there's exchanges, which is a grouping of sequences into a logical entity. And in most protocols, an exchange equates to a command in response, or a bundle into one exchange. So if you look at a frame, a single frame can be up to 2.1.1.2 bytes, and each frame consists of a header, which has the routing information, the payload, of course, which is the data, and the CRC. And of course, there are other headers and things that you can put on there
Starting point is 00:06:45 if security is turned on or if you have a virtual VMID type of fabric. There are other headers that can be put in there to accommodate those. So, as I mentioned, multiple frames can be bundled into a sequence. So a sequence can be used to transfer large amounts of data up to megabytes. It's really up to what the use is for a particular protocol.
Starting point is 00:07:14 And then the sequences are classified into exchanges. And an exchange is a set of sequences which form a particular job. And in most protocols, as I mentioned, the exchange is a single command response. So the other big part of, of course, any network is discovery. And so Fibre Channel has a name server which lives in the switch. And the name server basically collects all the information on each device that logs in.
Starting point is 00:07:53 And it automatically collects World World names. Each device has two World World names. It has a World World name for the individual ports and a World World name for the node, which may be a collection of ports. And the World World name is a unique number. Usually it's an assigned number that's assigned for the individual ports, and a world name for the node, which may be a collection of ports. And the world name is a unique number. Usually it's an assigned number that's assigned for the device per petatime
Starting point is 00:08:09 that uniquely identifies the device. The other thing that the fabric does is it provides and enforces zoning. Zoning is there to allow security. You can separate ports and decide which ports you can talk to which ones. And zoning can be dynamic. You can change zoning at night when you do backups
Starting point is 00:08:36 and then put it back to the normal running configuration in the morning. Zoning is similar to ACLs in Ethernet. The difference being that in Fiber Channel, there's a central port in authority. The fabric maintains it. Each fabric has a copy. This is the name server.
Starting point is 00:08:52 Each switch in the fabric has a copy of the zoning information, and it's distributed across the entire fabric. So if one switch goes down, there's no loss of that information. And the other thing that's a little bit different than the ACLs, if you're familiar with ACLs, they're not necessarily standardized. And in fiber channel zoning, the format is actually standardized
Starting point is 00:09:14 in the fiber channel standards. And fiber channel, if you're familiar with networking, there's the OSI layers. FibroChannel has something similar. We have the lowest layer, which is the FC0, which is the physical interface, which is what pushes the bits on the wire or the optics. We have the FC1 layer, which encodes those bits. And depending on the speed you're going, some of that encoding can start getting pretty complicated
Starting point is 00:09:42 in order to get through the noise that may be out there in the real world. FC2 is the framing, and flow control happens at that layer. And FC3 is what we call common services, which is what I mentioned, the name server, the zoning database, and some other management server type of things. And FC4 is the protocol layer, and that's where the protocols such as FCM and VME, SCSI, FICON are defined. So one term that I'll be using a bit in the talk is FCP. And so what is FCP?
Starting point is 00:10:23 Some of you who are familiar with the Fiber Channel for a long time may have heard it associated. FCP is SCSI. And that's what it was originally designed to transport. But later on, other protocols have borrowed the FCP data engine to transport their data. And the reason for that is that it's an existing transport engine that exists in the existing hardware and software. And it allows you to reuse some of the high-performance paths that exist in the hardware.
Starting point is 00:10:57 So a little bit on NVMe. People are probably all familiar with NVMe in this room, but I'll kind of go through a quick deck here on that. So, you know, NVMe started as an industry standard for PCI, basically designed to attach the PCI Express. And, of course, it was designed for SSDs, meaning you want low-lat latency, high IOPS. So the newer edition, now of course it's probably two years old,
Starting point is 00:11:37 so it's not as new, but the newer edition is NVMe over Fabrics. And the goal of that is to build that NVMe infrastructure out over a fabric. And the fabrics can be anything. There's a whole range of different fabrics. The initial fabrics were RDMA and fiber channel. And recently there's now a TCP binding as well. So some basics. The NVMe system consists of certain components.
Starting point is 00:12:10 Of course, you have drivers. And for FC-NVMe, the drivers, of course, will be provided by the vendors and put into the Linux stack as it happens in other technologies. The other component in NVMe is the subsystem, and this is really the storage device. And a subsystem contains the controller, the media, namespaces, and interfaces. And the next one, the controller, is the device that allows you to access the subspace. There's an ID that's associative with the controller, an SID,
Starting point is 00:12:52 and this is a unique ID that allows you to identify it either on the fabric or locally. And there's a concept of namespaces, and a namespace is a set of blocks within the storage device. It's similar to LUNs, if you're familiar with SCSI. It's not exactly the same, but it does have similar properties.
Starting point is 00:13:23 And one NVMe subsystem may have multiple namespaces. Now, one of the most important architectural constructs of NVMe are the queues. And the reason they're important is NVMe is designed to have a large amount of queues, or up to 64, large amount of queues, or up to 64,000 queues with 64,000 entries. And this allows you to take good advantage of aggregate bandwidth with multiple devices. And a little bit of an overview of how these things all fit together.
Starting point is 00:14:07 On the left, you have the PCI Express, which is really a memory interface. In the middle, you have Fibre Channel, which is a message-passing interface. And on the right, you have RDMA, which is kind of a combination of both because RDMA, of course, presents a memory interface across the network. TCP, I believe, would probably fit in the middle one. It would probably fit in the message interface. And, of course, in order to make
Starting point is 00:14:37 NVMe or Fabrics work, you have to extend the queues over the network. And since the queues are a very important component of NVMe, devices pretty much have the same view of the queues across the network. And since the queues are a very important component of NVMe, devices pretty much have the same view of the queues across the network or across the fabric as they would if it was a locally attached bus. So a little bit of a primer NFC-MME.
Starting point is 00:15:01 So basically what we're going to talk about is how it works, an update on NVMe now that it's been out for a year, and then an update on what we're doing for the next generation. So our goals when we designed NVMe was, first of all, it had to comply to the NVMe, when we designed FC-NVMe, first of all it had to comply to the NVMe or fabric specification.
Starting point is 00:15:27 Of course, we also wanted to have high performance and low latency. And we wanted to reuse existing hardware. We didn't want to have to require new hardware to be designed and built to make it work. Of course, it has to fit into the existing infrastructure of Fiber Channel
Starting point is 00:15:42 with little or no changes to that. And we also want to make sure that the Fibre Channel layer didn't have to touch the NVMe frames. So the existing layer means that the existing service layers, such as the name server, zoning, management, are still in place. So the goal of a high performance low latency that means that we want to use existing hardware acceleration
Starting point is 00:16:09 so Fibre Channel traditionally does not have an RDMA interface so we are using FCP as I mentioned to do the data transfer and FCP is currently being previous before FCM and NVMe, was being used by SCSI and FICON for data transfers. And the reason we want to use it is because many of the existing HBAs have FCP as a hardware accelerated option. And like FC, FCP itself is a connectionless protocol. There may be connections in the
Starting point is 00:16:47 protocol itself, but FCP itself is a connectionless protocol. It's really a data transfer protocol. So we need to map then the NVMe constructs into FCP. And so the NVMe command response capsules, is what they call it in the FABX definition, are mapped into FCP information units. So you'll see some of those in the upcoming slides. And then the NVMe IO operation, which means the
Starting point is 00:17:17 command and the response is mapped into a Fiber Channel Exchange. So that's the idea where the exchange is mapped into, or command is mapped into an exchange. So that's that idea where the exchange is mapped into our command is that mapped into exchange. So we have the different components from MVME, the SQE, which is the submission queue entry, which is actually the command, which is mapped into the fiber channel command IU. Data is mapped into fiber channel data IU. Since we don't have RDMA, we're not doing memory access. We're mapping into messages that are sent across the wire.
Starting point is 00:17:52 And then the CQE, which is the NVMe response, is mapped into the fiber channel response I.U., FCP response I.U. And if anybody is familiar with the SCSI portion, these are not the same, but similar constructs will exist with the SCSI portion, these are not the same but similar constructs will exist inside the SCSI. Some of the fields have been changed, of course, but the constructs are similar. And then of course, IAO operations
Starting point is 00:18:14 are bundled into an exchange, so write command, the data, and the response is bundled in the same thing with the read, read command, data, and response are both bundled into their own exchange. And one thing to say about one thing to mention, I'd like to mention is RDMA
Starting point is 00:18:37 is all about zero copy. Zero copy allows the data to be transferred from the network device to the user application with a minimum amount of data copies in the memory, which can be expensive. RDMA is a semantic for making zero copy easier or making it more upfront enforceable. You don't need to have RDMA to do zero copy. Most of the FC implementations have been doing zero copy
Starting point is 00:19:11 for 20 years before there was an RDMA. Really, the difference between RDMA and what the FC devices are doing is the APIs. There didn't exist an already made API when FC was defined, but it still uses a zero-copy architecture. Of course, the other important component is discovery. And both Fiber Channel and NVMe
Starting point is 00:19:43 have their own discovery mechanisms. And so the approach that we took is we use basically a dual model. We use the Fibre Channel name server to do the things it's good at, which is knowing what the Fibre Channel ports are, and then we use the NVMe discovery server to do what it's good at, which is knowing which subsystems exist out there. So, just a quick example, you know, you have the name server,
Starting point is 00:20:10 you look for the name server first to see where the discovery controller is, and you look at the discovery controller to find out where the storage devices are, and then you can talk to the storage devices. And then, of course, FCMU, as I mentioned before,
Starting point is 00:20:29 works with zoning, management server, and other fiber channel services. So an update on FCME. The standard, so this is the first revision. The standard was ratified in 2017. And at that time, there were already Linux drivers in development. I think the first Linux driver, and somebody can correct me if I'm wrong on this,
Starting point is 00:20:53 but the first Linux driver that had the FCMVME was probably 4.13. And it is also part of the unified host target SPDK. We are using existing devices, using HBA switch hardware to support FCNVMe, and in most cases, this is simply a firmware driver upgrade. For switches, you don't actually usually have to do anything. It's just for the HBAs. For performance, I'm not going to try to show different vendors' performance here,
Starting point is 00:21:30 but we have seen performance numbers up to 50% lower latency, depending on the configuration. Right now, there are Linux implementations available with the drivers in place. And as I will be mentioning in a bit, FCMV2 started development this last spring, and our focus on FCMV2 is enhancing our recovery. So just a summary of where we're at today with FCMV2.
Starting point is 00:22:03 Switches are available because switches don't really need to change. It's just another packet. There are new switches coming out that have additional data collection abilities so that you can see some statistics on what's going on through FCMVME configuration. But, you know, that's a plus.
Starting point is 00:22:24 You don't need that in order for it to work. HPAs, the host side driver is available for download today. Target side drivers are in development, and their firmware is available in sort of a trial mode. Many operating systems support it, including there is some engagement from VMware's Microsoft, and storage devices are expected to start
Starting point is 00:22:54 being released en masse in the second half of this year, which probably means now. So FC-MV-2. As I mentioned, FC-MV-2's focus is on enhanced error recovery. And our goal is to allow errors to be detected at the transport layer before the protocol layer knows anything has gone wrong. So why are we doing this? You might say, well, this is a reliable network. Why would you need to do additional error recovery? protocol error knows anything has gone wrong. So why are we doing this?
Starting point is 00:23:26 You might say, well, this is a reliable network. Why would you need to do additional error recovery? Well, you know, stuff happens. Bit errors do happen. There's a theoretical bit error rate on any network line you may be using. Typically, the actual bit error rate is lower than that, but there is a theoretical, and so bit errors can happen, and depending on the speed they can happen,
Starting point is 00:23:49 they could theoretically happen often. There's also a possibility of software errors or hardware errors. What causes it? We always hear about the cosmic rays. I did a little bit of digging, and there's actually some research by IBM. In the 90s, it suggested that for any 2.6 megabytes of RAM,
Starting point is 00:24:11 you would see one cosmic ray-induced error per month. Now, of course, our memories are bigger today, and they're smaller. They have more memory, and you have smaller chips, so that number may not be correct anymore. There may be a smaller chance of the memory being hit, but if it is hit by a particle, then there's a larger possibility of things taking place because the actual physical geometry is smaller. You also have radiation from the local environment.
Starting point is 00:24:42 One thing that chip makers are always careful of is making sure that they source the correct materials. You don't want to have radioactive solder in your package because you could then have radiation coming from your own chip and causing errors. So there is radiation in the environment that can cause errors. And then there's RF and power from RF noise and power noise from local equipment, or non-local equipment, because I did hear a story recently. Somebody told me about a data center that was crashing frequently at the same time every day. It took them a couple weeks to figure out that the reason it was crashing
Starting point is 00:25:16 was because at that time the local power company was switching its generators, and that generator switch would cause a low-frequency noise to be put out in the power lines, which was getting up into the electronics and causing some havoc. Solder hardware bugs, we're all familiar with that. Probably don't need to say more about that. And then the bit error rate, which is specified, currently is 10 to the minus 12th to 10 to the minus 15th, which at some of the current speeds means you could see theoretically multiple bits per error,
Starting point is 00:25:51 bit errors per hour. You don't usually see that. They're usually much better, but that is the theoretical amount. So with all these possible errors, how did this stuff work before? The lower link level does have some limited error recovery. A lot of the higher speed links
Starting point is 00:26:17 have what's called forward error correction, which is like a really advanced parity check. And depending on the encoding and the speed, it can recover multiple-bit errors. There's also the protocols themselves have a recovery mechanism, and both MDMeans and SCSI have their own recovery mechanisms. They're usually not real quick, but they will recover from errors. So our goal in the SCMV-Me2
Starting point is 00:26:50 is to basically not let the protocol layer see any of these bit errors. And we want to detect and recover for them in a fast manner, which you can do in a transport layer, which is harder to do in a protocol layer. We want to detect and recover for them. So it's basic error recovery.
Starting point is 00:27:12 Missing frames time out and are retransmitted. We've defined a new set of Fibre Channel Basic Link services to facilitate the recovery. Basically, a Basic Link service in Fibre Channel BasicLink services to facilitate the recovery. Basically, a BasicLink service in Fibre Channel is a packet that does a basic service such as setting up sequences that I mentioned or terminating and what have you. And then, of course, as I mentioned,
Starting point is 00:27:37 the protocol does not know anything happened. Good leading question. Would you repeat the question? He was asking if it's applicable to scuzzi as well. I'll answer it in two slides. So an example, and of course, one caveat for those of you who may know the standards, this is not true to the actual process. Their recovery is very complicated, and if I were to go through that,
Starting point is 00:28:05 I could probably spend the entire time talking about the mechanism. But, of course, an example is, frame goes missing, we have a two-second timer, we know, in most cases, within two seconds that something has gone wrong. We've defined a new flush service,
Starting point is 00:28:20 which we can use to double-check whether or not the frame has really gone missing, because one of the worst things you can do in error recovery is send out a duplicate frame when the original frame didn't go missing. So we make sure that the frame has actually gone missing, and then once we go through that process, we know that we can retransmit. And as I said, error recovery is a very complicated topic.
Starting point is 00:28:45 There's lots of conditions that you have to consider, and each one of them may have a slightly different process. You can lose any one of these things. You can also lose the recovery frames. The recovery mechanism frames itself could get lost. So you have to have a scenario for all these different types of issues. And so the goal, of course, is to recover quickly, and a recovery within a two-second time frame.
Starting point is 00:29:21 If you look at the protocol layers, they may be much longer. If you let SCSI or SMEMV try protocol layers, they may be much longer. If you let SCSI or SMEMV try to recover, they may be much longer. We think this is going to be becoming more important as link speeds go up. Of course, as link speeds go up, there's more bits being sent, and there's more possibility if a window is basically
Starting point is 00:29:38 getting an error. Here's the leading question right here for the SCSI. We think this is going to be something that's... We think this is going to be so good that we're also applying it to SCSI. We've opened a new project, FCP5 within T10, to define this error recovery mechanism for SCSI links.
Starting point is 00:30:02 So why do we want to use FCMVME? It's based on a dedicated storage network. Fiber channel is not carrying other types of traffic. It's not going to be congested by somebody watching a YouTube movie or what have you. It's a dedicated network for storage. You can run NVMe and SCSI side-by-side on the same port. You have a long-tested discovery service, which is, in most cases, it's automatically configured.
Starting point is 00:30:49 Zoning and security also applies to FC-MVME. And we also have the years of integration into data centers that FC has, which now FC-MMV can use. And of course, the last one, and this is the new one, is we think that with the enhanced error recovery, we're going to have an industry-leading error recovery mechanism, which occurs at the transport layer. So this slide was kind of written by a marketing person,
Starting point is 00:31:24 but FCMV, wicked fast. So this slide was kind of written by a marketing person, but... FCMME, wicked fast. Well, you know. Builds on 20 years of storage experience. It can be run side-by-side with SCSI. Inherents the benefits of discovery mechanism that Fibre Channel has, and it capitalizes on the qualification that's already taken place in data centers for Fibre Channel. And that's it.
Starting point is 00:31:52 That's my contact information. If you have any questions, anybody have any questions right now? All right, thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference,
Starting point is 00:32:30 visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.