Storage Developer Conference - #92: Fibre Channel – The Most Trusted Fabric Delivers NVMe
Episode Date: April 16, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 92. I'm going to be speaking on FCMVME
today. A little bit about myself.
My name is Craig Carlson.
I'm with Marvell.
A bunch of other qualifications, which really mean that I have tested lots of airline seats.
And if you have any questions about what airlines or planes have the best seats,
you can come see me afterwards.
I'm sure I can tell you.
I also want to thank Jay Metz from Cisco for providing some of the material for this deck.
So the agenda for my talk, we start with a refresher on FC, on NVMe.
I know some of you may already have some experience with NVMe and Fiber Channel,
but I just wanted to bring everybody up to the same level.
And then we'll talk about FC-NV we're going to talk about FCNVMe
and then do an FCNVMe update
because FCNVMe has been out for at least a year or so.
Then we're going to talk about the next generation, FCNVMe2,
and go on to some reasons why you might want to use FCNVMe. So just some watermarks here.
We're going to talk about how MCMME works,
how fabrics works,
how MVMore Fabrics works,
over a review of FCMME and an update on MVME2.
We're not going to do a real deep dive,
and we're not going to be comprehensive
and mention every single feature
that's out there for these technologies.
We're also not going to try to do a
comparison between different fabric
technologies. That's not my goal for here.
Basics on Fiber Channel.
For those
of you who may not be familiar with it, Fiber Channel
is a purpose-built network for storage.
There is a physical connection between each host and its storage in terms of there's a one-to-one connection to the switches for each device.
And there's also a logical connection for a host and storage, which can go from one host to multiple storage devices.
Some of the design requirements for Fibre Channel provide that one-to-one connectivity.
The transport and the services layers are on the same layer.
There's a well-defined role for each device.
You have initiators and targets.
It does not tolerate packet drop,
meaning that the fabric does not intentionally drop packets
if it gets in a congestion condition.
And there's really only north-south traffic,
meaning that there's really only traffic
between the initiators and targets.
The targets don't usually talk to each other,
and the initiators don't usually talk to each other.
The network is designed for
scale and availability. You can have multiple fabrics
connected to one host, which means that if one fabric goes
down, you have another fabric there to keep things running.
Some of the elements in a fiber channel
configuration, you have the initiator, you have a switch fabric, and the target.
And this is maybe familiar for those of you who are familiar with SCSI.
This is some of the same names that old SCSI uses.
For a fiber channel, the adapter is called an HBA
this plays the role of
if you're networking a NIC
in an Ethernet network
basically it encapsulates
the data into a fiber channel frame
and puts it on the network
and that would be
for our purposes that's NVMe
SCSI, FICON
so the switch is really For our purposes, that's NVMe, SCSI, Vicon.
So the switch is really the heavy lifter in a fiber channel network.
The switch, or the fabric, manages basically the entire configuration.
The biggest component of the switch is the name server,
which contains all the information about what exists in the fabric.
The name server is implemented as a redundant database,
which means that each switch has a copy of all the data,
so that if one switch goes down, the name server can continue to run.
There's no single point of failure.
Essentially, the name server knows about everything that's going on in the fabric.
So for the actual data transfer, Fibre Channel typically uses an unacknowledged datagram service.
Now, there are other classes of service which are not used as often, but typically what
it is is a service known as class 3. And class 3 is a datagram, which means it's connectionless.
It also has the property that the fabric will not drop it if there's a congestion situation.
Some other protocols, you could have a drop if there's congestion.
Usually it doesn't happen, but you still theoretically could have a drop.
You won't have a drop in a fiber channel frame unless there's some unrecoverable error such as a bit error or a frame corruption or something like that.
There's three fundamental constructs. There's the frame, which is the packet. That's where
you send the data in. There's sequences, which is a collection of frames,
which allows you to send more than the single frame length of data.
And there's exchanges, which is a grouping of sequences into a logical entity.
And in most protocols, an exchange equates to a command in response,
or a bundle into one exchange.
So if you look at a frame,
a single frame can be up to 2.1.1.2 bytes, and each frame
consists of a header, which has the routing information,
the payload, of course, which is the data, and the CRC. And of course, there are other
headers and things that you can put on there
if security is turned on
or if you have a virtual VMID type of fabric.
There are other headers that can be put in there
to accommodate those.
So, as I mentioned,
multiple frames can be bundled into a sequence.
So a sequence can be used to transfer large amounts of data up to megabytes.
It's really up to what the use is for a particular protocol.
And then the sequences are classified into exchanges.
And an exchange is a set of sequences which form a particular job.
And in most protocols, as I mentioned, the exchange is a single command response.
So the other big part of, of course, any network is discovery.
And so Fibre Channel has a name server
which lives in the switch.
And the name server basically collects all the information
on each device that logs in.
And it automatically collects World World names.
Each device has two World World names.
It has a World World name for the individual ports
and a World World name for the node,
which may be a collection of ports.
And the World World name is a unique number. Usually it's an assigned number that's assigned for the individual ports, and a world name for the node, which may be a collection of ports. And the world name is a unique number.
Usually it's an assigned number
that's assigned for the device per petatime
that uniquely identifies the device.
The other thing that the fabric does
is it provides and enforces zoning.
Zoning is there to allow security.
You can separate ports
and decide which ports you can talk to which ones.
And zoning can be dynamic.
You can change zoning at night when you do backups
and then put it back to the normal running configuration
in the morning.
Zoning is similar to ACLs in Ethernet.
The difference being that in Fiber Channel,
there's a central port in authority.
The fabric maintains it.
Each fabric has a copy.
This is the name server.
Each switch in the fabric has a copy of the zoning information,
and it's distributed across the entire fabric.
So if one switch goes down, there's no loss of that information.
And the other thing that's a little bit different
than the ACLs, if you're familiar with ACLs,
they're not necessarily standardized.
And in fiber channel zoning,
the format is actually standardized
in the fiber channel standards.
And fiber channel, if you're familiar with networking,
there's the OSI layers.
FibroChannel has something similar.
We have the lowest layer, which is the FC0, which is the physical interface,
which is what pushes the bits on the wire or the optics.
We have the FC1 layer, which encodes those bits.
And depending on the speed you're going, some of that encoding can start getting pretty complicated
in order to get through the noise that may be out there in the real world.
FC2 is the framing, and flow control happens at that layer.
And FC3 is what we call common services, which is what I mentioned, the name server, the
zoning database, and some other management server type of things.
And FC4 is the protocol layer, and that's where the protocols such as FCM and VME, SCSI,
FICON are defined.
So one term that I'll be using a bit in the talk is FCP.
And so what is FCP?
Some of you who are familiar with the Fiber Channel for a long time may have heard it associated.
FCP is SCSI.
And that's what it was originally designed to transport.
But later on, other protocols have borrowed the FCP data engine to transport their data.
And the reason for that is that it's an existing transport engine
that exists in the existing hardware and software.
And it allows you to reuse some of the high-performance paths
that exist in the hardware.
So a little bit on NVMe.
People are probably all familiar with NVMe in this room,
but I'll kind of go through a quick deck here on that.
So, you know, NVMe started as an industry standard for PCI,
basically designed to attach the PCI Express.
And, of course, it was designed for SSDs,
meaning you want low-lat latency, high IOPS.
So the newer edition, now of course it's probably two years old,
so it's not as new, but the newer edition is NVMe over Fabrics.
And the goal of that is to build that NVMe infrastructure out over a fabric.
And the fabrics can be anything.
There's a whole range of different fabrics.
The initial fabrics were RDMA and fiber channel.
And recently there's now a TCP binding as well.
So some basics.
The NVMe system consists of certain components.
Of course, you have drivers.
And for FC-NVMe, the drivers, of course, will be provided by the vendors and put into the Linux stack as it happens in other technologies.
The other component in NVMe is the subsystem,
and this is really the storage device.
And a subsystem contains the controller, the media, namespaces, and interfaces.
And the next one, the controller,
is the device that allows you to access the subspace.
There's an ID that's associative with the controller, an SID,
and this is a unique ID that allows you to identify it
either on the fabric or locally.
And there's a concept of namespaces,
and a namespace is a set of blocks
within the storage device.
It's similar to LUNs, if you're familiar with SCSI.
It's not exactly the same,
but it does have similar properties.
And one NVMe subsystem may have multiple namespaces.
Now, one of the most important architectural constructs of NVMe are the queues.
And the reason they're important is NVMe is designed to have a large amount of queues,
or up to 64, large amount of queues,
or up to 64,000 queues with 64,000 entries.
And this allows you to take good advantage of aggregate bandwidth with multiple devices.
And a little bit of an overview
of how these things all fit together.
On the left, you have the PCI Express, which is really a memory interface.
In the middle, you have Fibre Channel, which is a message-passing interface.
And on the right, you have RDMA, which is kind of a combination of both
because RDMA, of course, presents a memory interface across the network.
TCP, I believe, would probably fit
in the middle one. It would probably fit
in the message interface.
And, of course, in order to make
NVMe or Fabrics work, you have to extend
the queues over the network.
And since the queues are a very
important component of NVMe,
devices pretty much have the same view of the queues across the network. And since the queues are a very important component of NVMe, devices pretty much have the same view of the queues
across the network or across the fabric
as they would if it was a locally attached bus.
So a little bit of a primer NFC-MME.
So basically what we're going to talk about
is how it works,
an update on NVMe now that it's been out for a year,
and then an update on what we're doing for the next generation.
So our goals when we designed NVMe was, first of all,
it had to comply to the NVMe, when we designed FC-NVMe,
first of all it had to comply to the NVMe
or fabric specification.
Of course, we also wanted to have
high performance and low latency.
And we wanted to reuse existing hardware.
We didn't want to have to require
new hardware to be designed and built
to make it work.
Of course, it has to fit into the existing
infrastructure of Fiber Channel
with little or no changes to that.
And we also want to make sure that the Fibre Channel layer
didn't have to touch the NVMe frames.
So the existing layer means that the existing service layers,
such as the name server, zoning, management, are still in place.
So the goal of a high performance low latency
that means that we want to use
existing hardware acceleration
so Fibre Channel traditionally does not have an RDMA interface
so we are using FCP as I mentioned
to do the data transfer
and FCP is currently being
previous before FCM and NVMe,
was being used by SCSI and FICON for data transfers.
And the reason we want to use it is because many of the existing HBAs have FCP as a hardware accelerated option.
And like FC, FCP itself is a connectionless protocol. There may be connections in the
protocol itself, but FCP itself is a connectionless protocol. It's really a data transfer protocol.
So we need to map then the NVMe constructs into FCP. And so the NVMe command response
capsules, is what they call it in the
FABX definition, are mapped into
FCP information units.
So you'll see some of those in the upcoming slides.
And then the NVMe IO
operation, which means the
command and the response is mapped into a
Fiber Channel Exchange. So that's
the idea where the exchange is mapped
into, or command is mapped into an exchange. So that's that idea where the exchange is mapped into our command is that
mapped into exchange. So we have the different components from MVME, the SQE, which is the
submission queue entry, which is actually the command, which is mapped into the fiber channel
command IU. Data is mapped into fiber channel data IU. Since we don't have RDMA, we're not doing memory access.
We're mapping into messages that are sent across the wire.
And then the CQE, which is the NVMe response, is mapped into the fiber channel response
I.U., FCP response I.U.
And if anybody is familiar with the SCSI portion, these are not the same, but similar
constructs will exist with the SCSI portion, these are not the same but similar constructs
will exist inside the SCSI.
Some of the fields have been changed, of course,
but the constructs are similar.
And then of course, IAO operations
are bundled into an exchange,
so write command, the data, and the response
is bundled in the same thing with the read,
read command, data, and response
are both bundled into their own exchange.
And one thing to say about
one thing to mention, I'd like to mention
is RDMA
is all about zero copy.
Zero copy
allows the data to be
transferred from the network device to the user application
with a minimum amount of data copies in the memory, which can be expensive.
RDMA is a semantic for making zero copy easier or making it more upfront enforceable.
You don't need to have RDMA to do zero copy.
Most of the FC implementations have been doing zero copy
for 20 years before there was an RDMA.
Really, the difference between RDMA
and what the FC devices are doing is the APIs.
There didn't exist an already made API
when FC was defined,
but it still uses a zero-copy architecture.
Of course, the other important component is discovery.
And both Fiber Channel and NVMe
have their own discovery mechanisms.
And so the approach that we took is we use basically a dual model.
We use the Fibre Channel name server to do the things it's good at,
which is knowing what the Fibre Channel ports are,
and then we use the NVMe discovery server to do what it's good at,
which is knowing which subsystems exist out there.
So, just a quick
example, you know, you have the name server,
you look for the name server first
to see where the discovery controller is, and you look
at the discovery controller to
find out where the storage
devices are, and then you can
talk to the storage devices.
And then, of course,
FCMU, as I mentioned before,
works with zoning, management server,
and other fiber channel services.
So an update on FCME.
The standard, so this is the first revision.
The standard was ratified in 2017.
And at that time, there were already Linux drivers in development.
I think the first Linux driver,
and somebody can correct me if I'm wrong on this,
but the first Linux driver that had the FCMVME was probably 4.13.
And it is also part of the unified host target SPDK.
We are using existing devices,
using HBA switch hardware to support FCNVMe,
and in most cases, this is simply a firmware driver upgrade.
For switches, you don't actually usually have to do anything.
It's just for the HBAs.
For performance, I'm not going to try to show different vendors' performance here,
but we have seen performance numbers up to 50% lower latency,
depending on the configuration.
Right now, there are Linux implementations available with the drivers in place.
And as I will be mentioning in a bit,
FCMV2 started development this last spring,
and our focus on FCMV2 is enhancing our recovery.
So just a summary of where we're at today
with FCMV2.
Switches are available
because switches don't really need to change.
It's just another packet.
There are new switches coming out
that have additional data collection abilities
so that you can see some statistics
on what's going on through FCMVME configuration.
But, you know, that's a plus.
You don't need that in order for it to work.
HPAs, the host side driver is available for download today.
Target side drivers are in development,
and their firmware is available in sort of a trial mode.
Many operating systems support it,
including there is some engagement
from VMware's Microsoft,
and storage devices are expected to start
being released en masse in the second half of this year,
which probably means now.
So FC-MV-2.
As I mentioned, FC-MV-2's focus is on enhanced error recovery.
And our goal is to allow errors to be detected at the transport layer
before the protocol layer knows anything has gone wrong.
So why are we doing this?
You might say, well, this is a reliable network. Why would you need to do additional error recovery? protocol error knows anything has gone wrong. So why are we doing this?
You might say, well, this is a reliable network.
Why would you need to do additional error recovery?
Well, you know, stuff happens.
Bit errors do happen.
There's a theoretical bit error rate on any network line you may be using.
Typically, the actual bit error rate is lower than that,
but there is a theoretical, and so bit errors can happen,
and depending on the speed they can happen,
they could theoretically happen often.
There's also a possibility of software errors or hardware errors.
What causes it?
We always hear about the cosmic rays.
I did a little bit of digging,
and there's actually some research by IBM.
In the 90s, it suggested that
for any 2.6 megabytes of RAM,
you would see one cosmic ray-induced error per month.
Now, of course, our memories are bigger today,
and they're smaller.
They have more memory, and you have smaller chips,
so that number may not be correct anymore.
There may be a smaller chance of the memory being hit, but if it is hit by a particle,
then there's a larger possibility of things taking place because the actual physical geometry is smaller.
You also have radiation from the local environment.
One thing that chip makers are always careful of is making sure that they source the correct materials.
You don't want to have radioactive solder in your package
because you could then have radiation coming from your own chip and causing errors.
So there is radiation in the environment that can cause errors.
And then there's RF and power from RF noise and power noise from local equipment,
or non-local equipment, because I did hear a story recently.
Somebody told me about a data center that was crashing frequently at the same time every day.
It took them a couple weeks to figure out that the reason it was crashing
was because at that time the local power company was switching its generators,
and that generator switch would cause a low-frequency noise to be put out in the power lines,
which was getting up into the electronics and causing some havoc.
Solder hardware bugs, we're all familiar with that.
Probably don't need to say more about that.
And then the bit error rate, which is specified, currently is 10 to the minus 12th to 10 to the minus 15th,
which at some of the current speeds
means you could see theoretically multiple bits per error,
bit errors per hour.
You don't usually see that.
They're usually much better,
but that is the theoretical amount.
So with all these possible errors,
how did this stuff work before?
The lower link level does have some limited error recovery.
A lot of the higher speed links
have what's called forward error correction,
which is like a really advanced parity check.
And depending on the encoding and the speed,
it can recover multiple-bit errors.
There's also the protocols themselves have a recovery mechanism,
and both MDMeans and SCSI have their own recovery mechanisms.
They're usually not real quick, but they will recover from errors.
So our goal in the SCMV-Me2
is to basically not let the protocol layer
see any of these bit errors.
And we want to detect and recover for them
in a fast manner,
which you can do in a transport layer,
which is harder to do in a protocol layer.
We want to detect and recover for them.
So it's basic error recovery.
Missing frames time out and are retransmitted.
We've defined a new set of Fibre Channel Basic Link services
to facilitate the recovery.
Basically, a Basic Link service in Fibre Channel BasicLink services to facilitate the recovery. Basically, a BasicLink service in Fibre Channel
is a packet that does a basic service
such as setting up sequences that I mentioned
or terminating and what have you.
And then, of course, as I mentioned,
the protocol does not know anything happened.
Good leading question.
Would you repeat the question?
He was asking if it's applicable to scuzzi as well.
I'll answer it in two slides.
So an example, and of course, one caveat for those of you who may know the standards,
this is not true to the actual process.
Their recovery is very complicated, and if I were to go through that,
I could probably spend the entire time
talking about the mechanism.
But, of course, an example is,
frame goes missing,
we have a two-second timer,
we know, in most cases,
within two seconds that something has gone wrong.
We've defined a new flush service,
which we can use to double-check
whether or not the frame has really gone missing,
because one of the worst things you can do
in error recovery is send out a duplicate frame
when the original frame didn't go missing.
So we make sure that the frame has actually gone missing,
and then once we go through that process, we know that we can retransmit.
And as I said, error recovery is a very complicated topic.
There's lots of conditions that you have to consider,
and each one of them may have a slightly different process.
You can lose any one of these things.
You can also lose the recovery frames.
The recovery mechanism frames itself could get lost.
So you have to have a scenario for all these different types of issues.
And so the goal, of course, is to recover quickly,
and a recovery within a two-second time frame.
If you look at the protocol layers, they may be much longer.
If you let SCSI or SMEMV try protocol layers, they may be much longer. If you let SCSI or SMEMV
try to recover, they may be much longer.
We think this is going to be
becoming more important as link
speeds go up. Of course, as link speeds go up,
there's more bits being sent, and
there's more possibility if a window is basically
getting an error.
Here's the leading question right
here for the SCSI. We think this is going to be
something that's...
We think this is going to be so good
that we're also applying it to SCSI.
We've opened a new project, FCP5 within T10,
to define this error recovery mechanism for SCSI links.
So why do we want to use FCMVME? It's based on a dedicated storage network.
Fiber channel is not carrying other types of traffic. It's not going to be congested
by somebody watching a YouTube movie or what have you. It's a dedicated network for storage.
You can run NVMe and SCSI side-by-side on the same port.
You have a long-tested
discovery service,
which is, in most cases,
it's automatically configured.
Zoning and security also applies to FC-MVME.
And we also have the years of integration into data centers that FC has,
which now FC-MMV can use.
And of course, the last one, and this is the new one,
is we think that with the enhanced error recovery,
we're going to have an industry-leading error recovery mechanism,
which occurs at the transport layer.
So this slide was kind of written by a marketing person,
but FCMV, wicked fast. So this slide was kind of written by a marketing person, but...
FCMME, wicked fast.
Well, you know.
Builds on 20 years of storage experience.
It can be run side-by-side with SCSI.
Inherents the benefits of discovery mechanism that Fibre Channel has, and it capitalizes on the qualification
that's already taken place in data centers for Fibre Channel.
And that's it.
That's my contact information.
If you have any questions,
anybody have any questions right now?
All right, thank you.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.