Storage Developer Conference - #59: Introducing Fibre Channel NVMe
Episode Date: January 2, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, Episode 59.
So welcome to the introduction to FC-NVME, which if you don't know the acronym, is Fiber Channel over NVMe Express.
I'm sorry, the other way around, NVMe Express over Fiber Channel.
So a little bit about myself. I'm Craig Carlson. I'm with Cavium.
I chair the FC NVME working group within T11.
I also do some other things in the industry.
I also want to thank Jay Metz of Cisco
for contributions to the slide deck,
as well as to, as to taking one of the only pictures
of myself I actually like.
So the agenda for this talk is we're gonna do
some refreshers on a background on Fc-Fiber Channel and on NVMe.
And then we're going to go into what Fc-NVMe is itself, and then also a short section on why you might want to use Fc-NVMe. FC NVME. So just some ground work here. This presentation is a reminder of how fiber channel
work works, how NVMore Fabrics works, and a high-level overview of fiber channel NVME
and how they work together. What we're not going to do today is do a technical deep dive, no boiling the ocean, and we're
not going to do a comparison of NVME over fabrics, other methods of doing NVME over
fabrics.
If you have any further questions, if you want to get a deeper dive, you can come and
ask me and I can bore you as much as you want and give you all the details you want.
Just come find me in the hallway. So Fibre Channel.
So what is Fibre Channel? I'm sure a lot of you here already
know or have some experience with Fibre Channel. But Fibre Channel is
purpose built for storage. It's a high speed
connection between a host and a storage. And it's a logical
protocol also between the host and a storage, and it's a logical protocol also between the host and storage.
So what were some of the design requirements that went into making Fiber Channel in the first place?
One of the primary design requirements was a one-to-one connectivity. Even though you're on a network, distributed network or a fabric,
the devices and the hosts in storage really act like they're connected one-to-one.
Also, transport and services are on the same layer. We don't have different protocols on top to do
like PCP where you have different layers. We do have layers, but not as many as that. So,
most stuff is on the same layer. There's also a well-defined end device relationship initiator and
targets that comes from SCSI and the year there is no built-in packet drop
now of course things things happen packets can be corrupted and things can
drop but there is no congestion management procedures or packets get dropped like that there's only really only traffic north-south traffic
meaning traffic between posts and the storage devices and the fire channel network
is also optimized for high availability you can have multiple paths through a fabric and there's also services built in.
And I'll go into that in a second.
So a basic fiber channel configuration looks like this. You have an initiator,
which is your host. You have your fabric in between and of course you have
your service device so for a fire channel the initiator contains something called the the HBA
host bus adapter in other words, in other network technologies,
this is called NIC in terms of in ethernet world.
This is where the protocols get encapsulated
in the fiber channel frames.
And then of course the fabric,
which is a set of switches, one or more switches.
And the Fabric has, and Fabric Channel has a lot of intelligence.
It has something called the name server,
which is the repository of all the information.
It's implemented in the Fabric as a redundant
distributed network, so there's no single point of failure.
And every single device that logs into the fabric is
registered in the name server.
So a bar channel typically uses an unacknowledged datagram
class of service.
This is known as class 3.
It's defined as a reliable datagram,
meaning it won't be dropped.
It won't be dropped for congestion reasons like that.
If you get a bit of error, frames can get dropped.
Stuff happens, but it won't be dropped as a matter of the protocol.
And within the FI-Returnal data transfer, there are three fundamental constructs.
There are frames, which is a packet of data,
sequences, which is a set of frames collected together,
and exchanges, which depending on the protocols,
is many times associated with a command response tied together. So for a frame each unit is 221 12 bytes and
that consists of a fiber channel header, a payload, and a CRC.
And then multiple sequences can be bundled together.
I'm sorry, multiple friends can be bundled together into a sequence, and this allows you to transfer
large amounts of data, megabytes, gigabytes, what have you.
And then an interaction between two fiber channel points is bundled into something called an exchange.
And for protocols like SCSI and FCNVME, an exchange is mapped to a single command response.
So the important thing about exchanges is in fiber channel, the frames within the individual
exchange are guaranteed to be delivered in order.
Exchanges themselves, individual exchanges, may be delivered out of order in the fabric
so that the fabric can take advantage of any optimizations in paths that may exist between
switches.
So what that means is different commands can take different paths to the fabric, so they can be delivered in a different
order that they were sent.
And as I mentioned before, the other thing that Fiber Channel
has is a discovery layer, which is handled
with the name server.
And the name server.
The name server contains information such as worldwide names of all devices, the port
IDs that they exist on, what type of device they are, so on and so forth.
Fabric also provides a service called zoning, which allows ports to be separated from each other.
It's a security method.
It's also a data integrity method so that you don't have devices or hosts messing with your storage that you don't want them to touch.
And zoning is implemented in each switch in the fabric in a distributed fashion as well,
so it's also high availability.
So if you look at the layer that exists in Fibre Channel, you have at the lowest level layer you have FC0, which is the physical layer, which is
the bits and photons.
And then you have the FC1 layer, which is the encoding, which is in any high-speed network,
if anybody's looked at how networks work these days, because we're pushing so much data through
copper that sometimes can't really handle it,
there's a lot of encoding that goes into it
for error recovery and correcting bit errors.
Then above the byte encoding layer,
you have the framing layer,
and then you have the services.
FC3 is the services, which is the name server,
zoning server.
And then FC4 is the upper layer, which is the protocols, which would be SCSI, FCNVME, FICON, what have you.
So there's one term that keeps on getting brought up in fiber channel discussion,
and it's FCP.
And so what is FCP?
FCP was traditionally or historically defined to carry one storage protocol,
and that was SCSI.
And since that time, it's been adapted to carry other storage protocols.
So FCP really has evolved into a data transfer protocol, which can carry SCSI, can carry FICON,
and now we've been using it for FC-MVME as well.
And the reason that we do that is because the fabric and the HPAs have optimized paths for FCP.
So it allows us to take advantage of existing optimizations.
So on to a quick NVMe refresher.
So NVMe stands for Non-Volatile Memory Express.
It began as a PCIe-attached storage protocol.
Many of you in this room probably already having, if you have one of these, these laptops,
the Apple, newest Apples have NVMe Express drives in them,
and I know a lot of other laptops do now.
And about two or three years ago,
there was a project within the NVMe group
to define a fabric method of sending the data.
And so NVMe over Fabrics was born.
And the NVMe over Fabrics itself is a generic set of tools to extend NVMe over Fabric.
The initial fabrics that were defined are RDMA-based fabric protocols in Fiber Channel.
So some basics on NVMe.
Different layers are, so you have, of course, drivers.
And, you know, for inbox NVMe devices, you have one set of drivers,
and of course for the fabric NVMe devices, those will require new drivers,
which the group is working on right now,
and a lot of them have been pushed into the upstream kernel for Linux
and are also being ported to other operating systems.
The next layer is the subsystem, other operating systems.
The next layer is the subsystem.
And a subsystem is really what's contained in a storage device.
And it contains the controllers, the media, namespaces and interfaces.
Controller is, of course, what you would expect it to be.
It's the actual entity that executes the commands and returns the responses and manages the storage.
And within the preview of a controller are the namespaces, which are the actual storage extents,
which is the actual disk, equivalent to a disk image or a disk.
And the one thing that NVMe has over other storage protocols is a very deep set of queuing
in a large set of possible queues that can be associated with any particular
set of controllers. And so the important thing about NVMe is maintaining this
queue method so that you have large amounts of queues
which are associated with any particular controller.
So if you look at a taxonomy of the transport,
NVMe itself was defined as a memory-based model.
Now, of course, you can't, even in RFRDMA,
you can't use memory-based models in a fabric
because the data still has to be moved over the fabric.
So fabrics evolve into a message-based model.
And Fiber Channel maps the NVMe data, of course, onto fiber channel frames.
And I mentioned, as I mentioned,
the Q pairs are important in NVMe.
So in order to port NVMe under fabric,
you have to have a method for porting the queue pairs
across the fabric. And so the NVMe over fabrics definition has a set of commands
and a set of methods for maintaining the queues across the fabric which were
before in local memory and now are spread across a network.
So for FCMDME,
in this section we're gonna look at a high level understanding of how it works
and understand how SCP can be used to map
NVMe to fiber channel.
So when we're designing FC NVMe, we had some goals.
First off, we wanted to comply with the NVMe over fabric spec.
And of course, we wanted to maintain high performance, low latency.
NVMe, of course, is a low latency protocol, and so we wanted to maintain high performance, low latency. NVMe, of course, is a low latency protocol,
and so we wanted to maintain that.
So in order to do this, we want to use existing,
we also want to use existing HPA and switch hardware.
We didn't want to require ASICs to respond
to implement FC-NVMe.
And we wanted to fit into the existing
fiber channel infrastructure,
management, name server, and all those other things that exist for other fiber channel
protocols.
We wanted to be able to pass the NVME commands with little or no interaction from the fiber
channel layer.
And of course, the name server zoning management comes with it.
So high performance, low latency.
In order to maintain parity with existing protocols and improve on existing protocols too,
we need to use the same tools.
So we wanted to keep the same hardware acceleration
that exists, say, for SCSI.
And Fiber Channel does not have an RDMA protocol,
so we use FCP as a data transfer.
Currently, both SCSI and FICON use FCP.
And FCP is deployed in many, if not all,
implementations as a hardware accelerated layer.
So to map them to FCP, we have to map the NVMe command response data onto fiber channel FCP frames.
And an NVMe I.O. operation is mapped directly into a fiber channel exchange.
So that means that a single read operation would be one exchange.
So, for example, if you take an SQE, which stands for a submission queue entry,
a submission queue entry in NVMe is basically a command.
It's 64 bytes long,
and it's the entry that's put on the queue
that tells the controller what to do.
So the first step is to map one of these SQEs
into a FCP command IU.
And then if that command results in any data transfer,
of course, then the data has to be mapped into SCP data IUs.
And the data IU portion is what is accelerated by hardware.
The hardware engines will automatically transfer across the power channel network, place into memory,
so that there's no software handling of the data when it's in progress.
And then, of course, the response, which is the CQE,
which stands for completion queue entry, is mapped into the FCP response I.O.
And then the transactions for a particular I.O. are bundled into an exchange. So in this example, for example, you have a read operation, which is a single exchange,
and a write operation, which is a single exchange.
And one thing I keep on hearing about FCP is, well, how do you do zero copy?
RDMA was designed to allow network protocols to do zero copy implementations easily.
The fact is that FCP has been doing that for 20 years.
It just wasn't called RDMA.
SCP, current implementations going back 20 years, have been doing zero copy.
So you don't need to have RDMA in order to do zero copy.
RDMA is a set of tools, a semantics, that make it easier to do zero copy, but it's not required.
And so really the difference is the APIs.
So for RDMA, you have a defined set of APIs
that enforce or make it easier to do zero copy,
where FCP, it's the implementation which does it.
So for example, if you look at the transactions that take place on the wire,
FCP transactions, the latter diagram here, we have a read and a write operation taking place at the bottom.
And if you look at the same operation done with RDMA,
you have basically the same flow of commands.
No matter what you're doing on the wire,
you still have to send a command, get the data, or send the command and wait for the other side to say,
I'm ready for the data.
So it's basically the same operation at the lower level.
Of course, the other thing that is important is to maintain discovery mechanisms with an FCMDME.
And in order to facilitate that, we use pieces from both layers. We use the Fibre Channel name server to do the discovery of the ports.
So we can go out and say, what are the NVMe ports out here?
And once we find them, then we can use the NVMe Discovery Controller, which knows the details about the subsystems
which exist in the NVMe devices.
This allows each component to manage the data that it has the most knowledge about.
So for example, for an FCMME initiator to connect,
the first thing it will do is go to the nameserver, the FibreTel nameserver, and it will identify
where the discovery controllers are.
And then it will talk to the discovery controller,
which identifies the subsystems that it can talk to,
and then once it has that,
then it can start talking directly to the storage devices.
Is that for every exchange, or just for the...
That happens during initialization.
That would happen once for each initiator target pair.
And then once you have that,
unless something changes in the fabric,
if something changes in the fabric,
then you'll get a notification,
you have to do it again.
And of course, the zoning and management server and other service mechanisms continue to work with FCMME.
So why do you want to use it?
So I think the key thing about Fibre Channel is it's a dedicated storage network.
It's not sharing resources with anybody else. It's not sharing administrators with anybody else is it's a dedicated storage network. It's not sharing resources with anybody else.
It's not sharing administrators with anybody else.
It's a dedicated storage network.
You can also run NVMe and SCSI side by side on the same wire.
A lot of the implementations that exist right now
can run both protocols simultaneously on the same port
from an HPA into the switch.
Our channel's been around for a while.
All the testing that has gone into making Fiber Channel
an enterprise storage solution
in the last 20 years, 25 years, 30 years,
continues to be there for FCNVME
because we're keeping the same data transfer layer,
we're keeping the same fabric,
we're touching as little as we can in the path.
Question?
When you say NBME is crazy, not only the side-by-side, you mean the fiber channel.
The fiber channel is more common than the side-by-side.
So here's where…
You have a fiber channel gate, you're running NBME on top of SCP.
That's right, yeah.
Okay, so when you say it's SCSI side-by-side,
you mean I can run NBME on SCP
and fiber channel on the same gate?
Well, fiber channel is not SCSI.
Fiber channel is protocol.
Fiber channel is a fiber channel.
Yeah, if you're taking the traditional definition of FCP, which is it's SCSI, you're right.
So what I'm saying is you can run NVMe, which is running over FCP, side by side with SCSI,
which is running over FCP on the same wire.
Does that make sense?
Okay.
And then, of course, the built-in zoning security for fiber channel remains in place.
Of course, that picture is not a good example of security.
And there has been a lot of qualification of
fiber-tunnel devices, and the idea for deployment is
that while a change is required, it's not a
hardware change, it's a firmware change.
So the change is adding the cues to the EPA and stuff like that?
It's a bit more complicated than that.
It is a different protocol, so it does require a change to the firmware
to understand the commands and responses.
The queues, yes.
And it depends on the implementation.
A lot of times the queues are stored in the hosts.
Sometimes they may be in the controller adapter itself.
It really depends on the implementation.
Okay, so you can tell this slide was written by Jay, who's the marketing guy.
So SCME, wicked fast.
It builds on top of fiber channel, which is fast, too.
We're looking at 64 gigabit fiber channel right now.
Currently, 32 gigabit fiber channel is in the field.
So fiber channel remains one of the fastest storage networks that you can deploy.
It builds on 20 years of the storage network experience.
It can be run side-by- existing SCSI based fiber channel storage.
And actually you can run it side by side with FIACON2 because they're all layers on top
of the existing protocols.
It inherits the benefits of the devices that exist out there.
So more info, you can go to fibrochannel.org.
My email is up there if you want to talk to me.
You can talk to me in the hallway.
Any questions?
Can you describe why you need to be compatible with NVMe or Fabric for this type? Because we want to work with the existing infrastructure, people are designing devices that,
in particular large all- flash arrays which do MV newer
fabrics and the interface type then becomes possibly a political module you
can put already may in there you put power channel in there but the protocol
that the device is talking is in your fabrics so we want to maintain that compatibility of existing emerging devices.
Does that make sense?
Any questions?
A significant number of organizations, maybe 50% or better of the Fortune 100,
are running fiber channel protocol and discussing over FCP of the same directors.
Would the fact that we already have two protocols running on top of FCP
preclude adding the third, or is the only, no?
No, you can run with a compatible storage and an update to the HPAs,
you can run NVMe side-by-side as SCSI and FICON if you have it in a data center.
And considering that those directories are on the ground in so many organizations,
whether they're running FICON or just SCSI or some combination,
is this a protocol and is this a code update to that hardware?
So when you talk about the directors,
are you talking about the switches themselves,
the fabric itself?
Well, let's say we have Cisco 9509s,
and so we're running, let's say just discussing, how do we upgrade that?
So for a switch, there really is no change.
It's data, it's frames.
Now, some switches do have additional
layers that do some traffic analysis and things like that.
You may have to update that to get the new functionality,
but in order to pass, to send the data,
the switch doesn't have to change at all.
So the directors don't have to change it all
in order to send NVMe.
It's really a heavier lift on the HPA
and storage side to make that work.
And another question I have is we have,
now we have the appearance of companies like Vexata here in the Valley with super fast NVME storage arrays, very unique architectures.
They're running 6 million IO ops with good workloads with 8K block sizes, like unbelievable performance. What kind of, if we connect two Vixata type arrays or equivalents over distance, what kind of performance, do you have some comparison?
Because today we're using fiber channel for that.
Okay, so I did not bring any, this is meant to be a tutorial, so I didn't bring any specific vendor performance numbers. One thing I can say, you know, you're not going to
speed up the network storage or network performance by running a different
protocol. The latency in the network is going to be the same no matter what
you're running. What you gain is on the endpoints and you gain a lower latency in the NVMe stack.
And NVMe has a possibility of much deeper queues.
Deeper queues.
You can have 64,000 queues or 64,000 entries.
And that's where in some cases we're seeing
some big performance improvements
because you're able to use the aggregate bandwidth
of the fabric much more effectively.
So you're picking up parallelism?
Exactly, yeah.
And do you have any benchmarks or anything?
I don't have any, I can show you right now, sorry.
If you want to talk to me offline,
I can provide you with some pointers.
That is including this slide deck, so.
Thank you.
Any other questions?
Yes.
So I guess what's not making sense in my head is RDMA is pretty much, it reverses
the role of the initiator and target in data transfer.
Because when you're using SCSI, you issue a transfer ready, and then the data gets written out. But in RDMA, it's a read from the target back to the queue.
Doesn't that, I don't know, it seems like it's...
So I'm not sure what your question is.
Well, doesn't that conflict with what's existing in FCP?
It seems to me that...
Isn't that a difficult thing to do
without changing your lower layers?
Well, we're not trying to work like Artie Mae,
but if you look at the lower layers,
yes, the semantics of setting something up,
you have to set up your regions,
and you have to set up your regions and you have to are very different but once the data starts
going across the wire it looks the same because so it's just a semantic change
and then basically the frames look the same yeah yeah you know of course
between different protocols that you know you have different obviously
different frames but the data transactions look the same on the wires.
Does that make sense to you?
I'd have to look at it in more detail.
Any other questions?
All right, thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.