Storage Developer Conference - #97: Delivering Scalable Distributed Block Storage using NVMe over Fabrics
Episode Date: June 3, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, Episode 97.
Hi, good morning. Welcome to the session. So, Joy and I will be talking about using NVMe or Fabric to do distributed scale-out storage in this session.
My name is Mohan Kumar. I'm a fellow at Intel, and I've been with Intel for over 25 years.
And we also want to acknowledge our friends and colleagues, Scott and Reddy, who helped us put together this presentation.
So in the previous talk, you've heard about PCIe and what standards can build for you.
On top of PCIe, the other standard that got built was NVMe that allowed us to do storage very well, in fact, to the extent that any PCIe storage at this point is probably not.
It's an NVMe storage, and Devendra showed the numbers
in terms of the CAGR of the growth of those things, right?
There is also distributed scale of storage.
What do we mean by that?
Things like SAF, things like Hadoop,
and then there's a bunch of proprietary ones
that are not open source as well.
And the question is, and there is also another protocol,
I'll go into it, which is NVMe or Fabric,
which is the ability to carry NVMe on top of some fabric.
The Fabric today is basically Ethernet,
but it could be the TCP or RDMA or InfiniBand.
But I know they have the definitions for both Ethernet and RDMA fabrics today.
So it's one of those cases where the solution that if you take scalar storage, NVMe, and NVMe or fabric together,
the sum is greater than the parts, essentially.
That's our thesis, and we hope that you will agree with us by the end of this presentation.
Before we jump into it, we want to give you an overview of what in-vmware fabric
and what a distributed scalar storage is and what the issues are.
And then Sujaya is going to talk about various options to fix the problems that you have today
with distributed scalar storage and why what we propose is a better solution for this problem.
So if you think about NVMe or Fabric, it basically, like I said, builds on a bunch of NVMe drives
connected to storage nodes, right?
And essentially some host can access this over
the network, and it has the
ability to essentially materialize these
as drives.
As far as this host is
concerned, or any of these hosts is concerned,
they look physically connected to them.
It's composed storage, right?
So they,
from a standpoint of their drivers and
everything, and their software,
it looks like it's locally attached to them, right?
They don't know any different.
From NVMe onwards, right?
NVMe, the block layer,
everything sees this as an NVMe device
with a namespace and a certain capacity.
And what that capacity you can construct
by this NVMe or protocols management layer to say.
And what it allows you to do, take a drive
that maybe, say, it's one terabyte at the bottom together
and then essentially split it out into two different hosts
or go the other way, right?
It allows you to do all those neat things here
in terms of what you're able to materialize.
And the reason for going down this path as opposed to doing a software-based mechanism
is that it's very low latency. It's essentially built on top of, like I said, fabric today
is defined as Ethernet and RDMA. And at one stack level, you're probably looking at like
tens of microseconds latency in order to access it.
And if you want to compare the device latency
of a 4K block being transferred,
if it's anything NAND related,
it's probably in the 70 to 100 microseconds.
So 10 microseconds is 10% of what your actual device latency is.
And it's high performance as well, right,
because there's not a whole lot of overhead.
It allows you to do one-to-one mapping.
Essentially, you can take this drive and completely assign it,
or you can take some storage node entity
and then completely assign it to a particular host.
And then from an initiator standpoint,
this is one of the ways, in my mind,
both PCIe and NVMe and now NVMe Fabric win
is because they have a well-defined mechanism in software
to make this work, right?
So essentially what you do is in the host,
you have the NVMe to Fabric driver
that sits below the NVMe driver,
and then everything else is just completely transparent to software.
So you don't go around and keep changing your software layers
all the time to access it, right?
And then lastly, it's the theme of this conference,
and I guess it's that it's a standard interface,
so it's got broad adoption, essentially, right?
So if you look at Linux has got an NVMe fabric driver.
The other vendors are planning to support
their version of NVMe fabric driver.
So once you have the driver,
everything just looks and acts like an NVMe device,
and no software above that layer needs to know, right?
So that's on...
So, yeah, that talks to the widespread industry adoption.
I wouldn't say widespread yet.
I mean, it's getting towards that.
Linux definitely has it.
The other OSPs have announced at least plans
for supporting it either this year or next, essentially, right?
But the point is that this will be widely available.
Much of this is today on RDMA,
like I said, Rocky and iWarp.
There is also a definition for doing this on TCP IP
for things that...
Once you go to TCP IP, your latency will...
You don't have the performance guarantee,
but then the idea is that people want to do
enumero fabric with things that are, like,
what would I mean, a SATA drive, for example, right?
Collect a cluster of SATA drives that they want to transport,
abstract and transport as on-top using NVMe protocols.
So then switch to...
And then to connecting this to the main topic of today, distributed scale art storage.
So, ThinkSaf, for example, right?
Again, the pictures look similar, but I want to show you where they will start diverging as well, right?
You'll have today, as Devendra was pointing out in his earlier session, right?
Many of these are PCIe and NVMe-based SSDs as we go now and into the future.
And these things are attached to storage nodes,
and a bunch of storage nodes.
This is the scale-out portion of it, right?
You can scale your storage capacity
by essentially adding more storage nodes
and more drives into the storage nodes, right?
That's what things like Ceph does.
And then each one has a solution specific.
Like this here, you have your LibreDOS
or your software layer
that essentially understands
your specific storage protocol, right?
If it's Ceph, it is Ceph.
If it's some other protocol,
you have some piece of software
that then allows you to get a block allocation
from this thing.
So the concept of taking this hardware with storage
and turning them into a block abstraction
is a combination of this
and then the combination of these green boxes here, essentially.
And it gives you all the virtues that you would like
from any storage software.
It gives you availability
because one of these nodes could go down
because it gets distributed? It gives you availability because one of these nodes could go down because it gets distributed.
It gives you performance by, you know,
having multiple copies
and being able to access them from one...
Whichever copy is the most efficient
for you to get access from.
And the protocol...
They define the protocol for optimizations.
Like I said, in VMware Fabric,
we kind of picked RDMA and now TCP.
But in this case, the vendor gets to define
what protocol they want to use to communicate between,
because it's their client and it's their server.
So both ends, they control,
so they can define the protocol that they do.
And of course, you have both open source like
SAF and there's some closed source stuff. I don't want to pick a name.
But there's a bunch of them out there that
do the same. So what's the problem with
this type of a solution? First of all, the problem is that you need to go write
a driver or a client for client software
for each operating system, each hypervisor
that you end up having to support, right?
And then each one has its own solution-specific management, right?
It's not so much at the storage node layer,
but at the client node layer, right?
You need to do management that's very specific
to whatever stack it is that you have signed up for.
And you're bound to that stack.
Yeah, and that's what it's talking to, right?
The host essentially gets very tightly coupled
to your storage service
because you have to deal with
the lifecycle management issues, right?
You upgrade something.
You can't just do isolated upgrades, right? You have to deal with the lifecycle management issues, right? You upgrade something. You can't just do isolated upgrades, right?
You have to deal with the client impact of whatever you're doing at the server end.
And essentially, the clusters, any issue you have at your storage node cluster
essentially extends to your host also, because your host is as much impacted by,
because it's got this client software that's fully aware of whatever storage
stack, like Ceph, that's running underneath you.
And then, last but not least, it may take footprint from your host, right?
And this is more with the emergence of cloud and private cloud, if you're running a host and you're running some services,
VMs, containers, functions, you name it,
your primary goal is to essentially optimize that host
for delivering that thing, whatever you're running.
You don't want everything else is essentially infrastructure
or context to what you need to deliver.
You don't want your context to consume too much of your resources, right?
And depending on what scale of storage you're running,
it's going to take valuable resources away from
whatever you want to use it for, your application layer,
your VMs, containers, and so on, right?
And then there is a push these days to essentially essentially just to solve the problem of, you know, I
bought the host to run some application.
I run from my VMs or containers.
So I really don't want my host to be doing my contextual stuff.
So what do I do?
I take my contextual stuff and offload them to what's called the infrastructure accelerator,
right?
And there is FPGAs, SOCs are emerging to fill that gap,
and large cloud companies are doing their own, right?
But in order to, if you're going to do something like that,
then it needs to be fairly small footprint
because otherwise this is not going to work out, right?
If you're going to say, I'm going to say,
let's say there are three scale-out storage stacks
that you want to support, and all three of them,
you need to expose whatever abstraction they have.
And their storage software has to run
in your infrastructure accelerator.
Now you're essentially multiplying the problem
to the point where the accelerator part
is no longer true, because it's going to have a hard time.
It's basically becoming a host in itself.
So that's kind of the situation that we are in right now
in terms of where we have this valuable technology
and the status quo on the scale-out storage software.
And the question is, what do we go and do, essentially,
to bridge the gap and take some of the pieces
that we have in standardized solutions
and convert them into a solution for this problem?
And to give you more,
let Sujoy cover the rest of this. Thanks, Mohan.
So as Mohan mentioned, I'm Sujoy Sen.
I work at Intel as well,
and I've been focusing on storage and I.O.
disaggregation and pooling in general,
technologies of pooling over fabrics,
over Ethernet primarily, but other fabrics as well.
So Mohan sort of set the stage
for what the problem is that we're trying to solve,
which is really try and provide
a standards-based interface,
an NVMe specifically,
to various storage services that exist, right?
We're sort of targeting scale-out storage
as a good poster child for this
because, you know, that seems to be emerging
as a class of storage service
that's getting used quite a bit.
But, you know, there could be
a other set of storage services as well.
But what we really want to do
is provide a standard interface
in front of it
to solve all the issues that Mohan brought up
in the last foil
so
I'll talk about
maybe spend a couple of slides on
what
can be done today with things that are available today,
what some folks are doing today, and then some of the issues related to that, and then
talk and get into what our particular proposal is to solve this.
So obviously if you want to put a standard interface or some interface in front of something that doesn't support that natively,
you do what any good computer scientist does.
You employ indirection, right?
So you introduce, in this case,
you introduce the concept of gateway
that exposes NVMe, you know, to the host,
uses NVMe fabric, or it could be iSCSI,
you know, if that's the desired end state,
to a gateway node, and that translates to,
that has the custom client that can go
and talk to the actual storage service below, right?
So you get the benefits of a standard client or the host,
the footprint associated with it,
and more importantly, it decouples this with this.
Decouples the host from the storage service
as much as possible.
So it brings in all those goodness, but then of course you increase latency, so
you reduce performance, you have extra hops. The gateway can become a bottleneck depending
on your workload and IO patterns. And of course your management complexity increases because now you have one more extra component to manage
so all good
depending on what you want
that may work
so next thing you can do is
well I'll add multiple gateways
so that alleviates my bottleneck at the gateway
and I can now use
load balancing and other techniques to map my volumes and the host at the gateway, and I can now use, you know,
load balancing and other techniques
to map my volumes on the host
and distribute them to different gateways
and let each gateway handle a subset of the volumes.
This works really well to support
a large number of hosts and a large number of volumes.
Of course, you still have the extra hop latency issue, right?
Because you're still going through a gateway to get to
these, to your actual storage services.
But depending on, again, your workload and your volume access
patterns, it is still possible for a particular gateway to get
overloaded.
Right? And then you can bring in orchestration complexity, use telemetry to
sort of move these assignments, and you can get really sophisticated with
what you're doing. But at the end of the day, that just increases
more complexity in your management. And because
you're adding multiple gateways,
physical or virtual doesn't really matter.
It does add cost to the solution.
So the next obvious step is, well,
why have separate gateways?
Why not just integrate them into the storage nodes themselves?
And that works.
You know, you get rid of one layer of, you know, machines.
Again, physical or virtual doesn't matter.
But your latency really doesn't change, right?
The extra hop latency you still incur
because at the end of the day,
all the volumes in all the IOs to a particular volume
end up on one particular node first before getting distributed, you know, into the storage layer itself.
And again, that bottleneck that was occurring here on, you know, when we had separate gateways simply moves down to the storage node.
And in many cases, the storage node is CPU heavy, right?
And anything you take away,
any resources you take away from the storage node
in processing this gateway functionality
is basically directly affecting your storage performance
that you deliver to your clients.
So what would be nice
is if we had this integrated gateway concept,
but it was distributed, right?
This goes back to the original DSS picture
that Mohan showed, which was,
hey, you know, I have a volume.
I want to get to the correct node right in the beginning,
right from the get-go, right?
But instead of having this custom sort of client here,
I want this to be NVMe, right?
This is really the end goal that we're trying to achieve.
Which is good.
The question is, what does that mean?
The first thing it means is it requires these volumes
to be aware of multiple targets, right?
Today with NVMe Fabric, multipath notwithstanding,
primarily it is a one-to-one relationship, right?
One volume is mapped completely to one target,
to one subsystem.
You can have multiple subsystems.
The idea, though, is that's really for multipath, right?
It's not really aware of a volume mapped,
different LBA ranges being mapped to different nodes.
And that's one thing that, you know,
that's one change that's required, right?
The second one is, well, how does this guy decide,
the client decide how to place the data, right?
Because the storage service that runs here,
there's various, you know, as Mohan said,
there's open source versions, there's proprietary versions.
So it's very hard for this, the NVMe interface here
and the NVMe device here to support all of them,
you know, using one scheme.
So the placement scheme that this needs to use
has to be extensible, right?
So it's something that can be extended
to support multiple storage backends here.
The other thing that really is needed is,
you know, this architecture will decouple, you know,
the NVMe with the storage nodes here,
but the problem is management, right?
Every time something changes down here because of either a failure
that, you know, where, you know, it's trying to recover from a disk failure or node failure
and it's changing its placement around, or it's trying to load balance because of performance,
which all of these scale-out storages do, you're affecting the interface up here at the host.
And that brings in management complexity.
So this extension that we're talking about in terms of placement really needs to be,
you know, to have, you know, the least amount of management intrusion, right?
So that's really the end state we are looking for.
Now, the question is, how do we get there?
So we, you know, what we thought was,
let's come up with a solution
that gets to that end state
using NVMe and NVMe Fabric as it stands today, right?
And basically, if you remember, we need NVMe or fabric to be aware of multiple targets.
We need a way to extend, extensively place, figure out where data goes.
And we need that management complexity to go away so that these two can truly be decoupled.
So for the last part, we want it to be self-learning, right?
So we sort of took a leaf out of the, you know,
Internet routing world,
where even if you get a route that's not exactly accurate,
packets still get routed,
but eventually the router learns.
So we sort of want to take that concept
and apply it to this solution.
So what we're doing is we're proposing
the two concepts of a redirector and hints
into this NVMe fabric sort of solution.
What a redirector, a redirector basically on the initiative side
and on target size does similar things.
On the initiative side, a redirector, you know, gets the IO
and figures out where for this IO.O. and for this namespace
where it should go to.
And it looks up some table,
which we call the hint table here,
to figure that out, right?
When a target sort of gets an I.O.,
the target redirector figures out
if it should service it itself,
because it's sort of, it's a redirector that's specific
to the storage service,
or if it doesn't own it,
should it just send it to somebody else
who does own it?
But the main thing is,
even if the I.O. goes to the wrong target redirector,
that target always has to complete the I.O.
So there's no error back to the initiator.
So that way, that ensures, you know,
the weakest coupling, if you will, right?
The storage service is free to move things around
without the initiator getting impacted.
You can...
So even if the initiator gets it wrong,
the storage service, the redirector,
will take care of it, right it and complete the I.O.
So in this case, the I.O. one flows through.
The initiator doesn't know actually where it should go to.
It sends to a default redirect target.
This guy sends it to the right one, completes it.
And then the second concept of hint comes into the picture where now the
first tree director sends a hint back to the initiator saying, hey, for this IO, this is
really where you should be going. And the initiator learns from that hint, populates
its hint table, whichever, whatever, however it wants to implement it. Right? So the next time an IO comes to that LBA or that range,
it goes to the right target.
So with this scheme,
you know, you basically have an initiative
that eventually learns where, you know,
learns about the placement, right?
That's the basic idea.
And, you know, the storage service
is sort of free to do what it's doing, right?
Because once it gets an I.O.,
it services the I.O. as if it just came from its clients.
So then the question is, what do the hints look like, right?
So as I was saying, that we want to be able to,
these hints are this primary mechanism by which a host is told
where data is placed, right?
And this needs to be extensible, right?
It needs to support a wide range of placement schemes
that, you know, different solution stacks provide,
as well as it needs to be extensible
for new things that are showing, going to show up, right?
So what we sort of came up with is three categories,
and this, of course, I suspect more will be added to this,
but three categories of hints to take care of different kind of known backends today.
So you have sort of simple pairwise backends today. So you have, you know,
sort of simple pairwise backends, right,
where, you know, you're basically mapping a volume
across just two nodes and replicating it, right,
to where you have slightly more sophisticated placements
where, you know, extents are taken
from a set of nodes to create a volume,
and then within that, replication is done,
or erasure coding is done.
Then you have backends that do
sort of more RAID-like striping, right?
And then, of course, you have backends like Ceph
that do algorithmic-level-based placement, right? And then, of course, you have backends like Ceph that do algorithmic-level-based placement, right?
And so what we thought was,
if you categorize all of that,
there are sort of three schemes of hints, right,
that we can support.
One is SimpleHint, which is really a range of LBAs mapped to a set of targets that,
that, that an LBA maps to, you know, reads and writes separately. And again, a set of targets,
because for reads, you may want to give a priority, or a set of, you know, targets that the
reads can go to if you want to parallelize the
reads. But for writes, maybe you want to give
an ordered list of targets because of the primary
replica versus secondary. So that's what a simple
hint does. It takes care of X10-based mapping. It
takes care of pairwise HA sort of solutions as well.
Right?
The second one is striping hints, which are, as I said, backends that would do, that would
support, you know, things like, you know, rate zero, for example.
Right? zero, for example, right? Again, here the idea is that you give sort of a LBA range,
what extents it maps to, what kind of striping group it's part of, what's the stripe size,
and that allows the initiator to just calculate exactly where a particular LBA, you know, access needs to land.
And then the last is the hashing hint,
which is where the hint really comprises of the hashing function that needs to be used,
the way an object, because usually,
like,
especially if you take something like Ceph or Gluster,
you know, you derive from the volume name and the LBA, you derive an object name, an object ID,
you do, you calculate a hash function on it,
and then you go look up a table to figure out, you know, basically a hash
bucket table to figure out which node it needs to go to. And that's what, all of that is embodied
into the hashing hint, right? So what kind of chunk size this scheme uses, right? What is the
object name format that this scheme uses? Predefine a bunch of hashing functions that are common,
and I'm sure that can be extended.
And a hashing table location where the actual lookup
is for the actual node.
And these three together, essentially we think,
depending on which one you get to use,
or you can combine the two.
You can do a hashing hint,
but sometimes you get the wrong node
because the back end is changing things around.
So you can add a simple hint,
a specific location hint on top of a hashing hint.
That'll take precedence, right? So you can add a simple hint, a specific location hint on top of a hashing hint. That'll take precedence, right?
So you can do these things to always minimize your latency,
get to the right node as quickly as possible,
yet allow this thing to learn
as the storage service is expanding, contracting,
changing its placement. So, if you have to do
this, you know, what are the changes to NVMe or
Fabric that needs to be done? So our first premise
really is that we want to, you know,
we want to reuse the existing elements of NVMe fabrics, right?
Try not to introduce any new, as far as the protocol and the architecture is concerned,
not try to introduce any new element, anything radically different, right?
And that's what we set about doing, trying to see, can we just do this
using existing elements?
That doesn't mean existing implementations
will not have to be changed.
Of course, you have to implement a redirector.
You have to be able to implement, you know,
paying attention to the hints,
using the hints to go to the right place.
But at least from an NVMe fabric protocol standpoint,
we get to use as much as, you know,
existing elements that it already provides.
Also, it was important that legacy initiators
continue to work with storage services
that have this redirected capabilities.
So what I mean by legacy initiators is, you know,
we all know that once you have a standard out there,
and Debinder talked a lot about, you know,
compliance and interoperability,
NVMe and NVMe Fabric is certainly taken in a big way.
Lots of products out there.
Lots of native support from operating systems
and hypervisors.
So you'll have initiators out there
when you deploy a storage solution with this capability
that still need to work with existing initiators.
So we want to make sure this is backward compatible,
and basically what makes it backward compatible
is the fact that, you know,
an initiator that doesn't know about,
doesn't pay attention to hints
can still send an I.O. to what it thinks
a node is the right node, and that node will complete an I.O. to what it thinks a node is the right node,
and that node will complete the I.O.
It'll figure out where the I.O. actually belongs to,
send to that node, get the results, and send it back
at the expense of performance,
but you'll continue getting functional capability.
That's why it was even more element
that we use the NVMe fabric elements.
So there are three sort of things we need to worry about. One is how do we, what are the hints? How
do we represent hints? How do we notify hints? And how do we know that a particular NVMe fabric
initiative or target, but mostly it applies to target systems,
is capable of this functionality.
So the first thing, how do we represent hints?
Well, we figured log pages are a good way to do this.
Right?
NVMe already defines the concept of log pages,
both standard log pages as well as you can do vendor-defined log pages.
And that's a good way to basically deliver hints.
So all of those hints I talked about
can map to different log pages and different formats.
Of course, things like when somebody's reading a log
page it needs to be consistent, as a log
page might, you know, be read in multiple chunks.
So things like that have to be taken into
account. But the concept of log pages can be
used for this.
The second thing is how do we notify that, you
know, there is a hint, you know, that hint propagation
thing that I showed earlier. AERs are a good
way to go do that. You know, your asynchronous
event requests, that's already supported by NVMe or
Fabrics today. So you can have an AER outstanding
for the particular log page that you're looking for.
So the initiator can send an AER,
and whenever there's a log page
that affects that initiator,
the target can send a completion back,
and that causes the initiator to come back
and read the log page.
So that scheme should... We should be able to use that scheme to do the notification.
And then capability discovery is probably the easiest.
We think the supported capabilities and the get features commands, you know, we should be able to add bits to it for, to, to allow for, for
an initiative to discover redirector capability targets.
Until that's there, you know, you can use a
whitelist or some, some other schemes to sort of at least get the ball rolling.
So,
so in summary,
what, you know,
and Mohan sort of started with this, right?
What we really look, saying that
any distributed storage service, and as it was
probably obvious, I mean, we're focusing on distributed storage service, but we're really
looking at any storage service, I think can benefit from a standards-based interface into into the host, right?
I think, you know, SNE has been doing,
you know, that's really what SNE has been driving a lot,
standards-based interface.
From the earlier talk, you know,
we know that with standards,
there's innovation that we can leverage from a large body of work.
So what we want So what we really want is provide a standards interface
to any storage service, right?
And we believe any storage service
will benefit from this approach, right?
So once they develop their service,
they have a ready-made ecosystem
that they can just plug into, and they'll be able to develop their service, they have a ready-made ecosystem that they can just plug into,
and they'll be able to deliver their service
straight to the host right away.
NVMe, of course, we feel is the ideal interface.
It's, you know, it's obviously gotten a lot of ground.
It's got a lot of ecosystem support already.
And so that seems like the right place to be.
And because now there is a network component,
a fabric component to this,
that we believe NVMe Fabric fits that bill naturally.
It already has the elements needed to make such a scheme work.
So in total, between NVMe, NVMe Fabric,
and marrying it to any distributed storage service,
we feel that we can deliver sort of the best experience
to a customer as far as storage is concerned.
And, you know, the table here
kind of just captures everything that we said
in the last 40, 45 minutes or so,
which is you have a distributed storage service today,
and it suffers from a lack of standard host interface.
It, of course, delivers, you know, good performance, right?
And, of course, the gateways aren't applicable to it.
Then you add, you know, the gateway, a single gateway,
distributed gateways,
and gateways exist with, you know, AWS gateways,
Azure gateways, other gateways,
but they bring the goodness of a standard host interface,
but both in terms of availability and performance,
then, you know, they basically bring some, you know,
badness to the original distributed storage solution.
With the self-learning NVMe fabric solution,
you know, in general, we expect that
you restore the goodness of DSS,
but you bring in the goodness of a standard interface.
Yeah, so thank you.
Questions?
Sorry, yeah, go ahead.
How does stuff like reservations,
ANA, security,
something like that,
work?
So the question is,
how does things like reservations
and something like, how will that be supported here?
Well, if you notice, from an NVMe standpoint, it is still, from a host model, it looks like the same, right?
You have a volume that is surfaced on the host.
You send reservations to it. You need to have the NVMe fabric target support reservations to begin with.
Even today, if you don't have the scheme,
it always gets to a target first
and then gets to the right node in the storage service.
So if the storage service supports reservations,
the reservation just flows through.
This doesn't add any new complexity there.
Sorry, let me just get your question.
A lot of what you describe
is software-defined storage architecture
that's already implemented by a lot of upper layers,
a lot of systems.
Who do you see actually shipping the things you propose here?
You seem like you're trying to push it
way down to the bottom of the stack.
Is this devices, appliances, operating systems?
What?
Yeah, I think the...
What we're trying to do is, I think, as we said,
push it down to...
to add the NVMe layer
as far as the operating system is concerned, right?
So you don't need anything on the host that's, you know,
higher than that that's managing this, right?
Yeah, I see.
And have the storage service that's typically running elsewhere
not worry about how that's getting delivered to the host
as long as they have the right thing.
That's sort of the basic idea.
One second.
Let me...
So one thing is that we see,
if you think of scale-out storage,
I mean, they're solving a different set of problems really well,
but, you know, if you look at a scale-out storage software,
the volume management, snapshot,
all kinds of functions that are built into it,
and they do really, really, really well.
Right.
But then you're kind of in the client.
They feel obligated to put in a client interface
with their back end, essentially.
And then NVMe Fabric, if you look at it, essentially says, I'm going to help you
disaggregate storage, but then it doesn't do any of these things that you would expect any storage
service to do. So what we're saying is essentially the hybrid approach basically lets you
keep your storage back end, but then provide you an NVMe front end. So
the host burden goes away. The host burden gets to be
a standard-based solution.
And essentially, it's pieces of software that you create on top of the backend
to essentially provide the translation and, more importantly, the management matching.
There is this management layer that has to say,
I'm going to create a volume, take that volume, map it into an NVMe-based thing.
Now it shows up as NVMe, and then the NVMe management or fabric management takes over.
I still don't fully understand who's going to provide all those things.
The devices, the gateways, what the heck.
So, for example, let's take Ceph, right, because it's open.
I can see where Ceph provides this as a variation of Ceph, essentially, that does this.
So it comes from the storage nodes up.
So Red Hat will ship this.
Yeah.
Yeah, so you basically...
I mean, somebody has to develop
on the storage nodes
the right redirectors, right,
with NVMe fabric on top.
But on the host side,
it will be standard NVMe-initiated drivers
that will have to support this.
Right.
Yeah, question.
Something similar to the QT question.
Not only from a sanitizing perspective,
but from a white-locking,
I mean, a digital global, that type of thing.
How does it work in this...
Yeah.
So this doesn't change any of that model
because, you know, again,
if, you know, from a host standpoint,
you see a NVMe device,
and if you did TCG in Opal,
if you expected Opal support
and you were using it as Opal,
you still get to use it,
because that just gets to, again,
the storage service.
And the storage service, of course,
has to support, you know,
support, you know, the Opal standard.
The targets have to support the Opal standard,
which is true today.
I mean, for any NUF fabric solution,
the targets have to support the Opal standard.
And so this just brings that to the table.
There's nothing...
I don't think the basic device model
that NVMe and NVM Fabric brings
to the host and application standard,
whether it's reservations or Opal,
doesn't change.
Ultimately, it has to be supported by the back end
regardless of how you deliver it.
Great question.
So would this require any changes to the
API-level topic specification?
Yeah, so we don't...
So, like I said, all the elements exist.
You know, log pages, AER, you know, supported feature bits.
So, you know, we are prototyping with, obviously,
just vendor-defined things right now.
Once all of this gets sort of worked out,
we can standardize the log page information, for example, right?
You know, like, okay, log page XYZ is for hashing hint,
and this is the hashing hint format, right?
It doesn't change the protocol of NVMe or Fabric.
It's the management portion of it that's going to change.
Question?
It still doesn't enable multipathing.
Well, current multipathing would work.
You know, native NVMe multipathing that's now in the kernel
will continue working.
Sorry,
question?
Sorry, can you repeat the question?
Yeah, I think one of the main things we're trying to avoid is, especially to support large scale,
is having central management as much as possible.
So all the goodness that a distributed storage system brings is that.
And so if you distribute it, you know,
we think, you know, from a self-healing standpoint,
it's going to scale better.
But, yeah, you can implement something
that's more centralized as well.
I mean, I don't think this is mandating
a particular kind of implementation.
Because the hints can come from, like Mohan said,
can be a management solution, right?
Any other questions?
Well, thank you very much for your time.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers
in the storage developer community. For additional information about the storage developer conference,
visit www.storagedeveloper.org.