Storage Developer Conference - #97: Delivering Scalable Distributed Block Storage using NVMe over Fabrics

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 97. Hi, good morning. Welcome to the session. So, Joy and I will be talking about using NVMe or Fabric to do distributed scale-out storage in this session. My name is Mohan Kumar. I'm a fellow at Intel, and I've been with Intel for over 25 years. And we also want to acknowledge our friends and colleagues, Scott and Reddy, who helped us put together this presentation.

Starting point is 00:01:08 So in the previous talk, you've heard about PCIe and what standards can build for you. On top of PCIe, the other standard that got built was NVMe that allowed us to do storage very well, in fact, to the extent that any PCIe storage at this point is probably not. It's an NVMe storage, and Devendra showed the numbers in terms of the CAGR of the growth of those things, right? There is also distributed scale of storage. What do we mean by that? Things like SAF, things like Hadoop, and then there's a bunch of proprietary ones

Starting point is 00:01:46 that are not open source as well. And the question is, and there is also another protocol, I'll go into it, which is NVMe or Fabric, which is the ability to carry NVMe on top of some fabric. The Fabric today is basically Ethernet, but it could be the TCP or RDMA or InfiniBand. But I know they have the definitions for both Ethernet and RDMA fabrics today. So it's one of those cases where the solution that if you take scalar storage, NVMe, and NVMe or fabric together,

Starting point is 00:02:21 the sum is greater than the parts, essentially. That's our thesis, and we hope that you will agree with us by the end of this presentation. Before we jump into it, we want to give you an overview of what in-vmware fabric and what a distributed scalar storage is and what the issues are. And then Sujaya is going to talk about various options to fix the problems that you have today with distributed scalar storage and why what we propose is a better solution for this problem. So if you think about NVMe or Fabric, it basically, like I said, builds on a bunch of NVMe drives connected to storage nodes, right?

Starting point is 00:03:02 And essentially some host can access this over the network, and it has the ability to essentially materialize these as drives. As far as this host is concerned, or any of these hosts is concerned, they look physically connected to them. It's composed storage, right?

Starting point is 00:03:20 So they, from a standpoint of their drivers and everything, and their software, it looks like it's locally attached to them, right? They don't know any different. From NVMe onwards, right? NVMe, the block layer, everything sees this as an NVMe device

Starting point is 00:03:36 with a namespace and a certain capacity. And what that capacity you can construct by this NVMe or protocols management layer to say. And what it allows you to do, take a drive that maybe, say, it's one terabyte at the bottom together and then essentially split it out into two different hosts or go the other way, right? It allows you to do all those neat things here

Starting point is 00:04:00 in terms of what you're able to materialize. And the reason for going down this path as opposed to doing a software-based mechanism is that it's very low latency. It's essentially built on top of, like I said, fabric today is defined as Ethernet and RDMA. And at one stack level, you're probably looking at like tens of microseconds latency in order to access it. And if you want to compare the device latency of a 4K block being transferred, if it's anything NAND related,

Starting point is 00:04:35 it's probably in the 70 to 100 microseconds. So 10 microseconds is 10% of what your actual device latency is. And it's high performance as well, right, because there's not a whole lot of overhead. It allows you to do one-to-one mapping. Essentially, you can take this drive and completely assign it, or you can take some storage node entity and then completely assign it to a particular host.

Starting point is 00:05:06 And then from an initiator standpoint, this is one of the ways, in my mind, both PCIe and NVMe and now NVMe Fabric win is because they have a well-defined mechanism in software to make this work, right? So essentially what you do is in the host, you have the NVMe to Fabric driver that sits below the NVMe driver,

Starting point is 00:05:25 and then everything else is just completely transparent to software. So you don't go around and keep changing your software layers all the time to access it, right? And then lastly, it's the theme of this conference, and I guess it's that it's a standard interface, so it's got broad adoption, essentially, right? So if you look at Linux has got an NVMe fabric driver. The other vendors are planning to support

Starting point is 00:05:50 their version of NVMe fabric driver. So once you have the driver, everything just looks and acts like an NVMe device, and no software above that layer needs to know, right? So that's on... So, yeah, that talks to the widespread industry adoption. I wouldn't say widespread yet. I mean, it's getting towards that.

Starting point is 00:06:13 Linux definitely has it. The other OSPs have announced at least plans for supporting it either this year or next, essentially, right? But the point is that this will be widely available. Much of this is today on RDMA, like I said, Rocky and iWarp. There is also a definition for doing this on TCP IP for things that...

Starting point is 00:06:35 Once you go to TCP IP, your latency will... You don't have the performance guarantee, but then the idea is that people want to do enumero fabric with things that are, like, what would I mean, a SATA drive, for example, right? Collect a cluster of SATA drives that they want to transport, abstract and transport as on-top using NVMe protocols. So then switch to...

Starting point is 00:07:02 And then to connecting this to the main topic of today, distributed scale art storage. So, ThinkSaf, for example, right? Again, the pictures look similar, but I want to show you where they will start diverging as well, right? You'll have today, as Devendra was pointing out in his earlier session, right? Many of these are PCIe and NVMe-based SSDs as we go now and into the future. And these things are attached to storage nodes, and a bunch of storage nodes. This is the scale-out portion of it, right?

Starting point is 00:07:32 You can scale your storage capacity by essentially adding more storage nodes and more drives into the storage nodes, right? That's what things like Ceph does. And then each one has a solution specific. Like this here, you have your LibreDOS or your software layer that essentially understands

Starting point is 00:07:51 your specific storage protocol, right? If it's Ceph, it is Ceph. If it's some other protocol, you have some piece of software that then allows you to get a block allocation from this thing. So the concept of taking this hardware with storage and turning them into a block abstraction

Starting point is 00:08:10 is a combination of this and then the combination of these green boxes here, essentially. And it gives you all the virtues that you would like from any storage software. It gives you availability because one of these nodes could go down because it gets distributed? It gives you availability because one of these nodes could go down because it gets distributed. It gives you performance by, you know,

Starting point is 00:08:29 having multiple copies and being able to access them from one... Whichever copy is the most efficient for you to get access from. And the protocol... They define the protocol for optimizations. Like I said, in VMware Fabric, we kind of picked RDMA and now TCP.

Starting point is 00:08:51 But in this case, the vendor gets to define what protocol they want to use to communicate between, because it's their client and it's their server. So both ends, they control, so they can define the protocol that they do. And of course, you have both open source like SAF and there's some closed source stuff. I don't want to pick a name. But there's a bunch of them out there that

Starting point is 00:09:16 do the same. So what's the problem with this type of a solution? First of all, the problem is that you need to go write a driver or a client for client software for each operating system, each hypervisor that you end up having to support, right? And then each one has its own solution-specific management, right? It's not so much at the storage node layer, but at the client node layer, right?

Starting point is 00:09:43 You need to do management that's very specific to whatever stack it is that you have signed up for. And you're bound to that stack. Yeah, and that's what it's talking to, right? The host essentially gets very tightly coupled to your storage service because you have to deal with the lifecycle management issues, right?

Starting point is 00:10:03 You upgrade something. You can't just do isolated upgrades, right? You have to deal with the lifecycle management issues, right? You upgrade something. You can't just do isolated upgrades, right? You have to deal with the client impact of whatever you're doing at the server end. And essentially, the clusters, any issue you have at your storage node cluster essentially extends to your host also, because your host is as much impacted by, because it's got this client software that's fully aware of whatever storage stack, like Ceph, that's running underneath you. And then, last but not least, it may take footprint from your host, right?

Starting point is 00:10:39 And this is more with the emergence of cloud and private cloud, if you're running a host and you're running some services, VMs, containers, functions, you name it, your primary goal is to essentially optimize that host for delivering that thing, whatever you're running. You don't want everything else is essentially infrastructure or context to what you need to deliver. You don't want your context to consume too much of your resources, right? And depending on what scale of storage you're running,

Starting point is 00:11:13 it's going to take valuable resources away from whatever you want to use it for, your application layer, your VMs, containers, and so on, right? And then there is a push these days to essentially essentially just to solve the problem of, you know, I bought the host to run some application. I run from my VMs or containers. So I really don't want my host to be doing my contextual stuff. So what do I do?

Starting point is 00:11:38 I take my contextual stuff and offload them to what's called the infrastructure accelerator, right? And there is FPGAs, SOCs are emerging to fill that gap, and large cloud companies are doing their own, right? But in order to, if you're going to do something like that, then it needs to be fairly small footprint because otherwise this is not going to work out, right? If you're going to say, I'm going to say,

Starting point is 00:12:03 let's say there are three scale-out storage stacks that you want to support, and all three of them, you need to expose whatever abstraction they have. And their storage software has to run in your infrastructure accelerator. Now you're essentially multiplying the problem to the point where the accelerator part is no longer true, because it's going to have a hard time.

Starting point is 00:12:23 It's basically becoming a host in itself. So that's kind of the situation that we are in right now in terms of where we have this valuable technology and the status quo on the scale-out storage software. And the question is, what do we go and do, essentially, to bridge the gap and take some of the pieces that we have in standardized solutions and convert them into a solution for this problem?

Starting point is 00:12:53 And to give you more, let Sujoy cover the rest of this. Thanks, Mohan. So as Mohan mentioned, I'm Sujoy Sen. I work at Intel as well, and I've been focusing on storage and I.O. disaggregation and pooling in general, technologies of pooling over fabrics, over Ethernet primarily, but other fabrics as well.

Starting point is 00:13:45 So Mohan sort of set the stage for what the problem is that we're trying to solve, which is really try and provide a standards-based interface, an NVMe specifically, to various storage services that exist, right? We're sort of targeting scale-out storage as a good poster child for this

Starting point is 00:14:12 because, you know, that seems to be emerging as a class of storage service that's getting used quite a bit. But, you know, there could be a other set of storage services as well. But what we really want to do is provide a standard interface in front of it

Starting point is 00:14:30 to solve all the issues that Mohan brought up in the last foil so I'll talk about maybe spend a couple of slides on what can be done today with things that are available today, what some folks are doing today, and then some of the issues related to that, and then

Starting point is 00:14:51 talk and get into what our particular proposal is to solve this. So obviously if you want to put a standard interface or some interface in front of something that doesn't support that natively, you do what any good computer scientist does. You employ indirection, right? So you introduce, in this case, you introduce the concept of gateway that exposes NVMe, you know, to the host, uses NVMe fabric, or it could be iSCSI,

Starting point is 00:15:30 you know, if that's the desired end state, to a gateway node, and that translates to, that has the custom client that can go and talk to the actual storage service below, right? So you get the benefits of a standard client or the host, the footprint associated with it, and more importantly, it decouples this with this. Decouples the host from the storage service

Starting point is 00:15:59 as much as possible. So it brings in all those goodness, but then of course you increase latency, so you reduce performance, you have extra hops. The gateway can become a bottleneck depending on your workload and IO patterns. And of course your management complexity increases because now you have one more extra component to manage so all good depending on what you want that may work so next thing you can do is

Starting point is 00:16:35 well I'll add multiple gateways so that alleviates my bottleneck at the gateway and I can now use load balancing and other techniques to map my volumes and the host at the gateway, and I can now use, you know, load balancing and other techniques to map my volumes on the host and distribute them to different gateways and let each gateway handle a subset of the volumes.

Starting point is 00:16:55 This works really well to support a large number of hosts and a large number of volumes. Of course, you still have the extra hop latency issue, right? Because you're still going through a gateway to get to these, to your actual storage services. But depending on, again, your workload and your volume access patterns, it is still possible for a particular gateway to get overloaded.

Starting point is 00:17:28 Right? And then you can bring in orchestration complexity, use telemetry to sort of move these assignments, and you can get really sophisticated with what you're doing. But at the end of the day, that just increases more complexity in your management. And because you're adding multiple gateways, physical or virtual doesn't really matter. It does add cost to the solution. So the next obvious step is, well,

Starting point is 00:17:58 why have separate gateways? Why not just integrate them into the storage nodes themselves? And that works. You know, you get rid of one layer of, you know, machines. Again, physical or virtual doesn't matter. But your latency really doesn't change, right? The extra hop latency you still incur because at the end of the day,

Starting point is 00:18:22 all the volumes in all the IOs to a particular volume end up on one particular node first before getting distributed, you know, into the storage layer itself. And again, that bottleneck that was occurring here on, you know, when we had separate gateways simply moves down to the storage node. And in many cases, the storage node is CPU heavy, right? And anything you take away, any resources you take away from the storage node in processing this gateway functionality is basically directly affecting your storage performance

Starting point is 00:19:03 that you deliver to your clients. So what would be nice is if we had this integrated gateway concept, but it was distributed, right? This goes back to the original DSS picture that Mohan showed, which was, hey, you know, I have a volume. I want to get to the correct node right in the beginning,

Starting point is 00:19:28 right from the get-go, right? But instead of having this custom sort of client here, I want this to be NVMe, right? This is really the end goal that we're trying to achieve. Which is good. The question is, what does that mean? The first thing it means is it requires these volumes to be aware of multiple targets, right?

Starting point is 00:19:54 Today with NVMe Fabric, multipath notwithstanding, primarily it is a one-to-one relationship, right? One volume is mapped completely to one target, to one subsystem. You can have multiple subsystems. The idea, though, is that's really for multipath, right? It's not really aware of a volume mapped, different LBA ranges being mapped to different nodes.

Starting point is 00:20:19 And that's one thing that, you know, that's one change that's required, right? The second one is, well, how does this guy decide, the client decide how to place the data, right? Because the storage service that runs here, there's various, you know, as Mohan said, there's open source versions, there's proprietary versions. So it's very hard for this, the NVMe interface here

Starting point is 00:20:47 and the NVMe device here to support all of them, you know, using one scheme. So the placement scheme that this needs to use has to be extensible, right? So it's something that can be extended to support multiple storage backends here. The other thing that really is needed is, you know, this architecture will decouple, you know,

Starting point is 00:21:15 the NVMe with the storage nodes here, but the problem is management, right? Every time something changes down here because of either a failure that, you know, where, you know, it's trying to recover from a disk failure or node failure and it's changing its placement around, or it's trying to load balance because of performance, which all of these scale-out storages do, you're affecting the interface up here at the host. And that brings in management complexity. So this extension that we're talking about in terms of placement really needs to be,

Starting point is 00:21:57 you know, to have, you know, the least amount of management intrusion, right? So that's really the end state we are looking for. Now, the question is, how do we get there? So we, you know, what we thought was, let's come up with a solution that gets to that end state using NVMe and NVMe Fabric as it stands today, right? And basically, if you remember, we need NVMe or fabric to be aware of multiple targets.

Starting point is 00:22:34 We need a way to extend, extensively place, figure out where data goes. And we need that management complexity to go away so that these two can truly be decoupled. So for the last part, we want it to be self-learning, right? So we sort of took a leaf out of the, you know, Internet routing world, where even if you get a route that's not exactly accurate, packets still get routed, but eventually the router learns.

Starting point is 00:23:06 So we sort of want to take that concept and apply it to this solution. So what we're doing is we're proposing the two concepts of a redirector and hints into this NVMe fabric sort of solution. What a redirector, a redirector basically on the initiative side and on target size does similar things. On the initiative side, a redirector, you know, gets the IO

Starting point is 00:23:38 and figures out where for this IO.O. and for this namespace where it should go to. And it looks up some table, which we call the hint table here, to figure that out, right? When a target sort of gets an I.O., the target redirector figures out if it should service it itself,

Starting point is 00:24:05 because it's sort of, it's a redirector that's specific to the storage service, or if it doesn't own it, should it just send it to somebody else who does own it? But the main thing is, even if the I.O. goes to the wrong target redirector, that target always has to complete the I.O.

Starting point is 00:24:24 So there's no error back to the initiator. So that way, that ensures, you know, the weakest coupling, if you will, right? The storage service is free to move things around without the initiator getting impacted. You can... So even if the initiator gets it wrong, the storage service, the redirector,

Starting point is 00:24:43 will take care of it, right it and complete the I.O. So in this case, the I.O. one flows through. The initiator doesn't know actually where it should go to. It sends to a default redirect target. This guy sends it to the right one, completes it. And then the second concept of hint comes into the picture where now the first tree director sends a hint back to the initiator saying, hey, for this IO, this is really where you should be going. And the initiator learns from that hint, populates

Starting point is 00:25:19 its hint table, whichever, whatever, however it wants to implement it. Right? So the next time an IO comes to that LBA or that range, it goes to the right target. So with this scheme, you know, you basically have an initiative that eventually learns where, you know, learns about the placement, right? That's the basic idea. And, you know, the storage service

Starting point is 00:25:54 is sort of free to do what it's doing, right? Because once it gets an I.O., it services the I.O. as if it just came from its clients. So then the question is, what do the hints look like, right? So as I was saying, that we want to be able to, these hints are this primary mechanism by which a host is told where data is placed, right? And this needs to be extensible, right?

Starting point is 00:26:31 It needs to support a wide range of placement schemes that, you know, different solution stacks provide, as well as it needs to be extensible for new things that are showing, going to show up, right? So what we sort of came up with is three categories, and this, of course, I suspect more will be added to this, but three categories of hints to take care of different kind of known backends today. So you have sort of simple pairwise backends today. So you have, you know,

Starting point is 00:27:05 sort of simple pairwise backends, right, where, you know, you're basically mapping a volume across just two nodes and replicating it, right, to where you have slightly more sophisticated placements where, you know, extents are taken from a set of nodes to create a volume, and then within that, replication is done, or erasure coding is done.

Starting point is 00:27:35 Then you have backends that do sort of more RAID-like striping, right? And then, of course, you have backends like Ceph that do algorithmic-level-based placement, right? And then, of course, you have backends like Ceph that do algorithmic-level-based placement, right? And so what we thought was, if you categorize all of that, there are sort of three schemes of hints, right, that we can support.

Starting point is 00:28:01 One is SimpleHint, which is really a range of LBAs mapped to a set of targets that, that, that an LBA maps to, you know, reads and writes separately. And again, a set of targets, because for reads, you may want to give a priority, or a set of, you know, targets that the reads can go to if you want to parallelize the reads. But for writes, maybe you want to give an ordered list of targets because of the primary replica versus secondary. So that's what a simple hint does. It takes care of X10-based mapping. It

Starting point is 00:28:40 takes care of pairwise HA sort of solutions as well. Right? The second one is striping hints, which are, as I said, backends that would do, that would support, you know, things like, you know, rate zero, for example. Right? zero, for example, right? Again, here the idea is that you give sort of a LBA range, what extents it maps to, what kind of striping group it's part of, what's the stripe size, and that allows the initiator to just calculate exactly where a particular LBA, you know, access needs to land. And then the last is the hashing hint,

Starting point is 00:29:35 which is where the hint really comprises of the hashing function that needs to be used, the way an object, because usually, like, especially if you take something like Ceph or Gluster, you know, you derive from the volume name and the LBA, you derive an object name, an object ID, you do, you calculate a hash function on it, and then you go look up a table to figure out, you know, basically a hash bucket table to figure out which node it needs to go to. And that's what, all of that is embodied

Starting point is 00:30:11 into the hashing hint, right? So what kind of chunk size this scheme uses, right? What is the object name format that this scheme uses? Predefine a bunch of hashing functions that are common, and I'm sure that can be extended. And a hashing table location where the actual lookup is for the actual node. And these three together, essentially we think, depending on which one you get to use, or you can combine the two.

Starting point is 00:30:51 You can do a hashing hint, but sometimes you get the wrong node because the back end is changing things around. So you can add a simple hint, a specific location hint on top of a hashing hint. That'll take precedence, right? So you can add a simple hint, a specific location hint on top of a hashing hint. That'll take precedence, right? So you can do these things to always minimize your latency, get to the right node as quickly as possible,

Starting point is 00:31:13 yet allow this thing to learn as the storage service is expanding, contracting, changing its placement. So, if you have to do this, you know, what are the changes to NVMe or Fabric that needs to be done? So our first premise really is that we want to, you know, we want to reuse the existing elements of NVMe fabrics, right? Try not to introduce any new, as far as the protocol and the architecture is concerned,

Starting point is 00:31:58 not try to introduce any new element, anything radically different, right? And that's what we set about doing, trying to see, can we just do this using existing elements? That doesn't mean existing implementations will not have to be changed. Of course, you have to implement a redirector. You have to be able to implement, you know, paying attention to the hints,

Starting point is 00:32:23 using the hints to go to the right place. But at least from an NVMe fabric protocol standpoint, we get to use as much as, you know, existing elements that it already provides. Also, it was important that legacy initiators continue to work with storage services that have this redirected capabilities. So what I mean by legacy initiators is, you know,

Starting point is 00:32:55 we all know that once you have a standard out there, and Debinder talked a lot about, you know, compliance and interoperability, NVMe and NVMe Fabric is certainly taken in a big way. Lots of products out there. Lots of native support from operating systems and hypervisors. So you'll have initiators out there

Starting point is 00:33:19 when you deploy a storage solution with this capability that still need to work with existing initiators. So we want to make sure this is backward compatible, and basically what makes it backward compatible is the fact that, you know, an initiator that doesn't know about, doesn't pay attention to hints can still send an I.O. to what it thinks

Starting point is 00:33:42 a node is the right node, and that node will complete an I.O. to what it thinks a node is the right node, and that node will complete the I.O. It'll figure out where the I.O. actually belongs to, send to that node, get the results, and send it back at the expense of performance, but you'll continue getting functional capability. That's why it was even more element that we use the NVMe fabric elements.

Starting point is 00:34:06 So there are three sort of things we need to worry about. One is how do we, what are the hints? How do we represent hints? How do we notify hints? And how do we know that a particular NVMe fabric initiative or target, but mostly it applies to target systems, is capable of this functionality. So the first thing, how do we represent hints? Well, we figured log pages are a good way to do this. Right? NVMe already defines the concept of log pages,

Starting point is 00:34:46 both standard log pages as well as you can do vendor-defined log pages. And that's a good way to basically deliver hints. So all of those hints I talked about can map to different log pages and different formats. Of course, things like when somebody's reading a log page it needs to be consistent, as a log page might, you know, be read in multiple chunks. So things like that have to be taken into

Starting point is 00:35:17 account. But the concept of log pages can be used for this. The second thing is how do we notify that, you know, there is a hint, you know, that hint propagation thing that I showed earlier. AERs are a good way to go do that. You know, your asynchronous event requests, that's already supported by NVMe or Fabrics today. So you can have an AER outstanding

Starting point is 00:35:45 for the particular log page that you're looking for. So the initiator can send an AER, and whenever there's a log page that affects that initiator, the target can send a completion back, and that causes the initiator to come back and read the log page. So that scheme should... We should be able to use that scheme to do the notification.

Starting point is 00:36:10 And then capability discovery is probably the easiest. We think the supported capabilities and the get features commands, you know, we should be able to add bits to it for, to, to allow for, for an initiative to discover redirector capability targets. Until that's there, you know, you can use a whitelist or some, some other schemes to sort of at least get the ball rolling. So, so in summary, what, you know,

Starting point is 00:37:00 and Mohan sort of started with this, right? What we really look, saying that any distributed storage service, and as it was probably obvious, I mean, we're focusing on distributed storage service, but we're really looking at any storage service, I think can benefit from a standards-based interface into into the host, right? I think, you know, SNE has been doing, you know, that's really what SNE has been driving a lot, standards-based interface.

Starting point is 00:37:36 From the earlier talk, you know, we know that with standards, there's innovation that we can leverage from a large body of work. So what we want So what we really want is provide a standards interface to any storage service, right? And we believe any storage service will benefit from this approach, right? So once they develop their service,

Starting point is 00:38:02 they have a ready-made ecosystem that they can just plug into, and they'll be able to develop their service, they have a ready-made ecosystem that they can just plug into, and they'll be able to deliver their service straight to the host right away. NVMe, of course, we feel is the ideal interface. It's, you know, it's obviously gotten a lot of ground. It's got a lot of ecosystem support already. And so that seems like the right place to be.

Starting point is 00:38:31 And because now there is a network component, a fabric component to this, that we believe NVMe Fabric fits that bill naturally. It already has the elements needed to make such a scheme work. So in total, between NVMe, NVMe Fabric, and marrying it to any distributed storage service, we feel that we can deliver sort of the best experience to a customer as far as storage is concerned.

Starting point is 00:39:07 And, you know, the table here kind of just captures everything that we said in the last 40, 45 minutes or so, which is you have a distributed storage service today, and it suffers from a lack of standard host interface. It, of course, delivers, you know, good performance, right? And, of course, the gateways aren't applicable to it. Then you add, you know, the gateway, a single gateway,

Starting point is 00:39:37 distributed gateways, and gateways exist with, you know, AWS gateways, Azure gateways, other gateways, but they bring the goodness of a standard host interface, but both in terms of availability and performance, then, you know, they basically bring some, you know, badness to the original distributed storage solution. With the self-learning NVMe fabric solution,

Starting point is 00:40:10 you know, in general, we expect that you restore the goodness of DSS, but you bring in the goodness of a standard interface. Yeah, so thank you. Questions? Sorry, yeah, go ahead. How does stuff like reservations, ANA, security,

Starting point is 00:40:36 something like that, work? So the question is, how does things like reservations and something like, how will that be supported here? Well, if you notice, from an NVMe standpoint, it is still, from a host model, it looks like the same, right? You have a volume that is surfaced on the host. You send reservations to it. You need to have the NVMe fabric target support reservations to begin with.

Starting point is 00:41:09 Even today, if you don't have the scheme, it always gets to a target first and then gets to the right node in the storage service. So if the storage service supports reservations, the reservation just flows through. This doesn't add any new complexity there. Sorry, let me just get your question. A lot of what you describe

Starting point is 00:41:37 is software-defined storage architecture that's already implemented by a lot of upper layers, a lot of systems. Who do you see actually shipping the things you propose here? You seem like you're trying to push it way down to the bottom of the stack. Is this devices, appliances, operating systems? What?

Starting point is 00:41:53 Yeah, I think the... What we're trying to do is, I think, as we said, push it down to... to add the NVMe layer as far as the operating system is concerned, right? So you don't need anything on the host that's, you know, higher than that that's managing this, right? Yeah, I see.

Starting point is 00:42:16 And have the storage service that's typically running elsewhere not worry about how that's getting delivered to the host as long as they have the right thing. That's sort of the basic idea. One second. Let me... So one thing is that we see, if you think of scale-out storage,

Starting point is 00:42:39 I mean, they're solving a different set of problems really well, but, you know, if you look at a scale-out storage software, the volume management, snapshot, all kinds of functions that are built into it, and they do really, really, really well. Right. But then you're kind of in the client. They feel obligated to put in a client interface

Starting point is 00:43:00 with their back end, essentially. And then NVMe Fabric, if you look at it, essentially says, I'm going to help you disaggregate storage, but then it doesn't do any of these things that you would expect any storage service to do. So what we're saying is essentially the hybrid approach basically lets you keep your storage back end, but then provide you an NVMe front end. So the host burden goes away. The host burden gets to be a standard-based solution. And essentially, it's pieces of software that you create on top of the backend

Starting point is 00:43:29 to essentially provide the translation and, more importantly, the management matching. There is this management layer that has to say, I'm going to create a volume, take that volume, map it into an NVMe-based thing. Now it shows up as NVMe, and then the NVMe management or fabric management takes over. I still don't fully understand who's going to provide all those things. The devices, the gateways, what the heck. So, for example, let's take Ceph, right, because it's open. I can see where Ceph provides this as a variation of Ceph, essentially, that does this.

Starting point is 00:44:02 So it comes from the storage nodes up. So Red Hat will ship this. Yeah. Yeah, so you basically... I mean, somebody has to develop on the storage nodes the right redirectors, right, with NVMe fabric on top.

Starting point is 00:44:18 But on the host side, it will be standard NVMe-initiated drivers that will have to support this. Right. Yeah, question. Something similar to the QT question. Not only from a sanitizing perspective, but from a white-locking,

Starting point is 00:44:36 I mean, a digital global, that type of thing. How does it work in this... Yeah. So this doesn't change any of that model because, you know, again, if, you know, from a host standpoint, you see a NVMe device, and if you did TCG in Opal,

Starting point is 00:45:00 if you expected Opal support and you were using it as Opal, you still get to use it, because that just gets to, again, the storage service. And the storage service, of course, has to support, you know, support, you know, the Opal standard.

Starting point is 00:45:14 The targets have to support the Opal standard, which is true today. I mean, for any NUF fabric solution, the targets have to support the Opal standard. And so this just brings that to the table. There's nothing... I don't think the basic device model that NVMe and NVM Fabric brings

Starting point is 00:45:32 to the host and application standard, whether it's reservations or Opal, doesn't change. Ultimately, it has to be supported by the back end regardless of how you deliver it. Great question. So would this require any changes to the API-level topic specification?

Starting point is 00:45:50 Yeah, so we don't... So, like I said, all the elements exist. You know, log pages, AER, you know, supported feature bits. So, you know, we are prototyping with, obviously, just vendor-defined things right now. Once all of this gets sort of worked out, we can standardize the log page information, for example, right? You know, like, okay, log page XYZ is for hashing hint,

Starting point is 00:46:25 and this is the hashing hint format, right? It doesn't change the protocol of NVMe or Fabric. It's the management portion of it that's going to change. Question? It still doesn't enable multipathing. Well, current multipathing would work. You know, native NVMe multipathing that's now in the kernel will continue working.

Starting point is 00:46:47 Sorry, question? Sorry, can you repeat the question? Yeah, I think one of the main things we're trying to avoid is, especially to support large scale, is having central management as much as possible. So all the goodness that a distributed storage system brings is that. And so if you distribute it, you know, we think, you know, from a self-healing standpoint,

Starting point is 00:47:22 it's going to scale better. But, yeah, you can implement something that's more centralized as well. I mean, I don't think this is mandating a particular kind of implementation. Because the hints can come from, like Mohan said, can be a management solution, right? Any other questions?

Starting point is 00:47:42 Well, thank you very much for your time. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit www.storagedeveloper.org.

Storage Developer Conference - #97: Delivering Scalable Distributed Block Storage using NVMe over Fabrics

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.