Storage Developer Conference - #114: NVM Express Specifications: Mastering Today’s Architecture and Preparing for Tomorrow’s

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 114. All right. Hi, how you doing, everybody? My name is Jay Metz. I am a research engineer for Cisco. I am also on the board of directors for NVM Express. And I am working with you today to talk about what's happening in the actual specifications moving from where we currently are into what's going on in the future. And I said to myself, you know what? I have access to people who actually know what they're talking about. So I said to myself, self, because that's what I do,

Starting point is 00:01:10 I say self. And why don't I just ask Nick, who has been instrumental in doing a lot of the documentation for the changes between 1.3 and 1.4. If you've gone on the website and seen the changes, the list of changes for 1.4, that's's the man who wrote it so we're going straight to the source and then i said well you know what about the consequences of these what is it going to have as an impact for for testing well why don't we just ask david to show up too because if anybody knows about testing and what's going on in the testing it's going to be him. So I said, hey, self, you just dropped your level of work down to about two-thirds. Yay! So if you don't mind going to the next one. So what this is about is really the culmination of some things that have been happening both in the marketplace, in the industry,

Starting point is 00:02:01 in development, and as Nick and David will tell you, also in ways of how people are actually starting to implement the protocol outside of what the expectations really were. And that's fair, because like any new thing, people are going to start using it in ways that you don't necessarily anticipate when you start making the developments in the first place. But as a result, there are some ways of doing things that wind up getting confusing. People in the end consumer group, consumer and enterprise

Starting point is 00:02:32 group, start to wonder, well, what does it mean if I have this number attached to a specification? What do I develop to? If this is optional and that's mandatory, what happens? And what happens to the changes? So we started taking a hard look at what it is that we are doing happens and what happens to the changes so we started taking a hard look at what it is that we are doing as we start to develop this organically fluid specification and start to realize that we probably need to codify it a little bit better so uh what was about a year ago did we start talking about refactoring about that roughly yeah somewhere somewhere between nine months and a year ago we started talking about refactoring? About that, roughly? Somewhere between nine months and a year ago,

Starting point is 00:03:05 we started talking about, well, what we really need to do is we need to take all of these different aspects of the specification, and we need to repackage it in a way that's a lot easier to find stuff. And you can be honest. How many people have had a hard time finding what you need easily? Okay, again, we're here in the audience participation portion of the day. I'm not asking you to cheer, just give me a sign of life. Okay, so what we're going to talk about today is we're going to talk about the reasons why we do the refactoring, what we're

Starting point is 00:03:38 expecting people to get out of the refactoring, and how we get to and through the refactoring moving forward. But as I said earlier, one of the most importantoring and how we get to and through the refactoring moving forward. But as I said earlier, one of the most important parts about the whole thing is how do you make sure you're doing what you're supposed to be doing. And as a result, when we started talking to David early on in this process, he said, I'm glad you asked because there's a lot of stuff going on that is really going to wind up confusing people if we're not clear. So this effort is to try to make things clear people if we're not clear. So this effort

Starting point is 00:04:05 is to try to make things clear for you and for the viewers at home who are listening to this on, on the website. So once we get through this, we're going to be able to hopefully have not just a clear idea of how to use the specifications more efficiently, but also what to expect moving on in the, in the future, and how NVM Express

Starting point is 00:04:25 is going to help you get there. All right? If you don't mind. So, quick level setting as to what we're looking to do here. What we're going to do is we're going to be kind of brutally honest in a lot of ways. We're going to tell you where the warts are, how to put the cream on them and make them go away. We're going to help you understand what we think you should probably be doing,

Starting point is 00:04:47 so we're going to suggest courses of action, but we're not going to be prescriptive. We're not going to tell you what to do because there is a lot of optional stuff in the features by design. So there's this tightrope walk that we're trying to take that will help you understand where you can take what you're looking to develop in

Starting point is 00:05:06 the right direction without going too far as a deviance from where the specification itself is going. We're also not going to be exhaustive. As we'll see later on in the presentation, there are many, many, many changes, many mandatory changes from going from 1.3 to 1.4, for instance. There will be other additional changes going from 1.3 to 1.4, for instance. There will be other additional changes going from 1.4 to 1.2. There's no way we can talk about it in 45 minutes and cover everything. So we're going to give you some samples and some examples

Starting point is 00:05:33 of what kind of changes to expect, what the consequences of those examples are going to be, and then we can kind of extrapolate on to additional changes that we can't go over in fine detail. Fair? Okay, for the record of those at home, we had nod heads. Okay. Next. All right. So starting from getting from there to here. We're going to start at the top of the funnel. We're going to work our way down pretty quickly. Now, in theory, the process of getting from here to there from 1.2 to 1.4 should be relatively straightforward.

Starting point is 00:06:05 You pick up 1.4 and you start writing to the spec, right? Well, not really, because if you instead go to 2 to 1.3, you have to deal with 12 ECNs, 14 technical proposals, just to get to 1.3. Now, why would you want to do that if I can just pick up 1.3?

Starting point is 00:06:24 Well, because the problem is that a lot of the stuff that goes on in the technical proposals from 1.3 to 1.4 could also be applied to 1.2. So you can go back and read the technical proposals from 1.3 to 1.4 and go back to 1.2. So you're doing a lot of hopscotching. And then what makes things even worse is that sometimes you have multiple ECNs that don't necessarily accumulate the previous ECNs. So there's text in some ECNs that are not in other ECNs, and you wind up having to do a bunch of jumbling. Yay! Fun.

Starting point is 00:06:57 All right. Now, if we go to 1.3.1.4 to 2.0, things get a little bit messier. And we're like, you know what? Let's just take a step back and take a breather. The reality of the situation is that what we're doing for NVMe 2.0, which we're going to get to in a little bit, is going to be able to clarify a lot of the hopscotching. We're looking specifically to be able to concatenate a number of these things

Starting point is 00:07:23 so that you can have pretty much a single source of truth. Moving forward, there will be some additional technical proposals that are going into 2.0 as well. But the key thing to keep in mind here, and if you walk away with one thing today, the most important thing to walk away from is that this 1.3 to 1.4 to 2.0 is an easier path. And the reason why it's an easier path has to do with the way that the hopscotching works. So NVM Express is going to be helping everybody get from 1.4 to 2.0 because we'll be using 1.4 language to get to 2.0. Which means that if you try to skip 1.4 and go straight from 1.3 to 2.0, you will find yourself having to do a lot more hopscotching than necessary. So kind of a measured pedantic pedestrian way of approaching this will wind up being

Starting point is 00:08:12 very, very useful, particularly when we start to get into some of the mandatory changes that are going on in 1.4. Skipping over those are going to cause a little bit of hurt. And as a result, it will also become pretty unique. So the problems that you have as a developer going from 1.3 straight to 2.0 will be left up to you, for the most part, to solve because of all the different options.

Starting point is 00:08:36 So what we're going to try and convince you today, and like I said, it's going to be highly suggestive, is go from 1.3 to 1.4, then to 2.0, because that process, that putting on your socks before your shoes, will actually wind up helping you in the long run. So at that, just a summary. You've already seen Nick's great changes made in 1.4. We're doing the same thing for NVMe over Fabrics 1.1. We will be doing one for management as well.

Starting point is 00:09:12 You'll notice that we have those for 1.4. We don't necessarily have them for 1.3. We will likely also have that for 2.0 as well, I would assume. We're going to give more help going from that direction. So start off going from 1.2 or 1.3 to 1.4 before moving on to 2.0. All right. One of the things to just kind of reiterate that Jay's alluding to here, when we say that going from 1.4 to, or excuse me, from 1.3 to 2.0 will be more difficult, it's because we're refactoring the specification, right?

Starting point is 00:09:44 And we'll go into more detail about that in a little bit. But one of the things there is that there will be sections of the specification that move around. There will be some significant changes, not in terms of technical content, but in terms of how the spec is put together. And a lot of the technical proposals, the current technical proposals are written against that 1.4 specification. When we release 2.0, that's going to be a challenge to be able to put together. We just want to make sure that's clear.

Starting point is 00:10:16 This is all you. To start the presentation, we're going to kind of go back a little bit, and we're going to talk about what does it take to get ready for 1.4. You know, we've kind of come in and we've said, hey, it's important to kind of take these steps there. That first step is actually getting to 1.4, and so we want to talk about the types of changes that have gone into this 1.4 spec. We've got changes that are basically new features and feature enhancements, these things are similar,

Starting point is 00:10:47 but then the important thing to be able to iterate here and just be able to make sure you guys take away is that there's a number of required changes that are incompatible with previous versions of the specification. So as you go and get compliance checked through UNH, we need to make sure that all of those things are actually captured well and that you're aware of how to find that stuff. Because when you look at a couple hundred page specification,

Starting point is 00:11:14 it can be challenging to figure out what do I need to do, right? And so that's one of the things that we want to talk about today. Where do I start? As you see the camera here, pictures, this is something to keep track of. And you'll be able to get this on the website afterwards. But the idea is this is an important slide. As we went about making the 1.4 specification, and as Jay alluded to with regard to Fabrics 1.1 as well as the MI 1.1,

Starting point is 00:11:42 we have added this new kind of change list to the website. And so specifically out on the NVM Express website, right in the section where we've got all of the specifications, there is a listing of all the changes that have gone into the 1.4 spec. And on top of that, it calls out not just what the change was, but the specific sections in the specification where those changes came in at, as well as which TPs and or ECNs that change is related to. You can really get the detail about where do I need to go look to be able to find out about a particular change. And that change list that

Starting point is 00:12:25 is on the website is exhaustive. Unlike our presentation here, it contains all of the changes that are required to get from 1.3 to 1.4. And so this kind of detail is part of the reason why we really encourage folks to move to that 1.4 kind of release of the spec and that readiness before they move on to 2.0 because it's very, in this way it's very prescriptive. We aren't going to update some of those changes and where they're at into the 2.0 spec. You'd have to kind of search around for that stuff.

Starting point is 00:12:56 So that's one of the things we're trying to encourage here. So here, this is just, I don't know, groups, classifications of updates that were made for required changes. So, this is kind of going through that list of things that if you're going to be 1.4 spec compliant, you need to make updates in these various areas. I'm not going to go through this whole list, but the idea here is to really talk to you about the fact that there are a number of changes and that they're important changes. A lot of these have to do with things like properly handling error conditions or clarifying different kinds of NSID values, namespace identification values. Make sure that things are not left kind of implied in the specification, but we've really

Starting point is 00:13:44 gone through and made sure that things were explicit so implied in the specification, but we've really gone through and made sure that things were explicit so that we don't have cases where vendor A implements it one way and vendor B implements it another way. We're putting compliance in place for these things to make sure that people are able to, as they consume SSDs or arrays or what have you, they get a consistent implementation across vendors. One of the things that we're going to do here today is dive in just a little bit to give some examples of what those changes are and how we've explained them on the website.

Starting point is 00:14:20 This first one is about controller memory buffer. Controller memory buffer support has been there previously. It was introduced into the 1.3 specification. But what we did here is we kind of hardened the implementation. We found some error conditions where if a drive supported CMB but the OS didn't support CMB, there were some scenarios where you could get some bugs and some issues, and we wanted to make sure that we closed through some of those gaps, especially as some of them might be security-related. And so what we went and did is we added basically a support bit and an enable bit. And the idea here is to make

Starting point is 00:14:59 sure that, you know, as a drive implements the controller memory buffer, that the OS has to be aware to be able to turn on that functionality. And it's not sitting on in the background just kind of as a default without the OS realizing it. So this is a kind of important thing. In addition to that, as part of the changes that we went and did, we kind of removed some of the restrictions for the controller memory buffer.

Starting point is 00:15:26 Previously, you had to have all your data and your command queue, excuse me, your submission queue entries all into that same buffer, but now we kind of removed the restriction. You can have some in host memory and some in the controller memory buffer, kind of either direction and that's okay. In addition to fixing some of the gaps, we also allowed for more flexibility in the controller memory buffer implementation. I talked to some of the whys already, but the key piece of why we did this is to make sure that the OS is aware of the fact that the drive is supporting this

Starting point is 00:16:05 and that some bad actor can't do something in the background without the OS being aware and without the drive knowing that it's not the OS doing it. Those are some bad scenarios. So we wanted to make sure and do that. And the impact of inaction in this space is that you continue to leave a drive that supports CMB potentially vulnerable to some of these types of issues. And so we want to make sure that those things get closed down, that we make folks aware of that fact. And so that's why we called out this change and that we made the associated changes to the specification.

Starting point is 00:16:38 One of the things you'll note down on the lower right is kind of an example of what we did inside that change list on the website. We've got the NVMe revision 1.4, section 3.1, 4.7, 4.8, 7.3, and blah, blah, blah, blah, blah, right? But the idea is it's very explicit as to which section the changes are in so that when you go and you try and read through a large document, you're able to find where that stuff is at, and it's not hidden. The other thing, especially when it comes to some of the required changes, like something like this, we actually call out the explicit sentence where the change happened,

Starting point is 00:17:17 so that not only do you know the section, but you actually know this is the line that you really got a key in on, and this is what was important about the change. Some of these have half a dozen of those sentences in there, and it's really important that you read through each one of those things because it will help you make sure that your implementation is solid. Yes, so on this one, Joe gave this. This is Jay. He's much better at these deliveries than I am, but, you know,

Starting point is 00:17:47 it's all good. Here, what we've got is, you'll notice one of the changes is with namespace IDs. We have this kind of general usage for FFFF, well, 8Fs here. And what it generally means is that we're broadcasting that action to all of the namespaces inside of the subsystem. Okay? But the thing here is that we didn't explicitly define it for every single command. Right? And so there's a lot of kind of vendor-specific implementations that are out there, and the vast majority of them are exactly

Starting point is 00:18:25 what we intended. And then there's some that aren't. And some of them, even when we went through as the experts and we're saying, oh, yeah, it needs to be this way, it needs to be that way, and we're like, oh, we're at loggerheads with regard to what's the right thing that's intended for the various situations. And so what we did inside of 1.4 is we went and we made very explicit exactly what the different FFFF definitions were for each of the commands. This is true for, I guess, IO commands, set and get features, admin commands, as well as reservations. You know, all of these things, it's either, you know, FFFF is supported or it's not supported. When it is supported, it means this. You know, these are the error conditions to send back when it's not the case.

Starting point is 00:19:11 And honestly, the majority were actually defined before. But what we did is we went through and we found some gaps. And so what this did is, you know, this change really was about making sure that we had filled those gaps and that it was explicit what to do. Again, it's that cross-vendor support. As you're a host or as you're an application software, you're able to know exactly how the device is going to respond ahead of time. This is a really key thing. One of the things that we want to make sure is clear is what's the impact of not doing this?

Starting point is 00:19:43 My drive's fine. Nobody complains about this stuff. I never hear anything from my customers. Well, you know, this last example we have, what happens when a delete command is sent with NSID FFFF? Do you delete all your namespaces? Do you not delete all your namespaces? I mean, what do you do? I mean, that seems pretty obvious, right? It's a broadcast. You should delete everything. I see a cringe in the front row, second row. No one's in

Starting point is 00:20:10 the front row. So in the second row, there's a cringe. So we'll get onto this a little bit more later. But that's an example of the types of things that were unclear before that we've now clarified. Moving forward. So we couldn't go and not talk about any of the new features. So one of the things that we wanted to talk about was the persistent event log. Again, there's a number of new features that were added to the specification, but here we wanted to give, this is kind of a two-sided thing. This provides all kinds of ability to be able to capture logs and persist them across power cycles. That's important for a number of reasons, the biggest of which generally is kind of debug and to figure out what happened and to be able to keep logs

Starting point is 00:20:54 of what's going on. But the idea here was that we create a consistent way to do that again across vendors and across OSs. So this allows for basically all your SSD manufacturers to be able to generate things that are custom for their drive, but inside of this number of different types of events. So firmware commits, which are obviously going to be implemented in a vendor-specific way. Thermal excursions, the hows and whys of that are, again, going to be specific to a particular vendor. There's a lot of different things here where there's vendor specifics to the implementations here,

Starting point is 00:21:36 but how you get that data and the types of data, those types of things can be consistent. And so we wanted to provide a mechanism, a framework that could be used both by the device vendor as well as by the OS side or the host side. So that's what we wanted to go. And that one's building for some reason. So now we will go back to Jay and we'll get a little bit more about kind of where we're going forward from here with refactoring. That's right. It's exciting. Oh, yeah. about kind of where we're going forward from here with refactoring. That's right. It's exciting.

Starting point is 00:22:05 Oh, yeah. Okay, so let's start a little bit of dirty laundry here. One of the things that we've noticed as we've been evolving, the intentions have always been pure. There's no question about that. The idea of NVM Express from the very beginning was that it was supposed to be a very simple approach to handling block storage for non, non-volatile memory.

Starting point is 00:22:33 As a result, the organization wanted to keep things relatively simple. They wanted to keep a very small mandatory set of commands, and then you just add in a bunch of optional features. Great in theory. In practice, what's happened is that with all the different optional features and the different dependencies that's happened,

Starting point is 00:22:52 you've wound up with some inconsistencies in how the actual specification has been written. The other thing that's wound up happening as a result is that because of the fact that NVMe was originally designed for PCIe, there has been a trend in conventional wisdom, for lack of a better word, where people have associated NVMe with PCIe. In fact, have you all heard of the website Quora, where people can ask questions and get answers from so-called experts? Well, so if you look at the questions on NVMe and Quora, a lot of those happen to wind up being an equation of three different, very different elements.

Starting point is 00:23:33 NVMe equals PCIe equals M.2. I can only do so much. All right. But the idea here is that NVMe and PCIe being equivalent is a very real problem, even in the engineering space where people should actually know. So I work for a company that's not well known for storage, and so sometimes we have to have a little bit of a come to Jay moment and identify what it's actually what. Now as a result, we're realizing that, generally speaking,

Starting point is 00:24:07 the way that we're treating NVMe right now is that it is using PCIe as its own transport capability, not that they're one and the same. Now, if you look at some of the older specifications, you start to realize that there's a lot of synonyms about the two. So, okay, our bad. We're trying to fix that. The other problem is that when we started adding in fabrics,

Starting point is 00:24:27 when we started writing in NVMe over fabrics, the way that NVMe over fabrics is written is slightly different than the way that NVMe is written. So the structure is a little bit different, and some of the language conventions is a little bit different. All that stuff

Starting point is 00:24:43 has made for an interesting interpretation exercise on the part of the reader. So we're going to try and fix that, if you don't mind. So what we need to do is we need to figure out, we're re-examining, let me put it that way, we're re-examining what we have to understand is what are the coarse table stakes? What are the things that are absolutely positively part of the spec? What is the true element, the pure distilled NVMe, right? Now I think it's alcohol time or something. I don't know. We're looking for the distillery of NVMe, which basically means effectively what are the parts that are going to be the core? I mean, we've got our queue pairs, right?

Starting point is 00:25:25 We have our IO queue pairs. We have our admin queue pairs. Those are part and parcel of the core of what NVMe is. Those won't change. But what about other things? What are the optional things? What are the things that people will want to do for specialized use cases? So we need to separate out namespaces, which is a core aspect of what NVMe,

Starting point is 00:25:44 and the type of namespace that we're going to use. Those are going to have to be separated out intellectually and logically. So when we start to look at this, we also have to look at longevity. A longitudinal study of what is going to be implemented versus what is not. What is a temporary optional feature versus a long-term optional feature? So we've really rethought the way that we're going to be repackaging the specification to that end. And so as a result, we're going to be changing the actual writing of the specification to mirror this longitudinal approach to handling

Starting point is 00:26:19 the specification. And we think that over time, that will wind up being a better way of developing to the specification consistently now the tightrope as i mentioned earlier that we have to walk is how do we do this without being prescriptive that's the big question so if you don't mind the way that we've originally done things it helps to kind of see where things go in a nice big picture and then nick is is the genius behind this graphic. So for the sake of cadence. So we started off with the optional, the NVMe base specs, the original base specs,

Starting point is 00:26:55 and then we added NVMe over fabrics. And we also added NVMe over RDMA and eventually TCP. Fiber channel was its own thing. The T11 group was doing its own. So you could sort of have a imaginary dotted bubble to the side because that was a separate entity unto itself. And then we started adding in additional major categories from the NVMe specification. But it also had, a lot of these had very, very real implications for NVMe over fabrics as well.

Starting point is 00:27:32 And one of the questions that has come up often is, well, does this apply to NVMe only, or does this apply to NVMe over fabrics? A perfect example for this is asymmetric namespace access. Is that a fabrics thing, or is it a PCI thing? Is it an NVMe thing? Where does it actually fit? So as we start to add bubbles to all these different things and these bubbles start to overlap, the Venn diagrams get a little confusing. So what we've done is we've decided to change this around and have a core set of specification that includes both the NVMe spec and the fabric spec while maintaining what constitutes a transport versus what constitutes features all right so let's break this down a little bit more and see what it looks like now if you look at the way that we currently have this we have the nvb base spec

Starting point is 00:28:15 uh in blue in the middle we have the teal it's teal right anybody know the colors better than me i feel like i should be in kindergarten again okay Okay, this is red, this is blue, this is chartreuse. Anyway, NVMe over Fabrics has its own discovery service, the NVMe over Fabrics command set, various data structures, and a lot of these things kind of cross-pollinate. A lot of these things overlap. So what we're looking to do is saying, okay, well, which one of these actually is the same, and which one of these is not? So we've gone through and said, look, well, okay, well, we need to have a discovery service for both. We need to have a queuing model of the logs and the status codes for both. We need to have admins commands for both, but we don't necessarily need to have the nvmm command for both. That's only going to be on the, on the base side.

Starting point is 00:29:01 And so on and so forth. So if we take these things in pink and we start to put those things together, then we really have a core set of NVMe. So it really kind of looks like this, where we have the pink-based specification with all those different things that were all common to both sets, but then we have the transport mapping separate.

Starting point is 00:29:24 We have the individual transport mapping that we can include other transport mappings if they so should eventually come into pass and likewise the different types of feature sets that are not necessarily mandatory key value is is a great new namespace type but it's's not mandatory. Zone namespaces is not mandatory, but it is a great new feature. But all of these things are basically, if you will forgive the expression, plugged into the core

Starting point is 00:29:53 elements. So the way that the management, I'm sorry, the way that the specification is handled is that we have management on one side which kind of covers all of this, but at the same time you can individually identify what is necessary and what's not and easily find where the dependencies fall so you don't necessarily i'm going to go back real quick here going backwards

Starting point is 00:30:15 this is very easy to understand if your perspective happens to be understanding what mvme over fabrics works how works. It's not very easy, however, or at least it's not as easy as this, is understanding where your dependencies fall. If your dependencies for this TCB transport mapping do not go into there, but all you have to hear, this is a much easier way of understanding where your dependencies fall.

Starting point is 00:30:42 Therefore, it's a lot easier to find out where to find the information in the specification. And, you know, the stuff that we've done with the change logs that Nick was talking about earlier is great for itemizing everything out once you've understood this. Right? So does this make sense? Any questions so far about what this is? Now, I also want to point out really quickly that there was a lot of thought that went into these kinds of, these kinds of philosophical questions about how NVMe should be presented to developers. So the idea here is to make it easier for, for developers to do the, to do the job. We are eager for getting feedback as to how things

Starting point is 00:31:24 are or are not working. And actually to Jay's point there, we're still going through this process with some of the details. So there's work ongoing today with regard to what some of the specifics of these look like. So feedback is definitely welcome. And now we're going to get to the real meat of the stuff where the real smart guy comes up. And just before Dave gets going, I really want to underscore this. I said this earlier before, but when we started talking to Dave, he's like, oh, there's so much stuff that people need to know.

Starting point is 00:32:02 I'm like, okay, like what? And then he just started listing things off. And there isn't just more tests. There's different types of tests. Some tests are going away. Some tests have to be repackaged. Some tests have to be redone. So this and I want to bring this back to those bubbles at the beginning about avoiding the bears. You want to bring this back to those bubbles at the beginning about avoiding the bears. You want to understand very quickly why going from 1.3 to 1.4 is easy.

Starting point is 00:32:31 This is why. Once you get to the compliance part of this, it will make a lot more sense as to why going to 1.4 to 2.0 is going to be a lot easier. Okay. Thank you, sir. All right. Thank you, Jay. So one of the things we want to point out with regard to compliance is that our bigger goal with compliance is to protect interoperability.

Starting point is 00:32:49 Our goal, even with the refactoring, is to protect interoperability. The steps from the 1.3 spec to the 1.4 spec is to ensure that products remain interoperable. And so we're putting in a lot of work into our test documentation, a lot of work into our test tools to enable people to protect the interoperability of their products. And so one of the things that we've seen people doing is in the steps from 1.2 to 1.3, even now in the steps from 1.3 to 1.4, they're using those test docs, they're using those test tools,

Starting point is 00:33:20 and running them against their products in development to ensure that they stay compliant and they stay interoperable, even running them weekly and nightly. So there's resources that have been provided for that. Now, as we've gone through creating these compliance tests over the last several years, one of the things we've been paying a lot of attention to is whether a product is advertising support for the 1.2 spec or the 1.3 spec or now the 1.4 spec, a host is going to treat that differently. A host is going to look at that version field and then behave accordingly and

Starting point is 00:33:57 expect certain types of features, expect certain types of behaviors based on that. We've put that into the compliance step specifications as well. We've put that into the compliance step specifications as well. We've put that into the test tools as well. So our tests are going to behave differently depending on what version of the specification that product is advertising support for. So that's one of the things that we're doing in order to preserve compliance and to preserve interoperability. Now one thing we want to point out, one kind of our key tenets, one of our guiding principles when it comes to the refactoring and compliance is that refactoring in and of itself isn't

Starting point is 00:34:33 going to magically create a bunch more tests. What it means is that just as there's going to be portions of the base specification that move, there's going to be things get added into the base specification, things are going to move in the test specification that move. There's going to be things that get added into the base specification. Things are going to move in the test specifications as well. But the refactoring in and of itself isn't going to add a bunch more tests. But it is going to be some complexity in determining where the tests that apply to your product now reside and where they're documented. And I've got a couple slides to make that a little bit clearer. So this is the situation with specifications and test documentation today. There's three big specs that

Starting point is 00:35:11 have come from NVMe, the base specification, the management specification, and the fabric specification. And we have a test document for each of those. So depending on which spec affects your product, you're going to look in a different test document to determine what you need to do for compliance. This is the situation today. When we go into refactoring, that's going to change. The fabric specification goes away. Parts of that go into the base specification. We end up with transport specifications for the different transports.

Starting point is 00:35:41 There's going to be independent compliance documents for that. Even command sets, which we just alluded to, there's going to be separate compliance documents for that. So, again, the number of places that need to be checked, the number of documents you need to look at for a particular product, that's expanded, but the number of tests hasn't expanded. And I'm going to show that on the next couple slides. So it's not the refactoring that's going to cause more tests.

Starting point is 00:36:08 It's new TPs, which are new features. It's new ECNs, clarifications in the specification. Those kinds of things will create more tests, but the refactoring in and of itself will not. So to illustrate that, I'm going to do another today and tomorrow comparison. So today, if we start on the top row in this diagram, if you have a basic run-of-the-mill NVMe SSD using PCIe as a transport,

Starting point is 00:36:39 the test specification you're going to want to look at is the NVMe base spec conformance document. In that document, there's about 270 tests. Now, the ones that apply to your product, it might be a little less than that depending on feature support, but defined in that document, about 270 tests. If we go to the middle row here, you've got an NVMe SSD that's implementing the management interface. So naturally, there's more tests, about 323 tests, because now there's two test specifications you want to look at. If the product is something like an all-flash array, there's some things in the base specification,

Starting point is 00:37:13 in fact, quite a few things in the NVMe base specification that you need to pay attention to. There's also things in the fabric specification that you need to pay attention to. All told, that's about 217 tests. Again, feature support, whether you support certain optional features or not, is going to change that total test number that applies to a specific product, but this is what's defined today. Now, again, looking at tomorrow, you can see there's more test specifications that you need to look to, but the tests have simply found new homes because the number of tests has not increased just because we've done the refactoring. So again, if you have an NVMe SSD using PCIe as a transport, it doesn't have a management interface, that number of tests is still about 270. Now I say about because there could be some TPs that

Starting point is 00:37:57 get applied, there could be some ECNs that get applied, that may add a few tests, but again, the refactoring in and of itself isn't adding a number of tests. So now we want to look into a couple examples, a couple of things that were changed from 1.3 to 1.4 that are very important for us to pay attention to, that if we don't get right, are going to cause compliance problems, are going to cause interoperability problems. And some of this is a little bit of some of the dirty laundry that we referred to earlier, but we're going to dig a little deeper and see what can happen here. So one of the things that got cleared up in the 1.4 spec was the proper use of that NSID of all Fs.

Starting point is 00:38:40 And as was alluded to, there were some cases where the use of that NSID were well-defined, and then there were some cases where it was maybe optional. And then actually there were implementations that kind of took how it was defined for one command and then used that in another command, and basically were doing some undefined behavior. And so a lot of effort went in on the 1.4 spec to clear that up. The example we're looking at here is a get feature command for a namespace specific feature. If that get feature command gets sent with an NSID of all Fs,

Starting point is 00:39:13 what exactly is gonna happen? Now under 1.3 there was a controller who might accept that. So it could get that get feature command and say, yeah, I know what I'm going to do with that. Here's the information about my namespaces and how they are persistent across a power loss. And so the controller can send that off to the host. And the host gets that information, but the problem here is that the host and the controller, they weren't talking the same language about that.

Starting point is 00:39:44 So if there's a power loss, the host gets upset with what happened. Not because of the power loss. We built the protocol to be resilient to those kinds of things, to be able to deal with those kinds of things. The problem is that the controller and the host had different expectations about how those name spaces would be, or reservations on those name spaces would be persistent across power losses. So that was a little bit of a gap in NVMe 1.3. So in 1.4, we cleaned it up. If that get feature command is sent with an NSID of all Fs for a 1.4 compliant controller, now the controller is required to write back and say, that's an invalid namespace ID and report an error. So the host gets that error. And then what can it do? How can it accommodate

Starting point is 00:40:23 for that behavior? Well, it can start to query each of those namespaces individually. So it can send that get feature command again for namespace 1 and get actually the information back for that particular namespace. It can send it again for namespace 2 and get the information back for that particular namespace. So now the persistence may actually be the same between these two products. The power loss might still happen, but now the host and the controller have the exact same expectation about how those things will behave. That's one little adjustment that was made with regard to the all Fs NSID.

Starting point is 00:40:59 One more, and this is the one that Nick alluded to earlier, is how that namespace management command with the delete action will behave with that all Fs NSID. So under 1.3, if that namespace management command came in for all Fs, the controller might delete or he might not. And so that namespace management command comes in, and maybe the controller says, I don't need to do anything with this. It's an optional thing, whether I'm going to delete this or not. Then the host goes to double-check that that delete action occurs with the identify command. The controller sends back the namespace list. All the namespaces are still there. That's not what the host was expecting.

Starting point is 00:41:40 That can cause a problem. In 1.4, that activity or that behavior was cleaned up. Now that all Fs namespace ID is treated as applying to all namespaces. So the host sends that namespace management with delete and NSID of all Fs and the controller can write back and say, yep, I'm going to delete those.

Starting point is 00:42:02 It actually deletes it. Then the host can do the double check to make sure that those namespaces were deleted with the identify command, and the controller writes back with exactly what the host was expecting. So here we see that what was expected to be deleted actually got deleted.

Starting point is 00:42:17 The host and the controller have the same understanding. They're on the same page about how that namespace management command actually is implemented. And so these are things that we're going to be checking with compliance with regard to 1.4. So a quick summary about compliance. Kind of like we said earlier, if there's one thing that you're going to walk away from this presentation for, it's going to be that moving to 2.0 is going to be much easier from 1.4 than from 1.3.

Starting point is 00:42:47 And so we're putting in a lot of effort right now, both in the test specification and the test tools, to ensure that the community can get 1.4 compliance right. So with that, we'll hand it back to Jay to bring us home. Bring it on home. Okay. So we try to time this in such a way that we would be able to have some questions at the end, and I think we did an okay job. But, again, I just want to reinforce a couple of the different things that we're trying to get across here.

Starting point is 00:43:19 Once you go through this particular type of a presentation, you start to get a little antsy because it seems like, well, this is kind of a duh moment. It all makes sense now. But one of the things that we're trying to combat is some of the confusion that's been going on, both in terms of the end user space as well as the developer space, about how certain things are supposed to behave,

Starting point is 00:43:38 both at the really low level and also at the high definitional level. So to that end, it's best not to wait until the refactoring comes out, which will happen sometime next year. The best thing to do right now is to understand, especially the stuff that's going on in 1.4. If you haven't taken a look at the list of changes that's going on in 1.4,

Starting point is 00:43:58 you really should do that ASAP because it's extensive. And a lot of the mandatory changes for expected behavior are going to be particularly salient, especially since we can leapfrog and hopscotch back and forth between different specifications for optional features. If you're thinking about taking in TPs and putting them into a 1.2.1, or a 1.2.1 specification, you're going to get some unexpected results. So start to think about this transition now

Starting point is 00:44:26 and start to plan a migration strategy, for lack of a better word. It's not really migration, but I hope you understand what I'm talking about. But a plan of action as you move forward to go from 1.4 to 2.0. One of the other things that we're also trying to keen on is to get really good feedback

Starting point is 00:44:44 for the kinds of needs that you have for what's confusing in your own implementations as well. We want to be sure that we're actually addressing real-world problems, not just the theoretical ones that happen and that we get on a Thursday morning call. So to that end,

Starting point is 00:45:02 the idea here overall is that we really want to try to help, is that we really want to try to help, OK, we really want to try to help developers be able to be consistent. Nobody likes developing something that A isn't used or B is used incorrectly or C doesn't work. So we're trying to make that as easy as possible, given how difficult life is already.

Starting point is 00:45:21 Now, before we get too far into it, any questions that we can answer? Any specifics about some of the changes for 1.4, for example? Yes, sir? You noted that the T11 community has a wider channel of data. Correct.

Starting point is 00:45:36 What's the relationship between MD&E Express and T11? You guys go along. And how do you actually make that big pink horizontal bar where you're making things and common go in there? Excellent question. So I'm going to repeat the question for both the recording as well, the people in the back. The question really evolved around the relationship between NVM Express

Starting point is 00:46:00 and the T11 group, which manages the fiber channel standards. And in particular, the question was, what's the relationship between NVM Express and the T11 group, which manages the Fibre Channel standards. And in particular, the question was, what's the relationship between NVM Express and T11? And the second question is, does the T11 group use the new refactored approach to doing NVMe? Did I get that right? Okay. Is that a five minutes? Okay. It was a glare.

Starting point is 00:46:20 I couldn't see. Excellent question. So I'm also on the board of directors for Fiber Channel. And it turns out that there are several members of T11 in the technical working group for NVM Express. There's a lot of cross-pollination on that end. The relationship between T11 and NVM Express, there's a memorandum of understanding between the two groups, and they work hand-in-hand.

Starting point is 00:46:44 What goes on inside of T11 is tightly coupled with what goes on in NVM Express. Now, to your other question about does fiber channel use the core elements of it, the way that NVM Express is layered is that you have the NVM Express layer on top, then you have an NVM over fabrics binding, and then you have the fiber channel or other transport underneath. The key thing is that the bindings between the fiber channel and the NVMe layer, that bindings layer, those are worked on jointly. And the fiber channel specification will match what's going on inside of the bindings,

Starting point is 00:47:17 but the binding is handled by NVMe. Now the good news is that the same people are on both committees. So it makes it a lot easier. And on top of that, the refactoring effort won't change that binding. That's right. It's just a matter of how the specification is called out and making that clear inside the specification that that binding doesn't exist that way.

Starting point is 00:47:34 That's correct. So from a practical working relationship, that part won't actually change. I would think that on the fiber channel side, you'd have less skew than with, say, the IP or the channel side. Would have less... SKU. SKU. The question is that Fiber Channel would have less SKU.

Starting point is 00:47:59 It has less SKU because it's a much more mature protocol for storage. Whereas Rocky is really kind of transferring things over from HPC into a storage world. So we've got 25 years of fiber channel with a very well understood and well defined relationship between host targets and switches. Whereas it's not quite so well defined in Ethernet racing. Now, InfiniBand also has a well-defined area, too, but for whatever reason, really the Rocky and TCP and fiber channel, those happen to be the ones

Starting point is 00:48:33 that are getting most questions about. Any other questions before we have to close up shop? Yes, sir? Can I just make one suggestion? I don't know if the web page is changing. Ooh, we have suggestions. All right. It's very useful, but one thing I could suggest is that maybe it could be a downloadable file.

Starting point is 00:48:52 Yes. We'll do that. Yeah. No problem. Did you just volunteer for work? Oh, no. No, I volunteered for work. No, I volunteered Liz for work.

Starting point is 00:49:04 Oh, okay. Poor Liz. That's our admin, by the way. Poor Liz. It's a good suggestion. We'll do that. It might take a little while, but we'll do that. Yeah, as the number of items gets bigger,

Starting point is 00:49:18 then having an easier way to track is what we're looking to try to do. You have to remember we're also working with a bunch of people who are really effectively volunteering to do this. Now, if you want to pay me a lot of money to do it, that's a different question. Alright, last question. Anyone? Bueller? Bueller?

Starting point is 00:49:38 Alright. Thank you very much, gentlemen and ladies. I appreciate it. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers dash subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #114: NVM Express Specifications: Mastering Today’s Architecture and Preparing for Tomorrow’s

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.