Storage Developer Conference - #202: What is the NVM Express® Flexible Data Placement (FDP)

Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our Annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 202. Good morning, everybody. Welcome to this session on TP-4146. In case you're in the wrong room, TP-4146 is a spec addendum to NVMe that adds flexible data placement, which is one of a couple of placement modes that are part of the NVMe standard.

Starting point is 00:00:59 And from a high level, you know, the whole goal of placement modes is giving the host system much more control over where and how data is placed within the drive among the available flash. To get several benefits out of it, you can drive, you can decrease cycling on the NAND, you can get improved performance, you can get improved write amplification. And essentially it's a way of unlocking performance that typically is kept in the drive for a typical block-based drive. So as our customers have gone up and down their stack and removed all the inefficiencies that are on the host side of the stack, they've started focusing over the last several years on removing the remaining inefficiencies that they see that are in the SSD. So that's where there were several different placement modes that came about, but the one that I think has most traction around it now is 4146 flexible data placement. So at a high level, if that's not what you want to hear, you're probably in the wrong presentation. So this presentation is going to go more over the standard,

Starting point is 00:02:14 just to give you an idea of what is in the FTP, TP, and what will be part of the standard. Obviously, as you know, to build out a drive based on NVMe, you got to adhere to the remaining parts of the spec. So we're not going to get into a lot of NVMe details. I'm just going to try to highlight as a software developer what you have to be concerned with from flexible data placement. So the first thing that you have to understand is, as I mentioned, all our customers have removed all the inefficiencies in the software stack

Starting point is 00:02:49 and they're now looking at the drive. And one of the ways to get that efficiency on the host side is to be able to explicitly place data into sections of the NAND inside the drive and giving that control to the host is what FTP is all about. So the first thing that the drive has to share with the host is it has to give the host the geometry that the host is going to use to steer I.O. to particular sections of the NAND. And it can be physical allocation of NAND or a logical allocation of NAND. But the diagram here shows an overview of a drive.

Starting point is 00:03:30 And essentially within this cartoon of this drive here, there's what's called reclaim units. And a reclaim unit can be an erase block. It can be a stripe of NAND. It can be whatever configuration the drive or the manufacturer of that drive decides to set up. And there's a mechanism where we tell you what, and I'll get to this on a couple of slides forward, where we tell you the configuration of that NAND. So you can have a physical set of NAND that's sliced and diced in different types of ways. And you would, you'd check the configuration for the configurations that are available in that

Starting point is 00:04:11 drive to figure out what that NAND arrangement is. But starting at, again, looking at this cartoon here, essentially I have an F2B configuration. It's going to consist of one or more what we call reclaim units. And within those reclaim units, they are together combined to what's called a reclaim group. So what will happen is as I write to one of these reclaim units and it gets full and it gets erased, it pretty much stays within that reclaim group. So those are the first two things that the host system needs to understand. And again, this will be defined by the configuration. So the host uses the identify command to get the configuration and it would tell you what the NAND arrangement is.

Starting point is 00:04:58 Then, as I mentioned, with the placement modes, the whole concept around them is to steer I.O. to particular configurations or groupings of NAND. So in this instance here, I show a yellow, I guess that's a green, and a red, what's called a reclaim unit handle, which is essentially a pointer to a reclaim unit that sits within a reclaim group. So in this example here, I've got three reclaim unit handles, and they each point to a unique reclaim unit within a unique reclaim group. So that's how I route I.O. So using, say, the red handle pointing to the first reclaim group on the right, if I wanted to write something to that specific portion of NAND,

Starting point is 00:05:49 I would use that reclaim unit handle in the NVMe write command, and it essentially goes into the DIRSPEC D type fields of the NVMe write command, and then that I.O. would then be routed to that specific reclaim unit. Likewise, if I wanted to write to, say, on the far left side to the reclaim unit that's pointed to by the green handle, I would use that handle, and then those IOs would go to that particular reclaim unit. So as you can see, data is completely isolated physically because the stuff that I

Starting point is 00:06:26 wrote with the red handle goes to a completely different section of NAND than the stuff that was written with the green handle. So within this physical arrangement of NAND, it's basically hung inside an endurance group to maintain compatibility with the overall NVMe spec. And an endurance group, again, depending on what the vendor decides to define as a configuration for the drive, there could be one or more configurations within an endurance group. So you would have to query that to find out what are the configurations the drive is going to support, what is the arrangement of those within the endurance group, how do I want to configure that drive, which one of those configurations do I want to select for

Starting point is 00:07:09 that drive? You as the host do not get to make that up. You can only query the capability from the drive and implement one or more of those capabilities depending on what the system will allow you to do. So I already mentioned this, but as I went to this build, further builds out exactly what's happening. So I pick a particular reclaim unit handle, and then when I do that right, using that particular reclaim unit handle to that particular reclaim group, that right goes to the highlighted reclaim unit handle in the diagram there. So essentially, the host now has the ability to steer rights to the granularity that's defined by the configuration. So in addition to hanging within the specification for endurance groups, I have to hang within namespaces. So you can create when you want to set up an FTP configuration, you identify the namespace and you say, I want this namespace to be an FTP namespace.

Starting point is 00:08:13 And then that basically sets up that drive namespace to work with FTP. What's shown here is on the bottom left side is a placement handle and a reclaim unit handle identifier. So the drive will have reclaim unit handle identifiers that it specifies in a log page. But the host system, if they want to, they can provide a table which says, you gave me these reclaim unit handle identifiers. Here's the value I would like to use for those. You can use it that way by specifying your own handles or you can use the controller defined handles. Either way is an option that the host system

Starting point is 00:08:55 can decide however they want to do that. This is incorrectly titled log pages, but there is a bunch of stuff that you can query the drive for to identify exactly what the capabilities are and how you want to use them. Yeah. Oh, sorry. Yeah.

Starting point is 00:09:14 Yeah. So I'm sorry if I'm missing this. I'm going backwards. This one? Yeah, and this is probably my ignorance. Where's the build? Okay. Why would you want to reclaim unit in each reclaim group?

Starting point is 00:09:37 Why would you want to reclaim unit in each reclaim group? It depends on the granularity that you want to work it out at. So say this is a trivially simple example here. I've got four die. If I want to slice up my configuration where those dies are separately addressable, I can steer stuff to the right die or the left die. That's why you would want the reclaimed group to be different. Did that answer your question? I thought that the reclaimed unit handle has a reclaimed unit in each group. So I understand that you would probably want a reclaimed group to be a die.

Starting point is 00:10:22 Well, that's one possible configuration. I can also create this as a stripe, in which case that red number at the top would be pointing to that entire stripe as a unit. Does that answer your question? Mike, can you help? Is there something? All right, so the number of reclaimed groups is a configuration.

Starting point is 00:10:48 Some configurations are one reclaimed group, and the controller is going to, as writes come in, stripe writes across all the guys and take the peril of managing that on the host. Maybe a configuration is I want to expose every guy, and so I expose every die and with one namespace and one reclaim unit handle, that

Starting point is 00:11:09 host through that namespace can write to each die by specifying the reclaim group there. Now when the host does that, the host has a responsibility to strike the data across to deal with endurance and write it across all the dies. But this configuration allows that to be exposed to

Starting point is 00:11:25 the host if that's what a customer and vendor vendor want to agree on what a configuration all right so um some of the information that you would have to deal with uh with working with an fdp based drive is, again, I mentioned configuration. Mike just gave a good explanation of that. I may potentially say I've got several sets of configurations, some down to an erase block, some at a stripe level, some potentially at a super block level. So there are some pages that articulate that information to you

Starting point is 00:12:00 on what the configuration are. There are some pages on identifying what the reclaim unit handle usage is, a couple on FTP statistics that are unique from the regular NVMe statistics, and then FTP events, again, that are unique from the regular NVMe events. So the way this works is you query the page

Starting point is 00:12:20 to find out what the configurations are, and essentially the detail that's in the page is it shows the number of configurations. So as I mentioned, I could support one configuration. I could support 100 configurations if I want to. Ideally, though, as a device vendor, I will support a limited number of configurations, a very limited number of configurations because it's a qualification problem. It blows up the time to qualify a drive with multiple configurations. So we're kind of settling in the industry on two configurations.

Starting point is 00:12:49 One is based on a race block granularity, and the other one is based on a stripe granularity. And even among those configurations, there might be a little variations on maybe what the granularity is of the stripe or whatever. So the additional information is it tells you what the size is, and then there's basically a descriptor. And within that descriptor, there's a bunch of information that tells you exactly what that configuration is.

Starting point is 00:13:16 So if I've got two configurations, I'll have two of these descriptors. And they tell me things like what the attributes are for the, uh, uh, what the attributes are for the configuration, uh, what the vendor specified sizes. And as far as the attributes go, um, tells me whether the configuration is valid or not, whether what the configuration of volatile write cache is for that configuration, uh, and what the reclaimed group ID format is. And then other information is, tells you what size is, how many reclaimed groups there are. As Mike mentioned, there can be one if you're potentially in a stripe,

Starting point is 00:13:53 and there can be many if you're going with the die configuration. The number of unit handles that you will support. So I could support one, I could support 16. It's up to the vendor to describe that. But this is all information that you as a host system are going to want to query from the drive to understand what that configuration actually is so that you can effectively leverage the capabilities that we're providing. There's also the maximum number of placement IDs, as I mentioned. The number of namespaces that we would support in that configuration,

Starting point is 00:14:26 the reclaim unit nominal size, if there is a variable size potentially, the reclaim unit time limit. So a reclaim unit NAND has to be closed after a particular period of time. So, you know, we would be telling you how long you can keep that open before as the drive, I'm going to close that NAND handle. And then there's basically descriptors for each of the reclaim units where you can get information, more detailed information on each of the reclaim units. And as far as the descriptor goes, I would tell you what the media type or what the handle type is, whether it's initially isolated or persistently isolated.

Starting point is 00:15:10 The difference between those two is, if you think of that diagram that I showed with the different reclaimed groups, if I'm initially isolated, the placement will try to stay within a reclaimed group. But if for some reason I run out of storage or space in that reclaimed group, I can move it to another reclaim group. With persistently isolated, I would stay within that reclaim group as I do media events or any relocation of data. So that's something that you would have to know about. So reclaim unit handle, on that one slide where I showed the reclaim unit handles and whether it has a definition provided by the host, this log page just essentially

Starting point is 00:15:51 reflects that. If you gave the system host specified value for the reclaim unit handle, this is where you would see that mapping. Then there's some statistics that are unique to flexible data placement, and this is, again, over and above what's already in NVMe. Sorry. I thought I had a build on there, but there's stuff like the number of bytes written by the host system and the number of bytes written in the media. So you can use this as feedback to help you understand if you're getting the isolation that you want and getting the performance characteristics out of the drive that you're trying to capture by moving to a

Starting point is 00:16:37 placement mode to begin with. Then, as I mentioned, same with statistics and same with the other stuff, there's unique FTP events that are part of the standard. Again, they're unique over and above the events that are part of NVMe. And the events that you'll get from FTP are you'll get an articulation that tells you what the event type is. There's flags associated with it, which placement identifier this event is relative to, timestamp on it, again, which namespace it was part of,

Starting point is 00:17:17 and then there's more specific detail on exactly what those events are. So when you get an FTP event, it'll tell you whether it's an event caused by a host interaction, like an RU is not fully written to capacity, or I don't know what three is in there for, but whether the active time limit has been exceeded. So that gets articulated to the host that way. Whether there's been a controller level reset inside the drive, and whether if you did it right, it had an invalid placement ID. Then there's some, I can put whatever vendor specific information that I want in that event also. And then there's also potentially events along with what's happening inside the drive, where if I reallocated the data for some media event reason, you would get an event. You could get an event for that.

Starting point is 00:18:06 Same with if I implicitly modified a reclaim unit handle, that is I bumped it to a different reclaim unit. And again, there's also media-specific stuff here. So same with NVMe. If you subscribe to these events and you set the event type flag, you'll get this information. So, yeah. It's the same.

Starting point is 00:18:39 It's no different from regular NVMe and how you subscribe to events today. In this particular case, there is no asynchronous event notification. It's currently full, but a timestamp, because this is mostly affecting how the host figures out why they aren't behaving properly to utilize FTP, and a timestamp can be used to look at their internal logs to figure out what they were doing when the event happened. One of the things about FTP is rights are going to succeed even if the protocol is being violated. So, as I mentioned, there's additional stuff that you have to deal with from the host side system when you're

Starting point is 00:19:25 working with FTP. So I drew together two slides here to give a very, very, very high level overview from a software perspective of what the host needs to deal with when it tries to work with FTP. So, of course, the first thing is you have to configure an FTP drive if you're going to use it. So essentially what you do is you check the identify controller data structure and you see if the bit, FTP support bit is set to one. If it is, then the drive can support FTP. So your next step is you go query it for the configurations that the drive supports. And then once you do that, as I showed on the configuration log pages,

Starting point is 00:20:05 you know, you read all that information. You figure out which configuration that you want to use. And then you use the set feature. Again, set feature I'm not talking about here because that's one of the standard NVMe commands. You use the set feature command to put the drive into the FTP configuration that you've chosen. And then once you've done that, as I mentioned with the previous stuff, it's optional for you to configure the placement handles. If you want to provide that list, you can as the host system,

Starting point is 00:20:33 or you can use the list that the drive supports. Another step that's optional is you set up the events. And then, again, the set feature. To put it into FTP mode is FDPE. You set that to one, and then you're good to go for placement modes within that drive. Do you make sure that all the time? Well, if you've got a namespace that's already up, yeah, you have to start from scratch, right? Generally, that implies that you have to do a format also. But, yeah, that's correct.

Starting point is 00:21:15 So what do you do when you're – so after you've configured the drive, put in the configuration that you want and the drive's ready to go. I probably left that off here, set up the namespace. But essentially, once it's running, you do the writes, and Mike's going to go into detail on exactly how the writes work, but you add the parameter of the reclaim unit handle or the placement ID to the write operation. And then pretty much from there, what you do is you just check for events.

Starting point is 00:21:41 If you have an FTP event, you handle it, you respond to it. You check for reclaim unit handles to see if there's anything going on there. And then if you want, like Mike mentioned, you can monitor the statistics to see what's going on with the drive to confirm that it's operating the way you want. Some of the customers that we talked to about this, they think initially they'll be, you know, doing a lot of this interaction with the events and the statistics until they get their host software dialed in. But once they get it dialed in, you know, these are probably potentially

Starting point is 00:22:15 optional at that point once you get your host software working very well with an FTP drive. So Mike's going to walk through, I think the next is, yeah, Mike's going to walk through, I think the next is, Mike's going to walk through a very detailed example in a write command. So essentially I talked about the benefits of using FTP, the stuff that you have to query the drive for to figure out how to set it up, and a very high-level overview from a host software's perspective on what it takes to interact with the drive. I'm Mike Allison. I'm a senior director of NAND product planning team with

Starting point is 00:22:55 Samsung. I was actually the lead author of the technical proposal. But the unique thing about the FTP proposal is it really was a NVMe membership proposal. A lot of feedback from the sponsors, a lot of feedback from various SSD vendors, and it was a privilege to work with everybody in doing this. So what I'm going to do is you've seen all the words, you've seen all the pictures, but I'm a very visual guy, so I want to actually walk through what really happens when a write occurs. So before I go there, let's go over a couple things here. So what I have here is I have an SSD with one single endurance groups, and typically most SSDs are only going to have one endurance group,

Starting point is 00:23:42 and it includes all of the NAND associated with that. And a FTP was enabled on that SSD. It has, in this particular case, I'm just doing one reclaim group, basically, so it fits on the slide and it doesn't get cumbersome. But what it means then is that this reclaim unit here is, you could think of it as a super block. It may be striping a NAND block across all the dies. And so then as writes are occurring, the controller is actually trying to parallelize all the writes or whatever. But you know what? The host doesn't really have to care about that.

Starting point is 00:24:17 I have a reclaim unit. It has a certain size, and I can write to it. It also has multiple reclaim unit handles. And each reclaim unit handle will have a reference to a single reclaim unit. And these reclaim unit handles, if a write gets targeted there, then wherever it's referencing the reclaim unit,

Starting point is 00:24:37 it's going to perform the write, and that's how you get the placement. And in this particular case, reclaim unit handle zero is referencing reclaim unit zero. And there's been some data already written there. And the same is true for reclaim unit three and four. Now I have numbers here on the reclaim units for this presentation only. The host does not know anything about the number of reclaim units. It just knows that data will be written on the granularity of reclaim units. The host is given the number of reclaim unit handles when it configured the drive to be in this configuration. If there were four,

Starting point is 00:25:17 a namespace was created, and during the namespace creation, the host provided this table, and it said, look, I'm going to have an index into this list of reclaim unit handles and this here indicates the actual physical handles that are going to be used but the placement handle is what's used by the host in the interface to say which one am i going to use and the reason we did that is now if the host wants to validate any right command it now can just say if this hand if this value is two or less, it's valid. If we don't do that, then I have a sparse matrix here of how do I test that I have a valid RUH defined in a write command.

Starting point is 00:25:57 So what I'm going to do here is let's walk through a write sequence. So let's say the host issues a write to this controller to namespace A, the only namespace that I have here, and we're going to use placement handle one and reclaim group zero, and there's only one reclaim group, so I'm not going to mention reclaim group. And so when that write comes into the controller, the first thing it has to do is find the reclaim unit handle associated with that placement handle one. In this case here, now we're going to use the reclaim unit handle two. So we're going to use this blue reclaim unit handles. Now, reclaim unit handles allows then, you know, you get to do parallel writes.

Starting point is 00:26:39 It kind of looks like streams, right? You got a stream handle and you can do this. And when this data came in here, it came in through the new data placement directive. So we had the streams directive, we added a new directive, and that directive provides two pieces of information, the placement handle and which reclaim group. Now there is also an option that if you only have one reclaim group, then that only the placement handle needs to be passed in the interface. So once the controller figures out the reclaim unit handle, it's going to use that reclaim unit

Starting point is 00:27:13 handle to buffer the data and eventually write it out to the reclaim unit, and you placed your data in the reclaim unit that was referenced. At any point in time, the host can use a new IO receive command to say get me information about this RUH, and this would tell you how much data you can still put in here and how much time I have to fill it to capacity. Let's do another write here. This write's going to come in. It's going to use a different placement handle.

Starting point is 00:27:44 In this case, it's going to use placement handle 2, which goes to the reclaim unit handle 3. Same kind of concept. It looks it up in the table, figures out the reclaim unit handle, buffers it up, eventually writes it out to the reclaim unit. Now, in this particular case,

Starting point is 00:28:01 reclaim unit 4 was written to capacity. And at this point, this is where the controller comes in and says, I have no more rights to that Reclaim Unit. I need to go find another Reclaim Unit. And it's going to update its reference to a different Reclaim Unit that is empty. Now, if you have ZNS, you always have to write to this boundary. But with, I'm going to go through a case here, but with here, you just keep writing writes because the controller is always going to be updating the reclaim unit as they get filled to capacity.

Starting point is 00:28:37 So let's say you have a large write come in, and you go and look up the reclaim unit handle here, which is two. And in this particular case, we realize that it's a big write, so it's going to fill one to capacity. So the RUH2 goes in and fills up reclaim unit three. The controller is going to update to a new reclaim unit and complete the write. This is something you couldn't do with ZNS. So the host can manage what LBAs get written to a reclaim unit, but it also can just write and understand where, how many writes are left for LBAs and just keep writing and not necessarily have to be on the boundary if they don't want to be. They just have to know that when I reached my internal boundary and I'm tracking things and I went over the count, then the count went to the other one. There's an IO management

Starting point is 00:29:29 receive command where they can figure out all the information about how much data is available and are the agents, are they in sync? Now, FTP was purposely architected to be fully backwards compatible, both in namespace creation and in IO commands. So what that means is I can take an FTP-enabled device that's in a server that understands FTP and is using FTP, unplug that device, go into a server, and plug it in to a server that knows nothing about FTP, and it can read and write the namespaces. It can create namespaces and it can issue commands.

Starting point is 00:30:12 Now, it's not going to do placement because it doesn't understand placement, but it works. Okay. So let me give you an example. I, as a host, can buy an FTP-capable drive. I can plug it into my server, and I can enable FTP, and I can use it like I do today. Then another time later, I could say, well, I want to take advantage of FTP, and I could start modifying my software to start utilizing FTP over time, and take advantage of it, and just use it normally. Okay. I'll go into a little bit more details. So let's say I have an application on this same server that is understanding FTP, but that application does not understand FTP at all. And so it was never modified.

Starting point is 00:30:57 It may even be third-party software that I bought, and I can't manage it and change it. It issues a write command command and it does not specify the new data placement directive, which means there is no reclaim unit handle being specified or reclaim group. When that comes in, the controller will say, I don't know where to place this. So by requirement of the standard, it's always going to use entry level zero for the right in terms of picking which reclaim unit handle to use. If your FTP configuration has multiple reclaim groups, the controller will pick which reclaim group to place it in. There's some advantages with doing

Starting point is 00:31:39 this here. So in this case, it's going to use reclaim unit handle zero, which goes over here and does the write. What's the advantage of here is these reclaim unit handles, there's no requirement that the host can't share them across multiple namespaces. So if I create an FTP-capable SSD, plug it into a server, I take all my namespaces, and I give them the same RUH to run and use,

Starting point is 00:32:10 it works as though a conventional SSD today. There is no difference. It's kind of what we do today. So it's backwards compatible. So it just works. Put this table together here. Yes. Yes.

Starting point is 00:32:27 Yes. two and the replay unit handles three, but they're still that space. But there's no space left, and you get to the end of replay unit zero. Would it fail to write at that point? So what you're asking is, so first of all, namespaces, in terms of how they get allocated to which replay unit, is a dependency of how you set up your RUHs to your namespaces. So namespaces can be written to any recline group,

Starting point is 00:33:11 to any recline unit, up to the selection of the host. Normally in an SSD, when you create namespaces, you cannot exceed our maximum capacity. It may be over-provisioning or whatnot. So you can't ever write more than what would be available create namespaces, you cannot exceed our maximum capacity. Okay. And we'll maybe over provisioning or whatnot. So you can't ever write more than what would be available somewhere in the drive. Okay. Um, so you don't have that issue. What you're getting into though, is if I have multiple reclaim groups and a host decides to always select one reclaim group to do the right, and you fill it to capacity, what happens? The controller will do the

Starting point is 00:33:46 right and honor striping that data to make sure that we maintain our warranty and our thing. But the trick here is if the host is using FTP and they want to manage it, then they have to honor the protocol to get what they want. If they don't honor the protocol, the drive is still going to protect itself, the rights will occur, and you won't get the placement that you want. We're assuming that what we're hearing from these customers that want FTP is they do want to control it and understand. Mostly what they want to do is do data separation, right? Because this gives you the ability. Because one of the things here is this namespace has three reclaim unit handles.

Starting point is 00:34:23 It could actually do hot and cold separation of data. And in an SSD, when we go and write data to a NAND, the data that has the same lifetime is really good for NAND because I either erase it all or move it all at one chunk and don't have it mixed. Does that help? Okay. All right, so I put together this chart here.

Starting point is 00:34:46 Let me see what time it is. So I'm going to walk through some of these here. I kept getting questions of how is this like streams and how is this like ZNS. Really, FTP is streams plus, ZNS minus is the way I like to view it. Streams says, give me a bunch of reclaim unit handles and open them up. And at start time, you know that I'm at the beginning and there is no feedback to tell you where I am in a stream if I'm aligned or anything like that. There's no feedback to the host of where you are. ZNS is,'m going to give you zones, and you can write to the zone, and you can only write to the zone,

Starting point is 00:35:28 and you're going to always understand everything, and it has to be very strict, and if you ever violate the protocol, you get an error. FTP is halfway in between. It says I'm going to give you streams, but I'm going to give you feedback. And I'm going to allow you to write across the zone boundaries. If you want to, you're managing it, you go ahead and do that. I'll automatically update the pointers to the next one because the host has to go in ZNS. If you want to move across

Starting point is 00:35:57 boundaries, you have to change the zone number that you're writing to. Just like, you know, ZNS, there are only so many open things that we could write to, which is into the reclaim unit handles. So there's a balance there between the two. I like to think of streams came first. It was great. We did ZNS and then FTP. So one learned from the other. And if you look at this list here, it really goes is to some of them are the same, some of them are different. Some of the key ones here is you can achieve a WAF of one with all three of these. Okay? ZNS, it's guaranteed.

Starting point is 00:36:36 It's in the protocol. But it requires that you change all software in the whole software stack that's dealing with that drive to make it happen. Streams, you think you can get there, but because you're no feedback, you're never really sure what you got. Maybe you get it, maybe you don't. FTP, you get feedback to know, am I really using the protocol correctly? But if I mess up my write, the write is still going to happen and I can go fix my software later. Again, it's backwards compatible, so you can choose when you want to update your software. Okay. The other big things that I want to say is

Starting point is 00:37:17 that that's really, you can read the table. I'm not going to go through the table. I made it really fancy and made it so you can script through all of this stuff here. No, wait. This is the one I like right here. Ready? Watch this one. I'm getting better at these animation stuff. Let me get past this here.

Starting point is 00:37:47 Yes. Namespaces are relevant. So what I want to say here before we get to some questions here is there are three more FTP sessions. Two of them are right in this room right after this in succession. We have looking at the host software stack. We're going to have somebody look at just the SSD itself. And then one of the things that we didn't go to is we just talked about here's FTP,

Starting point is 00:38:14 but why did we really do FTP? What is this garbage collection? At a bird's eye feather tonight, I'm going to go into more detail. I'm going to show you where does the right amplification come from. I'm going to talk about, okay, host, how many RUHs should you ask for? And it's going to be a very informal session.

Starting point is 00:38:32 I have a bunch of more animation slides to go over stuff, but the intent is to have kind of the dialogue right here and just look at it from a host perspective and not get into the SSD portion and just various things that you have to think about with that. And there will be snacks and beverages available there. So with that, any more questions? Any question? When you say the LBA at submission time, how do you know?

Starting point is 00:39:02 So, yeah, he wanted to know, you know the LBA at submission time, how do you know the LBA? So today in a conventional drive, when you do a write, you say, here's an LBA. And the drive writes it wherever it wants in N. You have no clue where it's going. ZNS, when you say write of an LBA, it's two pieces of information. It's the LBA, but it's also which zone is it going into. Okay? So it's tuple of information.

Starting point is 00:39:34 The difference with FTP is when you do a write of the LBA, it's the LBA. You have a different field in the write command that says where to place it. So in Z and S, if I was to have a bunch of objects, and I want to put it in a specific zone, then I have to figure out which LBA I want to write it to. And if I want to do multiple parallel commands, I have to do the append command, and I don't know the LBA until the write is complete. Where here is, the LBA is the LBA, whatever object you defined, it doesn't change. You're just trying to place it. And if I have objects of different temperature, I have a temporary one.

Starting point is 00:40:10 I want to put all my temporary ones together. If I have an object that's going to be long-term, more long-term, you know, I'm on a web page and I have a short temporary guy playing with stuff and I want to save it. But then when he saves it long-term, I can actually separate them in different super blocks and keep long-term and short-term data around. Okay. Good question. Another thing that's subtle in the spec that I do want to talk about, in the FTP configuration, there's this thing called PowerSafe in the FTP configuration. And what we were thinking ahead of a time is, you've built an SSD, and some customer comes in and says, I want this configuration.

Starting point is 00:40:48 And that configuration may require maybe more R select it, I'm not power safe, and you have to use the non-volatile or the volatile cache mechanism of NVMe and host, you would have to do flushes. But then I may have a configuration that does fit within flushing with the power protection, and it could say I'm power safe or not. And we have mechanisms to do backwards compatibility there such that if any FTP configuration says that I'm not power safe, then if you look at the NVMe mechanisms, it would say that it has a volatile write cache. So that way, if you plugged it from one server to the next, it will say, oh, I have volatile write cache because I don't know which FTP configuration was enabled. So those are little subtleties that we were trying to look forward to make sure we were fully compatible in any scenario

Starting point is 00:41:47 that may come up and not have to go re-architect the FTP as we go further. Yes, sir? So I guess maybe I missed this. So when the host says data analysis, here's the LDA I want to send to the location. When the controller moves that data around, is that the feedback mechanism you're talking about to tell the host, okay, the data's now somewhere else,

Starting point is 00:42:10 or does just the drive itself track, I sent data to location one, it's actually in location two, so I'm asked what location one is. How is that? The intent is that the hosts are going to track which LBAs they wrote during an open reclaim unit. They're going to know that this group of LBAs were written to a reclaim unit somewhere in there. And the host knows that to avoid future garbage collection where we need to go take that reclaim unit

Starting point is 00:42:40 and erase it for future rights, if there's still data in there, it may have to internally copy it, okay? If you come at 7 p.m., I actually have a diagram that goes all over this. But the intent of FTP is the host is going to either delete or invalidate that data long before that, so I never have to copy the data, and my WAF goes to one. Today, they don't know where the data is, right? And so they don't know how to do it, and we intermix namespaces, and they have no clue. So this lets them have the ability to manage that if they want to do it, okay? It knows, it can know when a new reclaim unit handle was started by knowing how many rights are available. There's also an IO management send command that says, point to a brand new empty reclaim unit,

Starting point is 00:43:30 and I'm going to track the number of rights that I do and which LBAs I've written there. I know the count, so I know how many I can write there before you're going to automatically switch to the next one. So if I can write a gig per reclaim unit, I write a gig, and I know those LBAs, and I write the second gig, and I know they're put together. Is there an event on the reclaim handle? No.

Starting point is 00:43:53 No event. It just happens. The host needs to track the number of writes that they're doing, and that's advertised in the FTP configuration of the size of the reclaim units. Yes, sir? So there are two mechanisms in NVMe. So let me back that up. How does the host do the invalidate is what I believe the question is. So when you write an LBA into a drive, okay, it gets written to some NAND.

Starting point is 00:44:36 That data stays there until I can erase a bigger portion of the NAND to get rid of it. If the host does another rewrite of that LBA, it gets written somewhere else, and I have the new copy, but the old copy we like to think is invalid, right? So it's there. I can't get rid of it until I erase the whole block, but there may be other data in there that is still valid, okay? So one way that the host does that is just rewrite an LBA. Another way is the dataset management command that says, I don't longer need that LBA. Another way is the dataset management command that says, I don't longer need that LBA. It's invalid. Okay. And then there's requirements of what to return if they read it and stuff like that. The intent of FTP is if the host is tracking which LBAs they

Starting point is 00:45:16 wrote there, they can either rewrite them or they can invalidate them. But what they want to do is do that before I have to do garbage collection in the drive and avoid that copy. I still have to do the erase of the NAND block to be able to write it for future writes and distribute writes and do wear leveling across my NAND. But those are the two operations. If you come at 7 p.m., I actually walk through all of that and show those two concepts at 7 p.m. tonight.

Starting point is 00:45:43 Okay? Yes? all of that and show those two concepts at 7 p.m. tonight. Okay, yes. In the spec, there is nothing stated there. The spec says that a reclaim unit has to be one or more of the erasable size of the media. So it could be one NAND block or many. What we're seeing is in discussions out there that they've had in NVMe and others is typically if you're doing a single reclaimed group, it's going to be the super blocks. I can't give you an answer from a vendor point of view because there are different types of NANDs and different companies and all this stuff. So what that size is is variable in the ecosystem. The other thing is, is maybe they'll do it per die is another thing that people talked about in the reclaim group for die. And they want to manage writing to dies.

Starting point is 00:46:34 And then it could be one or more NAND blocks within that die that they may group together in there. So that is a negotiation between customer and vendor of what configurations that should be built and put into the drives. And so... Yeah, and that configuration will tell you the size of those and they could have one or more of them. And again, that's still, we're early on in FTP and we're working the ecosystems, figuring out what they are there. And And again, that's still, we're early on in FTP and we're working the

Starting point is 00:47:05 ecosystems, figuring out what they are there. And John said in the end, I think there's going to be one, two or three that'll probably settle out across the industry. But the spec has been designed to be very scalable and not define that from a spec level. And each controller is doing it. And one of the reasons, as John said, that we're letting the controller advertise, and instead of letting the host come in and just say, give me a random number of reclaim unit handles and this size of this, is we can't test it. It's just there.

Starting point is 00:47:37 But if I give a set of configurations that meets my customer's needs, and it's a set that I can test very well in that, and it just helps the development and the validation. Yes, Randy. Do you mind going back to the right? Yep. I knew that was going to. Go back one more.

Starting point is 00:48:06 Okay. It's not the random one. Okay. So... You mean one of these? Yes. That'll work. So the controller uses reclink.

Starting point is 00:48:18 So here you have a write. You send the write to the controller. The host specifies placement handle one. Yes It's placement handle one is tied to unit handle two Now here you only have one reclaim group Yeah, so in it's, the directive within NVMe has two fields in every write command. One is the directive type, am I streams, am I FTP, or am I not defined?

Starting point is 00:49:00 Because those are the three defined by NVMe today. And then there's another field called the directive specific field. And in the directive specific field, that field gets filled with the definition by the specific directive type. In streams, it would be the stream identifier. For the data placement directive, it can have two pieces of information. One is the reclaim,, and the other one is the reclaim group. Now, there is a caveat. That field is only 16 bits.

Starting point is 00:49:37 I have variable-sized reclaim groups and variable-sized placement handles. So what we allowed in the configuration will tell you the bit offset of what's placement handle and what is reclaimed group, which limits the number of reclaimed groups you can have, which would also limit the number of placement handles that could be given to a namespace. Now, one thing is here, I do want to really. Yes.

Starting point is 00:50:04 And if the host doesn't fill in the directive, then it defaults to the zeroth entry, and the write happens anyway. Yes. Even if it's FTP aware, that still happens. So, again, on a write, controller gets a command. It looks up the placement handle, or it defaults to zero to figure out which reclaim unit handle. It uses that resource to do the writing, which is where buffering happens.

Starting point is 00:50:28 And then when there's enough data to buffer up, it writes it out to the NAND. It's the same sequence over and over and over again. Yes? At the end, does it take the name placement identifier or would that change? Name placement identifier. Is the placement identifier plus the... The placement, again, the placement identifier was just to use in the interface

Starting point is 00:50:53 from the host to the controller to say, hey, at creation time, I asked for three RUHs that could be, you know, spread out in numbering, and I want to have a nice indexing capability to them. But the namespace information is saved across power cycles, and that table is saved. But that table is defined at namespace creation. And if the host does a namespace create command

Starting point is 00:51:19 and doesn't specify this table, the controller will fill in entry zero for you, and you always have a table from the specification. Okay? At 7 p.m., we'll dive deeper into this. At 7 p.m., we'll dive into this. You can ask all the questions you want. I have more animation slides, and I've been having a lot of fun.

Starting point is 00:51:40 I'm getting pretty good at it now. All right, any other questions? All right, thank you. Thank you very much. Thanks for listening. For additional information on the material presented in this podcast, be sure to check out our educational library at snea.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #202: What is the NVM Express® Flexible Data Placement (FDP)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.