Storage Developer Conference - #94: Key Value Storage Standardization Progress

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 94. So I am Bill Martin. I am from Samsung. Presenting more on the status of standardization efforts is something that Samsung is helping to drive along with some other companies, which is key value storage. And as a representative of Samsung, I have to throw up this disclaimer that nothing I say can be held against me.

Starting point is 00:01:11 So what is key value storage? And one of the keys here is that it is, so I'll start off with what it is. It is storing an object or a value associated with a key. It is also a new paradigm that is different than block storage. So with block storage, you pass a logical block address and some amount of data that is defined by a length that you pass with it. Instead of doing that, you are passing a key. The key may be variable length, it may be fixed length, and you are passing a value. The value, again, also may be variable length or fixed length. And you don't pass a logical block address, and we'll talk about why that's a good thing to be doing. It's also different than object storage. And the next slide will talk about how it's different than object storage. So object storage, in terms of how people typically think of it, is it is a solution or platform.

Starting point is 00:02:18 It's something more than an individual device, generally. It's something like a cluster of storage with a global namespace that has an extensive higher level feature set. And it has features like global namespace resiliency, other things that are associated with this large pool of storage. We're not trying to do object storage in its large sense and move that down into the disk drive. Rather, well, let me talk two other things on object storage. It may, object storage at times has been considered a device,

Starting point is 00:03:03 for example, the kinetic object storage at times has been considered a device, for example, the Kinetic object storage. One of the differences here is Kinetic did not have a native object storage interface language. It used existing RESTful, a RESTful interface where you did puts and gets, but they weren't specific to the object and the label of that object. Another characteristic typically associated with object storage is that it is searchable based on the value in the object as opposed to just searchable by the key that that object is stored with. And finally, object storage, when you think of a large system that is an object storage

Starting point is 00:03:55 system, performs operations based on the object key and the object value. We're not talking about doing any of that high-level stuff with key-value storage. What we are talking about, we are talking about a device, an SSD. So we're not talking about a large system of a whole bunch of SSDs. We're talking about an interface to that individual SSD. We're talking about a interface to that individual SSD. We're talking about a native key and value interface where we're defining a new set of commands,

Starting point is 00:04:33 a new set of application programming interfaces. And we're talking about doing store, retrieve, delete, et cetera, with the key and the value instead of a logical block address. Searches and other operations on the value are done outside of the storage element. So we're not trying to put a whole lot of compute, but we're trying to offload some things. So why do we want to do key value storage? So solid state storage currently maps from an address to a physical location. That's how it works.

Starting point is 00:05:12 Unlike in the past early on days of disk drives, there was a fairly fixed mapping from logical blocks to physical blocks. When we moved to solid state storage, you moved to having to have a mapping table. So what is key value storage? Well, it's a different mapping than what you currently do for storing logical blocks, but it's what solid state storage already does. We already do a mapping.

Starting point is 00:05:44 It removes triple mapping that currently occurs in systems today. Today, if I want to store a key value pair onto a solid state storage device, I have a mapping from my key to a file system. Then I have a mapping from the file system to a logical block. Both of those mappings take place generally in the host. Then finally, when that write command or the retrieve command, the read command, comes down to the device,

Starting point is 00:06:24 there's a third layer of mapping from a logical address to a physical address. So we have three layers of mapping right now, and what we're doing with key value storage is turning those three into a single mapping where you pass a key to the storage device, you pass the value to the storage device, and the storage device maps the key to a physical location or set of physical locations on your SSD. So we're trying through doing this to eliminate a piece of the overhead. So where are we developing this? We're actually working with two different organizations

Starting point is 00:07:19 at the moment. We're working in SNEA, the organization that's putting on Storage Developer Conference, to define a key value storage API. at the moment. We're working in SNEA, the organization that's putting on Storage Developer Conference, to define a key value storage API. We are working within NVMe to develop a key value command set. And I'll get a little bit later to actually where we're at in each of those organizations. We don't have full approval in the NVME work group, but I'll talk about that a little bit later. So I don't want to imply that this is fully embraced within the industry,

Starting point is 00:07:57 but want to talk about where we are currently and where we see things going. So in SNEA, we have begun developing the key value storage API. We're currently, we have a revision that has been released for public review. It's available for anybody who wants to look at it to pull it up, look at it. And one of the things we really would like is people who are interested in this to go grab it, review it, and say, hey, hold on. I see some value in key value storage, but I think you're missing this, or I think you ought to have this, or the way you've defined this is not exactly what we want. We really are interested in public review of the work that we're doing within SNEA. So please grab the website.

Starting point is 00:08:56 SNEA has a lot of things under public review, and they really are things that SNEA would like to see people go out and look at and review. So what's included in our key value storage API? We have key space management. We have a store API. We have a retrieve API. And I'll go into each of these in detail as we go through this. We have an exist API to determine whether or not a key exists.

Starting point is 00:09:31 We have a delete API, and we have group operations. All of those are defined within the key value storage API. So key space management. Multiple key spaces can coexist on a single key value device. Each key space has its own characteristics. It has its own capacity. It has key ordering if that is supported by the particular device, or that may actually be supported in a library above the device. So this is an API. There is an interface between the device and the API. That device may do some of the processing in the host. It may do all of that processing on the

Starting point is 00:10:34 device. There are also device characteristics that are global. So the API talks about some of those global characteristics and also the specific characteristics related to key spaces. One of the other characteristics of a key space is that a key within one key space is unique from a key in another key space. So there is a uniqueness in terms of addressing between the key spaces. The store command, fairly simple. It's very much like a write command. It specifies the key space that you're addressing.

Starting point is 00:11:28 It specifies the key that you are attempting to store. It specifies the value you're attempting to store. Okay, that's pretty much what you have in a normal write command. There are three options that are associated with it. The first is fairly straightforward, do I compress the data? And that's an instruction to the device as to what the device does with the data. If the device has the capability to compress and it says do or don't compress this data. Now the next two are a little bit different than what you have in a traditional block storage paradigm. And they have to do with when do I or don't I perform this store command. And they are overwrite and update only.

Starting point is 00:12:17 With overwrite, overwrite says if that particular key exists, do I overwrite the value that's there, or do I give an error indicating that value already exists, you are not allowed to overwrite it? Update only is a matter of determining if the key exists on the device, then I'm allowed to update it. If the key does not exist on the device, then I flag that as an error that you have not created this key, therefore you cannot update it, therefore you cannot write it. What this does is there are cases where you want to create the key by saying overwrite and dot update only. So you create the key initially by writing to it, by storing to it.

Starting point is 00:13:24 And the next time you come down, you say, I know I already created the key, therefore I am updating it. Now, one of the other differences between key value storage and block storage is when you do an update, you update the entire value. In block storage, if I write LBA N and I say I want to write 10 logical blocks starting with LBA N, I can then come back and say, okay, I want to just write LBA N plus one. And I can write some piece of that overall block that I wrote initially. With key value storage, your unit of atomicity is the entire value. So when I come down to write the value again, I rewrite the entire thing. I don't update. Now one of the things, and it wasn't really on why do we want to go here, related to this is one of the things that this avoids is when you look at NAND flash technology, by updating the entire value,

Starting point is 00:14:48 you avoid a piece of garbage collection. If what you can do is say, my entire value is contained in a multiple number of erase blocks, then you're not saying, I have to move things around when I write a subset of it when I update it. When I update, I rewrite the entire thing someplace new, and now I can erase the entire old piece. So store really kind of contains a lot of the architecture

Starting point is 00:15:22 that we feel is important in the key value storage API. Retrieve, once you know what store is, it's pretty much the converse of it. However, when you get down to the options, it's got decompress, converse of compress. The other one, though, that we have put into the API is the ability to delete after retrieving. We have particular applications that people have come to and said, we want to be able to retrieve the value, but after we retrieve it, we are done with that value and would like to delete it. So it is kind of a fused command where it's two commands in one,

Starting point is 00:16:14 where you do a retrieve immediately followed by a delete. So the next one is list. So, list returns a list of keys. That list has a starting point. The starting point is an index into that list. And, again, this is on a per key space basis. So I didn't actually run down all the parameters that go with this. So list has a parameter of the key space that you're listing in. It also has a parameter of the key that is the starting point for your list. It has a size of what is to be

Starting point is 00:17:08 returned. And based on the key size, it determines how many keys you will return. So if you have small keys, you may return more keys. If you have large keys, you may return fewer keys. And I'll get into a little bit more. At the very end of the API stuff, I'll talk about our group functions. List is one of the commands that can use the group functions within the API to narrow down the scope of what you're doing, but I'll touch on that a little bit later in another slide. Exist, again, is a fairly straightforward API, and you specify a key space, as you do with all of these APIs, and you return success if that particular... Oops, I did my slide wrong.

Starting point is 00:18:15 You return success if the key exists, not the value. So I apologize, my slide is wrong. So secondly, this may be performed on a list of keys. So you may actually ask, here's a whole list of keys, do they exist? Again, this is another one. No, this is not one that's part of the group's command. So you can give it a list of keys and ask, do this list of keys exist? And it basically returns success for each particular key.

Starting point is 00:19:10 Yes? Yes? Pardon? If you attempt to retrieve a key that is not there, you will get an error. Yes. Okay. Part of why you have exist as opposed to just retrieve is exist will let you know if the key is there, but without returning the data which you would get from a retrieve. So, group operations. We have defined the group operations to allow operations to be performed on a group of keys. That group of keys is specified by a set of bits within the keys. They are high order bits only, so they start from the high order and go down. And we needed

Starting point is 00:20:09 to pick an end of that, and we picked the high order end so that if you have variable keys, it's always the same set of bits. It's always the high order bits. So the group is specified by two different fields. One is how big is the set of bits that define the group. And the second is a mask of those bits to say if those bits equal this particular value, then all of the keys that match that characteristic are part of the group. So the operations that are supported doing that are the list operation. So you can say rather than a list of all of the keys in that key space, which may be a large number, I only want the ones that fall within this group.

Starting point is 00:21:17 Now, your list, if you're not getting all of them but getting some of them, it's controlled by the group characteristics, but also is controlled by the size that you're asking for. The other supported operation is the delete operation which allows me to delete all of the keys that are associated with that particular group identifier. So what this does is it means that if I have a key and that key has four zero hex in the top byte, in the most significant byte of the key, all of those values are considered to be part of the group. And if I store things using that high order byte, differentiating different groups, I can then say, oh, I'm done with this group of keys and their associated values, and delete all of them in a single operation.

Starting point is 00:22:22 I can also go and find what are all of the keys within that group. No. And we've changed the nomenclature to key space because there are some differences from an NVMe namespace. So the question was, do we have operations that operate across namespaces? And no, we don't. Yes? You said in one of your questions that you have operations across key spaces.

Starting point is 00:22:57 For example, how to check what is the amount of free space on a drive? Okay. Those are not... In the API, the API is focused basically on the IO commands. There is discussion of developing a key value management specification. That management specification would be the place that we would put things that gave us characteristics of the overall key value device. And that will be a separate specification. We have not started work on that.

Starting point is 00:23:37 That will be worked on within the same working group that's working on the API, but it would be a management interface. There's another talk that's working on the API, but it would be a management interface. There's another talk that's being done this week, I believe it's tomorrow, by Mark Carlson, talking about a management specification that's already been done for IP-based storage devices. This would be a similar thing to what's been done for that. It, again, will all be within the same technical work group within SNEA. Right. That's actually, that would be in the NVMe command set as opposed to the API. Okay, so that's in the next part.

Starting point is 00:24:31 Right. And regarding the number of key spaces that you guys are thinking about, like unlimited? We have, for the API, the question was, are we thinking about how many key spaces are supported for the API? The API is written as a data structure-based API, and there is not a... I don't think there's a limit on it, but I'm trying to remember now.

Starting point is 00:25:00 I don't remember whether or not that is a data structure which would allow an indefinite number or if there's an actual number of bits associated with it. There is a name associated with your key space which is not limited. So it is a character string of unlimited size in the API. Now there again, when we get to the command set, which is the second half of this, there will be more limits there in terms of the size of fields. But in the API, there is not a limit

Starting point is 00:25:43 to the number of key spaces that the API is capable of supporting. Okay. Again, I'll talk about that a little bit more in the second half. That comes down to the command specification. The question was, are there limits on the size of the value based on what you can store? When you get to the command set, there are limits. Within the API, there are not currently limits.

Starting point is 00:26:20 But there again, that is a characteristic that needs to be discovered through management to determine for a given device what those limits are. Another question back here. So the question was, is there any ordering to the enumeration within the API to specify the ordering characteristics, whether it is ordered or is not ordered, and the type of ordering. What the device supports may be different than the breadth of what the API supports. But the API does support ordering information. One more question. On the wider...

Starting point is 00:27:15 Can you pick this KVSSP within the existing XFS file system databases, etc., easily, or it will be a heavier lift to modify those, to kind of fit this into the existing file system databases, etc.? Okay. It's not intended to fit within the file system because it really is intended to be a key value storage which is not a file system storage. So the question is, have we done research that looks

Starting point is 00:27:52 at how this fits within the file system? What we've looked at rather than that is how can you use things like RocksDB and how can RockXDB utilize this without having to go through the file system. So this is basically taking an architecture where you get rid of the file system piece and use this instead. Now, an alternative here that has been looked at, has been explored, and is actually usable is within the file system to pass the file name directly as your key and what's stored for that particular file is your object or your value yes we do. Okay, and as I get to the end of this, I will talk about we do have some open source code available.

Starting point is 00:29:02 So what's our status of the KV API? So we are developing this in the Object Drive Technical Workgroup within SNEA. If you are a SNEA member, you're welcome to join. If you're not a SNEA member, please join SNEA and join the Object Drive Technical Workgroup. I do wear multiple hats. One of my hats is I am co-chair of the SN Drive Technical Work Group. I do wear multiple hats. One of my hats is I am co-chair of the SNEA Technical Council. So you'll hear me advertising, hey, come join SNEA and work with us. But I also say that because I'd really like as much of the industry

Starting point is 00:29:40 looking at this and how they can use it so that we can make this a prolific next generation step for storage within the industry. We are currently meeting weekly to review the specification, enhance the specification, take inputs on the specification, and make modifications. That's not the only thing that the Object Drive Technical Workgroup is doing. They are also working on the IP storage management and one other item that Mark Carlson's pushing, and I can't remember what it is at the moment. So we have multiple things that that technical work group does work on. We meet for one hour via conference call weekly.

Starting point is 00:30:37 In addition to that, coming up in October, SNEA hosts a technical symposium that will be in Colorado Springs. On that day, we'll have an entire eight-hour day that we will meet doing work in the object-derived technical work group, of which about half of that day will be spent on key value and half on other items. Currently, revision 0.16 is available for public review, as I mentioned at the beginning of the API stuff. And right now, my goal is aiming for the end of this year for releases of specification.

Starting point is 00:31:21 So it is in public review now. It is fairly solid at the moment, and we are looking for inputs, and the types of things that we're having come in at the moment are more on the order of clarification as opposed to enhancements. So the NVMe command set. Where are we? Right now we're in architectural development. At the end of this set, I'll do more of a status overview like I did in the KV API.

Starting point is 00:31:57 But what is the NVMe command set? It is a new command set. It is not being built on as an extension of the block commands, but as a new command set. It is not being built on as an extension of the block commands, but as a new command set. It is intended to operate over either PCIE or over NVMe over fabric. And as part of our architectural discussions, we've come up against places where we've said, oh, we need to make some modifications to this to make certain that it does work over fabric. And we are, within those architectural discussions, attempting to make certain that it works both in PCIe and over fabric. Architectural overview. So first off, a single controller supports either block commands or

Starting point is 00:32:49 key value commands, not both. So you don't have a controller that supports both things. However, an NVM subsystem may have a block command controller and a key value controller. So within your subsystem, you could have one storage element that is a block storage element and a different storage element that is a key value storage element, where you may even have the capability of configuring that on the fly. We're not specifying that particular operation. The biggest thing we're specifying is that there is indeed a separation between those two, and they're not the same.

Starting point is 00:33:40 Yes? When you say controller, are you referring to the media? I'm referring to an NVMe controller, which is an architectural entity within the NVMe system. So it is an entity on your disk drive that controls a particular piece of that drive. Okay. So this is a new specification that will reference the NVMe base spec. It will reference, if necessary, the NVMe over fabric specification. And NVMe is currently working on an NVMe PCIE specification. It will reference that as that

Starting point is 00:34:37 work develops. But it will be a new specification. It won't be rolled into the NVMe base specification and trying to add a different command set there. Part of what that means is if you build a block storage device and that's all you want to build, this particular specification, you don't have to develop to it at all. You don't have to read it. It doesn't affect you in any way, shape, or form. If you want to develop a key value command set device, you read the new specification, and it points you to the pieces of the base specification

Starting point is 00:35:21 that are important for you to be aware of as well, such as things that define your queue structures, all of that type of stuff. All of that's in the base specification. So key space. What is key space? It is comparable to namespace. It allows separate spaces on a controller to use the same keys without overlapping. It allows partitioning of the controller resources. So part of what you're able to do with keyspace, and this comes back to one of the questions that was asked earlier, you have characteristics that are complete controller characteristics. Within NVMe,

Starting point is 00:36:07 you have the identify controller that gives you information about the entire controller. You have other things in the current base that are namespace specific, and you have identify namespace that gives you information about what's in the namespace. Keyspace will be similar to that. Within Identify Controller, currently, you have, no, sorry, within Identify namespace, you currently have a set of logical block formats that you can format the device to. Similarly, there will be a set of, within key space, a set of key value characteristics that you can format controller namespace to. So the store command. One of the unique things about key value storage as compared to block storage is we have defined keys and And right now, our view for phase one of this development is that the keys can be up to 32 bytes long. Well, that's too much to fit into the current NVMe command structure. So what we have done in our description of it currently is keys greater than 16 bytes are passed as part of the data pointer, and they may be SGL or PRP.

Starting point is 00:37:55 Now PRP cannot be transmitted across FCOE. So if you want to do this across a fabric, you will need to do this as an SGL formatted version. The other thing about them is the keys that are greater than 16 bytes, they are the first descriptor pointed to by the SGL or PRP, and they are located in a single element of your scatter gather list or your physical region pointer list. The value is then pointed to by the data pointer. If you are in this situation where the key is 16 bytes or less, then the entire data pointer is pointing to your value. If you were in the case where your key is greater than 16 bytes and less than or equal to 32 bytes,

Starting point is 00:38:50 then your value is pointed to by everything in the data pointer after the first element. So this isn't, for the store command, isn't too much of a stretch because both the key and the value are being retrieved from the host. When we get to the retrieve command, we've done the same thing. The keys up to 16 bytes are carried in the command. Keys greater than 16 bytes are passed as part of the data pointer. In other words, they're pointed to by the first element of the data pointer, whether it's SGL or PRP. The value is returned to locations pointed to by the data pointer.

Starting point is 00:39:40 Again, SGL or PRP. Now, the anomaly here is that now your scatter gather list or your PRP has one element that is pointing to something that you are pulling from the host, and the rest of the elements are something that you're putting back out to the host. But in terms of how the device operates, it's really not any different because you go fetch your SGL or PRP, it gives you information how to operate, and the only anomaly here is you have to go retrieve more information from the host that's pointed to by the SGL or PRP. But you're already capable of doing that because you're capable of doing store commands anyway. So the list command returns a list of keys within the key space. It starts at the key specified by the command,

Starting point is 00:40:42 and again, the list command has to be able to use the data pointer if you have a larger key. Now, the only thing the data pointer points to at this point is the key that you're starting with. For the first release of this, the list command is not ordered. There is, within the command set, we don't have a requirement for any ordering. What happens is whatever data structure the device has is used to order your keys. What that means is if I start with key foo, it will give me the next set of keys based on that starting point. I take the last key that I get in that command, use that, and I can get the next set. And that list, as long as I haven't stored or deleted a key in the process of doing the list, will remain in that same order. Now, if I delete a key and I now start from

Starting point is 00:41:56 somewhere, that should not change the order, but there's not any requirement that it not change the order. So in other words, if I delete something and I now write something new in, and it happens to go into that same place in my data structure, my lookup table, then, and I do the same list again, I may get something different from my list if I did a store or delete during that. Yes? For a given key size. You don't need the value because all you're doing in the list command is returning keys. You do need to know the size of your keys. You're asking, do you return the size of the value?

Starting point is 00:43:10 Okay. I know that we've discussed that, but I don't remember the answer to that question at the moment. That is something we have discussed both in the API as well as in here. You do need to be able, within the list command to return,

Starting point is 00:43:31 the value size as part of what you return, yes. What happens if you delete the key that was the last one in the last row? If you use that, you will start at the beginning of... If you use any key that doesn't happen to exist, you'll start at the beginning of whatever the structure is that the device has. So basically, if I passed it a null key, I start at the beginning. If I pass it an unknown key, I'll start at the beginning. If I pass it an unknown key, I'll start at the beginning. So if in the process of doing a sequence of list commands, I delete one in the middle and that was where I was going to start again, I'm going to bump myself back up to the beginning. And basically

Starting point is 00:44:18 that's where the list command will not be the same if you make any modifications to the keys that are stored. So exist command, very much like the API at return success of the key exists in the key space, the exist command as we have it defined today for the first phase of this only takes one key. It does not allow you to pass a list of keys and find out if all of them exist. So this is a one key at a time exist. Where are we at?

Starting point is 00:45:10 So the work in NVMe has been provisionally approved by the NVMe board. They are still looking at the value of this within the industry. Samsung and a number of other people see some significant value, but the NVMe board wants to make certain that NVMe doesn't get too broad, too diluted. We are developing the architecture. We are having weekly subgroup meetings.

Starting point is 00:45:46 They are one hour in length. We use up the full hour. There are lots of discussions going on. So if you're involved in NVME and are interested in this, I would strongly encourage you to come participate in that. The document is being posted so you have access to it if you are an NVME member. But we are working hard towards completing the architecture. So we are having very deep architectural discussions of what should this architecture look like. My push is for member review beginning around the end of 2018, possibly early into 2019. So that's where we're at in NVMe. With that, I'd like to thank you, and we do have about three minutes left for questions. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an

Starting point is 00:46:55 email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #94: Key Value Storage Standardization Progress

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.