Storage Developer Conference - #144: Key Value Standardized

Episode Date: April 6, 2021

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, episode 144. Hello, good morning, good afternoon, good evening, wherever you may happen to be, whatever time it is that you are watching this presentation. This is the Storage Developer Conference 2020 presentation on key value standardization. I am Bill Martin. I am from Samsung. I'm also co-chair of the SNEA Technical Council, co-chair of the SNEA Object Drive Technical Work Group, where key value API standardization has been done.
Starting point is 00:01:12 I am on the NVMe Board of Directors, and I am also the chair of the NVMe Key Value Task Group, developing the NVMe key value standard. So thank you for joining me today. And I hope to present to you information that will help inform you about the key value architecture and key value standardization. Our agenda will start by going over the architecture, talk about why key value is important, and then talk about the standardization effort in NVMe and in SNEA, and finally present to you some open source libraries that are available. Starting with the architecture. I'd like to start by just defining how key value is different than some other architectures that you may be familiar with. First off, block storage versus key value storage. Block storage, the data is stored in blocks of a fixed size. Today, that size is 4K bytes, could be 8K bytes.
Starting point is 00:02:24 Older devices were 512 byte blocks. Key value, the data is stored as unstructured data. There is no fixed size to it. It is whatever size is defined for a particular key value pair. In block storage, your data is addressed by a logical block address. It addresses one logical block or multiples of logical blocks. In key value, your data is addressed by a key, which has nothing to do with the physical storage media or even a logical translation to that. In block storage, the LBA is a fixed number of bytes, which limits you to exactly how many logical blocks you can store on a device. On key value, the key is variable length. This still limits you based on whatever the largest size the key is to how many key value pairs could be stored. However, it is not related to blocks on the device. In block storage, your storage space is allocated in integer multiples of block size.
Starting point is 00:03:38 So if you have a 4K byte block size, you allocate either 4K bytesbytes, 8 K-bytes, 12 K-bytes, 16 K-bytes, etc. In key value, the storage space is allocated in increments of bytes. So if your value happens to be one gigabyte, then you allocate one gigabyte of space. If your value happens to be 32 bytes, then you allocate 32 bytes of space. That does not mean that on a key value device, you may not have some physical limitations that control how you actually allocate data, but it allows you to allocate it on whatever byte boundary your physical media is capable of allocating on. In block storage, there's a one-to-one mapping
Starting point is 00:04:36 between logical blocks and physical blocks. Now that doesn't mean that logical block one maps to physical block one, but your storage is broken into physical blocks that are of the same size as your logical blocks. So each logical block maps to one specific physical block. That mapping could change. In key value, the value is associated with an amount of physical storage necessary that has nothing to do with a block size. Again, you could store 32 bytes. The mapping is done within a mapping table that maps your key to point to some physical storage where your value is stored.
Starting point is 00:05:29 Next, I'd like to cover what is the difference between key value and object storage. This is a little bit more complex, a little bit less obvious. Often people will think of object storage as key value storage. The key value that I'm defining here is key value storage as defined in the NVMe and SNEA standards. So this is not necessarily a generic industry term, but it is the term that is used related to those standards. So key value, the data is stored based on a key on a native key value device. In object store, the data is stored based on an object identifier, but typically it is stored on a block storage device. Now, as we move forward, you may find that you are actually taking your object and your object identifier and using those to map onto a key value device.
Starting point is 00:06:33 That may save you some, but you still have that translation that has to happen due to some of the other differences as we go down this list. Key value, the key is a variable length entity. In object storage, the object identifier is fixed length, but then in addition to the object identifier, you have metadata that is also associated with pointing to the actual object.
Starting point is 00:07:06 So your location of the object is identified both by a fixed length object identifier and a variable piece of metadata. In key value, the storage provides the mapping of key to value. In object storage, you have a protocol that provides mapping of the object identifier to the object and that protocol is typically done on the host system some layer above the physical storage media. And key value storage is device level only in its implementation that we're describing here. There may be other implementations where it is at a higher level. But for what we're talking about here, it is device level only.
Starting point is 00:08:11 In object storage, the object may be split across multiple levels. That allows you to shard your object, etc. But when you get to the bottom of that, key value storage can be a perfect place to store a piece of your object. In key value, you have no metadata associated with your value. In object storage, as I talked about earlier, you do have metadata that is associated with your object. In key value, the value is comprised on the key value device. In object storage, the object may be split across multiple devices and the splitting of that is handled by a protocol or software layer
Starting point is 00:08:59 somewhere above the physical device. One of the things that that means is in key value, because your entire value is stored on the device based on a key, it makes your data potentially more searchable. In object storage, you have to pull your object together from the multiple devices and then do whatever searching you want to do on that object. So the characteristics of key value storage, again, this is based on the current implementation. The key is variable length. The current NVMe specification allows from one byte to 32 bytes. Currently, it is extensible to allow for a larger key. It is unique across that key value device. You could, by protocol, make it unique across multiple key value devices.
Starting point is 00:10:11 And depending on what your implementation and your application are doing, that may be a useful thing to do, but it is not required. The value, its variable length, it is from one byte to megabytes or more and we'll get into the exact size limits in NVMe further down as we talk about the NVMe standardization of key value. So how does key value operate? Storing data is stored as a single value associated with a key. It is not updatable in place. So what that means is I can't go and say replace this portion of the value you have to either have the entire value in your memory or read it up modify it and then rewrite the entire key value pair and it then gets placed typically
Starting point is 00:11:20 in a new location on the storage device. It's not extendable in place. In other words, if I have written a key value pair where the value is 32 bytes long and I want to add another 32 bytes to it, I can't append to that existing one, I need to read up the 32 bytes and create a 64-byte value and then write that 64-byte value as a single key value store operation. When you store, you are storing the complete value. All of this is the current implementation within the standards and there is room for extending those and potentially changing some of these to where you could potentially update in place. You could potentially append to what's out there, you could potentially write your value as several different pieces.
Starting point is 00:12:29 However, the current implementation does not do that. One of the questions that comes up is, what is the atomicity of a key value pair? And because the key value payer is writing a complete value, it is atomic to that key value payer. It does not require that you maintain a log of what you're doing in order to be able to recover. When you do a write or a store of a key value pair, if you have a power fail in the middle of that, when you come back, you will either get all of your original key value pair that was there, or you'll get all of your newable, and writing a complete value simplifies the ability to provide an atomic operation. Retrieving. Data is retrieved as a single value associated with a key. In the future, it could be a portion of the value. Even today it can be a portion of the value, however that portion has to start at the beginning. And we'll get a little bit more into that as we go into the actual
Starting point is 00:13:58 NVMe standardization and how you can retrieve just the beginning portion of the value or some other piece of it that you're interested in. Deleting. A key value pair may be deleted. Unlike block storage, block storage has the concept potentially of deallocating storage, but there is no delete operation for a particular data stored at a logical block address. In key value, you may delete it, which effectively does a deallocation of that, but it is a deallocation as opposed to a recommendation for a deallocation, which most deallocates on block addressable devices are recommendations that may or may not be actually implemented. And it's only when you do an overwrite that you know that that old data is not accessible. Key value operations have the ability to list all of the keys stored on the device.
Starting point is 00:15:18 This is useful if you have a device and you want to find out all of the data that's there and be able to copy it to another device. In a logical block address device you can just start reading it logical block address 0 and go to your maximum logical block address and attempt to read everything that's there whether it was or wasn't written and copy it somewhere else. This allows you to find all of the key value pairs that are stored on a device in order to recover those to some other media. So next, let's talk about why is key value important. So I'd like to start by talking about a block architecture versus a key value architecture for storing data cellular data. So in a block architecture you have your data center software infrastructure goes through some key value glue logic
Starting point is 00:16:15 into a key value API but then you have to do a software key value store where you map that key to a block storage and you actually typically will be doing that through some sort of a file system so you actually are now mapping the key to a file system which then is in turn mapped to the block storage, you then go through a block interface to your block storage device and write it in blocks where you have mapped between the block and the logical block and the physical block. In key value, you've got the same top part of this, but when you hit the key value API, you have a fairly thin library interface to your particular device protocol. And then you actually store the key value pair right down onto the key value device.
Starting point is 00:17:19 And there's one mapping table here that is your key mapping to where the physical storage for the value is. This removes two layers of mapping. It removes some transaction information. So with this, you increase your transactions per second. You decrease your write amplification factor, and you decrease your latency because you've a key value interface for things that are appropriate to store as key value pairs. Some use cases. Storing photos or videos as a single addressable object. This means that the photo or video is stored on your device with just a single address that points to it which is your
Starting point is 00:18:34 key. What this means is if I add computation to my storage and I want to do some sort of processing on the photos that happen to be on the device, I can go through and look at the data associated with each key, knowing that it's a photo, and do whatever processing on that I want to do, whether that is sorting or converting the format, whatever it is. And I know that each key points to a complete object that I can operate on. Additional use cases would be storing records associated with a unique identifier. For example, if you want to store medical records for an individual. Your key in this case would be the personal medical record identifier for that individual, and your value would be their complete medical record in
Starting point is 00:19:36 whatever format you happen to store it in. Likewise, employment records. You have an employee identification number and then associated have something on your storage device that is easy to take computational storage and utilize that to then search for what you want on your device and return just the information that you want. So in summary, benefits of key value. It removes a translation layer, which is a performance benefit. It allows storage device to manipulate data based on content, search for values for a particular pattern, perform encoding on the value. It removes the provisioning overhead that you have in block storage because there is no pre-assigned mapping of logical to physical association. You make that mapping at the time that you store a key value payer where you map the appropriate amount of storage to your value. You know how much
Starting point is 00:21:10 storage by reading information from the device. You know how much total storage is available and you can use all of that storage and not have to worry about how the mapping is done. In logical block address devices, when you run out of space, you may or may not know that if you are thin provisioned and you have fewer physical blocks than you have logical blocks. This removes that layer of ambiguity. The limit to the address range is not based on the size of the physical storage. So you can have an address range that is much greater than your physical storage, but it is an address that has somewhat more of a meaning than a logical block address. It is possible by the protocol that you use above the them whether or not that particular key exists on the device.
Starting point is 00:22:45 And my result would be that one of them would come back and say, oh, that key value pair does exist on this device. And at that point in time, your application knows where to go to retrieve that information. So next, let's talk about standardization. First, we're going to talk about the NVMe key value command set. Second, we'll talk about the SNEA key value API. So in the NVMe standardization, we have defined an NVMe key value command set. It is a unique specification within NVMe. Within NVMe, we needed to come up be associated with a single NVMe base specification, which includes the administrative commands, the queue definitions, the log pages, asynchronous event notification,
Starting point is 00:24:14 as well as the transport layer of NVMe over PCIe or NVMe over Fabrics. There is another Storage Developer Conference talk that is being presented during this Virtual Storage Developer Conference that is talking about the NVMe 2.0 specification process and where we are at within that process. So NVMe key value basic constructs. The key is specified in the command. What that means is we have a 32-byte maximum length key. The reason for that is we have a 64-byte command, and in order to do anything more than 32 bytes, we either have to extend the command structure or go outside of the command to pass the key. We are looking at both of those as enhancements
Starting point is 00:25:27 for the future, but the current specification does limit you to a 32 byte maximum length for the key. There is a one byte minimum length for the key. The reason for that is because the key length is specified in bytes, which is the next point here, is that your key is in one byte granularity. So you can specify anything from one byte of key up to 255 bytes of key length. An N byte key does not match an M byte key.
Starting point is 00:26:13 And that is actually a very important fact to be aware of in the fact that if you have a two byte key that is 00 Baker echo hex that does not match a one byte key that is Baker echo hex so the additional byte even if it happens to be zeros in hex does have significance and it actually allows many more possible combinations if you choose to use them. In terms of the value, the length again is specified in the command. The command allows up to four gigabytes of length. Another unique characteristic about the value, however, is the value may be zero length. There are several reasons why value could be a flag indicating something. The other, as we go through the commands and options for the store command, is that you may want something that says, these are the valid keys that may be stored on this device and so you go and set the device up to accept certain keys and only those keys and then only those keys you
Starting point is 00:27:58 pre-assigned may be written on the device so you would therefore start by setting up a zero length key. So I'd like to go through each of the commands that are defined for the NVMe key value command set. The first of those commands is the store command. It provides the ability to store a key value pair. There are three options available within that command. The first of them is compress versus no compress. is only valid or is only meaningful if you have a key value device that does compression. If you have a key value device that does do compression, there are a couple of reasons why you might send your store command requesting no compression. The first of them is if you know that what you're storing
Starting point is 00:29:09 is already compressed, then you do not want to compress it again during the store operation, which may actually cause the size to increase. The second of the reasons for using the no compress option is if you have a device that you are trying to replicate and you want to read the data and store it on another device, what you want to do is read the data that's already been compressed and then store it without compression so that you are only moving the compressed data from
Starting point is 00:29:54 one device to another. Now, that's only valid if you know that the compression algorithm used on both devices is the same. The next option is do not overwrite. This option basically says if a key value pair exists and you do a store command for that same key with the same value or a different value doesn't matter, that store command will fail if you have set the do not overwrite option. This prevents you from writing over a key value pair
Starting point is 00:30:37 that exists inadvertently if you believe that you're only writing new key value pairs to the device. The other option that exists is do not create. This is almost the inverse of do not overwrite. And it goes back to what I described in the previous slide of if I set this option, it means I want to have first gone down and stored on the device all of the keys that are allowed to be written to this device. And then by setting do not create, it means that if that key has not already been written to the device, I would get an error when I try to store that key value pair. Setting do not overwrite and do not create both at the same time will basically mean that you cannot do anything with a store command.
Starting point is 00:31:46 So setting the two of those at the same time is an illegal option. So the retrieve command, this is the equivalent of a read command in a logical block device. It provides the ability to retrieve a value associated with a key. Again, you have options. However, in this case, there's actually only one option, and that one option is either decompress or provide raw data. Again, this has a couple of reasons why you would use this option. One of them is if you know that the data was compressed before being stored using some
Starting point is 00:32:36 algorithm that is different than the compression algorithm on the device, then you don't want to decompress it using the wrong algorithm. For example, say you know that you have a zip file that has been compressed using some form of zip, and you want to store it on the device. When you stored it, you would store it saying, do not compress. When you retrieve it, you want to retrieve the raw data and do the unzip in the host or in the application. The second, also talked about during the store command, is if I want to move one data from one device to another, where the devices share the same compression algorithm, I would like to retrieve my data as raw data,
Starting point is 00:33:39 meaning I'm getting the compressed version of it, and then store it without compression to the other device. When I actually want to retrieve the data for the purpose of using the data, then I would allow it to be decompressed on retrieve. The size of the value is returned in the completion queue entry. Now what this means is assume that I have a 7K byte value stored with the key. I do two things with that. If I do a retrieve command, if my buffer that I'm given for that retrieve command is 7k bytes or greater, then I retrieve all of the value into that host buffer, and I return the length of 7k bytes in the completion queue entry.
Starting point is 00:34:46 What this does is allows the application to know how much of the buffer that it has has valid data in it. Now, the other thing that I can do with that is I can give it a retrieve command with a zero length buffer and simply find out how big the value is. This allows me then in the future, if I haven't maintained that information in my application, to find out how big it is, then allocate a buffer that is an appropriate size for the value that is stored there, and then go back and issue a subsequent retrieve command where I specify 32 bytes or 64 bytes. And what I may be doing with that is I will get the first 32 bytes or 64 bytes. It will always start at the beginning, but if I have a header that I could parse to determine whether
Starting point is 00:36:05 or not this particular value is one that I'm actually interested in, then I've been able to retrieve that header and make those determinations without retrieving the entire value. The thing that I cannot do, the last point here, is I cannot return data starting at an index. If I want the entire value or I want a specific portion of the value, the host must provide a buffer large enough to retrieve the entire value or at least the value up to the point that is of interest. So in other words, if I wanted the piece of the value that happened to exist from 1 Kbyte into the value to 2 Kbytes into the value, and the value happens to be 1 Megabyte long,
Starting point is 00:36:59 I need to provide a 2 Kby byte buffer and then within the application, parse out the particular 1K byte that I am interested in. The next command is an exist command. This command takes a key as an input. It returns a status of 00 if the key value pair exists, where a status value of 00 is command completed successfully. It returns a status of key does not exist if the key value pair does not exist. This allows you to go and determine whether or not a specific key value pair does exist on the device that you are talking to. The next one is a more complicated command in terms of how it operates.
Starting point is 00:37:54 It is the list command. It returns a list of keys key exists on the device. If that key does not exist on the device, then it starts at the beginning of its list. However, the list is not in a sorted order. What that means is it isn't alphanumerically sorted. It just is in whatever order the device chooses to return it in. This may be in the order that things are stored in the lookup table that the device has for key value pairs or any other device. However, that list is idempotent if there are no intervening store or delete commands.
Starting point is 00:38:57 What does that mean? If I give two identical list commands with the same starting key, I will get the exact same response. What this allows for is if I don't have a buffer large enough for all of the keys on a device, then I can send a list command with perhaps zero as my starting point. It will then start from the first key in its key value list and give me however many keys fit within the buffer that is allocated. I can then take the last key that is returned in that list command, use that in a subsequent list command, and get the next chunk of keys and keep doing that until I have gotten all of the keys available on the device. Now, that only works if you don't have intervening store or delete commands. Retrieve commands would not impact this because it doesn't change my list.
Starting point is 00:40:27 The other thing that the list command does not do is it does not return a value length associated with each key. It just gives you a list of keys that exist on the device. If you need to find the value length, you would need to take your key and do a retrieve of that key with a zero length buffer for putting the value into. And that in turn would give you the key length in the completion queue entry for that command. So next I'd like to move on to the standardization in the SNE key value API. General concepts, it is aligned with the NVMe key value command set. The APIs that you have, you have the ability to open a device. When you open a device, it returns a handle for that device. Once you've opened the device, you can retrieve device information,
Starting point is 00:41:41 which will give you things like capacity, the maximum key length, the maximum value length. You can create and delete key spaces on that device. A key space is equivalent to an NVMe namespace that is associated with the NVMe key value command set. The API allows you to retrieve a key value pair information. It allows you to store a key value pair. It allows you to retrieve a key value payer and each of these have the same options that are available in the nvme command set it allows you to delete a key value payer and to list your key value payers so those pieces of information are the same.
Starting point is 00:43:08 It also provides synchronous and asynchronous functions. that goes with the function so that when the result is returned asynchronously, you can associate that return with the command that actually was being operated on. The other thing that the key value API provides that is not part of the NVMe key value command set is grouping. This is done in the user library because it's not currently supported in the KV command set. In order to do grouping, you must have fixed length keys. So it limits what you can do on the device, but it provides some added functionality that may be worth that limitation. So what grouping does is it allows a portion of the key, in other have four bits that are being used for your grouping, then if those four bits are set to a certain value, then all keys that start with the first four bits set to that value are part of that group. Now, it does require you to create a group. You can't just go and utilize it without first creating it. In order to create the group, if you have key value pairs already stored on your device,
Starting point is 00:44:50 then the device has to walk the tree to put the keys into the group. So what this does is it means that if you haven't set the grouping up before you store information, the device doesn't have to somehow maintain some order of the keys that are used for storing devices. Now, once you have created a group, future stores to the device do put the keys into appropriate groups
Starting point is 00:45:27 when those key value pairs are stored. Now, why do you want this? There are two specific functions that are enabled by grouping. One is you can list key value pairs that exist on the device only within the group. So this allows you to list a subset of keys but a set of keys associated with something that you have predetermined based on the bit settings for that group field. The
Starting point is 00:46:01 second thing that it allows is it allows you to delete an entire group with a single call to the API. So this means that you don't have to go and find all your keys that match a certain pattern and then go back and delete them one at a time but can rather if they are grouped delete an entire group. So the specific API's you have open device this is actually a repeat of a previous slide that got duplicated in here. So I've covered this and will not cover it again. Open source libraries. So Samsung has developed a number of open source libraries. The pointers to them are here. The slides are available to you as an attendee at the SDC conference, but we have KV API kernel driver and emulator that are available on a public GitHub. We have a key value user space driver that is also available on GitHub. We have a Ceph object storage designed for the Samsung key value SSD.
Starting point is 00:47:30 This allows you to take a Ceph object storage and implement it utilizing a key value storage device. Network key value APIs at a host software level that abstract multiple direct attach or remote key value FSSDs are available. NVMe OF drivers are coming soon. And finally, we have an open source implementation of KVROX, which is a ROX database that is compatible with KeyValue Store and MyROX, providing a storage engine designed specifically for key value SSDs. All of these are available on the nvmexpress.org website under releases.
Starting point is 00:48:52 And the SNEA key value API is available on the SNEA website under their publicly available standards. So with that, I hope this has been useful to you. There is another presentation on key value during this conference and there are a number of presentations on computational storage, which could take advantage of key value storage. And finally, there are talks on NVMe that can help you point you to the NVMe 2.0 talk that is done by John Michael Hands and myself that talk more about the process of going to NVMe 2.0 and the multiple command set work that is an undergirding for the key value storage. So I thank you very much and hope you enjoy the remainder of your day. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Starting point is 00:50:31 Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.