Storage Developer Conference - #94: Key Value Storage Standardization Progress
Episode Date: May 6, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 94. So I am Bill Martin. I am from Samsung.
Presenting more on the status of standardization efforts is something that Samsung is helping to drive along with some other companies,
which is key value storage.
And as a representative of Samsung, I have to throw up this disclaimer that nothing I say can be held against me.
So what is key value storage? And one of the keys here is that it is, so I'll start off with what it is.
It is storing an object or a value associated with a key.
It is also a new paradigm that is different than block storage. So with block
storage, you pass a logical block address and some amount of data that is defined by a length
that you pass with it. Instead of doing that, you are passing a key. The key may be variable length, it may be fixed length, and you are passing a value.
The value, again, also may be variable length or fixed length. And you don't pass a logical block
address, and we'll talk about why that's a good thing to be doing. It's also different than object storage. And the next slide will talk about how it's different than object storage.
So object storage, in terms of how people typically think of it, is it is a solution or platform.
It's something more than an individual device, generally.
It's something like a cluster of storage with a global namespace
that has an extensive higher level feature set. And it has features like global namespace
resiliency, other things that are associated with this large pool of storage.
We're not trying to do object storage in its large sense
and move that down into the disk drive.
Rather, well, let me talk two other things on object storage.
It may, object storage at times has been considered a device,
for example, the kinetic object storage at times has been considered a device, for example, the Kinetic object
storage. One of the differences here is Kinetic did not have a native object storage interface
language. It used existing RESTful, a RESTful interface where you did puts and gets, but they weren't specific to the object and the label of that object.
Another characteristic typically associated with object storage
is that it is searchable based on the value in the object
as opposed to just searchable by the key that that object
is stored with.
And finally, object storage, when you think of a large system that is an object storage
system, performs operations based on the object key and the object value.
We're not talking about doing any of that high-level stuff
with key-value storage.
What we are talking about, we are talking about a device, an SSD.
So we're not talking about a large system of a whole bunch of SSDs.
We're talking about an interface to that individual SSD.
We're talking about a interface to that individual SSD.
We're talking about a native key and value interface where we're defining a new set of commands,
a new set of application programming interfaces.
And we're talking about doing store, retrieve, delete, et cetera,
with the key and the value instead of a logical block address.
Searches and other operations on the value are done outside of the storage element.
So we're not trying to put a whole lot of compute, but we're trying to offload some things.
So why do we want to do key value storage?
So solid state storage currently maps from an address to a physical location.
That's how it works.
Unlike in the past early on days of disk drives,
there was a fairly fixed mapping from logical blocks to physical blocks.
When we moved to solid state storage,
you moved to having to have a mapping table.
So what is key value storage? Well, it's a different mapping than what you currently do
for storing logical blocks,
but it's what solid state storage already does.
We already do a mapping.
It removes triple mapping that currently occurs in systems
today. Today, if I want to store a key value pair onto a solid state storage device, I have a mapping
from my key to a file system. Then I have a mapping from the file system
to a logical block.
Both of those mappings take place generally in the host.
Then finally, when that write command
or the retrieve command, the read command,
comes down to the device,
there's a third layer of mapping from a logical address to a physical address.
So we have three layers of mapping right now,
and what we're doing with key value storage is turning those three into a single mapping where you pass a key
to the storage device, you pass the value to the storage device, and the storage device
maps the key to a physical location or set of physical locations on your SSD. So we're trying through doing this
to eliminate a piece of the overhead.
So where are we developing this?
We're actually working with two different organizations
at the moment.
We're working in SNEA,
the organization that's putting on
Storage Developer Conference, to define a key value storage API. at the moment. We're working in SNEA, the organization that's putting on Storage
Developer Conference, to define a key value storage API. We are working within NVMe
to develop a key value command set. And I'll get a little bit later to actually where we're at in each of those organizations. We don't have full approval in the NVME work group,
but I'll talk about that a little bit later.
So I don't want to imply that this is fully embraced within the industry,
but want to talk about where we are currently and where we see things going.
So in SNEA, we have begun developing the key value storage API.
We're currently, we have a revision that has been released for public review.
It's available for anybody who wants to look at it to pull it up,
look at it. And one of the things we really would like is people who are interested in this to go grab it, review it, and say, hey, hold on. I see some value in key value storage,
but I think you're missing this, or I think you ought to have this, or the way you've defined this is not exactly what we want.
We really are interested in public review of the work that we're doing within SNEA.
So please grab the website.
SNEA has a lot of things under public review,
and they really are things that SNEA would like to see people go out and look at and review.
So what's included in our key value storage API?
We have key space management.
We have a store API.
We have a retrieve API.
And I'll go into each of these in detail as we go through this.
We have an exist API to determine whether or not a key exists.
We have a delete API, and we have group operations.
All of those are defined within the key value storage API.
So key space management.
Multiple key spaces can coexist on a single key value device.
Each key space has its own characteristics. It has its own capacity. It has key ordering
if that is supported by the particular device, or that may actually be supported in a library
above the device. So this is an API. There is an interface between the device and the API.
That device may do some of the processing in the host. It may do all of that processing on the
device. There are also device characteristics that are global. So the API talks about some of
those global characteristics and also the specific characteristics related to key spaces.
One of the other characteristics of a key space is that a key within one key space
is unique from a key in another key space.
So there is a uniqueness in terms of addressing between the key spaces.
The store command, fairly simple.
It's very much like a write command.
It specifies the key space that you're addressing.
It specifies the key that you are attempting to store. It specifies the value you're attempting to store. Okay, that's pretty
much what you have in a normal write command. There are three options that are associated with
it. The first is fairly straightforward, do I compress the data?
And that's an instruction to the device as to what the device does with the data. If the device has
the capability to compress and it says do or don't compress this data. Now the next two are a little
bit different than what you have in a traditional block storage paradigm.
And they have to do with when do I or don't I perform this store command.
And they are overwrite and update only.
With overwrite, overwrite says if that particular key exists,
do I overwrite the value that's there, or do I give an error indicating that value already exists, you are not allowed to overwrite it?
Update only is a matter of determining if the key exists on the device, then I'm allowed to update it.
If the key does not exist on the device, then I flag that as an error that you have not created this key,
therefore you cannot update it, therefore you cannot write it.
What this does is there are cases where you want to create the key
by saying overwrite and dot update only.
So you create the key initially by writing to it, by storing to it.
And the next time you come down, you say,
I know I already created the key, therefore I am updating it. Now, one of the other differences
between key value storage and block storage is when you do an update, you update the entire value. In block storage, if I write
LBA N and I say I want to write 10 logical blocks starting with LBA N, I can then come back and say, okay, I want to just write LBA N plus one. And I can write some piece of that
overall block that I wrote initially. With key value storage, your unit of atomicity
is the entire value. So when I come down to write the value again, I rewrite the entire thing.
I don't update.
Now one of the things, and it wasn't really on why do we want to go here, related to this is one of the things that this avoids is when you look at NAND flash technology, by updating the entire value,
you avoid a piece of garbage collection.
If what you can do is say,
my entire value is contained in a multiple number of erase blocks,
then you're not saying,
I have to move things around when I write a subset of it when I update it.
When I update, I rewrite the entire thing someplace new,
and now I can erase the entire old piece.
So store really kind of contains a lot of the architecture
that we feel is important in the key value storage API.
Retrieve, once you know what store is, it's pretty much the converse of it.
However, when you get down to the options, it's got decompress, converse of compress.
The other one, though, that we have put into the API is the ability to delete after retrieving.
We have particular applications that people have come to and said,
we want to be able to retrieve the value, but after we retrieve it, we are done with that value
and would like to delete it.
So it is kind of a fused command where it's two commands in one,
where you do a retrieve immediately followed by a delete.
So the next one is list.
So, list returns a list of keys.
That list has a starting point.
The starting point is an index into that list.
And, again, this is on a per key space basis. So I didn't actually run down all
the parameters that go with this. So list has a parameter of the key space that you're listing in.
It also has a parameter of the key that is the starting point for your list. It has a size of what is to be
returned. And based on the key size, it determines how many keys you will return.
So if you have small keys, you may return more keys. If you have large keys, you may return fewer keys. And I'll get into a little bit
more. At the very end of the API stuff, I'll talk about our group functions. List is one of the
commands that can use the group functions within the API to narrow down the scope of what you're doing, but I'll touch
on that a little bit later in another slide.
Exist, again, is a fairly straightforward API, and you specify a key space, as you do with all of these APIs,
and you return success if that particular...
Oops, I did my slide wrong.
You return success if the key exists, not the value.
So I apologize, my slide is wrong.
So secondly, this may be performed on a list of keys.
So you may actually ask, here's a whole list of keys, do they exist?
Again, this is another one.
No, this is not one that's part of the group's command.
So you can give it a list of keys and ask, do this list of keys exist?
And it basically returns success for each particular key.
Yes? Yes? Pardon?
If you attempt to retrieve a key that is not there, you will get an error.
Yes.
Okay.
Part of why you have exist as opposed to just retrieve is exist will let you know if the key is there,
but without returning the data which you would get from a retrieve. So, group operations. We have defined the group operations to allow operations to be performed on a group of keys.
That group of keys is specified by a set of bits within the keys.
They are high order bits only, so they start from the high order and go down. And we needed
to pick an end of that, and we picked the high order end so that if you have variable
keys, it's always the same set of bits. It's always the high order bits. So the group is specified by two different fields. One is how big is the set of bits that define the group.
And the second is a mask of those bits to say if those bits equal this particular value,
then all of the keys that match that characteristic are part of the group.
So the operations that are supported doing that are the list operation.
So you can say rather than a list of all of the keys in that key space,
which may be a large number,
I only want the ones that fall within this group.
Now, your list, if you're not getting all of them but getting some of them,
it's controlled by the group characteristics,
but also is controlled by the size that you're asking for. The other supported operation is the delete operation which allows me to delete all
of the keys that are associated with that particular group identifier. So what this does is it means that if I have a key and that key has
four zero hex in the top byte, in the most significant byte of the key, all of those
values are considered to be part of the group. And if I store things using that high order byte, differentiating different groups,
I can then say, oh, I'm done with this group of keys and their associated values,
and delete all of them in a single operation.
I can also go and find what are all of the keys within that group.
No.
And we've changed the nomenclature to key space because there are some differences from
an NVMe namespace.
So the question was, do we have operations that operate across namespaces?
And no, we don't.
Yes?
You said in one of your questions that you have operations across key spaces.
For example, how to check what is the amount of free space on a drive?
Okay.
Those are not... In the API, the API is focused basically on the IO commands.
There is discussion of developing a key value management specification.
That management specification would be the place that we would put things
that gave us characteristics of the overall key value device.
And that will be a separate specification.
We have not started work on that.
That will be worked on within the same working group that's working on the API, but it would
be a management interface. There's another talk that's working on the API, but it would be a management interface.
There's another talk that's being done this week, I believe it's tomorrow, by Mark Carlson,
talking about a management specification that's already been done for IP-based storage devices.
This would be a similar thing to what's been done for that.
It, again, will all be within the same technical work group within SNEA.
Right.
That's actually, that would be in the NVMe command set as opposed to the API. Okay, so that's in the next part.
Right.
And regarding the number of key spaces that you guys are thinking about,
like unlimited?
We have, for the API, the question was,
are we thinking about how many key spaces are supported for the API?
The API is written as a data structure-based API,
and there is not a...
I don't think there's a limit on it, but I'm trying to remember now.
I don't remember whether or not that is a data structure which would allow
an indefinite number or if there's an actual number of bits associated with it. There is
a name associated with your key space which is not limited. So it is a character string of unlimited size in the API.
Now there again, when we get to the command set,
which is the second half of this,
there will be more limits there
in terms of the size of fields.
But in the API, there is not a limit
to the number of key spaces that the API is
capable of supporting.
Okay.
Again, I'll talk about that a little bit more in the second half.
That comes down to the command specification.
The question was, are there limits on the size of the value based on what you can store?
When you get to the command set, there are limits.
Within the API, there are not currently limits.
But there again, that is a characteristic that needs to be discovered through management to determine for a given device what those limits are.
Another question back here.
So the question was, is there any ordering to the enumeration within the API to specify the ordering characteristics,
whether it is ordered or is not ordered, and the type of ordering.
What the device supports may be different than the breadth of what the API supports.
But the API does support ordering information.
One more question.
On the wider...
Can you pick this KVSSP
within the existing XFS file system databases, etc., easily,
or it will be a heavier lift to modify those,
to kind of fit this into the existing file system databases, etc.?
Okay.
It's not intended to fit within the file system
because it really is intended to be a key value storage
which is not a file system storage. So the question is, have we done research that looks
at how this fits within the file system? What we've looked at rather than that is how can
you use things like RocksDB and how can RockXDB utilize this without having to go through the file system.
So this is basically taking an architecture where you get rid of the file system piece
and use this instead.
Now, an alternative here that has been looked at, has been explored, and is actually usable is within the file system to pass the file name
directly as your key and what's stored for that particular file is your object or your value
yes we do.
Okay, and as I get to the end of this, I will talk about we do have some open source code available.
So what's our status of the KV API?
So we are developing this in the Object Drive Technical Workgroup within SNEA.
If you are a SNEA member, you're welcome to join.
If you're not a SNEA member, please join SNEA and join the Object Drive Technical Workgroup.
I do wear multiple hats. One of my hats is I am co-chair of the SN Drive Technical Work Group. I do wear multiple hats.
One of my hats is I am co-chair of the SNEA Technical Council.
So you'll hear me advertising, hey, come join SNEA and work with us.
But I also say that because I'd really like as much of the industry
looking at this and how they can use it so that we can make this a prolific next generation step for storage within the industry.
We are currently meeting weekly to review the specification, enhance the specification,
take inputs on the specification, and make modifications.
That's not the only thing that the Object Drive Technical Workgroup is doing.
They are also working on the IP storage management and one other item that Mark Carlson's pushing,
and I can't remember what it is at the moment.
So we have multiple things that that technical work group does work on.
We meet for one hour via conference call weekly.
In addition to that, coming up in October,
SNEA hosts a technical symposium that will be in Colorado Springs.
On that day, we'll have an entire eight-hour day that we will meet doing work in the object-derived technical work group,
of which about half of that day will be spent on key value and half on other items.
Currently, revision 0.16 is available for public review,
as I mentioned at the beginning of the API stuff.
And right now, my goal is aiming for the end of this year
for releases of specification.
So it is in public review now.
It is fairly solid at the moment, and we are looking for inputs,
and the types of things that we're having come in at the moment are more on the order of clarification
as opposed to enhancements.
So the NVMe command set.
Where are we?
Right now we're in architectural development.
At the end of this set, I'll do more of a status overview like I did in the KV API.
But what is the NVMe command set?
It is a new command set.
It is not being built on as an extension of the block commands, but as a new command set. It is not being built on as an extension of the block commands, but as a new
command set. It is intended to operate over either PCIE or over NVMe over fabric. And as part of our
architectural discussions, we've come up against places where we've said, oh, we need to make some modifications to this to make certain that it does
work over fabric. And we are, within those architectural discussions,
attempting to make certain that it works both in PCIe and over fabric.
Architectural overview. So first off, a single controller supports either block commands or
key value commands, not both. So you don't have a controller that supports both things.
However, an NVM subsystem may have a block command controller and a key value controller.
So within your subsystem, you could have one storage element that is a block storage element
and a different storage element that is a key value storage element,
where you may even have the capability of configuring that on the fly.
We're not specifying that particular operation.
The biggest thing we're specifying is that there is indeed a separation between those
two, and they're not the same.
Yes? When you say controller, are you referring to the media?
I'm referring to an NVMe controller,
which is an architectural entity within the NVMe system.
So it is an entity on your disk drive that controls a particular piece of that drive.
Okay.
So this is a new specification that will reference the NVMe base spec.
It will reference, if necessary, the NVMe over fabric specification.
And NVMe is currently working on an NVMe PCIE specification. It will reference that as that
work develops. But it will be a new specification. It won't be rolled into the NVMe base specification
and trying to add a different command set there.
Part of what that means is if you build a block storage device
and that's all you want to build,
this particular specification, you don't have to develop to it at all.
You don't have to read it.
It doesn't affect you in any way, shape, or form. If you want to develop a key value command set
device, you read the new specification, and it points you to the pieces of the base specification
that are important for you to be aware of as well, such as things that define
your queue structures, all of that type of stuff. All of that's in the base specification.
So key space. What is key space? It is comparable to namespace.
It allows separate spaces on a controller to use the same keys without overlapping.
It allows partitioning of the controller resources.
So part of what you're able to do with keyspace,
and this comes back to one of the questions that was asked earlier,
you have characteristics that are complete controller characteristics. Within NVMe,
you have the identify controller that gives you information about the entire controller.
You have other things in the current base that are namespace specific, and you have
identify namespace that gives you information about what's in the namespace. Keyspace will be similar to that.
Within Identify Controller, currently, you have, no, sorry, within Identify namespace,
you currently have a set of logical block formats that you can format the device to. Similarly, there will
be a set of, within key space, a set of key value characteristics that you can format controller namespace to. So the store command. One of the unique things about
key value storage as compared to block storage is we have defined keys and And right now, our view for phase one of this development is that the keys can be up to 32 bytes long.
Well, that's too much to fit into the current NVMe command structure. So what we have done in our description of it currently is keys greater than 16 bytes are passed as part of the data pointer, and they may be SGL or PRP.
Now PRP cannot be transmitted across FCOE.
So if you want to do this across a fabric, you will need to do this as an SGL formatted version.
The other thing about them is the keys that are greater than 16 bytes,
they are the first descriptor pointed to by the SGL or PRP,
and they are located in a single element of your scatter gather list or your physical region pointer list.
The value is then pointed to by the data pointer. If you are in this situation where the key is 16
bytes or less, then the entire data pointer is pointing to your value. If you were in the case where your key is greater than 16 bytes
and less than or equal to 32 bytes,
then your value is pointed to by everything in the data pointer
after the first element.
So this isn't, for the store command, isn't too much of a stretch because both the key and the value are being retrieved from the host.
When we get to the retrieve command, we've done the same thing.
The keys up to 16 bytes are carried in the command.
Keys greater than 16 bytes are passed as part of the data pointer.
In other words, they're pointed to by the first element of the data pointer, whether it's SGL or PRP.
The value is returned to locations pointed to by the data pointer.
Again, SGL or PRP. Now, the anomaly here is that now your scatter gather list or your PRP has one element that is pointing to something that you are pulling from the host,
and the rest of the elements are something that you're putting back out to the host. But in terms of how the device operates, it's really not any different because
you go fetch your SGL or PRP, it gives you information how to operate, and the only anomaly
here is you have to go retrieve more information from the host that's pointed to by the SGL or PRP.
But you're already capable of doing that
because you're capable of doing store commands anyway.
So the list command returns a list of keys within the key space.
It starts at the key specified by the command,
and again, the list command has to be able to use the data pointer if you have a larger key.
Now, the only thing the data pointer points to at this point is the key that you're starting with.
For the first release of this, the list command is not ordered. There is, within the command set, we don't
have a requirement for any ordering. What happens is whatever data structure the device
has is used to order your keys. What that means is if I start with key foo, it will give me
the next set of keys based on that starting point. I take the last key that I get in that command,
use that, and I can get the next set. And that list, as long as I haven't stored or deleted a key in the process
of doing the list, will remain in that same order. Now, if I delete a key and I now start from
somewhere, that should not change the order, but there's not any requirement that it not change the order. So in other words,
if I delete something and I now write something new in, and it happens to go into that same place
in my data structure, my lookup table, then, and I do the same list again, I may get something
different from my list if I did a store or delete during that.
Yes?
For a given key size. You don't need the value because all you're doing in the list command is returning keys.
You do need to know the size of your keys.
You're asking, do you return the size of the value?
Okay.
I know that we've
discussed that, but I don't remember the answer
to that question at the moment.
That is something we have
discussed both in the
API as well as in here.
You do need to be able, within the list command to return,
the value size as part of what you return, yes.
What happens if you delete the key that was the last one in the last row?
If you use that, you will start at the beginning of...
If you use any key that doesn't happen to exist,
you'll start at the beginning of whatever the structure is that the device has.
So basically, if I passed it a null key, I start at the beginning.
If I pass it an unknown key, I'll start at the beginning. If I pass it an unknown key, I'll start at the beginning.
So if in the process of doing a sequence of list commands, I delete one in the middle and that was where I was going to start again, I'm going to bump myself back up to the beginning. And basically
that's where the list command will not be the same if you make any modifications to the keys that are stored.
So exist command, very much like the API at return success of the key exists in the key space,
the exist command as we have it defined today
for the first phase of this only takes one key.
It does not allow you to pass a list of keys
and find out if all of them exist.
So this is a one key at a time exist.
Where are we at?
So the work in NVMe has been provisionally approved
by the NVMe board.
They are still looking at the value of this
within the industry.
Samsung and a number of other people see some significant value,
but the NVMe board wants to make certain that NVMe doesn't get too broad, too diluted.
We are developing the architecture.
We are having weekly subgroup meetings.
They are one hour in length. We use up the full hour. There are lots of discussions going on. So if you're involved
in NVME and are interested in this, I would strongly encourage you to come participate in
that. The document is being posted so you have access to it if you are an NVME member.
But we are working hard towards completing the architecture. So we are having very deep
architectural discussions of what should this architecture look like. My push is for member review beginning around the end of 2018,
possibly early into 2019. So that's where we're at in NVMe. With that, I'd like to thank you,
and we do have about three minutes left for questions. Thank you. Thanks for listening. If you have questions about the
material presented in this podcast, be sure and join our developers mailing list by sending an
email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers
in the storage developer community. For additional information about the Storage
Developer Conference, visit www.storagedeveloper.org.