Storage Developer Conference - #119: Squeezing Compression into SPDK
Episode Date: February 19, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 119. So I'm Paul, and this is Jim.
My claim to fame this year is I named this presentation,
so I get to take credit for that.
Isn't that cool?
Squeezing compression into SPDK.
Isn't that awesome?
Okay, so first, this is probably, I think I know the answer,
but show of hands, who's pretty familiar with SPDK,
at least knows what it is?
Okay, great.
I mean, we're here every year doing three or four talks.
I don't know, I think we have four talks this year
between the Intel folks.
So, yeah, I won't spend a lot of time
going over what SPDK is,
but this is really to talk about the compression piece
that Jim and I worked on this year.
There's still some to-do items.
We'll talk about that as we go.
But this is really more about that.
I'll give you enough context that if you're not
familiar with SPDK, you'll know kind of what we're talking about. Okay, so we'll go over like super
high-level architecture. And SPDK is so big, and there's so many different components, there's
really no way that we could get through all of them and focus on compression. So I've really
trimmed that slide down to just cover what we're talking about today.
Talk a little bit about DPDK.
You guys have all heard of DPDK, right?
The other DK, PDK.
I'm going to bring up the Crypto BDEV module.
I actually was here last year with Fiona from the DPDK team,
and we talked about the Crypto module. That was our first purview into the dpdk framework. So I'll explain
a little bit about how that works, give you the high-level flow of that, only because
I think it's really interesting to see how that compares and contrasts with the compression,
because the compression work was quite a bit more complicated than the crypto work, but
they both tie into dpdk in the same way. Then I'll talk about the compression BDEV module, or virtual BDEV module, and sort of
explain how that works.
And then Jim will cover libreduce, or the reduce library.
And we'll, as we go through the slides, you'll get a feeling for what that is and how that
fits into the big picture.
Okay, so here's the super high level architecture.
Like I said, it's missing a tremendous amount of components,
but at the top is typically where we put our front-end stuff.
So this is the example that I'm using here is the NVMe Fabrics Target,
which today supports both RDMA and TCP.
So this would be the higher end, and then we're using NVMe as an initiator.
And then this is our block device layer
with just a few examples shown in here,
and this is kind of the user space storage stack
is the best way to think about it, right?
We've got block devices,
which actually have something physical backing them,
and then virtual block device modules
that you can stack on top of block device modules
to add functionality and capability
like crypto and compression, and we've got a logical volume manager.
So there's all sorts of different things you can do by adding virtual block devices.
So what I'm showing here is we've got all sorts of different virtual block devices,
the DPDK encryption virtual block device, the compression block device,
and then ReduceLib, which is new with compression,
which you'll see here in a minute, uses persistent memory.
And we do so through the persistent memory development kit, the PMDK.
Then down at the bottom in Beam Express, we've got an environment abstraction layer in SPDK
so that we can actually plug into any kind of environment like DPDK that supports the services that we use from DPDK,
memory management, PCI management, all that kind of good stuff.
Okay, next slide.
All right, so DPDK has got a lot more than this, too.
But this is the slide we used last year to talk about crypto.
You can see here's the crypto stuff over here.
So they've got a crypto dev module and a compressed dev module that we're both using now.
Now, we chose DPDK Framework to implement these things because it's really cool for doing this.
So if you guys are familiar with ISA-L, everybody knows what ISA-L is, right?
That's actually out of our team as well, the team that Jim and I work on.
So ISA-L does these things too.
So we've been asked a few times, right, why don't you just use ISA-L?
While we're using DPDK framework, you just bolt it right into SBDK.
And we do use it for some things within SBDK.
But the cool thing about, like, take, for example, the crypto BBDF.
They're actually both very similar.
The DPDK guys have built a generic abstract API to hide whatever's behind it actually doing the crypto or the compression.
So we write to a single API,
regardless of whether the back end is a hardware-assisted device
or a software pulled-mode driver,
which one of them happens to be ISL, or anything.
So we really write to a generic API, and then anybody can go in and
plug in anything they want in the end with very, very little integration work. And it's the same thing with
CompressDev, right? So they followed the same model. There's one API
for CompressDev, and then you can use hardware assist on the back end,
or you can use software.
Okay?
Okay, so this is an animation, so I'll just sort of flip at you or you can use software. Okay?
Okay, so this is an animation,
so I'll just sort of flip at you when I'm ready to go.
Okay, so here's our, not yet.
Here's our SBDK application up here,
and then I'm showing the block device layer,
and this is the crypto example.
Like I said, I wanted to go over this because it's a pretty simplified view of the world.
And then the BDEV is actually the block device
that's representing the device.
So this would be an NVMe BDEV
in the example I'm talking about.
And then this is our virtual BDEV
that we just stack on top of it.
So you could actually put crypto on top of anything.
You could put it on top of a logical volume.
You could put it on top of an NVMe BDEV. You could put it on top of a logical volume. You could put it on top of an NVMeb dev. You could put it on top of
a RAM disk. You could put it on top of anything you want.
Then we've got our SSD down here, of course. And then off to the right, this is where we've
got DPDK. This is the crypto dev API. So this is the generic API I was talking about. And
then the two examples are ASNI, pulled mode driver, and the QAT pulled mode driver. So that first one is a completely software-based cryptography,
and the second one is the Intel QuickAssist card.
And from our perspective, writing this module, it's basically the same API.
Both asynchronous, fire them off, pull them, and get the completions.
Okay, so let's start with the first click here.
So an IO comes down from the application,
and its first stop is the Crypto Virtual BDEV module
because that's how this thing was configured.
Its next stop is going to be the Crypto.
It's going to take that I.O., whether it's a read or a write,
and it's going to fire it off asynchronously
over to the CryptoDev framework.
And we've got a puller that's going to be running over here
and cruising along looking for completions on those. And as soon as the operation is done, the crypto
or unencryption is done, then it comes back to the crypto B dev driver with, you know,
the buffers identified and what's in them. So then what we do is we take either the encrypted
data or the unencrypted data, and we fire it down
to the BDEV, which the BDEV in turn fires it down through the driver and gets it out
to the SSD. So pretty, pretty, pretty simple example, right? And that's the beauty of the
virtual block device model that we've got. It allows us to do these kind of things pretty
easily. Okay, so now let's talk about the big picture here for compression.
You can see there's a couple of new things on the slide.
The first most prominent one is, like, persistent memory.
So we use persistent memory, and we get into a lot more detail on this
when Jim goes through LibReduce.
But in order to efficiently allocate and keep blocks stacked up
as we compress the data,
it requires a lot of metadata operations.
So we have a need for very fast metadata operations.
So it's basically a really good fit for persistent memory.
The other stuff in here that's new that you see on top of the compression driver is libreduce and PMDK.
So libreduce is its own module.
It's only used by the compression virtual block device.
But this model is a little bit different than what we saw before.
And as I walk you through this, it'll become clearer.
The compression VB dev in this implementation is much more of a traffic cop
than it is a virtual block device in the sense of what we just saw over there.
When it comes up and initializes, we trade function pointers back and forth with libreduce
so that libreduce knows how to talk to raw disks.
It knows how to do a compression and decompression operation,
but it does so all of them
through the compression virtual block device module.
Okay, so let's click on the first one here.
First thing that happens is same as with crypto.
It still comes down to the compression driver.
So the compression driver gets a read or write operation to a logical device of some kind.
And instead of firing over to the compression framework like we did in crypto,
now what we're going to do, go ahead and click Jim.
Now what we're going to do is we're going to take that read or write
and we're going to tell LibReduce, hey, there going to tell lib reduce, hey, there's a reader write,
but I don't know anything about the physical layout of the disk.
I don't know whether the data is compressed, and I don't know anything about it.
So it fires that off to lib reduce,
and lib reduce is going to do some metadata operations through PMDK,
so it's talking to its persistent memory.
And then it gets a little bit complicated,
so the slide isn't completely accurate for what the code does.
But then lib reduce, go ahead, Jim,
libreduce is going to call back into the compression VB dev module
with some function pointers that we trade at initialization time
and say, okay, here's the data that you need to compress or decompress.
And then the compression VB dev module does the compression operation
and actually gets back to lib reduce and says,
okay, now this is done.
And then we take the completed operation
and fire it down to the B dev,
and it makes it out to the SSD.
So like I said, a little bit more complicated than crypto,
but pretty cool that, you know,
we've got good use of persistent memory
and use of PMDK.
I'm sure you guys have seen all the PMDK stuff going on this week
and over the last couple of years.
Really easy bolt-in, really convenient to use for us.
Okay, so with that, I'm going to turn it over to Jim,
who's going to go through the details of libreduce
and how we actually take advantage of the compression algorithms that DPDK gives us.
Okay. Can you guys hear me? Okay. So Paul gave a really good overview of how this libreduce
library fits into the overall SPDK picture. So I'm going to focus now mostly just on that
compression library itself. We purposely wanted to build this library so that it wasn't tied
exactly to the beta interfaces that Paul talked about, so that it wasn't tied exactly to the BDEV interfaces that Paul talked about,
so that it could be used standalone. It really helped us kind of make sure that the APIs
were correct. It made it really easy for us to build test infrastructure around it. And
it means that you could pick up this library and you could use it in a different context.
It's not tied directly to our SPDK block device layer. So go ahead to the next. So there's.
There we go. Yeah, so go ahead to the next. So
there's three main pieces to this library. So the
first one to construct this is you need some
sort of block device for your backing I.O. units.
I'm gonna use this term backing I.O. units here. It's basically the, where we're gonna
store the compressed blocks on disk. I'm gonna talk about some cases where we actually don't
store compressed blocks, we store the uncompressed blocks. And so this backing I.O., these backing
I.O. units and the backing I.O. device really refer to that. And so with, with this
implementation we're going to typically be using a thin provisioned SPDK logical volume.
So that as we start writing blocks to that logical volume, that'll actually start consuming
the capacity and that's how we're going to realize our savings.
So the key one here is that we're using a persistent memory file for the mapping metadata.
So if
you've done block compression, you know that you have this problem where you've got maybe
16K of data that's coming in and only 8K of data that's
coming out. And you've got to keep track of how
those uncompressed blocks map to the compressed blocks on disk.
And so we really thought using persistent memory would be a
really unique way to take advantage of that byte
persistence, and so we're using PMDK directly to read
and write that persistent memory file.
Let me go ahead and hit the next one.
We do have some amount of metadata on the
block device. So, for example, in the SPDK block
device, when, when block devices start showing up in the system,
so, for example, let's say a logical volume is exposed,
we need to know if this logical volume, does it have a compressed,
I mean, is it compressed?
And so we store a small amount of metadata at the beginning of the block device
with some parameters describing how we're using lib reduce.
So, for example, things like the path to the persistent memory file, how big is the chunk, how big are the I.O. units, some of those parameters.
And then the key thing is this is a metadata algorithm only. We're not implementing our
own compression algorithm. We're gonna use standard compression algorithms like, you
know, D flight, LZ4, et cetera. This is just describing the algorithm we use for the mapping
this metadata and how we store it on persistent
memory.
OK. So this is gonna talk a little bit
more about how this library fits into things like
the SBDK block device layer. Go ahead and
hit the first.
So the first part here is we have, you know, at the beginning when you create, when you
want to create a, what we call a reduced volume, you're going to call this reduced file init.
And it's basically going to call into here, you're going to pass it the block device,
and you're going to pass it a path for the persistent memory file.
It's going to go ahead and create the persistent memory file, store the metadata on disk, and return to you a handle that you can then use
to start doing read and write operations.
And then later, you can do a load operation. So we use this load operation whenever we
get a new block device. This will tell us not only, this will not only load a existing compressed
volume, but you can also use it to say, is there a compressed volume on this block device.
So we use this in the SPDK path. Whenever we identify, we have a process called examine.
Whenever a new block device comes up, each of the virtual block device modules get a
chance to see, do I want to claim this block device?
And this is what we use for that.
This is how we can tell whether any existing block device is a compressed volume or not.
Then these are the read V and write V is how the upper layer application actually performs
read or write to do read or write operations to the compressed volume. And then
on the bottom end, this is where we end up with a number of operations. So one is, if
you know anything about PMBK, if you don't, definitely attend some of the persistent memory
sessions. I know Andy Rudolph's going to be doing a hackathon with PMBK, and you can understand
more how this works. Reading and writing to persistent memory is just load and stores.
There's no special API there.
But we do need very critical junctions when we're updating the metadata
to actually make sure that it's persisted.
And so we'll make calls out to PMDK to persist the specific regions.
Go ahead and hit the next.
We'll do readV, writeV, unmap out to the backing i o device so whenever we have to go and read
compressed blocks or write compressed blocks or unmap blocks we'll make those calls and then go
ahead and hit the next one and then of course there'll be cases where we'll have to go and say
we want somebody to go compress and decompress some data for us so So we don't, inside of here, we're depending on something
outside of us. In this case, Paul's compressed BDEV module to actually do that compress or
decompress operation. This is not, this is not tied specifically to the DPDK framework.
The DPDK framework stuff is handled in the module that Paul talked about. So to kind
of summarize here, this is completely independent from the SPDK framework
and the BDEV layer, so it can be used
as a standalone module.
I'll talk a little bit about what chunks mean.
So whenever you're doing compression,
you have to decide what chunk of data
you're gonna compress on.
We decided for simplicity to force the caller. If we say that a chunk is
16K and you have an I.O. that spans a 16K boundary, it's up to the caller to actually
make sure that those I.O.s are split. We already have quite a bit of functionality in SPDK
and other areas of the code to do this splitting. And so we decided to leverage that and not put,
duplicate all that here in the libreduce library.
And then it's single threaded.
So for each compression volume,
there's operations where you have to make sure
that you synchronize.
If you've got multiple writes going to one chunk,
you have to synchronize them.
Within SPDK we can pass messages between threads very, very quickly. And so
from a complexity standpoint, we decided to make this
single threaded, and then we have code in the
upper layers that makes sure that if you, you
have multiple threads that are sharing the same volume,
they will send them to one thread to make sure
that those IOs are all synchronized.
Okay, so layouts. I'm going to describe a little bit about how the data is actually laid out on the SSD
and in persistent memory, and then I'm going to go through a few examples
on what read and write operations actually look like.
So first we've got the SSD.
Go ahead and hit the next one there, Paul. So we
split this up into the I.O. units. So for
this example, we're gonna assume each of these I.O.
units is 4K, which means we're always gonna be
reading or writing to the backing I.O. device in
4K units.
And next we have the persistent memory file. And
this is, the persistent memory file is split up
into two main pieces. The first one is a chunk map. So a chunk
map contains what we call chunk entries. Go ahead
and hit the next.
So what this does is, you can think of
this, each of these chunk entries as representing 16k
of data. So in this case what we're saying
is that we have 16k of data. It was
able to be compressed down to
10,121 bytes, which means that it would take three
4k units to store that data, and then it
points to those three units. And then finally we
have the logical map, and then this basically points
to the, the user visible logical map. When they
say I want to write to LBA zero, well
that means we're talking about this here.
If they're talking about LBA that's at a 16K offset,
that's gonna be this entry.
And so this is where we basically store an entry here
which points to the chunk map entry,
and that's kinda how all these pieces fit together.
Okay, so let's walk through a couple of examples.
OK, so the first one, we've just initialized the logical volume. So we have no data on
this whatsoever yet. User says they want to write 4K of data. It'll be a zero. So first
thing we do is we look up in the logical map. this is empty. So we know this is, there's no existing data.
We don't have to worry about any kind of read, modify, write operation.
So the first thing we do is we allocate, we keep a bitmap in memory of the chunk maps
and which ones are used and which ones are allocated.
So first we're going to allocate an empty chunk map, entry zero.
Then we're going to compress the data. So I'm going to allocate an empty chunk map, entry zero. Then we're gonna compress the
data. So I'm gonna call out to Paul's module,
I'm gonna say I want you to compress this
4K of data with 12K of zeros, because we're
only writing 4K out of the 16K chunk. And
let's assume this compresses down to 2500 bytes.
So since it compresses down to 2500 bytes, it
means we just need one I.O. unit, one 4K I.O.
unit, so we're going to allocate one of those.
So then here we write that compressed block to
disk. We go ahead and we write the chunk map
entry. And then this is the part that makes
it what I would say is official. That's what
it's basically committed. Once we write to the
persistent memory, the logical map,
then the next time somebody goes and reads the data,
they're going to find that data and they're going to be able to read it.
So you can see kind of how these point to each other here.
Let's go ahead to the next.
Okay, so now we'll do, now we're gonna write a full chunk.
We're gonna write a full 16K to offset 64K.
Go ahead and hit the next.
So here we can see this is empty, right?
So we, this says 16K, this should have said 64K.
So we're gonna look up here.
This is empty, there's nothing there yet.
We're gonna allocate another chunk entry in the map.
Okay, so now let's say we compress this 16k and it only compresses to 14, it compresses
to 14,000 bytes. So 14,000 bytes, it's going to take four IO units to be able to store
that data. So why compress it, right? If we compress it, we might as well just store the
uncompressed data on the disk so that then later? If we compress it, we might as well just store the uncompressed
data on the disk so that then later we can just read it back and we can save time. So
we'll actually allocate the four backing I-O, the I-O units, but we're gonna write the uncompressed
data to the disk instead. And go ahead and hit the, and then again, we're gonna store
this here. Since we see 16384, then that's gonna be our clue later, that that's
uncompressed data, so we'll know that when we
read the data back, we won't have to decompress
it. And then again, we write in persistological
math entry.
OK. So now we're gonna write to offset 4K
to this chunk that we already wrote to earlier.
And this is going to be a really good example of showing why persistent memory works really well here.
Why the byte, being able to do byte addressability and 8-byte atomic operations makes this a really nice solution.
So we look up the entry, we find something.
So now we know we can't just take our 4K data and write it.
We've got to read the old data, we've got to merge it, and we've got to write new data
back.
So we're going to read the I.O. unit.
We're going to decompress it into our, into our data buffer.
And then we need to merge the incoming 4K.
So we've got the, we've got the old data that was out on, on disk. It was that
original 4K that the user had with a bunch of
zeros, and now we're gonna merge this 4K of
incoming data into that.
Go ahead to the next. So here, we're gonna say
this compresses down to five thousand bytes. So now
this means we need two backing I.O. units. So the, the, one of the things
I want to call out here is that we never update anything in place. Right, so you can see here,
we don't, we don't overwrite this chunk map entry. We're gonna allocate a new one. And
same thing here for the backing I.O. units. We're not gonna try to reuse that entry zero.
We're gonna allocate two new, two, two new ones. And when we, when the volume gets initialized, we
account for this. We account for extra space in
the persistent memory file. We account for some extra
space on the SSD so that we can make
sure that we have extra space. We don't have
to do any of these in place.
So now we've written the compressed data to the
SSD. We write and persist the chunk map entry.
And then this is the really, actually just go back just one.
So you can see here, as of this point, this still is not committed, right?
If the system were to crash at this point and we came back up,
this is still going to point to the old data.
So now when we do the, so at this point we persist this first,
and then we write this.
And only once this is written is the data actually committed.
And so this is how we can really make sure that the persistent memory allows us to atomically be able to update these chunks so that it's going to be power fail safe.
So this is probably a good point to call out that when we reload, so here in memory we can release the old chunk entry and the I.O. units.
This is only releasing it in our internal in-memory bitmaps.
We don't have to write anything to disk to actually free them.
When we load this thing back up later, we're gonna walk through this list, and this is
gonna tell us which of these entries are list, and this is gonna tell us which
of these entries are valid, and these are gonna tell us which of the I.O. units are
used, and that's going to implicitly free everything that may have still been there
from when we ran previously.
OK. Trim is pretty easy, especially if you're doing, you know, trimming a full chunk. Go ahead
and hit the next. Basically all we have to do is we just clear this. So we can just,
go ahead and hit the next one. So we clear that out. That implicitly releases those chunks
because now when we come back up, when we read that, there's not, there's not going
to be anything there. So there's nothing referencing this chunk map entry. There's
nothing referencing those four units on disk.
OK. And then the final example here is reading some
of the data back. Sometimes after you write the data,
you want to read it back. So we're gonna look
up the, the chunk map entry. Go ahead and
hit the next one here. So here we're going to look up the chunk map entry. Go ahead and hit the next one here.
So here we're going to read those two backing I.O. units, 5 and 6.
So what we can do here is since the user's only reading 4K,
we can actually target the user buffer for the 4K that he's interested in
and just use a bit bucket for the other 12K.
Because we want to try to avoid
having to read the data into one buffer
and then copying it over into the user buffer.
Okay, so that was a whole bunch of examples.
Hopefully that all made sense.
We do have some next steps here.
One of them is on the unmapped path,
we have some optimizations we can make
for doing what we're calling sub-chunk allocation masks. So right now, we're not keeping track in the
metadata of which LBAs in the chunk have actually been written to. Some of those cases where we're
writing zeros, we could actually make some optimizations where we don't have to do that.
We could save on some bandwidth. We. Really hoping we would have some really
cool performance charts for this. We don't have the
charts yet. Once we do, they will be out on
the SPDK website. Some of the initial data looks
good, but there's still some things that we need
to run through.
We still have some additional on disk metadata parameters
that we want to store on disk that's not being stored yet.
Go ahead and hit the next one.
And then we have some methods for reducing the metadata file size.
So right now, the chunk entries, when they refer to the I.O. units, they're using 64-bit values for those.
And we've got some methods where we could do 32-bit instead.
So we would still support volumes that have more than 2 to the 32 I.O. units.
But with a little bit of extra finagling, we can, you know, cut, you know, effectively cut the size of the metadata file in half.
And that's when you start looking at, you know, cost effectiveness of this,
that's one of the things you want to do, right, because a persistent memory is more expensive than
the NAND, of course, and so, you know, we want to make sure we're making use of that persistent
memory as well as we can. And I think that is, I think that is it. If you want to get more information, there's tons of info out on our SBDK.io website.
A lot of information on the documentation link
that goes through a lot more details on what Paul talked about.
And there's some more details on LibReduce here as well.
One thing to call out,
we're actually having a developer meetup here in the Bay Area.
What is it, November 11th and 12th.
So if you're interested, if you're somebody who is, you know, hands-on
and wants to engage in a really, like, deep dive technical session over about a day, day and a half,
you know, check this out.
There's an opportunity to RSVP for that.
Yeah, that one's actually hosted by Nutanix also.
Yeah.
So with that, questions?
Jay?
So with regards to the unallocated boxes,
how do you handle sanitization?
You want to sanitize the drive so that, you know, for security purposes, how do you handle
the sanitization of the actual data on the drive at that point since you've got so many
redirections?
Yeah.
So I mean typically if you have some of that, let's say we take a logical volume
and then we're going to make that compressed
that is typically going to go to one client
and then so I think
your question is what happens when that storage
is going to be repurposed for
some other client. So
in lib reduce we're not
handling that sanitization part there. That would
be handled at the layer where
you allocate the logical volume. So that logical volume, you have the opportunity to sanitize that
logical volume before you repurpose it to create storage for some other client.
So in the model that you had before with the three stages, you've got the chunk
data, you've got the logical data, and you've got the actual write to me drive. So Libreduce
isn't handling all of that.
You're still relying heavily on
some of the physical media
for some of this, is that right?
Yeah, so
LibReduce is handling these parts
here, but everything
here is associated with one compressed
volume.
So...
The virtual volume. Yes... The virtual volume, the virtual volume.
Yes. Yeah. So, you know, typically what happens is you...so how about I step, take
a step back and walk through the process of how you get one of these set up. So the first
thing you would do is, let's say you want to create a, a 10 gigabyte compressed volume. So the first thing you're going to do is you're
going to create a 10 gigabyte logical volume on the SSD. Then you're going to take that
10 gigabyte logical volume and you're going to pass it to the lib reduce. He's going to
create a compressed volume that's going to be 10 gigabyte minus a little bit for some
of those extra blocks we talk about so we don't have to do things in place. But then that 10 gigabyte
is going to be, is going to be
assigned to one
client. Maybe an NVMe over Fabrics
client or a VM,
et cetera.
And then
libreduce is going to handle this. It's also going to handle
the persistent memory.
At some point,
once that logical volume is no longer needed,
you're going to destroy the compressed volume, which is basically going to zero the portion
of the logical volume. And then it's also going to delete the logical volume, because
this is no longer needed. So...
I was going to say, just in case it's not clear, logical volume and compressed
volume are two totally different things, right?
So logical volume is a different abstraction layer,
and maybe it helps if you think about it,
that the very top layer that's exposed to the application is the compressed volume.
The compressed volume and all this stuff sits on top of a logical volume,
which abstracts away what's underneath it.
So we've got multiple layers of abstraction here.
So that's why you're saying when you delete the logical volume,
independent of the compression volume,
you've got the option of saying destroy this when I start it
or destroy this when I complete it and wipe out all the data.
The genesis of the question is if I've got a host
that's going to be sending a sanitization command to the drive,
and I'm going through this method where the logical volume is being managed by SPK,
where does... Oh, you're talking about sanitized operations that are coming from the
top end to the compressed volume.
Jay, don't call them sanitized. Sanitized applies to an entire device,
a physical device. You're talking about secure data deletion,
which might be a namespace or something. You're confusing everybody
by using that word.
Bad J.
Yeah, so right now, I mean, we have
un-map operations support on the top end. Looking at sanitize is something we probably need to look at.
That's a good...
Yeah, we'll either do un-maps or write zeros in the operation we're talking about.
Yeah.
Yeah, please.
Yeah, so your question is like,
this stuff, put it in the, instead of persistent memory.
So, PMDK actually supports that.
PMDK, you can actually have a file that's on disk,
and it will, what it'll do is it'll
mmap this, and then when you do the pmem persist, it's
actually gonna do an msync. So you can actually do
this on disk today. It's just gonna be slower, because
when you wanna update eight bytes, you're updating a
whole block.
Quite a bit slower.
What's that?
Quite a bit slower.
It is gonna be quite a bit slower. What's that? Quite a bit slower. It is going to be quite a bit slower.
So there, you know, there, there certainly are ways that you could put this on disk that
might be faster.
This was one that we purposely implemented to take advantage of persistent memory.
If you wanted to have one where all of this metadata was stored on disk, this may or not,
this may or may not be the best algorithm to do that, but
this one is well tuned if you do have persistent
memory on your system.
Go ahead.
Can you do compression and encryption in the stream
of one end to the other and only then all
through the whole process?
Yes. Yep.
And the SPDK keeps those in the correct order?
Meaning, compress before encrypt?
Well, so, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the, the, the, the,
the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, Yes. Yep. And the MPDK keeps those in the correct order.
Meaning,
well,
so,
you could decide to do encrypt before compress.
I mean, if you decide to set it up that way,
we don't, you know,
that wouldn't be a smart way to do it,
so we don't enforce it.
We don't say, like,
well, that's a stupid way to layer them.
But it does support... The user configures that yeah yeah the crypto and compression vb dev modules are
completely independent yeah i see okay but as far as streaming one yep so the uh the encrypted
stuff would be you know sorry um yeah the compressed data yep just stay in ram and then Yeah, so what would happen is if you...
Why don't you go back to my intro slide?
Sorry, there's lots of animation.
Yeah, so here... Yeah, actually, so go forward here.
So here, if you had an encrypted block device down here,
then this is basically going down to that device instead.
And then that, in turn, could go to the logical volume manager,
and that, in turn, could go down to the actual SSD.
So these can really stack on top of each other.
I have one other question.
What if you want to demount the compressed volume?
Will you then take the metadata file, push it out there,
and be done so you can reuse your unbottled memory
or whatever is coming out next?
Assume that they're just mounted all the time,
therefore you have to keep the metadata in the all the time. So, if you want, let me make sure I understand what you're asking.
Yeah, you can certainly do that.
I mean, you have to keep the persistent memory file on disk.
We persist the path to that file.
So you could, I mean, yeah, you could unload all those.
You could unmount the persistent memory file system. But if you wanted to come back and you wanted to load that volume again,
you've got to make sure that the persistent memory file system is
loaded. So that it can find that file.
Okay. And that file is.
That's right. Yep. In the back.
I've written this in years past, so I have a few questions.
Okay.
. It looks like you're throwing away 12.5% compression ratio because of your allocation. Okay. There is some, yep.
Yeah, there's some cases where we're doing some rounding
and there's some bytes that are, yep.
Yep.
Yeah.
Can you clarify?
Yeah. Yeah.
Yeah. Yeah.
Yeah.
So that's certainly true.
I mean, that's kind of how the SSD handles it internally anyways.
Right?
Like, if you go back to...
Let me just go to here
yeah so I think what you're saying here
is now you have a case where you've just got one block here
in isolation
yep
yep
yep
yep Yeah, yeah, yeah.
Yep.
Yeah, so we, you know, we felt like with the,
since the SSD is already internally doing FTLs on a 4K granularity,
you know, as you're writing to the SSD, it's just going to,
it has its own log typically in the SSD. And so even though the logical, the LBA numbers themselves may
not be contiguous, that data still ends up getting written contiguously on the SSD. So
we felt like the, you know, having a bitmap here where we can just pick, you know, the
next four open blocks would be sufficient for this algorithm. Thank you.
Go back to the first question.
What I understand is all data is still very visible
to anyone who has access to the underlying box.
I don't know why you did that.
Have you considered adding a coordinated support That's an interesting idea. so that you can no longer get to the whole data theme?
That's an interesting idea.
We haven't talked about that in terms of this.
I mean, for what we've talked about today,
they're two distinct layers.
They don't really have knowledge of each other.
Something like that could certainly be done, but that's...
Don't get down and say, something like that could certainly be done but that's yeah
yeah so I think you would just have to keep track of what are those
keys that you have for each block.
Right?
So that you could read it back later.
Yeah.
And going back to this
gentleman, have you considered
creating a region on
the
block volume that is the same
size as your
persistent file so that you can copy the persistent file? Yeah, we...
Well, no, we...
Yeah, I mean, that's a really good...
So we have heard that.
We haven't implemented that yet,
but that's something that I think we'll probably do,
because that does make it easier to be able to,
you know, maybe move it from one system to another.
And, of course, that metadata
file on disk isn't going to be updated every time you do an I.O., but it would be a way
to be able to, if you had an operation offline, then it would make it easier to be able to
move that logical volume between systems.
Yeah. When I did this years ago, I didn't have any of this in memory, so I had to update
it.
Sure. Yeah. So it was more of a linear stream. Yeah. I didn't have any personal memory, so I had to update it. Yeah, sure.
So it was more of a linear stream.
Yeah.
And then the linear stream is lost.
Yeah. Yeah.
Yeah. Yeah, it's always a trade-off.
I mean, with any algorithm, there's definitely a trade-off
on speed versus compression ratio.
And so this is...
Yeah.
Yeah.
Yeah.
Yeah. Yeah. Yeah.
Okay.
Yeah.
Let me, I think there was another.
Yeah.
Yeah.
So we do have garbage collection.
So in this case here, after we've,
here's a case where we did a read-modify-write.
We no longer, we no longer are using this block.
We don't have any logical map entries that are pointing to this.
So we will clear the bit in memory
so that we can use this again. And if the
system reboots, it will see that there's nothing referencing this, and so this bit will stay
clear as well, and we'll use it the next time we boot. So we are doing garbage collection
in memory. We're not doing garbage collection from the standpoint of trying to get contiguous
LBA regions.
Would there be fragmentation in that? Male Speaker 1. Yeah.
There could be fragmentation.
Yep.
Yeah.
I mean, the one advantage, which is something we would need to explore, is that if you have,
you know, three blocks here and one block here, that's two write operations.
You can't do that in one.
Because when you do a write operation, you can only, you can only give just an LBA range.
Whether or not that makes it faster,
that's up for, we'd have to take a look at that.
But that'll be for next year's talk.
Any more questions?
Yes.
I suppose that the compression is done in a... Yes? I'm sorry?
The compression is...
I mean, you could do the compression on either side.
You could do it on the initiator or the target side.
It would really depend on, I guess, your use case.
If your file is a shared file between different users.
If your file is a shared file between different users.
I think even if you had a block device
that was shared between multiple clients,
on the target side,
you could still have the compression
done on the client side.
And then as clients were connecting to it,
I mean, they're each going to decompress the data.
But it's going to be on the target side
before the data's returned.
You mean because you're transferring the uncompressed data over the
network instead of the compressed data? Yeah, so I think that would come down to, if it's
a shared volume read-only, then certainly doing the decompression on the client side would make more sense.
If it's a shared volume, but, I mean, typically
you're not going to have shared volumes where they're both writing to it, because that just wouldn't work, because
then you would have no way for those clients to coordinate the metadata updates.
But I think it would just kind of depend on your usage model on
which side you would do it on. I think it would just kind of depend on your usage model on which side you would do it on.
I think there's cases both for the initiator side,
like you're talking about for the shared volume,
but I think there's a lot of cases where doing the compression
on the target side makes sense as well.
A lot of times clients may not always have the offload engines
or things to do the compression most efficiently,
and a lot of times those might be built into the target,
and the target can do that compression and decompression very efficiently.
Yeah, you have to keep the persistent memory with...
Yep, yep.
And certainly if it's on the target side, that's easy to keep those together.
On the initiator side, at least for
this mechanism, you would have to have the persistent
memory file be able to be accessed by all the
clients as well. So, yeah, there's some, there's
some trade-offs. Certainly if you're doing this on
the client side, yeah, sending that persistent memory
file to all the clients may work,
but it would be using that across, you know,
you'd be duplicating that persistent memory file across all the clients.
So any more questions?
Okay.
Well, thanks, everybody.
Lots of good questions.
Appreciate the feedback.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.