Storage Developer Conference - #119: Squeezing Compression into SPDK

Episode Date: February 19, 2020

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 119. So I'm Paul, and this is Jim. My claim to fame this year is I named this presentation, so I get to take credit for that. Isn't that cool?
Starting point is 00:00:48 Squeezing compression into SPDK. Isn't that awesome? Okay, so first, this is probably, I think I know the answer, but show of hands, who's pretty familiar with SPDK, at least knows what it is? Okay, great. I mean, we're here every year doing three or four talks. I don't know, I think we have four talks this year
Starting point is 00:01:07 between the Intel folks. So, yeah, I won't spend a lot of time going over what SPDK is, but this is really to talk about the compression piece that Jim and I worked on this year. There's still some to-do items. We'll talk about that as we go. But this is really more about that.
Starting point is 00:01:23 I'll give you enough context that if you're not familiar with SPDK, you'll know kind of what we're talking about. Okay, so we'll go over like super high-level architecture. And SPDK is so big, and there's so many different components, there's really no way that we could get through all of them and focus on compression. So I've really trimmed that slide down to just cover what we're talking about today. Talk a little bit about DPDK. You guys have all heard of DPDK, right? The other DK, PDK.
Starting point is 00:01:55 I'm going to bring up the Crypto BDEV module. I actually was here last year with Fiona from the DPDK team, and we talked about the Crypto module. That was our first purview into the dpdk framework. So I'll explain a little bit about how that works, give you the high-level flow of that, only because I think it's really interesting to see how that compares and contrasts with the compression, because the compression work was quite a bit more complicated than the crypto work, but they both tie into dpdk in the same way. Then I'll talk about the compression BDEV module, or virtual BDEV module, and sort of explain how that works.
Starting point is 00:02:29 And then Jim will cover libreduce, or the reduce library. And we'll, as we go through the slides, you'll get a feeling for what that is and how that fits into the big picture. Okay, so here's the super high level architecture. Like I said, it's missing a tremendous amount of components, but at the top is typically where we put our front-end stuff. So this is the example that I'm using here is the NVMe Fabrics Target, which today supports both RDMA and TCP.
Starting point is 00:02:58 So this would be the higher end, and then we're using NVMe as an initiator. And then this is our block device layer with just a few examples shown in here, and this is kind of the user space storage stack is the best way to think about it, right? We've got block devices, which actually have something physical backing them, and then virtual block device modules
Starting point is 00:03:19 that you can stack on top of block device modules to add functionality and capability like crypto and compression, and we've got a logical volume manager. So there's all sorts of different things you can do by adding virtual block devices. So what I'm showing here is we've got all sorts of different virtual block devices, the DPDK encryption virtual block device, the compression block device, and then ReduceLib, which is new with compression, which you'll see here in a minute, uses persistent memory.
Starting point is 00:03:46 And we do so through the persistent memory development kit, the PMDK. Then down at the bottom in Beam Express, we've got an environment abstraction layer in SPDK so that we can actually plug into any kind of environment like DPDK that supports the services that we use from DPDK, memory management, PCI management, all that kind of good stuff. Okay, next slide. All right, so DPDK has got a lot more than this, too. But this is the slide we used last year to talk about crypto. You can see here's the crypto stuff over here.
Starting point is 00:04:22 So they've got a crypto dev module and a compressed dev module that we're both using now. Now, we chose DPDK Framework to implement these things because it's really cool for doing this. So if you guys are familiar with ISA-L, everybody knows what ISA-L is, right? That's actually out of our team as well, the team that Jim and I work on. So ISA-L does these things too. So we've been asked a few times, right, why don't you just use ISA-L? While we're using DPDK framework, you just bolt it right into SBDK. And we do use it for some things within SBDK.
Starting point is 00:04:58 But the cool thing about, like, take, for example, the crypto BBDF. They're actually both very similar. The DPDK guys have built a generic abstract API to hide whatever's behind it actually doing the crypto or the compression. So we write to a single API, regardless of whether the back end is a hardware-assisted device or a software pulled-mode driver, which one of them happens to be ISL, or anything. So we really write to a generic API, and then anybody can go in and
Starting point is 00:05:28 plug in anything they want in the end with very, very little integration work. And it's the same thing with CompressDev, right? So they followed the same model. There's one API for CompressDev, and then you can use hardware assist on the back end, or you can use software. Okay? Okay, so this is an animation, so I'll just sort of flip at you or you can use software. Okay? Okay, so this is an animation, so I'll just sort of flip at you when I'm ready to go.
Starting point is 00:05:55 Okay, so here's our, not yet. Here's our SBDK application up here, and then I'm showing the block device layer, and this is the crypto example. Like I said, I wanted to go over this because it's a pretty simplified view of the world. And then the BDEV is actually the block device that's representing the device. So this would be an NVMe BDEV
Starting point is 00:06:15 in the example I'm talking about. And then this is our virtual BDEV that we just stack on top of it. So you could actually put crypto on top of anything. You could put it on top of a logical volume. You could put it on top of an NVMe BDEV. You could put it on top of a logical volume. You could put it on top of an NVMeb dev. You could put it on top of a RAM disk. You could put it on top of anything you want. Then we've got our SSD down here, of course. And then off to the right, this is where we've
Starting point is 00:06:35 got DPDK. This is the crypto dev API. So this is the generic API I was talking about. And then the two examples are ASNI, pulled mode driver, and the QAT pulled mode driver. So that first one is a completely software-based cryptography, and the second one is the Intel QuickAssist card. And from our perspective, writing this module, it's basically the same API. Both asynchronous, fire them off, pull them, and get the completions. Okay, so let's start with the first click here. So an IO comes down from the application, and its first stop is the Crypto Virtual BDEV module
Starting point is 00:07:08 because that's how this thing was configured. Its next stop is going to be the Crypto. It's going to take that I.O., whether it's a read or a write, and it's going to fire it off asynchronously over to the CryptoDev framework. And we've got a puller that's going to be running over here and cruising along looking for completions on those. And as soon as the operation is done, the crypto or unencryption is done, then it comes back to the crypto B dev driver with, you know,
Starting point is 00:07:35 the buffers identified and what's in them. So then what we do is we take either the encrypted data or the unencrypted data, and we fire it down to the BDEV, which the BDEV in turn fires it down through the driver and gets it out to the SSD. So pretty, pretty, pretty simple example, right? And that's the beauty of the virtual block device model that we've got. It allows us to do these kind of things pretty easily. Okay, so now let's talk about the big picture here for compression. You can see there's a couple of new things on the slide. The first most prominent one is, like, persistent memory.
Starting point is 00:08:14 So we use persistent memory, and we get into a lot more detail on this when Jim goes through LibReduce. But in order to efficiently allocate and keep blocks stacked up as we compress the data, it requires a lot of metadata operations. So we have a need for very fast metadata operations. So it's basically a really good fit for persistent memory. The other stuff in here that's new that you see on top of the compression driver is libreduce and PMDK.
Starting point is 00:08:41 So libreduce is its own module. It's only used by the compression virtual block device. But this model is a little bit different than what we saw before. And as I walk you through this, it'll become clearer. The compression VB dev in this implementation is much more of a traffic cop than it is a virtual block device in the sense of what we just saw over there. When it comes up and initializes, we trade function pointers back and forth with libreduce so that libreduce knows how to talk to raw disks.
Starting point is 00:09:10 It knows how to do a compression and decompression operation, but it does so all of them through the compression virtual block device module. Okay, so let's click on the first one here. First thing that happens is same as with crypto. It still comes down to the compression driver. So the compression driver gets a read or write operation to a logical device of some kind. And instead of firing over to the compression framework like we did in crypto,
Starting point is 00:09:37 now what we're going to do, go ahead and click Jim. Now what we're going to do is we're going to take that read or write and we're going to tell LibReduce, hey, there going to tell lib reduce, hey, there's a reader write, but I don't know anything about the physical layout of the disk. I don't know whether the data is compressed, and I don't know anything about it. So it fires that off to lib reduce, and lib reduce is going to do some metadata operations through PMDK, so it's talking to its persistent memory.
Starting point is 00:10:00 And then it gets a little bit complicated, so the slide isn't completely accurate for what the code does. But then lib reduce, go ahead, Jim, libreduce is going to call back into the compression VB dev module with some function pointers that we trade at initialization time and say, okay, here's the data that you need to compress or decompress. And then the compression VB dev module does the compression operation and actually gets back to lib reduce and says,
Starting point is 00:10:26 okay, now this is done. And then we take the completed operation and fire it down to the B dev, and it makes it out to the SSD. So like I said, a little bit more complicated than crypto, but pretty cool that, you know, we've got good use of persistent memory and use of PMDK.
Starting point is 00:10:45 I'm sure you guys have seen all the PMDK stuff going on this week and over the last couple of years. Really easy bolt-in, really convenient to use for us. Okay, so with that, I'm going to turn it over to Jim, who's going to go through the details of libreduce and how we actually take advantage of the compression algorithms that DPDK gives us. Okay. Can you guys hear me? Okay. So Paul gave a really good overview of how this libreduce library fits into the overall SPDK picture. So I'm going to focus now mostly just on that
Starting point is 00:11:36 compression library itself. We purposely wanted to build this library so that it wasn't tied exactly to the beta interfaces that Paul talked about, so that it wasn't tied exactly to the BDEV interfaces that Paul talked about, so that it could be used standalone. It really helped us kind of make sure that the APIs were correct. It made it really easy for us to build test infrastructure around it. And it means that you could pick up this library and you could use it in a different context. It's not tied directly to our SPDK block device layer. So go ahead to the next. So there's. There we go. Yeah, so go ahead to the next. So there's three main pieces to this library. So the
Starting point is 00:12:19 first one to construct this is you need some sort of block device for your backing I.O. units. I'm gonna use this term backing I.O. units here. It's basically the, where we're gonna store the compressed blocks on disk. I'm gonna talk about some cases where we actually don't store compressed blocks, we store the uncompressed blocks. And so this backing I.O., these backing I.O. units and the backing I.O. device really refer to that. And so with, with this implementation we're going to typically be using a thin provisioned SPDK logical volume. So that as we start writing blocks to that logical volume, that'll actually start consuming
Starting point is 00:12:54 the capacity and that's how we're going to realize our savings. So the key one here is that we're using a persistent memory file for the mapping metadata. So if you've done block compression, you know that you have this problem where you've got maybe 16K of data that's coming in and only 8K of data that's coming out. And you've got to keep track of how those uncompressed blocks map to the compressed blocks on disk. And so we really thought using persistent memory would be a
Starting point is 00:13:22 really unique way to take advantage of that byte persistence, and so we're using PMDK directly to read and write that persistent memory file. Let me go ahead and hit the next one. We do have some amount of metadata on the block device. So, for example, in the SPDK block device, when, when block devices start showing up in the system, so, for example, let's say a logical volume is exposed,
Starting point is 00:13:49 we need to know if this logical volume, does it have a compressed, I mean, is it compressed? And so we store a small amount of metadata at the beginning of the block device with some parameters describing how we're using lib reduce. So, for example, things like the path to the persistent memory file, how big is the chunk, how big are the I.O. units, some of those parameters. And then the key thing is this is a metadata algorithm only. We're not implementing our own compression algorithm. We're gonna use standard compression algorithms like, you know, D flight, LZ4, et cetera. This is just describing the algorithm we use for the mapping
Starting point is 00:14:26 this metadata and how we store it on persistent memory. OK. So this is gonna talk a little bit more about how this library fits into things like the SBDK block device layer. Go ahead and hit the first. So the first part here is we have, you know, at the beginning when you create, when you want to create a, what we call a reduced volume, you're going to call this reduced file init.
Starting point is 00:14:54 And it's basically going to call into here, you're going to pass it the block device, and you're going to pass it a path for the persistent memory file. It's going to go ahead and create the persistent memory file, store the metadata on disk, and return to you a handle that you can then use to start doing read and write operations. And then later, you can do a load operation. So we use this load operation whenever we get a new block device. This will tell us not only, this will not only load a existing compressed volume, but you can also use it to say, is there a compressed volume on this block device. So we use this in the SPDK path. Whenever we identify, we have a process called examine.
Starting point is 00:15:36 Whenever a new block device comes up, each of the virtual block device modules get a chance to see, do I want to claim this block device? And this is what we use for that. This is how we can tell whether any existing block device is a compressed volume or not. Then these are the read V and write V is how the upper layer application actually performs read or write to do read or write operations to the compressed volume. And then on the bottom end, this is where we end up with a number of operations. So one is, if you know anything about PMBK, if you don't, definitely attend some of the persistent memory
Starting point is 00:16:16 sessions. I know Andy Rudolph's going to be doing a hackathon with PMBK, and you can understand more how this works. Reading and writing to persistent memory is just load and stores. There's no special API there. But we do need very critical junctions when we're updating the metadata to actually make sure that it's persisted. And so we'll make calls out to PMDK to persist the specific regions. Go ahead and hit the next. We'll do readV, writeV, unmap out to the backing i o device so whenever we have to go and read
Starting point is 00:16:47 compressed blocks or write compressed blocks or unmap blocks we'll make those calls and then go ahead and hit the next one and then of course there'll be cases where we'll have to go and say we want somebody to go compress and decompress some data for us so So we don't, inside of here, we're depending on something outside of us. In this case, Paul's compressed BDEV module to actually do that compress or decompress operation. This is not, this is not tied specifically to the DPDK framework. The DPDK framework stuff is handled in the module that Paul talked about. So to kind of summarize here, this is completely independent from the SPDK framework and the BDEV layer, so it can be used
Starting point is 00:17:29 as a standalone module. I'll talk a little bit about what chunks mean. So whenever you're doing compression, you have to decide what chunk of data you're gonna compress on. We decided for simplicity to force the caller. If we say that a chunk is 16K and you have an I.O. that spans a 16K boundary, it's up to the caller to actually make sure that those I.O.s are split. We already have quite a bit of functionality in SPDK
Starting point is 00:18:00 and other areas of the code to do this splitting. And so we decided to leverage that and not put, duplicate all that here in the libreduce library. And then it's single threaded. So for each compression volume, there's operations where you have to make sure that you synchronize. If you've got multiple writes going to one chunk, you have to synchronize them.
Starting point is 00:18:25 Within SPDK we can pass messages between threads very, very quickly. And so from a complexity standpoint, we decided to make this single threaded, and then we have code in the upper layers that makes sure that if you, you have multiple threads that are sharing the same volume, they will send them to one thread to make sure that those IOs are all synchronized. Okay, so layouts. I'm going to describe a little bit about how the data is actually laid out on the SSD
Starting point is 00:18:54 and in persistent memory, and then I'm going to go through a few examples on what read and write operations actually look like. So first we've got the SSD. Go ahead and hit the next one there, Paul. So we split this up into the I.O. units. So for this example, we're gonna assume each of these I.O. units is 4K, which means we're always gonna be reading or writing to the backing I.O. device in
Starting point is 00:19:16 4K units. And next we have the persistent memory file. And this is, the persistent memory file is split up into two main pieces. The first one is a chunk map. So a chunk map contains what we call chunk entries. Go ahead and hit the next. So what this does is, you can think of this, each of these chunk entries as representing 16k
Starting point is 00:19:38 of data. So in this case what we're saying is that we have 16k of data. It was able to be compressed down to 10,121 bytes, which means that it would take three 4k units to store that data, and then it points to those three units. And then finally we have the logical map, and then this basically points to the, the user visible logical map. When they
Starting point is 00:20:02 say I want to write to LBA zero, well that means we're talking about this here. If they're talking about LBA that's at a 16K offset, that's gonna be this entry. And so this is where we basically store an entry here which points to the chunk map entry, and that's kinda how all these pieces fit together. Okay, so let's walk through a couple of examples.
Starting point is 00:20:28 OK, so the first one, we've just initialized the logical volume. So we have no data on this whatsoever yet. User says they want to write 4K of data. It'll be a zero. So first thing we do is we look up in the logical map. this is empty. So we know this is, there's no existing data. We don't have to worry about any kind of read, modify, write operation. So the first thing we do is we allocate, we keep a bitmap in memory of the chunk maps and which ones are used and which ones are allocated. So first we're going to allocate an empty chunk map, entry zero. Then we're going to compress the data. So I'm going to allocate an empty chunk map, entry zero. Then we're gonna compress the
Starting point is 00:21:06 data. So I'm gonna call out to Paul's module, I'm gonna say I want you to compress this 4K of data with 12K of zeros, because we're only writing 4K out of the 16K chunk. And let's assume this compresses down to 2500 bytes. So since it compresses down to 2500 bytes, it means we just need one I.O. unit, one 4K I.O. unit, so we're going to allocate one of those.
Starting point is 00:21:30 So then here we write that compressed block to disk. We go ahead and we write the chunk map entry. And then this is the part that makes it what I would say is official. That's what it's basically committed. Once we write to the persistent memory, the logical map, then the next time somebody goes and reads the data, they're going to find that data and they're going to be able to read it.
Starting point is 00:21:53 So you can see kind of how these point to each other here. Let's go ahead to the next. Okay, so now we'll do, now we're gonna write a full chunk. We're gonna write a full 16K to offset 64K. Go ahead and hit the next. So here we can see this is empty, right? So we, this says 16K, this should have said 64K. So we're gonna look up here.
Starting point is 00:22:21 This is empty, there's nothing there yet. We're gonna allocate another chunk entry in the map. Okay, so now let's say we compress this 16k and it only compresses to 14, it compresses to 14,000 bytes. So 14,000 bytes, it's going to take four IO units to be able to store that data. So why compress it, right? If we compress it, we might as well just store the uncompressed data on the disk so that then later? If we compress it, we might as well just store the uncompressed data on the disk so that then later we can just read it back and we can save time. So we'll actually allocate the four backing I-O, the I-O units, but we're gonna write the uncompressed
Starting point is 00:22:54 data to the disk instead. And go ahead and hit the, and then again, we're gonna store this here. Since we see 16384, then that's gonna be our clue later, that that's uncompressed data, so we'll know that when we read the data back, we won't have to decompress it. And then again, we write in persistological math entry. OK. So now we're gonna write to offset 4K to this chunk that we already wrote to earlier.
Starting point is 00:23:27 And this is going to be a really good example of showing why persistent memory works really well here. Why the byte, being able to do byte addressability and 8-byte atomic operations makes this a really nice solution. So we look up the entry, we find something. So now we know we can't just take our 4K data and write it. We've got to read the old data, we've got to merge it, and we've got to write new data back. So we're going to read the I.O. unit. We're going to decompress it into our, into our data buffer.
Starting point is 00:23:59 And then we need to merge the incoming 4K. So we've got the, we've got the old data that was out on, on disk. It was that original 4K that the user had with a bunch of zeros, and now we're gonna merge this 4K of incoming data into that. Go ahead to the next. So here, we're gonna say this compresses down to five thousand bytes. So now this means we need two backing I.O. units. So the, the, one of the things
Starting point is 00:24:25 I want to call out here is that we never update anything in place. Right, so you can see here, we don't, we don't overwrite this chunk map entry. We're gonna allocate a new one. And same thing here for the backing I.O. units. We're not gonna try to reuse that entry zero. We're gonna allocate two new, two, two new ones. And when we, when the volume gets initialized, we account for this. We account for extra space in the persistent memory file. We account for some extra space on the SSD so that we can make sure that we have extra space. We don't have
Starting point is 00:24:57 to do any of these in place. So now we've written the compressed data to the SSD. We write and persist the chunk map entry. And then this is the really, actually just go back just one. So you can see here, as of this point, this still is not committed, right? If the system were to crash at this point and we came back up, this is still going to point to the old data. So now when we do the, so at this point we persist this first,
Starting point is 00:25:26 and then we write this. And only once this is written is the data actually committed. And so this is how we can really make sure that the persistent memory allows us to atomically be able to update these chunks so that it's going to be power fail safe. So this is probably a good point to call out that when we reload, so here in memory we can release the old chunk entry and the I.O. units. This is only releasing it in our internal in-memory bitmaps. We don't have to write anything to disk to actually free them. When we load this thing back up later, we're gonna walk through this list, and this is gonna tell us which of these entries are list, and this is gonna tell us which
Starting point is 00:26:06 of these entries are valid, and these are gonna tell us which of the I.O. units are used, and that's going to implicitly free everything that may have still been there from when we ran previously. OK. Trim is pretty easy, especially if you're doing, you know, trimming a full chunk. Go ahead and hit the next. Basically all we have to do is we just clear this. So we can just, go ahead and hit the next one. So we clear that out. That implicitly releases those chunks because now when we come back up, when we read that, there's not, there's not going to be anything there. So there's nothing referencing this chunk map entry. There's
Starting point is 00:26:47 nothing referencing those four units on disk. OK. And then the final example here is reading some of the data back. Sometimes after you write the data, you want to read it back. So we're gonna look up the, the chunk map entry. Go ahead and hit the next one here. So here we're going to look up the chunk map entry. Go ahead and hit the next one here. So here we're going to read those two backing I.O. units, 5 and 6. So what we can do here is since the user's only reading 4K,
Starting point is 00:27:16 we can actually target the user buffer for the 4K that he's interested in and just use a bit bucket for the other 12K. Because we want to try to avoid having to read the data into one buffer and then copying it over into the user buffer. Okay, so that was a whole bunch of examples. Hopefully that all made sense. We do have some next steps here.
Starting point is 00:27:39 One of them is on the unmapped path, we have some optimizations we can make for doing what we're calling sub-chunk allocation masks. So right now, we're not keeping track in the metadata of which LBAs in the chunk have actually been written to. Some of those cases where we're writing zeros, we could actually make some optimizations where we don't have to do that. We could save on some bandwidth. We. Really hoping we would have some really cool performance charts for this. We don't have the charts yet. Once we do, they will be out on
Starting point is 00:28:13 the SPDK website. Some of the initial data looks good, but there's still some things that we need to run through. We still have some additional on disk metadata parameters that we want to store on disk that's not being stored yet. Go ahead and hit the next one. And then we have some methods for reducing the metadata file size. So right now, the chunk entries, when they refer to the I.O. units, they're using 64-bit values for those.
Starting point is 00:28:43 And we've got some methods where we could do 32-bit instead. So we would still support volumes that have more than 2 to the 32 I.O. units. But with a little bit of extra finagling, we can, you know, cut, you know, effectively cut the size of the metadata file in half. And that's when you start looking at, you know, cost effectiveness of this, that's one of the things you want to do, right, because a persistent memory is more expensive than the NAND, of course, and so, you know, we want to make sure we're making use of that persistent memory as well as we can. And I think that is, I think that is it. If you want to get more information, there's tons of info out on our SBDK.io website. A lot of information on the documentation link
Starting point is 00:29:31 that goes through a lot more details on what Paul talked about. And there's some more details on LibReduce here as well. One thing to call out, we're actually having a developer meetup here in the Bay Area. What is it, November 11th and 12th. So if you're interested, if you're somebody who is, you know, hands-on and wants to engage in a really, like, deep dive technical session over about a day, day and a half, you know, check this out.
Starting point is 00:30:03 There's an opportunity to RSVP for that. Yeah, that one's actually hosted by Nutanix also. Yeah. So with that, questions? Jay? So with regards to the unallocated boxes, how do you handle sanitization? You want to sanitize the drive so that, you know, for security purposes, how do you handle
Starting point is 00:30:27 the sanitization of the actual data on the drive at that point since you've got so many redirections? Yeah. So I mean typically if you have some of that, let's say we take a logical volume and then we're going to make that compressed that is typically going to go to one client and then so I think your question is what happens when that storage
Starting point is 00:30:54 is going to be repurposed for some other client. So in lib reduce we're not handling that sanitization part there. That would be handled at the layer where you allocate the logical volume. So that logical volume, you have the opportunity to sanitize that logical volume before you repurpose it to create storage for some other client. So in the model that you had before with the three stages, you've got the chunk
Starting point is 00:31:19 data, you've got the logical data, and you've got the actual write to me drive. So Libreduce isn't handling all of that. You're still relying heavily on some of the physical media for some of this, is that right? Yeah, so LibReduce is handling these parts here, but everything
Starting point is 00:31:37 here is associated with one compressed volume. So... The virtual volume. Yes... The virtual volume, the virtual volume. Yes. Yeah. So, you know, typically what happens is you...so how about I step, take a step back and walk through the process of how you get one of these set up. So the first thing you would do is, let's say you want to create a, a 10 gigabyte compressed volume. So the first thing you're going to do is you're going to create a 10 gigabyte logical volume on the SSD. Then you're going to take that
Starting point is 00:32:11 10 gigabyte logical volume and you're going to pass it to the lib reduce. He's going to create a compressed volume that's going to be 10 gigabyte minus a little bit for some of those extra blocks we talk about so we don't have to do things in place. But then that 10 gigabyte is going to be, is going to be assigned to one client. Maybe an NVMe over Fabrics client or a VM, et cetera.
Starting point is 00:32:37 And then libreduce is going to handle this. It's also going to handle the persistent memory. At some point, once that logical volume is no longer needed, you're going to destroy the compressed volume, which is basically going to zero the portion of the logical volume. And then it's also going to delete the logical volume, because this is no longer needed. So...
Starting point is 00:32:59 I was going to say, just in case it's not clear, logical volume and compressed volume are two totally different things, right? So logical volume is a different abstraction layer, and maybe it helps if you think about it, that the very top layer that's exposed to the application is the compressed volume. The compressed volume and all this stuff sits on top of a logical volume, which abstracts away what's underneath it. So we've got multiple layers of abstraction here.
Starting point is 00:33:23 So that's why you're saying when you delete the logical volume, independent of the compression volume, you've got the option of saying destroy this when I start it or destroy this when I complete it and wipe out all the data. The genesis of the question is if I've got a host that's going to be sending a sanitization command to the drive, and I'm going through this method where the logical volume is being managed by SPK, where does... Oh, you're talking about sanitized operations that are coming from the
Starting point is 00:33:49 top end to the compressed volume. Jay, don't call them sanitized. Sanitized applies to an entire device, a physical device. You're talking about secure data deletion, which might be a namespace or something. You're confusing everybody by using that word. Bad J. Yeah, so right now, I mean, we have un-map operations support on the top end. Looking at sanitize is something we probably need to look at.
Starting point is 00:34:20 That's a good... Yeah, we'll either do un-maps or write zeros in the operation we're talking about. Yeah. Yeah, please. Yeah, so your question is like, this stuff, put it in the, instead of persistent memory. So, PMDK actually supports that. PMDK, you can actually have a file that's on disk,
Starting point is 00:34:41 and it will, what it'll do is it'll mmap this, and then when you do the pmem persist, it's actually gonna do an msync. So you can actually do this on disk today. It's just gonna be slower, because when you wanna update eight bytes, you're updating a whole block. Quite a bit slower. What's that?
Starting point is 00:35:03 Quite a bit slower. It is gonna be quite a bit slower. What's that? Quite a bit slower. It is going to be quite a bit slower. So there, you know, there, there certainly are ways that you could put this on disk that might be faster. This was one that we purposely implemented to take advantage of persistent memory. If you wanted to have one where all of this metadata was stored on disk, this may or not, this may or may not be the best algorithm to do that, but this one is well tuned if you do have persistent
Starting point is 00:35:28 memory on your system. Go ahead. Can you do compression and encryption in the stream of one end to the other and only then all through the whole process? Yes. Yep. And the SPDK keeps those in the correct order? Meaning, compress before encrypt?
Starting point is 00:35:36 Well, so, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the,
Starting point is 00:35:44 the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, Yes. Yep. And the MPDK keeps those in the correct order. Meaning, well, so, you could decide to do encrypt before compress. I mean, if you decide to set it up that way, we don't, you know, that wouldn't be a smart way to do it,
Starting point is 00:36:00 so we don't enforce it. We don't say, like, well, that's a stupid way to layer them. But it does support... The user configures that yeah yeah the crypto and compression vb dev modules are completely independent yeah i see okay but as far as streaming one yep so the uh the encrypted stuff would be you know sorry um yeah the compressed data yep just stay in ram and then Yeah, so what would happen is if you... Why don't you go back to my intro slide? Sorry, there's lots of animation.
Starting point is 00:36:45 Yeah, so here... Yeah, actually, so go forward here. So here, if you had an encrypted block device down here, then this is basically going down to that device instead. And then that, in turn, could go to the logical volume manager, and that, in turn, could go down to the actual SSD. So these can really stack on top of each other. I have one other question. What if you want to demount the compressed volume?
Starting point is 00:37:07 Will you then take the metadata file, push it out there, and be done so you can reuse your unbottled memory or whatever is coming out next? Assume that they're just mounted all the time, therefore you have to keep the metadata in the all the time. So, if you want, let me make sure I understand what you're asking. Yeah, you can certainly do that. I mean, you have to keep the persistent memory file on disk. We persist the path to that file.
Starting point is 00:37:59 So you could, I mean, yeah, you could unload all those. You could unmount the persistent memory file system. But if you wanted to come back and you wanted to load that volume again, you've got to make sure that the persistent memory file system is loaded. So that it can find that file. Okay. And that file is. That's right. Yep. In the back. I've written this in years past, so I have a few questions. Okay.
Starting point is 00:38:21 . It looks like you're throwing away 12.5% compression ratio because of your allocation. Okay. There is some, yep. Yeah, there's some cases where we're doing some rounding and there's some bytes that are, yep. Yep. Yeah. Can you clarify? Yeah. Yeah. Yeah. Yeah.
Starting point is 00:39:07 Yeah. So that's certainly true. I mean, that's kind of how the SSD handles it internally anyways. Right? Like, if you go back to... Let me just go to here yeah so I think what you're saying here is now you have a case where you've just got one block here
Starting point is 00:39:37 in isolation yep yep yep yep Yeah, yeah, yeah. Yep. Yeah, so we, you know, we felt like with the, since the SSD is already internally doing FTLs on a 4K granularity,
Starting point is 00:40:04 you know, as you're writing to the SSD, it's just going to, it has its own log typically in the SSD. And so even though the logical, the LBA numbers themselves may not be contiguous, that data still ends up getting written contiguously on the SSD. So we felt like the, you know, having a bitmap here where we can just pick, you know, the next four open blocks would be sufficient for this algorithm. Thank you. Go back to the first question. What I understand is all data is still very visible to anyone who has access to the underlying box.
Starting point is 00:40:40 I don't know why you did that. Have you considered adding a coordinated support That's an interesting idea. so that you can no longer get to the whole data theme? That's an interesting idea. We haven't talked about that in terms of this. I mean, for what we've talked about today, they're two distinct layers. They don't really have knowledge of each other. Something like that could certainly be done, but that's...
Starting point is 00:41:24 Don't get down and say, something like that could certainly be done but that's yeah yeah so I think you would just have to keep track of what are those keys that you have for each block. Right? So that you could read it back later. Yeah. And going back to this gentleman, have you considered
Starting point is 00:41:55 creating a region on the block volume that is the same size as your persistent file so that you can copy the persistent file? Yeah, we... Well, no, we... Yeah, I mean, that's a really good... So we have heard that.
Starting point is 00:42:15 We haven't implemented that yet, but that's something that I think we'll probably do, because that does make it easier to be able to, you know, maybe move it from one system to another. And, of course, that metadata file on disk isn't going to be updated every time you do an I.O., but it would be a way to be able to, if you had an operation offline, then it would make it easier to be able to move that logical volume between systems.
Starting point is 00:42:37 Yeah. When I did this years ago, I didn't have any of this in memory, so I had to update it. Sure. Yeah. So it was more of a linear stream. Yeah. I didn't have any personal memory, so I had to update it. Yeah, sure. So it was more of a linear stream. Yeah. And then the linear stream is lost. Yeah. Yeah. Yeah. Yeah, it's always a trade-off.
Starting point is 00:43:10 I mean, with any algorithm, there's definitely a trade-off on speed versus compression ratio. And so this is... Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Okay.
Starting point is 00:43:29 Yeah. Let me, I think there was another. Yeah. Yeah. So we do have garbage collection. So in this case here, after we've, here's a case where we did a read-modify-write. We no longer, we no longer are using this block.
Starting point is 00:43:59 We don't have any logical map entries that are pointing to this. So we will clear the bit in memory so that we can use this again. And if the system reboots, it will see that there's nothing referencing this, and so this bit will stay clear as well, and we'll use it the next time we boot. So we are doing garbage collection in memory. We're not doing garbage collection from the standpoint of trying to get contiguous LBA regions. Would there be fragmentation in that? Male Speaker 1. Yeah.
Starting point is 00:44:25 There could be fragmentation. Yep. Yeah. I mean, the one advantage, which is something we would need to explore, is that if you have, you know, three blocks here and one block here, that's two write operations. You can't do that in one. Because when you do a write operation, you can only, you can only give just an LBA range. Whether or not that makes it faster,
Starting point is 00:44:50 that's up for, we'd have to take a look at that. But that'll be for next year's talk. Any more questions? Yes. I suppose that the compression is done in a... Yes? I'm sorry? The compression is... I mean, you could do the compression on either side. You could do it on the initiator or the target side.
Starting point is 00:45:26 It would really depend on, I guess, your use case. If your file is a shared file between different users. If your file is a shared file between different users. I think even if you had a block device that was shared between multiple clients, on the target side, you could still have the compression done on the client side.
Starting point is 00:45:51 And then as clients were connecting to it, I mean, they're each going to decompress the data. But it's going to be on the target side before the data's returned. You mean because you're transferring the uncompressed data over the network instead of the compressed data? Yeah, so I think that would come down to, if it's a shared volume read-only, then certainly doing the decompression on the client side would make more sense. If it's a shared volume, but, I mean, typically
Starting point is 00:46:31 you're not going to have shared volumes where they're both writing to it, because that just wouldn't work, because then you would have no way for those clients to coordinate the metadata updates. But I think it would just kind of depend on your usage model on which side you would do it on. I think it would just kind of depend on your usage model on which side you would do it on. I think there's cases both for the initiator side, like you're talking about for the shared volume, but I think there's a lot of cases where doing the compression on the target side makes sense as well.
Starting point is 00:46:57 A lot of times clients may not always have the offload engines or things to do the compression most efficiently, and a lot of times those might be built into the target, and the target can do that compression and decompression very efficiently. Yeah, you have to keep the persistent memory with... Yep, yep. And certainly if it's on the target side, that's easy to keep those together. On the initiator side, at least for
Starting point is 00:47:27 this mechanism, you would have to have the persistent memory file be able to be accessed by all the clients as well. So, yeah, there's some, there's some trade-offs. Certainly if you're doing this on the client side, yeah, sending that persistent memory file to all the clients may work, but it would be using that across, you know, you'd be duplicating that persistent memory file across all the clients.
Starting point is 00:47:52 So any more questions? Okay. Well, thanks, everybody. Lots of good questions. Appreciate the feedback. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community.
Starting point is 00:48:26 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.