Storage Developer Conference - #148: End To End Data Placement For Zoned Block Devices

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 148. Hello, my name is Mark Acosta. I'm here to give a talk on data placement. Why data placement? Well, data placement, if done correctly in the right situation, can save a significant amount of storage costs. So cost savings is always a good thing. Data placement is one way to go off and achieve it. So data placement is basically for people that like organization and believe that if you organize things, they're easier to retrieve, store, and that's them in that kind of way.

Starting point is 00:01:16 So if you can imagine your storage device as a bag and all your files are these beads, you can see that if it came time to move all the white ones, it would be kind of a little bit difficult. So this is kind of a, I don't know, a really, really bad storage device. But if you took a look at this as a storage device, then you can imagine that you had to move the white beads. Things kind of suddenly got a lot easier. And that's kind of the nature of data placement, that if you put things in a particular order and then you maintain that order,

Starting point is 00:02:00 things become easier to move around and to find. But when you initially store something, you know, and organize it, and I think about, okay, my garage, I take it all apart, I take it all out, I stack it up. But over time, things come in and things need to change. The order kind of gets um all messed up so if you're like me i kind of have a area where i just collect all my changes or the things that i want to add and then at some time i go do a bulk bulk update a bulk reordering of the garage. And a lot of things in storage are like that too, where the media prefers large changes over small changes. There's large block rights when it comes down to just the total megabytes per

Starting point is 00:02:57 second and throughput of a system are so much higher if you do big chunks in it versus all these smaller ones. It was about 1994 that Rosenblum and Oysterhouse introduced the log structured file system. Now what the log structured file system did was it understood the nature of the media at the time, HDDs, and optimized the placement of files and updates of files to reduce the mechanical latency. At the time, I remember being around 10 milliseconds or maybe 12 milliseconds per seek,

Starting point is 00:03:40 and maybe in 1994, 10 to 15 megabytes per second on a media rate. So if you can imagine if I'm doing 10 millisecond seeks, I can do about 100 IOs per second. If they're 4K IOs, oh, that doesn't come out much. It comes out to be like 400K bytes per second. But if somehow I can make all of those writes sequential on the media, I can do 10 to 15 megabytes per second, about a 25x improvement in throughput, just by eliminating the mechanical movement of the heads. So their log-structured file system

Starting point is 00:04:22 took that concept and pointed it in two directions. One, let's point the metadata and the files close, put them close together on the media to minimize seek. And then two, on updates, let's turn these random writes into sequential writes onto the media. So the first innovation on the log structured file system was the recognition that the file had metadata associated with it, and the metadata wasn't co-located. So here they show the one, two, three, four different types of structures on the media, but yet when they were placed by the file systems, they were placed not directly next to each other, but required seek operations to go retrieve the entire set.

Starting point is 00:05:15 With the log structure file system, however, you can see that all the data was placed very close together, such that when you went to go do an operation like read a file, a lot less mechanical movement was needed. The second optimization was done on how files are updated. So if you can imagine the Unix fast file system at time zero already has four data files laid out on media sequentially in pretty nice order. And then when it comes time to do the changes to the files, they need to go off and do these random writes. Because the file system updates in place, it will actually do a seek to this location and then write the date of the media.

Starting point is 00:06:06 Then we'll need to do another seek, another seek, and another seek. So this operation of four updates to four seek operations. But imagine in the log structured file system, instead of doing that, the operations, the writes are all collated in one place in a new segment of the media. And there's left what we call holes or places where the media or data no longer exists on the media. It's essentially old data. So you can see in this operation, instead of doing the four file updates, it does four sequential updates, and an indirection table is used that when this file is read, it knows to go and seek, find this head.

Starting point is 00:06:55 So it's pretty efficient on the right. If you're reading back sequentially, these seek operations do slow things down. So the third and final operation is what Sprite called the copy and compact. We talked a little bit about those holes. You know, here are some holes where there used to could be some old data, used to be some old data, but I'd much rather have some wide, long, sequential place to write new data. So what do we need to do? We do this copy operation, which frees up a nice continuous space. So this is the original garbage collection that we now refer to in SSDs,

Starting point is 00:07:40 very similar in operation. So the results that they achieved on the log structure file system are, as expected, extremely good. If you take a look at this create number, what is that? It's about a 15x improvement in creating files, which you would expect if you eliminated these small block random movements to go and write data and replace them with one big large sequential write, yeah, we're going to be a lot faster. So that was really, really good news. And then also on read, that's a little bit about the file placement, that the locality of the metadata was close to it but one of the things that was very

Starting point is 00:08:25 interesting about this um conclusion section was this one thing where they noted that in that this lfs works very well for these large files you know in particular there are files that require no garbage collection um there are things that are created and destroyed in their entirety. So wouldn't that be great if files never had to change and storage never had to change? But that was kind of what was said is, if we can figure out a way to go do this, things would get a lot better.

Starting point is 00:09:12 In 1996, Patrick O'Neill introduced the log-structured merge tree. Now, what Patrick and his team found was that small changes to B-trees were relatively very inefficient. And if you contrast that to I have lots of changes to index, or I want to create an entire index from scratch, it was much more efficient. So the log structure tree did exactly that. It avoided updating existing B-trees. It kept enough in memory.

Starting point is 00:09:43 And when it came time to it, they would create an entire B-tree index. And this was a lot more efficient and provided some performance improvement. Now, the bulk loading versus the insert gets kind of complicated. So I'm going to defer this discussion to this gentleman that I found here, Jens Dietrich. He wrote this e-book in Patterns and Data Management, which focuses a lot on SQL queries and what workloads do they generate,

Starting point is 00:10:18 how do they work, how do indexes work and joining. But he had a very good explanation of the bulk loading of B trees. So it is important to understand a little bit about how log structures merge trees worked. And you see in this area right here, let's just say I get some updates and I keep them in memory. And so all the new writes go to this tree in this area. But eventually you need to merge that into the next level tree. And as in this example, this is simply a two-level tree,

Starting point is 00:10:59 the memory tree and a tree on disk. So if you could imagine that this memory on disk has all the data that exists in the database, and when this memory table needs to be merged into this, and just for the sake of argument, let's pretend it's going to be merged into this file right here. You will take the data from this file, this file, and then go off and create a new file from it. And once that new file is created, the original file or sets of original file are deleted. They're no longer needed. So you have an example of here is the entire data in a file or a B-tree in a file and the updates come and get merged into it and once that merge is completed this file

Starting point is 00:11:52 is no longer needed so it's erased in its entirety. This right here kind of shows a similar thing that this leg of the B-tree needs to be merged with these this leg of the B-tree so that they come together and then they create a new file, the new immutable file that has all the data inside of it as well as the index to find the data. All right, so let's do a little bit of a recap here. We started off with an HDD device that, first of all, it maintains the spatial locality of the files on writes, and its media is much more efficient on sequential accesses. And then we go on over and we pair that up with a log structured file system and minimizes the head movement of the HDD by placing data sequentially on media. But it had this problem with GC.

Starting point is 00:12:55 Now comes the, next comes the log structured merge tree that minimized the garbage collection for log structured file system by the creation and use of the mutable files. You put it all together and we end up with a very efficient storage system and I represent that once again by these beads that are separated by files with the color indicating each file. So life was pretty good in the 1990s. You had an end-to-end data placement. First of all, you had a file system that was optimized for the storage media. It understood that hard drives really like sequential, that you can get great performance improvement with these sequential writes that they did. And then you then had applications that were storing data, such as log structured merge tree that started to use these large block

Starting point is 00:13:59 immutable files. So here you had really an end-to-end data placement perfect coming together of all three components what happened was kind of interesting in came ssds you know and ssds are all built on nand and most one of the most interesting thing about nand is it has some very nice properties. It likes to be written and erased in large segments. And that type of property, if you thought quickly about it, guess what? The log structured file system likes to write in large segments for efficiency and is used to collecting what's ever left over in fragments and can do the garbage collection. So just a log structured file system on top of a NAND, add a storage interface to it and voila, you got your modern SSD and life is great. So now in the system, instead of a hard drive, we have an SSD,

Starting point is 00:15:09 and the question comes up is, how does the addressing scheme of an SSD affect the overall system performance relative to how an addressing scheme of a hard drive would affect system performance. So let's talk a little bit about this access right here. First of all, I did a RocksDB overwrite workload, 100% write workload. All I want to do is take a look at the write amp of the SSD. And if you remember correctly, just for some sake, if you have a write-amp equal to four, what does that mean? Well, it means for every one megabyte that you write from the host, you need to have four megabytes of write plus three megabytes of read.

Starting point is 00:16:10 So write amp of 4 has 7x of the NAND bandwidth of a write amp of 1, 4x being write, 3x being read. So I took a look at the normalized host bandwidth, and by normalized I simply said, oh, okay. That's going to be the write plus the read normalized to this point right here of one. So here's normalized post-ban list, and on this axis right here, I have my number of background jobs, and that's how many simultaneously merge operations I'm doing at this log structured merge tree. So if you take a look at, I got one backdrop running.

Starting point is 00:16:55 I get about a 1x bandwidth. That's my baseline. But if I take a look at this 3.84 terabyte drive, well, what do I mean by 3.84 terabyte drive. Well, what do I mean by 3.84 terabyte? Let's just say I have four terabytes of data of NAND on the device, and I format it to be 3.84 user capacity. So there's a little bit of OP in there,

Starting point is 00:17:20 but not a whole lot. But when I scale it to four, I virtually get no increase in throughput. So what did I do? I wanted to speed things up. I increased the number of background jobs and nothing happened. All right. So good enough. So what if I did another thing? So what if I took that same drive and I reformatted to 3.2 terabytes oh okay that's pretty good and you can take a look at this line right there and I was able to scale by 2x so doubling the number of background jobs I got 2x the bandwidth so I got some type of scalability. But then just for Grimms, I took that same drive and I formatted down to two terabytes of user capacity and I got another two and a half or another half or two and

Starting point is 00:18:13 a half of the 1x bandwidth or the one background job. So there's a little bit more scaling in that. So what is the difference between these three? Well, we know that it's user capacity, but we also know that the less user capacity I have, the more I can give to over-provisioning. So this has, well, you know, 100% over-provisioning, 50% capacity, 50% OP, and by the definition of over-provisioning, that's 100%. This is a little bit more, and this is more.

Starting point is 00:18:47 If I took a look at the write amp, wow, this 2 terabyte right here pretty much had a write amp of 1 across the board. And you would expect that because there's so much extra space that I can literally rewrite the entire drive before I do any garbage collection. So you would expect a write-out of about one. The 3.2 went to a write-out close to one here to a write-out of around two. On the 3.84, I started off at a write-out of two and a half, and then I ran up to a write-out of 3.5. So you can see this write-out of 2.5, and then I ran it to a write-amp of 3.5.

Starting point is 00:19:26 So you can see this write-amp of 3.5 here represents a considerable amount of NAND workload because that's 3.5 of writes plus 2.5 of reads is the total NAND bandwidth, so I got a total of 6 megabytes of NAND bandwidth for every 1 megabyte of host traffic. Well, if I took a look at here with a write amp of 1, well, I get 1 megabyte of writes, I get 1 megabyte of NAND bandwidth, so I have 1 sixth the NAND bandwidth going here. The back end of the SSD is one-sixth as busy. So it enables that extra bandwidth to be given to the host, and this is why we see this scaling of the throughput. So if I took a little bit of a deeper dive into it, well, okay, Mark,

Starting point is 00:20:27 was there any spatial locality hints or information given by the file system? And in this test workload, I used XFS. And I took a block trace of XFS, and this was for a background jobs. I think I set this one to number four. And you can see for the most part, if I take a look at the LBA ranges in one period, there's typically four operations being written at a time. Sorry about this, but this access being the LBA written, and this is simply the command coming to the drive.

Starting point is 00:21:04 So right here looks like a small straight line, but that's the sequential write to the SSD. Here's another sequential write. Here's another sequential write. And here's a mostly sequential write through here. Any hard drive, if you would have done this, would have had four different sequential threads, and the spatial locality of the operation would have maintained,

Starting point is 00:21:33 meaning this file would have been all located physically close to each other, and same with this file and this file and this file. But with the SSD, spatial locality is not maintained. And to demonstrate that, once again, I'll go back to if this was a hard drive and you're doing updates, things are changing in place or we go back and forth. You know, maybe I write this amount to this data. I write this amount. I come back here. I can start off with this LBA and finish it all up.

Starting point is 00:22:15 Now, this file right here has complete locality on the media. It's all sequential. But an SSD with a log structure file system doesn't store based on spatial locality it stores it based on temporal locality so as writes come into these different files it all gets mixed up and this is in the same erase block so when it comes down to erase for instance this light little blue green files i got it all over the place, so there's going to need to be a lot of garbage collection, and hence a higher write amp. So write amp goes up in this system simply because we did not maintain the spatial locality of the files.

Starting point is 00:23:02 They were stored based on temporal locality of the files. They were stored based on temporal locality. So let's do another recap. Like before, we had a log-structured merge tree application that created large immutable files. In this example, we used an extent file system that placed data sequentially on media in large blocks. Didn't do any GC, but still gave enough spatial locality through its addressing to the storage system to intelligently place the files in a sequential order. But then we ended up with an SSD that uses a log structure file system, which uses temporal data placement algorithm, does not place necessarily sequential and meaty, and hence did not take advantage of the large and beautiful file. And we have high levels of garbage collection and write-in. So I would really think of this system not really as a neat square set box with a bunch of places to store it up. But we ended up with kind of more of a bag storage where everything gets mixed up together.

Starting point is 00:24:20 And we have problems with write-out performance scalability. So the question that comes up is how to fix this problem. Well, we can fix this problem by a very simple fix. Make an SSD that uses a block addressing method to place data spatially. And that's exactly what the ZNS or zone namespace does that so the TSP or learn going to give a complete presentation and do a deep dive into some namespaces. Make sure you sit into his presentation. For all the details. With DNS in place, we've returned to this end-to-end data placement. Once again, we have, for example, a log structure merge tree application where the files are stored or files are all immutable. Then in this example, we can also have a log structure file system that maintains the spatial locality of these immutable files.

Starting point is 00:25:27 It allocates a sequential LBA or sequential block addressing method for each files. Once again, the large files reduce GC. And then finally, with the introduction of ZNS, we have a method to communicate to SSDs on which files we want to be placed where. And once again, that will reduce garbage collection. And all the files are very neatly stored in nice little zones that make it easy to migrate, replace, and reduce as garbage collection. So understanding of these files and their behavior, their migration patterns is really important if you're looking into data placement.

Starting point is 00:26:15 So I started off most of my career using block trace to analyze the workloads of storage systems. But over time, it's become more and more apparent that these LBA addressings or these blocks written aren't independent writes. They're really part of files. And that to understand workloads, you need to go up a level

Starting point is 00:26:38 and start taking a look at the migration patterns of files and their different characteristics. So the study of files I call phylogy. So I took a look into the Cassandra files, mostly because everybody starts off and does RocksDB, and Cassandra is also a log-structured merge tree. So I just decided to do something a little bit different and examine Cassandra instead of a RocksDB workload or a RocksDB application.

Starting point is 00:27:12 So basically, Cassandra is a KB store that uses this log-structured merge tree. And on this right side right here, I've got a list of all the files that are created by, not all the files, a list of files that were created by Cassandra. And you can see that there are these groups of files that are designated on there and by far the largest is this data.db. The rest of these files in RocksDB are all part of the main SST file but for some reason Cassandra makes them a little bit differently puts it has the file

Starting point is 00:27:49 layouts a little bit differently so I'm mostly interested in the data.db file because it represents the majority of the data storage of the system in this example I just set the size of the data.db file to 256 and you can see it come out of 257 megabytes.

Starting point is 00:28:15 The test work log was simply YCSB, did a thread count of one. About the only change I did to the script was I went out and set it for a uniform distribution versus the default, which was the Zipfian. And I put 15 million key loads and then I followed by 16 million random puts just to make sure everything was steady state near the end of it. Background compaction sets, I set to 8, and then I set a target file size of 160 megabytes.

Starting point is 00:28:56 So data collection on data.db files. You know, I looked at block trays for so much of my career for workloads but there is a linux command called inotify wait and inotify wait allows you to point to a directory or set of files and capture some events for those files and those events could be anything from when is it read, when is it open, when is it deleted on it. The one problem I had with inotify-wait was that it had this, I believe a second resolution and when you're opening many files very quickly that wasn't enough so one of the guys from our research team

Starting point is 00:29:43 went in and modified inotify- weight that gave me nanoseconds resolutions. This slide right here is simply the man page for iNotify weight on it. You can see it's pretty simple. You have the events over here that you're allowed to watch for, and then you point to a directory on it. So the very first file I looked at was, or the very first thing I looked at in the DataDB file was its lifetime. And I like this graph because it's pretty interesting if you take a look at the different levels or the different life cycles let's go over this graph a little bit on this one right now is the file lifetime and this is simply the test time so

Starting point is 00:30:38 this test ran for what about 42 000 seconds but these files down here all have the shortest life cycle. And then you can see there's these nice even bands, which is kind of really, really kind of cool when you think about it. Because if you remember in the log structured merge tree, you have these growing sets of trees that have larger and larger data sets. So you would expect the ones that are very, very small and at the lowest level of the tree would have the highest access rate and be overwritten the quickest. And that's exactly what you see here. My

Starting point is 00:31:21 guess is, or my belief is, that these are all at the lowest level, these may be here, and so forth. And this is the reason I changed it to a uniform distribution. With the Zipvian, I wouldn't see this. With a uniform, you would expect each of the entire, well, the entire log structure merge tree to have an even workload, so the files in each level should have about the same lifecycle on it. When you take a look at these lifecycle, the file, I did find one pretty interesting paper by this Taysian Kim in the FAST-19, and I have a reference right here on it. But what they were, what Kim did was he was looking at using NVMe strings, and then was applying string allocations to the highest

Starting point is 00:32:17 or the hottest file, and showed a big reduction and a significant reduction in right out by using the strings on here, which makes sense because if you have a super large file. That doesn't get modified it and then you have some high frequency data in there you can end up with what we call trap. Oh, P. So this five portion may be on this level right there. This portion comes to here. He's written once in a blue mode. He's written quickly. You mix them together, and you end up with this dead spot when it's in there. So you may have to go and reclaim just this little bit of space, and this may not be the best greedy choice for the garbage collection algorithm. So his paper showed the most that, hey,

Starting point is 00:33:05 if I group all the high-frequency data together, I end up with less trap charge, or not trap charge, less trap OP, and I have a lower write amplification. I notify kind of gave some other interesting data for us SSD man kind of guys one of them was the number of open files

Starting point is 00:33:39 so this is when I start running this application how many files are open at one particular time. For SSD people, this tells us how many zones or parity contexts that we need to have to support this application. And in this version of Cassandra, I think I used a number of background jobs of eight, and I saw 45 open files as the maximum number reached through the run.

Starting point is 00:34:08 One could probably imagine with some tuning on the application size, this could possibly be reduced. But this is the number I got just running it as is. Another useful piece of information was this file open time. And this is interesting of NAND for how long before a file starts to be written potentially in a NAND block, and then how long before later it says I'm done. So this can give you what is the how much time do I have to have the NAND block open. And the open time of the NAND block means is that not every page is written,

Starting point is 00:34:48 so there's a little bit more higher susceptibility to read disturbed effects. So NAND open time, we NAND suppliers specify and like to keep it as short as possible. If you like this iNotify way, there is the standard Unix, but I got my patch from Nicholas Cassell that just gave me that in increased resolution, so to better measure things like open data files. So I wanted to thank Nicholas for providing that piece of code for me. So we talk a little bit about all these large mutable files, how we can put them nicely in zones and everything should work really well, but life is not really perfect. File sizes will vary.

Starting point is 00:35:44 I mean, so it's not 100% you can align these files into some NAND storage blocks such that you get this perfect road where I erase the file, I delete the file. It means I erase one NAND block and there's no garbage collection whatsoever. May or may not happen, but it does not happen that often. So when we're taking a look at what we can do with storage, I kind of like to take a look at, you know, setting up a goal post. So I think of the left goal post is these are large files. If I was able to maintain the spatial locality of these large files, and they may not be perfectly aligned with the NAND blocks,

Starting point is 00:36:34 is there any gain I can expect in reduction of write-up or being able to utilize more of the disk space? But on the other side of it, you know, I kind of, well, this left goalpost is this is your entitlement. You know, without any other fancy tricks, I should be able to achieve this type of savings.

Starting point is 00:36:55 But on the right side, hey, we know that these are immutable files. We can see from the phylogy of Cassandra that there's some high frequency files and there's some low frequency files. If I was able to take that information, I could possibly do better with the, I could

Starting point is 00:37:17 possibly do better in the utilization and reduction of RIDAM. So let's talk a little bit about this left goalpost. So one of the first places I saw this, quote, left goalpost, was this performance of a greedy garbage collection scheme and flash bait solid-state drives published by IBM Research. And what was interesting about that is if you take a look at it, what he did here was he said, oh, okay, if I have a NAND block and I have, let's say, 512 separate pages in this NAND block,

Starting point is 00:38:06 and I did random writes to a page size in each of this NAND block. What is the corresponding write amplification? So if you take a look at this portion right here, he talks about the occupancy of it. And that's just my meaning, how much OP do we have. But on this graph axis right here, we have the right amplification. And if you take a look at it, this C equals 512 is when the NAND is 512 location. Let's just pick this point right here.

Starting point is 00:38:45 And it has a write amp of around maybe 2.8 or so. Okay, that's pretty interesting. But let's take a look at this other end of the spectrum. Now, what if I had my NAND block just once again, but instead I only had four blocks, and I was doing random writes to these four blocks. Huh. Well, it doesn't take much.

Starting point is 00:39:12 If I just happen to get a hit here, I now have a much lower write-out because I only have to garbage collect these two blocks right here. So these, quote, writing or dividing N into larger blocks or hence writing larger blocks also reduces write amplification. So if you take a look at that same point, it's about 1.5 or so. So I had like a 1.5 to something something like a 2.8 just eyeballing it change in right amplification simply by the size of the right or the percentage

Starting point is 00:39:56 of the right of the entire um nan block kind of cool That was really kind of cool. So I took... So I really liked the research, the work IBM research did. I think they introduced the concept that there's more than one way to reduce the write-amp of the system. You can reduce the write-amp, you know, as we all know, by adding over-provisioning, but also you can reduce the write-amp, you know, as we all know by adding over-provisioning, but also you can reduce the write-amp of a system by increasing the size of the file or increasing the size of the write, and it's really meant to be a percentage. So if I could write, as we saw earlier, if I could write

Starting point is 00:40:42 20 random writes of 25% of the NAND block size at the time, I was able to reduce the write-out from 2.8 to 1.5. So that was a big gain on it. So I want you to go off and reproduce that. So I created a FIO script. And what I did in this FIO script was I varied the size of the block size, and I varied it such that it went from anywhere from practically zero, meaning a very small, let's just say, 4K random writes, to a very large write such that it was 2.6 or so the size of the zone capacity, or, you know, we think of it as also the NAND block size. So this is kind of interesting because when you got to these larger block size, this being the ratio of the block size to the zone capacity, this being the right amp of the system, as the ratio of the block size to the zone capacity increases.

Starting point is 00:41:46 We got some very steep drops in the right amps of these curves, you know, gone here. So without spatial locality, I'm kind of stuck on this access, meaning with the way that the SSD stores everything temporally. I got those four workloads. They're all getting merged together. The only way to scale performance, improve quality of service is to add OP,

Starting point is 00:42:15 which takes away from the user capacity. And you can see in this system, if I'm at this node, the only way to achieve a write app of one is to kind of reformat my drive and give up half of its capacity to over-provisioning and only use half of its capacity to store the database or whatever I'm trying to store on the system. So as I go and I take a look at this node right here, this 4, if I was able to achieve a ratio of about, it looks like about 2.6, of the write size over the zone capacity, I achieved a write amp of 1. And that's actually very, very cool because for performance scaling and quality of service if i had to use a two terabyte drive suddenly with spatial locality and using zns for the spatial locality i can almost double the amount of capacity or disk utilization. So where before it might have taken two SSDs

Starting point is 00:43:28 to store a four terabyte database, I can now store that in a single database, cutting my storage costs down considerably just by increasing the utilization. And just to make sure that this is clear, so imagine that I have like these NAND blocks right here. And what I mean by this 2.6 or so is that I go off and I just get lucky. I start here. I write one, two, 2.6 NAND blocks. And when later that large immutable file is deleted, wow, look at this.

Starting point is 00:44:05 This NAND block needs no GC. This NAND block needs no GC. And I got 0.4 of this block right here. So potentially this might be GC, but with a little bit of over-provisioning, one of these files or one of these parcels gets overwritten, so I get pretty much close to a write amp of one. So I can increase my user capacity, increase my performance, but there's one other really super cool thing that comes along with this.

Starting point is 00:44:36 The higher the write amp, the more PE cycles I need to service the lifetime rights of a drive. And let's just say for argument's sake that I had a write-amp before and I needed 4K PE cycles in order to support the write workload of this particular SSD over its lifetime. And that would be at a write-amp before. But if I was able to use spatial locality, I changed my log structure merge tree to be 2.5, I could then claim a write app of equal to one, which means that the PE cycle should go down somewhat accordingly.

Starting point is 00:45:22 So this now goes down to maybe like 1.2 or 1.1. It's a little bit less because the OP doesn't count on it. So there's a little bit of background, but let's just say it's about 1.1 for the sake of this discussion. So now, guess what? I had a 4K PE cycle NAND, and now I need something like a 1.1. Wow, this might be our QLC NAND, where this might be a TLC NAND. So this has a better cost structure. It stores more bits per cell. So not only did I increase my user capacity, increase my performance scaling, but I've also enabled

Starting point is 00:46:05 lower cost media like qlc so zns with spatial locality in this end-to-end data placement does a lot of good things and one of them is it accelerates the transition to this higher density or this qlc line with more bits per cell, once again, further increasing it. So this 4 terabyte goes maybe to a 5 terabyte if I had QLC drive. So 2 terabytes in TLC to 5 terabytes. So this is a pretty big gain in utilization. But it all comes back with this end-to-end data placement and

Starting point is 00:46:47 maintaining the spatial locality of the rights throughout the entire system. So what do we do with this information of data placement and the spatial locality? Well, if you can imagine an imaginary system that goes off and has three components to it, the very first component is any application with large immutable files. And what this does, it starts the process of data placement by using these large immutable files that are written once, maybe read many, but deleted at one time. We now know that a log structure file system is very good at utilizing the large immutable files,

Starting point is 00:47:44 but there might be a little bit of GC left on remnants. But this log structure file system will pass down spatial information via the addressing of the, via the addressing method of the drive. In a standard SSD, the spatial information is lost because the SSD is using a standard SSD, the spatial information is lost because the SSD is using a log structure file system and is storing data temporally. So it has things come in,

Starting point is 00:48:13 it just stores it based on time. But a ZNS system, ZNS SSD uses that spatial information to go and place the data in these really nice, nice order. And we know by now that these nice order, when it comes time to moving or deleting files, makes things so much easier to reclaim the space, and that reduces write-in. And we also know that the lower write-in also starts enabling things like QLC and even further cost reduction, while at the same time increasing the scalability of performance and also our quality of service. So if we were to have large block immutable files on top of a log structure file system, on top of a ZNS storage device, life is being back.

Starting point is 00:49:08 Life returns to being good. And so that's the left goalpost. I mean, with this type of system and large files, you can take a look at your using your Unix tools to see your average file size. You can see if they're immutable. And if they meet all those criteria, you can go off and use ZNS to significantly reduce the cost of the system while improving performance and quality of service.

Starting point is 00:49:38 At the right-pull post, I'm going to kind of leave a little bit for this talk that Hans Holmbark has done. It's called Zen FS Zones and Rocks PB, who likes to take out the garbage. But he takes all of this information, knowing that this is a log structure merge

Starting point is 00:49:57 tree. I got some high frequency writes, and I'm going to utilize all of that information. I'm going to put together a system that's going to be the lowest cost, highest performing and have a write amp close to one. So make sure you listen into his talk to find out how exactly he did that. In summary, what we found is that large immutable files with the right file systems can significantly reduce the write-out. But we talk mostly about files in a log-structured merge tree, but when you expand it a little bit and think about,

Starting point is 00:50:37 oh, if I have sets of files in directories, and these directories get archived in order, like they're some type of, this is January data, and at the end of the time, you move January to a lower cost storage or February data lower cost storage that it'd be used as a spatial locality and something like a ZNS drive That maintains that spatial locality when it comes time to move those directories or sets of files, they move as a bulk. And what that enables is a lower write-in. So it's not just log-structured merge tree. It's really, really any type of files or directories or sets of directories that move or migrate at the same time. Now, the traditional SSDs, I think by now we should understand that they use temporal locality, that we know that we have the NBME streams, but it really

Starting point is 00:51:31 didn't provide the scalability. It was trying to use another method, that the best method is to have an addressing capability in the SSD that it can understand that that addressing is inferring some spatial locality, and the SSD will go off and maintain that spatial locality. And when we put that all together, we have lower cost storage. We start enabling QLC. We improve the scalability and improve the quality of service. Well, I'd like to thank all of you for taking the time to listen to my talk and feel free to give me a drop and send me an email if you have any questions or at the end of this talk, there'll be some time that you can have some Q&A. Thank you very much.

Starting point is 00:52:25 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #148: End To End Data Placement For Zoned Block Devices

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.