Storage Developer Conference - #104: Introduction to Open-Channel/Denali Solid State Drives

Episode Date: August 5, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 104. My name is Matthias, and I want to talk about OpenStyle SSDs. I've been doing this for a couple of years now, My name is Matthias, and I want to talk about Open Channel SSTs.
Starting point is 00:00:47 I've been doing this for a couple of years now, and I'm kind of going to give an introduction to Open Channel SST, what is it, where are we going, and then we're going to talk a little bit about what's next. So we're at the stage now where we're going to look a little bit forward and see where we're going to end up and what we're going to work on in the next one or two years. So I previously was at a startup and I'm in a big company,
Starting point is 00:01:10 so I get to present this nice looking slide which basically say that you shouldn't trust what I say. It doesn't mean anything, but I still take it for a grain of salt. So that's all good. But you got lawyers. I got lawyers. I got lots of them. Like that's a sandisk. I mean, we're lawyered up. So that's all good. But you got lawyers. I got lawyers.
Starting point is 00:01:25 I got lots of them. Like, that's a sandisk. I mean, we lawyer it up. I mean, we don't need it anymore. Yeah. So there's, yeah, the motivation, the interface, the ecosystem, and what's next. And there's some great news here regarding standardization. For me personally, it's been, it's something very big for me,
Starting point is 00:01:46 something I look forward to. We're going to talk about that. So open-channel SDs, why are they interesting? So there's this thing, when you look at an SSD today and you do 4K random reads to it, and you kind of look at the latency and how long it takes,
Starting point is 00:02:06 and then you look at the percentiles, like how many of the IOs completes within 100 microseconds, 150 microseconds. And you can kind of draw this curve, and if you only have reads to an SSD, it's pretty consistent. Like most of your IOs completes within 150 microseconds. So that's great. Obviously, over time, you will have these outliers, but generally, your curve will look like that.
Starting point is 00:02:29 The problem with SSDs are that as soon as you start adding just a small bit of writes, that's when it hits you, and you start having these outliers. Four milliseconds, six milliseconds, and so on. And I mean, that's okay. They're still vastly fast. But the thing is, if you're running these things at scale and you have 100 SSDs, 1,000 SSDs,
Starting point is 00:02:53 and you have to go ask each of them for this answer, and you're kind of waiting on the last one in that row, like you need to get all the data back before you can give back the final answer, then your latency will be those four milliseconds. And that's not good enough. So how can you either control it, make it better, and how can you kind of make sure
Starting point is 00:03:15 that it doesn't happen so you avoid it? And obviously you can say, well, if I can isolate my workloads and say, well, I have one SSD. They're usually, they were small, like a couple hundred gigabytes, and your workload could fit within that. That's great. The thing with NAND in general, the media that is within SSDs, is that they're getting bigger and bigger, and more and more dense. And that means that we are not,
Starting point is 00:03:52 if you're a hyperscaler and you run virtual machine workloads, then you're going to have multiple customers sharing the same drive. And they're not coordinating. One guy can read, another guy can write. And suddenly everyone has really bad performance. And that kind of drops into, so that kind of problem where you have a multi-tenancy environment, and you kind of have this where many users
Starting point is 00:04:12 are sharing the same drive, and you get these unpredictable latencies. The other part is that all these different users are having different workload. Some of them are running databases. Some of them are running databases. Some of them are in place. It's a MySQL database. Some of them are out of place, like RockCB.
Starting point is 00:04:30 You have sensors, you have analytics, virtualization, you have video. And they all have different characteristics, how much can be compressed, and so forth. And in these environments, you stuff them all onto the same SSD. How can you build an SSD that has to be generic for all these workloads, that's the hard part, is it's really hard to make an efficient SSD that works for all these workloads. So what usually happens is that a customer, hyperscaler or an old flaccid rate guy tells them,
Starting point is 00:05:03 that's the volume. This is my workload, please go optimize for it. And then an SSD vendor will go off and do these optimizations. And you can do that, but then you're optimizing for that particular workload. And the problem that especially hyperscalers have is that obviously you know some workloads,
Starting point is 00:05:23 they don't change that much, but in general, you don't know, you say my workload is this one week, and the next week is something else. So you cannot say this is general, and it changes too much, and they have to plan years in advance for kind of when is the new drive, when are we gonna deploy, and so on. So that's not really feasible for them,
Starting point is 00:05:41 so the idea here would be to kind of, can we shuffle this around a bit. So before that, I just want to give a short introduction to SSDs in general. And basically, within an SSD, you have this media on the bottom that dies, the NAND dies. And when you access those, you can read from them. It takes like 50 to 100 microseconds.
Starting point is 00:06:06 When you write to them, it takes like one to 10 milliseconds, and when you erase for them, it's like three to 15 microseconds, so the idea is that you can read anywhere from within the NAND chip if you want to, but you have to write sequentially within this thing called a fast block, and if you wanna write again, you have to erase it. That's kind of the constraints that we work under. That's a lot of more details,
Starting point is 00:06:30 but that's the general kind of things around it. And then the way that the SSD gets its performance is that we're gonna shuffle a lot of them together, and that's how we get the throughput of the drive, and we kind of get this parallelism going on and so on. That's why you see the curves, like a low QDFs, you have wide, like, so-so in performance, whereas when you start firing off more QDF on them,
Starting point is 00:06:54 then see where you get the bandwidth of the drive. That's because you can utilize more of these at the same time. That's really great. So what you probably notice is that NAND has the re-write erase interface, but we had to opt to the host. We're actually talking read-write.
Starting point is 00:07:09 So there has to be something in between. So this can be in hardware and firmware, whatever. But generally, you have to have this where you have this logical to physical translation layer where you take this read-write erase and convert it into, yeah, into the read-write arrays. So that's the translation map. You have wear-leveling, so NAND is not perfect at all.
Starting point is 00:07:32 I mean, that's, I mean, you're lucky if you could get your data back unless you have some really good ECC scheme. That's kind of what it boils down to. So you need to make sure how you place the data. You have to wear it evenly. You have to do a lot of tricks to kind of make the promises of a normal hard drive,
Starting point is 00:07:50 and you do rewrite, so when you get to that durability level, you actually have to do quite a lot of work. Then there's all this bad log management, there's media error handling, which is where, yeah, you do all the stuff that you actually make the media doable, and actually works as you expect it to. So that's kind of what goes on within an SSD in the broader picture.
Starting point is 00:08:16 But that's not the only thing. The other thing is that when we look at the host, it's actually also a problem, is that if I'm at data in an application, I sit on the top in user space, and I'm BroxDB, and I'm communicating, I want to read LBA 10. Fine.
Starting point is 00:08:34 And it hits my file system, which is kind of also lockstored. So BroxDB is lockstored, and the file system within a lockstored structure, and what goes on is when I hit my file system, it might convert it onto LBA20. Fair enough. And then the thing is, with the FlashTranslator before,
Starting point is 00:08:51 that's also a kind of lock-structured file system. So, again, I'm going to end up, I'm going to end up maybe on LBA30 or something like that. So, I have no idea when I go up from the RuxDB and go down through the layers where my data are actually ending up. And the thing with SSDs are you want to co-locate data on a flash block
Starting point is 00:09:12 such that the data have the same age. You want to make sure that you don't have, at some point, have active data. When you come around and you want to garbage collect, if you have like half of the flash block with active data and half of it with cold, then half of it will be invalid and doesn't care, but you still have to move the flash block with active data and half of it with cold, then half of it will be invalid, it doesn't care, but you still have to move
Starting point is 00:09:28 the active data away. And that part is really expensive. That's called write amplification on SSDs. And that's what we wanna avoid, and that's what Open Channel tries to solve. And that brings me to this indirect write. So within an SSD, we have the lock structure, but we also have, we are writing into a write buffer.
Starting point is 00:09:49 So when you're gonna write to this Flask block, you're not writing 4K at a time. You do if you maybe have SLC or something, that kind of memory. But in general, you write at like 16K or 48K or something like that, or maybe even more than that if you're writing across something called planes. So you kind of collate all the data together
Starting point is 00:10:09 and then you flush it out to the drive. And you don't know if you just started at the start of a block or at the end of a block, of a flash block. There you have the user has no knowledge of that. So that's a big problem. And it basically becomes this best efforts approach where, I mean, we do our best with the data we got. And SSD has a read-write interface.
Starting point is 00:10:36 Obviously, it does have more today with NVMe streams, for example, so we can do more. But in general, up to recently, that was kind of what you had, and the SSD could only do so much. So that's kind of what we want to look at. So open-channel SDs, the goals are that we have IO isolation, so these dies, you want to be able to access them independently.
Starting point is 00:11:01 We have predictable latency, so we want to make sure that we can avoid having these outliers that we saw in the beginning. And we want to control the data placement and when we actually access the drive, uh, access the media. And so this comes into, we want to know, like, if we know the boundaries of the flash block, we can then be smart and make sure the host maybe already know what kind of, how hot my data is and what kind of data fits together. Let's put it on the same FlaskBlock so when I'm gonna need to erase it,
Starting point is 00:11:29 I can do it at the same time. So often this is these, is that on the first pass you just write it in and then later on you optimize it and then you Flask out the data into different superblocks. I'm gonna get back to that but that's generally how it works. And so what Open Channel is, is that we take these parts, the logical to physical translation map, we take the garbage collection,
Starting point is 00:11:54 and then we kind of split the value into two. That's what's happening. We have this logical to physical translation map, which we now give the host, or the SOC, or whoever we want to do it, to give it that responsibility. The garbage collection, which we now give the host or the SOC or whoever we wanna do it, to give it that responsibility. The garbage collection, if we know, so usually when you get these outliers in the SSD
Starting point is 00:12:12 and you don't know why it happens because the user's not doing any writes but the drive is, then hey, it's because the garbage collection within the drive. We'll move that into the host so that we can make the decision of when to do it. And if the data placement, if the application we kind of have to do it doesn't need garbage collection
Starting point is 00:12:28 in the traditional way, we can avoid that too. For example, RocksDB has a lock structure in itself, so we don't need, in the common sense, a garbage collector, which is normally within the SSD. Then there's lots of availableing, which it's kind of an abstraction a little bit of a way. We don't want to tell the host, we have like the array cycles, p cycles in SSDs.
Starting point is 00:12:49 We don't want to tell the host that detail because it's meaningless. It can change over the lifetime over the SSD and so on. So instead we want to give the host a hint of where to place the data. And that's what we're gonna do. Obviously, now that we moved all this up to the host, well, now we need software to drive all that.
Starting point is 00:13:08 So if you wanna have an open channel drive and let it be a block device like a normal drive, we can build a host side of TIL, which does this L2P, a lot of physical translation map, do the gaps placing, a lot of logical valuing. And has similar overheads to traditional SSDs. So one of the things with SSDs that are on the market today is that if you have one terabyte of storage and media,
Starting point is 00:13:34 you usually use one gigabyte of DRAM to kind of buffer for that mapping table when it arrives. There's different optimizations you can do, but if you don't want to do these kind of lookups that kind of give you extra latency and so on, you're going to need one gigabyte of DRAM for one gigabyte of storage. So that's really bad.
Starting point is 00:13:51 That costs money. That's just how it is. And also, like file systems and databases, I mean, if we can integrate it there and utilize what we already have there in FTL, then we can kind of get, remove this utilize what we already have there in FTL, then we can kind of get, remove this layer that are within the SSD to FTL, remove that logic,
Starting point is 00:14:10 and kind of integrate it to what we already have today. So when we look at an OpenTL SD, we look at the concepts. Like, we break it out, what is it actually we are introducing with this. So it's some, a concept called chunks, where we do sequential write only, LPA ranges, which is this, and this. You can say that this maps into a Flask block where you have to write
Starting point is 00:14:33 sequential within. That's kind of concept that we exposed up to the host. Then we want to make sure we can align the writes to these internal block sizes, so we want to tell the host what are the boundaries of, of which this should write. Then there's hierarchical addressing. So we have these parallel units,
Starting point is 00:14:52 which we're going to talk about. How do you address them individually? We have host-assisted media refresh. So an SSD actually does rescropping internally and moves data around. We don't see it, but it happens. We could move it around as an SSD, but we can also tell the host, please move it around, because then the host would know,
Starting point is 00:15:11 and it can kind of schedule it out. Maybe it doesn't need to because the data is somewhere else. But there's all these optimizations where we can start applying. And then there's this host assisted well living that I was talking about. So for chunks, so it's, yeah, it's a range of LBAs, right, so it's pretty sequential.
Starting point is 00:15:29 The cool thing about it is that we do this to DRAM for the logical to physical mapping table by all those magnitudes, so instead of having a gigabyte per, of DRAM per one terabyte of media, we can then go away with like one or two megabytes per terabyte, which is great. Like this is the best case, obviously you wanna have
Starting point is 00:15:49 some hybrid, but in generally, in that range we are. And we can do this hot-cold separation, we know data, when we start writing to a chunk, you can kind of move right to this chunk with the cold data and with this, with the hot data and so on. And then the key part is that I said with the NAND flash, you need to erase it before you can write again. So within the OpenTel world,
Starting point is 00:16:12 that's called the reset, which we can then, when you want to go back again, then you can kind of, oh, there it is. Perfect. Then you need to reset it. So basically a chunk starts in this state free and you want to go into it, and when you start writing to it, it changes state into be open. Then you write a bunch to it, and then at some point it's full.
Starting point is 00:16:34 Like say you have four megabytes, that's what it is. Then when you wrote four megabytes, it will go into a closed state. And then you have valid data there, and then until then, and you want to use it again, you have to reset, and you go back up. So that's kind of how the normal work with it. So for those of you familiar with, like,
Starting point is 00:16:50 SMR drives, like single magnetic drives from the hard drive world, that's the implicit device model. We kind of, we took that from the SMR world and brought it into OpenTel. So that meant that we could both use the ecosystem that is already there for SMR, but it also means that we have a tried model
Starting point is 00:17:08 for how to do this. So it's more by accident that we kind of fit together both for SMRs and for how it fits together with this open channel model. Cool, so, and obviously, now that we are talking SMR drives, some of these drives have conventional zones and they have sequential write only zones.
Starting point is 00:17:31 And have chunks, chunks. So in SMR it's called zones, and open channel is called chunks. So the open channel spec kind of allows that, and if it's conventional you can do random with sequential, that's all good. And then for those chunks where we need to be sequential right, we define that and we have to do that.
Starting point is 00:17:48 The host software needs to do that sequential writing. That's all great. So then there's hierarchical addressing, where we kind of see you have a normal SSD down here. And normally within SSD you have like a NAND controller and then you have multiple channels. And then on top of those channels, you have like these NAND controller and then you have multiple channels. And then on top, to do those channels, you have like these dies attached that you saw before.
Starting point is 00:18:10 So in open channel, we call those channel groups and then we call the dies parallel units. So the guarantee from parallel unit is that they are independent from the other parallel unit. So you can do read or write to either of them at the same time and they won't conflict. Obviously if they share the same channel or group, they do share some bandwidth constraints,
Starting point is 00:18:31 but in general you can do read and write from the same. And the important thing is here that this is all, it's not a one-to-one mapping. It doesn't have to be. In implementations today it usually is. But you could have like four dies that are slot together and be one parallel unit and expose that up to the host if you wanted to as an SSD.
Starting point is 00:18:52 But it's not a requirement. You can do anything how you want it to do. And then how does that look? How does the host address it? So it has these LBAs that we saw before. We lay on top the chunks that we're gonna have, and then within those, then a parallel unit has a set of chunks within it,
Starting point is 00:19:11 and then there's the group that kind of groups the parallel units together. And that's all exposed up through the NVMe address space, like the LBA address space that you get through your normal drive. You just tell the host this is how it works. So that means you have your nvme namespace
Starting point is 00:19:26 and then you have your groups, your parallel units, and the chunks. That's a logical way to look at it. That's how it's exposed. Then the media refresh we have. This idea is that, yeah, so I talked about this step. This is the NAND. It doesn't store the data forever.
Starting point is 00:19:46 You kind of need to refresh it. And yeah, you as a drive can do that. But now that we kind of let the host do the data placement, well, now we can tell the host, hey, please go refresh this data. So that's what we're gonna do. And then you can read the data and then write it somewhere else.
Starting point is 00:20:02 Or maybe, so if you're a hyperscaler, you might have three different copies somewhere in your data center, you don't really need to rephrase it. That's the next thing, you can do that kind of optimization as well. Then there's the host assist, the wire leveling, where we kind of, yeah, when we write to the SSD,
Starting point is 00:20:20 we don't know the temperature of the data, that's the idea. And we kind of, when we go through it, when we garbage collect it, so we do all the writes, and we have this concept of a superblock with an SSD. It has many names, but let's call it a superblock. It's basically where you stripe across multiple chunks or multiple flash blocks within an SSD.
Starting point is 00:20:38 So you shuffle everything into that to begin with. You write it out. And then later on, you come around the garbage collector, picks it up, and then starts garbage collecting the data. And then it sees there's some of it that's warm, there's still valid data, and there's some of it that's cold, and then splits it up into two different superblocks. And then as we go on, then we see, hey,
Starting point is 00:20:59 now we have superblocks that have the warm data and the cold data, and we kinda make sure that the data fits together. What we can do is if we have a prior information about this, how they're gonna fit, what's the age of the data, we can do this placement directly at the same time so we don't have to rewrite the data again and again. That's kind of the idea.
Starting point is 00:21:22 We enable the host to know if I have cold data, I should probably use chunks which are near their end of life. So, you should maybe see, say you have a flash block, and that flash block, you can erase that, let's see, say, 3,000 times. And I'll use the hot data a lot, and then suddenly I only have like a thousand erasers left. But I have lots of other blocks
Starting point is 00:21:49 which still have three thousand erasers left, resets left. I want to make sure that, okay, then I know the cold data. Let's move that into the chunks that only have these thousand erasers left because then we're not going to update it as much. That's the kind of optimization. If we know something about left because then we're not gonna update it as much.
Starting point is 00:22:06 That's the kind of optimization, so if we know something about the data, we can kind of apply it. So all this together is all in the spec, that's kind of the concepts when you go through the spec, that's kind of what it's built upon. So we have the IO isolation where we use, go through the, yeah, groups and parallel units.
Starting point is 00:22:22 You have the fine-grained data refresh where we do refresh the data. You have to reduce grain data refresh, where we do refresh the data. You have to reduce the write amplification because we can kind of place the data down at the right places such that we don't have to rewrite data as much, so it just reduces the amount of data refresh that we have to do. There's the DRAM and over-wisening
Starting point is 00:22:40 because we're writing append only, so we can reduce the mapping table to these couple of megabytes. And then we have these direct writes towards this expensive internal data movement. So that's all great. That's the spec. It's worth exactly zero if you don't have a software ecosystem on top of it.
Starting point is 00:23:00 So one thing is putting all this together. The other part has been to build this open ecosystem around it. And basically what this means is that we've been building, like in the Linux kernel, we've been extending it such that, for example, we are taking the NVMe device driver and we extended it such that it supports us to detect open challenges as these.
Starting point is 00:23:22 There's support for these 1.2 and 2.0 specifications. And basically what it does is that there's this lighten-vim subsystem in the kernel, which is the open channel subsystems part. And basically that registers with that part, so we know it exists. It's great. And then the next part is, and then we also, there's a new thing now
Starting point is 00:23:47 which we're going to come back to, it's like we have this own block device, and that's how you kind of combine it in with how SMR drives works today. So that's really great. So the LightenVim subsystem, what is that? That's the core functionality which kind of, we say, we register up to this system. So then when we drive come up, we talk to the LightenBIM subsystem, and then we're there.
Starting point is 00:24:14 And then we expose it up to the host. It doesn't do anything at this point. We then say, hey, we want to put something on top, and this can be something called, for example, there's a host side FTL called Pblock, which we can put on top, and then we can see it itself as an Open Channel Drive. So that's cool.
Starting point is 00:24:31 And we could also, if a file system supports it, this kind of SMR kind of semantics, we could also put a file system here if we wanted to. So that's kind of what we did on the kernel side. And then on the user space side, since there's SMR support, there's support like using libcbc, which is what's used for traditional hard drives, single magnetic hard drives.
Starting point is 00:24:53 There's FIO, which has gained some support. There's liblite and vim, that's been built on for a couple of years. That's Simon sitting there as the main turner off. There's SPDK, Jim has been graciously taking in patches on Open Channel. Not yelled too much at people, but that's really awesome. So that works as well.
Starting point is 00:25:14 So that's kind of, support is growing and it kind of makes it easier and easier to kind of start using it. So that's really awesome. So this, all this report kinda started back in 2016, where we put up the subsystem in the Neos kernel. We put up the user space library, the liblightenvim, and had support for that in like April last year.
Starting point is 00:25:39 And that went up like 4.11 of the kernel, and pblog came in like 4.12, had the open channel 2.0 spec released here this January this year, and got support in here 4.17, so that was pretty quick. Then we got SPDK support here in June that Jim took in, and then we have FIO with zone support that came in in August. And then on the side, we're working on enabling like zone block device, so we have an open channel SSD working and showing itself as an SMR drive
Starting point is 00:26:07 and use that ecosystem. And then there's an interesting project in, and it was at Stanford, now it's at Santa Cruz. There's a guy called Heiner Litz, which has been working on this idea that normally within an SSD you have XOR that you cross, you're just driving across multiple chunks and then you have this parity on the side.
Starting point is 00:26:26 His idea was that, hey, let's use that, and then use that to give lower latency. So the idea is that if he knows when he kind of wanna read some data that is busy, I'll use the parity to kind of recover the data without going into that particular busy die. So this makes sense because the reads maybe take 50 microseconds,
Starting point is 00:26:51 but the write or erase might take 15 milliseconds. It's not that high, but I mean, if you don't have something called erase, programs suspend on your chip, that makes a big difference. If you do have it, it makes less of a difference, but you can still get away with them. It still makes a lot of sense. And then we're also working on, chip, that makes a big difference. If you do have it, it makes less of a difference, but you can still get away with them, it still makes a lot of sense.
Starting point is 00:27:07 And then we're also working on, yeah, on making a new revision of the spec. It's been proven to be fairly robust, so people have been implementing it. I haven't got too many bug reports. I have a small list that I kinda keep track of, and then fix all those small things and put it in a new revision of it.
Starting point is 00:27:25 That's kind of the overview of Open Channel. So one thing I want to show, now that we have time to it today, is that... So this is just to show that it kind of works. So this is kind of inception running, where we have a virtual machine on my laptop, and within that we have a virtual machine again.
Starting point is 00:27:46 So it's kind of slow but it gets the idea across. So for this we kind of brought it up and we have this NVMe drive that's open channel which we can see that because it's NVM, that's the prefix we use for the LightenMEM subsystem and it's kind of registered. So that kind of means that we are there. And then we go in here and the blocks subsystem there,
Starting point is 00:28:12 and let me see, this is normally where you kind of get your information about the block device that you have. And here we have a LightenVim thing. Hold there. And basically here what we can see is that this particular drive is a 2. Hold there. And basically here, what we can see is that this particular drive is a 2.0 drive. We can see that it has, per each
Starting point is 00:28:32 chunk, it's 4096, so this is a 16 megabyte chunk, because it's 4k LBAs. Then we can see how many chunks are there. There's 64, but then there's multiple parallel units, which means we have four.
Starting point is 00:28:49 So in general, there's 256 kind of chunks in this SSD. So that's great. So one of the things, so this particular kernel is extended with exposing an open channel SSD as a zone block device. So that means I can take my favorite block zone support, like command line tool, which is basically in all newer Linux versions,
Starting point is 00:29:12 this is available, and I can basically just point it to the open channel SSD. And then it will list, like, so this is kind of a convoluted format, but basically it shows, like, here's the status of, so chunks and format, but basically it shows here's the stator disk. Chunks and zones, when I intermix them, it's the same thing, roughly. They have a stator disk, they have a length,
Starting point is 00:29:36 and they have a write pointer, and then there's SMR-specific, and then you can see this particular zone is empty. That's great. All that is in place. Cool. Then what we can do is, I talked about pblock. Let's do that.
Starting point is 00:29:51 We can create. There's here. In the NVMe CLI tool, we have their support for that. We can go in and say, lring, create. Then we have the NVMe drive, and we're gonna call this particular drive instance, the device we wanna create, the block device, is called pblog0, and then we're gonna go, the type is, we wanna instantiate pblog as a target,
Starting point is 00:30:17 so target is called pblog, that target type is called pblog, and then we're gonna say we had four parallel units, so let's put an FTL on those parallel units. That's great, so that worked. Basically, we can see here in the bottom, yeah, that initialized, so that's great. So let's put a file system on it.
Starting point is 00:30:44 There we go, oh, the on it. There we go. Oh, the right one. There we go, and it's created. So file system is on. Let's mount it. So now we have to picture it, it has to drive, has pblog, now we're gonna put a file system on top. So that's great, so that worked as well, yay, that's good.
Starting point is 00:31:04 And then we can run an FIO script. So we just do small IOs, like 128K, IodF1, small file, whatever. Just show, hey, it works. So this performance is out the window, it doesn't matter. So basically show that it kind of works. So it lays it out, and we can kind of read-write from it. So this kind of read-write workload,
Starting point is 00:31:29 we get like measly 30, 62 megabytes per second. It's not reading. It's there. It works. That's awesome. Yeah, cool. That was the one thing. So we show that we kind of have an OpenScienC, and we can see what it can be, be a block device. Let's go off with this again.
Starting point is 00:31:44 And then we're going to kill this one. And remove it. Remote. And kill this P block one. There we are. So now we took P block off, and now we're back to the device. So the cool thing I want to show, and that's what we kind of lead up to, is that what if we could just take F2FS and point it to our open channel drive? So F2FS is actually lock structure within,
Starting point is 00:32:12 and it has support for SMR. And that means now that I exposed open channel SSD as an SMR drive, it should work seamlessly. Obviously, when I try to create it, well, it sees, hey, I'm host managed. You really need me to have some block feature in the earth required. Okay, so we go into there, and you can do like dash m.
Starting point is 00:32:34 That tells the story. Whoops, it gets created. So that's awesome. So now we can do a mount. This one here, there, and then we can do here. And now I simply mounted F2FS on top of an open channel SSD, F2FS is the FTL. So what this basically means, so P block,
Starting point is 00:32:58 if you bring it up, it takes one gigabyte of DRAM per one terabyte of storage. F2FS doesn't do that, it's a couple of megabytes. So suddenly we reduced the need for DRAM by 1,000x. So that's pretty cool. So what we can do now, we can do the same read-write and fire it off, do it twice, and see, kind of get the same kind of idea.
Starting point is 00:33:22 It's like, it just works out of the box. So that's pretty cool. That's what I wanted to show. So a long time it's been like, you can get the users of Open Channel, but there's not big idea in it if you just expose it as a block device. The cool part is when you can do this.
Starting point is 00:33:38 Or when you can do, I've heard, some computational startup has been doing this with databases, and that's pretty awesome. That's kind of what you could do with this kind of architecture. You could reduce this DRAM, and you can have... And you can also... So say when you were Fusion I.O. back in the days,
Starting point is 00:33:56 you had all this CPU overhead. Now the CPU doesn't exist because it's in the file system, and it's something you anyway needed to do, so you get an FTL for free. And that's, that's pretty awesome. So you don't need to have resource for it. And so that's kind of the, the idea of it. Cool. Then the last thing I want to talk about is what's next. And this has been the past standardization. So yesterday, the board approved these two new TPARs
Starting point is 00:34:33 that kinda went into NVMe. So there's two main use cases. So this is the official NVMe slides, kind of, when they're presented into the NVMe work group a while back, and there's two use cases. One is hardware isolation, which is these kind of when they presented into the NVMe work group a while back, and there's two use cases. One is hardware isolation, which is these kind of parallel units, and another one is right-hand production,
Starting point is 00:34:51 which is the stream zones kind of approach. And there is a need from certain customers that want to have, be able to slot an SSD into different pieces, which for OpenTel means that you have these parallel units, but kind of what fixed that up came into NVMe a little while back is the IODeterminism. And one thing is that we want more of that.
Starting point is 00:35:11 We want to extend it. So there was going into there. And then there was, yeah, where we want to have these zones, where we want to actually be able to place the data more explicitly. So streams kind of fixed this in NVMe, but it's a logical construct, you don't have a bounds on it.
Starting point is 00:35:30 The idea would be to add these bounds to make the host more intelligent about placing its data. So this means that there's this work beginning in the NVMe work group. So yesterday in the board meeting, it got approved for phase two, which means now we can start working on the spec. So the one thing is what we call zone namespaces,
Starting point is 00:35:50 and it adds this chunk abstraction into the NVMe. So this is kind of, it's the SMR interface, that's kind of where we want to keep it compatible, such that you can, if you had the crazy idea that you want to tunnel an NVMe hard drive through that, you could, but it compatible, such that you can, if you had the crazy idea that you want to tunnel and then hard drive through that, you could, but it's not really, I don't know if it makes sense, but you could, but it kind of keep it similar,
Starting point is 00:36:12 because, not because of the hardware, but because of the software ecosystem. So the open channel subsystem took three years, roughly, to make, and similar on the SMR side, this has been going on for three or four years, so it's a lot of work that kind of goes into making this support. And we don't want to duplicate that,
Starting point is 00:36:31 because then if we go in and build a new spec, and a new thing, and a new paradigm, then it's gonna be three to four years before we have a software stack that we can actually utilize it. And then there's these, so that's one part, and then the other part is the set endurance groups management where we wanna expose
Starting point is 00:36:50 this parallelism further. So usually when you take an SSD that supports higher determinism, you kind of pre-configure it to one configuration. The idea is that you can configure it at runtime if you want to. And I talked a little bit about the ecosystem. So basically all this,
Starting point is 00:37:08 what I just showed you here with F2FS, I mean, building this and getting this into NVMe and get it standardized, this will work out of the box. So we have prototypes working internally that kind of works with some namespace and everything. So that's pretty awesome. So the idea is that all this code
Starting point is 00:37:25 is already being used by Microsoft and other people in the Linux ecosystem. So it's already in use today and been tested and production friendly. So when we then bring this kind of zone namespaces and extends the higher determinism into NVMe, we can from day one have production software
Starting point is 00:37:42 that the people that uses the SSD can use from day one. Cool. So that was what I had. Thanks. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join
Starting point is 00:37:59 our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.