Storage Developer Conference - #201: Towards large-scale deployments with Zoned Namespace SSDs

Episode Date: February 13, 2024

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 201. All right, so I'm going to talk about these large-scale deployments, so namespaces and where we kind of are today.
Starting point is 00:00:45 And kind of before we kind Before, with all the ecosystem work that we've been doing, I want to set the scene for why we've been spending so much time building ecosystems and all these GNSSDs and so on. One of the challenges that hyperscalers and cloud service providers have is that
Starting point is 00:01:01 they have this, obviously, constantly challenged with large volumes of data that's reaching customer demand for cost-efficient storage and wants as cheap as possible. And while they want to have that, they also want to have it high performance. So the metrics they usually look at is kind of like IOPS, terabytes, throughput, latency, quality of service, and so on. And the TCO impacts of all of that, just having it deployed in the data center. And one of the cool parts, one of the TCO impacts of all of that, just having it deployed in the data center. And one of the cool parts,
Starting point is 00:01:27 one of the all interesting parts is kind of lifetimes and ride-rides per day, which is, I mean, depending on where you are, but roughly like one ride-ride per day is kind of what some of them are looking for. And that's, I mean, and that's kind of like looking at a five-year horizon. And some of these hyperscalers,
Starting point is 00:01:44 they're moving to a kind of extent extend the lifetime of the hardware in the fleet from like five years to seven years or more if they could. And so it is really important as we're kind of moving to QLC and PLC NAND, which have lower drive rates per day, to actually be able to have technologies that can extend that such that they can have a longer time in the fleet and make use of it and obviously they have a lot of good reasons for doing this like carbon neutrality emissions and all that and and kind of what they're seeing so far is that i mean if they can get away with it they will actually go on to go to 10 years and onwards and so that's that's
Starting point is 00:02:21 really kind of like very interesting and kind of of fits into this where we have climate and all that these days. And so conventional SSDs, they're really not able to achieve this from their point of view. It's kind of like this typical lifetime of three to five years. They want seven years or more. And there are different ways to achieve that. Either you use TLC memory, you pay more for it, or you do like, and so that's one way, or you go with QLC, but that has low dry rights per day.
Starting point is 00:02:55 And so they need kind of a solution that kind of eliminates what is this issue that SDS has with this writing verification, and they use that to kind of fulfill and increase the dry writes per day, but also kind of to improve the performance overall. So CNS kind of with some namespaces is one way to kind of solve these challenges.
Starting point is 00:03:12 And down here at the bottom, I've shown kind of like conventional SDs, and we have high cost for TLC, relatively high cost, right? It's still what it is. And then you have QLC, which is getting better, has better cost structure. It has a typical dry write per day.
Starting point is 00:03:30 There are some that are better, like some vendors that have slightly higher dry writes per day, but I haven't seen a QLC-derived conventional that has more than one dry write per day. And then when you apply its own namespaces support to an SSD, TLC becomes like very, very efficient. I mean, you could do more two and a half drive writes per day with zero OP. You can achieve the same kind of with a conventional SSD if you do like 28% OP. You can get to kind of those drive writes per day,
Starting point is 00:04:00 but you have to pay like roughly 30% more for your media. And then there's the QLC, which is kind of this balanced approach where we see kind of roughly a drive ride per day. That's kind of in that ballpark. So that's one way to look at it. You can kind of have these kind of different ways to look at the media. Another way to look at it is kind of like performance is expensive, just in general. So we have one use
Starting point is 00:04:25 case that Facebook has described, which is this cache-lib caching engine, where in the paper they have wrote a couple of years ago on the cache-lib engine, one of the things they did there was kind of like, hey, we want to have as low write-amp as possible on the drive. And to achieve that, like 1.1 and 1.4 of these workloads they had, they kind of just over-visioned 100%, kind of like, so you only use half the capacity of the SSD. So if we kind of extrapolate that to get that performance for an eventual SSD
Starting point is 00:05:00 versus an SSD with CNS, we kind of see that goes from like, I mean, you need probably twice the capacity or if you use CNS, you need half the capacity. And then what is the cost there? And especially on SSDs, primarily the cost is the NAND media. There is other costs as controller,
Starting point is 00:05:19 DRAM, and so on. But as you kind of have more and more media attached to one controller, the dominating factor becomes the media itself, which means for, in Facebook's case, they're kind of overpaying for storage by 2x for this particular workload when they want to have these low
Starting point is 00:05:35 write-ins. All right, so what's kind of like, so we'll do a little bit about what we do, like what SSDs with some namespace supports has and what they can do. So essentially, they eliminate SSDs write amplification. I kind of like the dark garbage-based write amplification. There are still media reliability things that are going on within the SSD,
Starting point is 00:05:59 but it eliminates one of the primary sources, the primary source of write identification within the SSD. And that happens by kind of fixing this mismatch that is between the storage interface that is presented to the host, which is the random write interface, and then you have to
Starting point is 00:06:17 like, when it goes into the SSD, kind of like, kind of matching that up, so you can kind of collaborate on the data placement. So one part here is kind of like, one part is the throughput, which is kind of like you hammer the drive just for drives, and as soon as it grabs case in, you draw a conventional SSD, but with 7% OP, it drops like 3X. And if you have like 28% OP, it's like around 2X. So that's kind of like established there's kind of ways to do it.
Starting point is 00:06:44 So this kind of like matches, okay, so, but if you have a CNS drive so it's kind of like established there's kind of ways to do it so this kind of like matches okay so but if you have a cns drive it's like just stable there's not much to it um another way to kind of look at it from the latency point of view and there is you kind of like on the on the x-axis kind of like you're adding writes to it like 200 megabytes a second 400 megabytes per second and it kind of increases increases, and the read latency, like 4K, when read latency increases as well, but it's kind of very stable, linear, whereas if you look at conventional SSDs, because the GC is going on and getting activated,
Starting point is 00:07:15 you can see, like, the read latency kind of bumps up, and as soon as you kind of hit the max, let's say for the 7% OP drive, that hits around, for this particular case, 400 megabytes, 360 megabytes per second. While this is the drive, the media configuration has there can do like 1.1 gigabytes per second write, only it's actually doing that internally in the SSD,
Starting point is 00:07:39 but it's actually the host can only write the 360 of them. The rest of it is going off to GC and therefore there is all these activities going on within the drive and that's what we kind of want to eliminate. All right. So that's kind of like the benefits of C-list. And this, you can then easily look at this
Starting point is 00:07:57 from throughput or from latency or you can see it drive life per day, like lifetime, you can kind of cut it in which way you kind of want and go from there. All right. So how did we get here? So many of you in this room has been part of building this kind of ecosystem.
Starting point is 00:08:15 A lot of you have been involved in the CNS centralization, which we started off together back in 18 and then worked together for a year and a half or so to kind of get to a complete specification. So this was a lot of work with a lot of you involved, which were very exciting. And also very, I mean, it was one of where we kind of added a new specification, like a new specification document to NVMe. So that was a big amount of work we had to do there.
Starting point is 00:08:47 So that we finished up in June 2020. And then it's kind of, then there's kind of like there was, so that was kind of the specification. We had something to work off. Now we needed to have support for it because while the numbers looks great for ZNS, you do need software support for it to be used.
Starting point is 00:09:06 So there was also the other body of work. And for that, there was a lot of support added into the software ecosystem around the same time that the spec was released. So this was all the general support into the Linux kernel. So in the Linux ecosystem here specifically, I'm talking about like the Linux kernel kind of enabling it natively. And that was added in straight after. And then we improved on it. And then there was SPDK support was added a little bit later.
Starting point is 00:09:33 If you have building something with SPDK in my old class array or something like that, you could use SPDK and have library functions to work with the ZNS files. Also, right after the spec was released, vendors started to come out, hey, we have announcing our offerings with SDs
Starting point is 00:09:53 that has CNS support. So all that's been evolving since 2020, and things are, we are getting to the second generation of CNS drives that we see in the market, which is really exciting. So one of the things that we have kind of been involved in, kind of like the applications and so on, and kind of enabling that ecosystem and showing off, hey, there's all these ways, how do I use it, how do I deploy it, and so on. Alright, so just a quick primer on, I've said CNS and so
Starting point is 00:10:24 in main spaces many times now, and what is it actually? So it's a, very briefly, it's kind of an NVMe namespace, which has this abstraction of zones, which is like logical blocks within these, these are boxed into fixed-sized zones, and then those are communicated to the host, and then the host can do that to do data placement.
Starting point is 00:10:43 So it places this data into it. There's one key here. It's kind of like, I mean, so an NVMe SSD, it's a namespace. So you, in an SSD, you can have, like, a zone namespace with half this kind of zone abstractions. You can also have the same SSD being a conventional SSD, a conventional namespace. So the SSD can be, like, yeah, a normal SSD.
Starting point is 00:11:03 So benchmarks that we're doing, we're kind of showcasing that where we have a conventional drive and we have a drive with some namespace support. It's the same drive, it's just that now it's a conventional namespace that we're using and now it's a sole namespace. So that makes it so that we can do a really good comparison between the
Starting point is 00:11:20 actual, I mean, what is the benefits that we get from it? Because we can do an Apple-Apple comparison, which is something that I'm, I mean, academically I'm very that we get from it? Because we can do an Apple-Apple comparison, which is something that I'm, I mean, academically I'm very excited about that we could do that because that kind of like took away all these kind of noise that usually can kind of impact the results. Here it's kind of like same hardware, same everything. The firmware kind of difference,
Starting point is 00:11:38 the data path within the firmware is very similar. That was something that was really cool. Another part which is interesting about this interface is that it mimics some of the work that's been for this, like, SAC and CBC models for host-managed SMR drives. And especially Damien Lemol's team has, for many years, kind of built up a robust ecosystem for that. So we didn't start from scratch with DNS
Starting point is 00:12:03 because we aligned to that model. So there was like previous years of work which has been done to kind of like enable all of this. So we didn't start from nothing. We started from something that was already working and deployed in the field by end users and so on. And we were kind of like adding in and kind of filling in the blanks.
Starting point is 00:12:20 So here's kind of an overview of the ecosystem. So the development has been going on since 16, and even before that. So 16 is the year where there was general zone support, generic zone support in the name screen. Before that, it was kind of like pass-through commands and all that. But this is kind of where things became really stable.
Starting point is 00:12:41 So that's been there since 2017. But this is a 10-year effort we're getting to now, which is usually the time it takes to build something with file storage these days. There is support across all the different major distributions like Red Hat,
Starting point is 00:12:59 SendOS, Tor, Debian, Ubuntu, you name it. I mean, all of them have support for Stone Storage. So if you have a Stone I mean, all of them have support for Stone Stories. So if you have a Stone Stories drive, and supports this specific device model, I'm going to get back to that. And you plug it in, and it'll just work. And there will be tooling and everything.
Starting point is 00:13:16 Another part that's coming through is that we have local file system support. So initially, it was just F2FS, which were targeted mobile systems like phones and tablets and so on. And then the latest here recently, we've added Damien Mossium has added in support to BorderFS, which means like there's an enterprise file systems, which now work with both CNS and host management system water drives. So that's kind of, I'm going to get back to why that's really, really exciting,
Starting point is 00:13:53 because it eliminates a lot of these issues that are challenges that have been around CNS. Well, how do we use it? Well, if you have a local file system that's generally available, you just put it in, and then on top, it's a general purpose. I mean, it's a normal file API. You can do random writes. You can do what you want. Then there are storage systems,
Starting point is 00:14:07 like the Ceph distributed storage system. There's something, and then when we move into the cloud, OpenEBS, MyOstore, there's something called SPDK's cloud storage acceleration layer in SPDK, like post-site FTL, which is good for all flash arrays. There's like a bunch of libraries,
Starting point is 00:14:21 like libcbd, libnvme, SPDK, FIO, Coemo, like libcbd, libnvme, spdk, fio, coemo, blocks on blocktest, and many, many others. So it's kind of like there's general support and so on. And that's kind of our focus, the teams that we're working with, is kind of end-to-end application enhancements. And that's kind of like cloud orchestration and databases, and databases, and caching, but I wrote it twice there. And so one of the cool things here is that this is kind of used in production at cloud service providers today at scale, especially driven by the SMR ecosystem.
Starting point is 00:15:03 So this is already kind of daily used and used in production by millions of drives today. All right. So that's something I think is very proud of where we got to here after these 10 years. I mean, there's been a lot of people involved and it's been kind of like a huge effort, yeah, over the last decade to get to this.
Starting point is 00:15:25 So this is really kind of has been a lot of people, yeah. 20 people, so there's been full-time on it for this time. So it's really amazing. So I want to talk about like how customers kind of like end users kind of deploy these when deploy use of this storage. And we see kind of two major ways to do it. I mean, when you deploy SSDs, one is kind of used through storage array.
Starting point is 00:15:52 Another one is kind of through local storage. So with storage arrays, I'm thinking, I mean, there are DIY solutions, where there are some old-fashioned vendors who have their own solutions. Another way is to use, like, these off-the-shelf ZNS drives and all that,3MSTs and so on. And that's one way.
Starting point is 00:16:09 And then you're exposed. I mean, you run like an FTL within this box, and you're exposed like in common source at the other end. Another way is kind of like local storage where you can do any application using this local file system that I talked about, support. So yeah, I mean, application doesn't know, are aware of it. And then there's one where we kind of gone, went all in.
Starting point is 00:16:31 Okay, let's go to build and make it just to make it, like make it as fast as we can. So for storage arrays, I mean, so this is, yeah, the old Flask array. I mean, you commonly expose this through like NEMO fabrics, NFS, Samba, so on. And the storage box runs some software that kind of terminates it and kind of like build, I mean, do the translation,
Starting point is 00:16:50 and then you expose conventional storage to the end users. It's like high-performance storage system for, I mean, you use it for AI, ML, streaming, and databases and so on. And one of the goals when going towards this kind of solution is to kind of like the dream is to replace some HTTP workloads with QLC SSDs. And kind of having this where you have the write-writes per day guaranteed to be more than one, you're actually kind of getting closer towards having them be able to do that. So one case here is Alibaba.
Starting point is 00:17:22 They have had a project going together with Intel and Al-Solid time to see that. So one case here is Alibaba. They have had a project going together with Intel and us all the time, sleep people, and where they replace kind of hard drives with QLC SSDs in their third generation big data platform. Whereas that old one where they didn't have the technology, where they didn't use QLC, they kind of like, they were kind of like twice as fast and had more density and so on. There's a lot more to it, but that's kind of the gist of it. So that's kind of like, they were kind of like, yeah, twice as fast and had more density and so on. There's a lot more to it, but that's kind of the gist of it. So that's kind of one way that we're seeing people are deploying this kind of storage.
Starting point is 00:17:52 Another one is this kind of, yeah, okay, yeah. And so there is kind of like, we've been doing it kind of like you've been doing your own proprietary solution. Well, so we might, so we've kind of been, I mean, especially Solida, I we kind of been, I mean, especially Solidigm together with Alibaba has been working on this C-cell part with the Cloud Storage Acceleration Layer inside of SPDK, which, yeah, is this translation layer.
Starting point is 00:18:17 And it has this right-shaping tier and it scales well with QLC and so on. And one of the work that they're doing, Solidigm, together with their partners, is to kind of like have a reference platform, reference implementation, and everything just you use it out of the box, which is really, really cool.
Starting point is 00:18:37 So here it's kind of like, yeah, that's kind of like one way to deploy, where you're kind of like, I mean, where Solidigm has like released this image and it's really, I mean, just deploy it and then you have your solution. It's very easy to use. All right.
Starting point is 00:18:48 Then there's kind of like the local storage and file system with some storage support approach. And then this is kind of for users, which, I mean, they don't need like full speed benefits of some storage. I mean, they actually just want to, they have it. And maybe they have some applications that doesn't need it or just haven't been optimized yet. And typically, always, when you use these storage,
Starting point is 00:19:12 you use it as part of a file system that's there. So you're anyway going to put something on. Might as well put on a file system which has some sort of support. So two of those is F2FS and ButterFS, which now has been enabled. ButterFS specifically has been since 5.12. Initially, I was SMRs, hard drives were stable. In the latest release of the kernel, we are now stable on CMS as well. And then Damien's team that's been doing this is really cool.
Starting point is 00:19:38 One of the learnings we had from that was that, which was kind of like this integration, when we looked at it, it's kind of like, this integration, when we looked at it, it's kind of like, so we compared this conventional drive with its own drive, and then just used the file system with the support, but we kind of did the benchmarking. I mean, we got a little bit. You got, I mean, you weren't like, same, you weren't slower. You were like five, 10% better when you used a drive
Starting point is 00:20:01 that has DNS support. I mean, and the interface up to the applications, that's just a file, POSIX API. I mean, you don't know that the stone store is underneath. You don't care anymore. But since you're just using a DNS drive underneath, you get this 5, 10% extra throughput. Which is, it's okay.
Starting point is 00:20:18 But where it really shines is kind of on the latencies, where it's kind of like when you get to the four nines and so on, it's kind of like, you really see the conventional drive like fall off a cliff, and where it's much more stable when the CNS drive. And so, additionally, you also get the 7%, 28% OP back. I mean, so CNS drives, you can just utilize the whole drive.
Starting point is 00:20:44 There's no slowdown, it just works. So you also have the extra capacity. Another thing is that it works natively with these hint-based placement approaches like streams. It just works. So any improvements that's done there, it kind of just natively works with CNS as well, with this drive there.
Starting point is 00:21:04 And there's, again, no software requirements. There's nothing, it just works. just natively works with CNS as well, with this drive server. And there's like, again, no software requirements. There's nothing. It just works. You don't have to do anything. I mean, these days, you plug it in, format it, MKFS, and then you point it to the drives, and yeah, that's fine. And then you put on your applications. You don't care anymore that it's on storage. It just works.
Starting point is 00:21:21 All right. But sometimes you just want to go all in. And that's where we have this end-to-end application integration. That's like IO intensive applications. That's where you're like banging the drive all day, like at high right throughput and so on. And it's kind of like these large-scale storage systems where you have like exabytes where it really matters
Starting point is 00:21:41 that you're using your storage efficiently. So for that that we are kind of like working on different applications so we we've done many and there's indigenous and community has done many i mean but the main ones is like my sequel by dancing on tirac db there's rocks db there's native support in upstream support that's something called longhorn kubernetes land uh there's sephh, there's OpenUPS, which is also kind of over in Kubernetes land. And there's all the distributions, of course.
Starting point is 00:22:09 So I guess when it goes through, so we have all these kind of integrations, and I just want to go through the ones that we have here in bulk and talk a little bit about them. So that's MySQL, CacheLib, and it is Ceph and these cloud integrations. All right, so for Pocono MySQL, so this is a collaboration between WD and Pocono,
Starting point is 00:22:31 where we have worked with them on taking the MyRoc storage backend and then Pocono MySQL using that, where we, at a hands, just in the room, he developed this ZenFS storage backend, which plugs into RocksDB, made upstream today, which kind of allows RocksDB to run natively on top of drives that have some storage. And today, the work is such that, I mean, you take the public container image
Starting point is 00:23:00 that wears Piscona, like my SQL, like just like normally when you have a container, you say, hey, I need one MySQL. That image has support for some stories. I mean, if it sees a certain story, it just deploys it and works with it. So here it's kind of like, I mean, we get, I mean, 80% higher throughput
Starting point is 00:23:18 just when using it, and there's like lower latency, tail latency, and so on. So this is kind of the same story. You get really high insert performance, I mean, because now the write's really efficient. And you also get a little bit of improvement on the read side as well.
Starting point is 00:23:32 One of the cool parts here is, I mean, it's very subtle. But so normally when you use MySQL, you use a source backend called InnoDB. And normally when you use MyRocks, which is optimized for space amplification, normally it is slower on the read path because of when you couple it together with CNS and these optimizations,
Starting point is 00:23:52 you're getting into the realm of where the InnoDB and the MyRocks implementation have roughly the same performance. So while typically you're kind of like, either if I read heavy workloads on database workloads, I'll go use InnoDB. And if I write a write efficient workload and write heavy workload, I go use MyRocks. With this kind of the drive integrated and all that, we can actually get to a point where you could just go ahead with MyRocks and work with that. So that's really exciting.
Starting point is 00:24:21 I kind of like, oh, that's something where, obviously, MyRocks doesn't have the same kind of functionalities as InnoDP. So it's never black and white. But it's kind of, it's one of the interesting things we saw. Then we've been looking at Caslib. There's a lot of more work to this. I'm only showing a little bit. There's paper on the review,
Starting point is 00:24:42 so I'm, but I'm just gonna take a very small part of it of this work that's gone into it. So, CacheLib is kind of this general purpose open source caching engine from Facebook. And it has, like, they use it for many of the different, like, what's called, like, storage services that they have within Facebook. And it's the cool part is it can both use DRAM and it can use Flash. So we did one of the benchmarks we had done is kind of like KVCast with this five-day trace from Meta that they provide. And here the key part is we're running the drive at full capacity. And that even means we're also using the OP, the 7% or the 28% OP kind of thing.
Starting point is 00:25:25 I mean, so this particular thing is 7%, 107%. We're using the whole drive. So if you do this in a normal drive, it performs terribly. That's where Facebook kind of goes to, hey, let's only use half. But we're just using it all in this particular case. And then the key is, of course, I mean, it's a WAF of one as usual.
Starting point is 00:25:45 I don't want to get too much into it, but there are some tricks that's being done in this particular case. And then the key is of course, I mean, it's a waff of one as usual. I don't want to get too much into it, but there are some tricks that's being done and where you do more reads, but you get very efficient, kind of like the write performance on it. So what we see is like the 3x write throughput on it, on this particular benchmark, and we see 2x read. And on the latency point, we see like 2x, 2 to 10x,
Starting point is 00:26:02 kind of like depending on how many nines we have, kind of improvement. But it's a major difference and kind of like shows, like, hey, you can go ahead and do this. That's really exciting. And I mean, it's just plug and play. You plug it in and it works. So there's not much to it.
Starting point is 00:26:18 This work is not upstream yet, but it's kind of like to show what's possible because the paper is not out yet. We still are waiting. Then another one is kind of Seth Crimson. This is, if you're building HPC, if you're building kind of like really massive storage systems with hard drives, for example, you turn into Google Blizzard.
Starting point is 00:26:37 I looked up kind of like how much, they have this telemetry thing that's like 1.1 exabyte deployed, and that's the one that opt-ins talk about it. So it's quite a used file system these days. And one of the things that's been working on is kind of in Crimson, which is the next release of Ceph, there's like native support for zone storage. So it doesn't mean it goes postmanaged, it's more drives than CNS drives,
Starting point is 00:26:59 it just works. So that's all, yeah. And when you use Ceph, it exposes conventional storage. You don't know you have someone's storage. It just doesn't care. It just works. So here you get, I mean, you still have conventional workloads. You're on top, but you still get 30% extra throughput.
Starting point is 00:27:16 All is good. So that's another one. And all this is kind of natively supported, so you don't need to care about it. It just works. Then the last one is kind of natively supported, so you don't need to care about it. It just works. Then the last one is kind of cloud integration. So many times when you do deployments, like application developers, they use these cloud orchestration platforms.
Starting point is 00:27:34 And we're getting into the states where initially it was stateless, all these containers. Now we're getting into making them stateful. And for that part, we are seeing kind of like, so you want to expose the storage into these containers so they can use it. And here we kind of integrated with Longhorn and OpenEPS and C-Cell.
Starting point is 00:27:52 And kind of like what we've shown here on the right is kind of like you have, yeah, where we have it by, where it's backed by blocks. So in stores like there's the Longhorn and there's the SPDK, C-Cell work. There's also at MyoStore, where we actually expose zone drive, just for those where we want to go all end-to-end fast,
Starting point is 00:28:10 and kind of show it, hey, and then we run a workload. And this is all within a container where everything is like wired in, and it also kind of like, so, yeah. And, I mean, the container, the application that runs for them doesn't have to care that it's ZoneStore, it doesn't matter, it just works.
Starting point is 00:28:24 All that is taken care of somewhere else. Yes, and Hans and Dennis in the talk later today will talk more about the details and all of this. I'm not going to talk too much about it. Yeah, all right. So it's kind of like, so we look at the ecosystem where it is today compared to in 2020.
Starting point is 00:28:45 So we have, like So a lot of this support for its own namespace command set has been announced or added into products across a broad set of vendors. I think there's solid support in the Linux software ecosystem. It's been helped a lot by the existing foundation for host-managed SMR SSEs. But recently, these improvements that's been done to the local file system, it's extremely exciting and relational key value database systems because that's some of the really IO-heavy workloads. So, we have all these vendors that are building both SSEs,
Starting point is 00:29:22 but also validation and so on. They also have been active. One of the key parts that we learned after the release was that it shouldn't be easy. I mean, we, vendors implemented different versions of CNS, which
Starting point is 00:29:39 caused not always being supporting the work that was in the Linux kernel or just like didn't or customers were kind of getting different drives where they couldn't quite work it so they didn't know what they were getting. So one of the cool parts here has been here that's done work within SNEA has kind of been to standardize kind of common device models such that we can all kind of rally behind the same storage devices that have the same properties such that when we build a software
Starting point is 00:30:03 stack, it works together. And then second is kind of like using it, making it easy to use, such that you don't need to care about the zone storage, that you just use it and put your applications on top. And that's something I think with the local file systems and these cloud solutions that's now supported natively, I mean, that's something where it really helps there.
Starting point is 00:30:23 All right. So on this Neo model, so this is a zone storage technical work group. It's been going on for quite a while. And one of the things that we got to there is kind of defining these common requirements for a zone storage device. And it's really, really interesting in that kind of like so when we have that where we kind of say this is the properties, I'm going to get into kind of a little so when we have that, where we kind of say, this is the properties, I'm going to get into kind of a little bit about what we define here. When you do that, you kind of like, we get the multi-sourcing. So end users can kind of source from multiple vendors. They can say, this is the model I want. And then they can get in the same realm.
Starting point is 00:30:58 But it also has this common software requirements. So you have like something that's just generally, you know, what you build for. So specifically, that's kind of like, so some of the things here that I'm really proud of, what's standardized. I mean, there is the model where it's high performance and high capacity. There's the use case described. That's great. And I think that was good that I was added in. It gave a lot of clarity. But what I'm really excited about is this common requirement of a zone storage device,
Starting point is 00:31:25 which kind of defined who managed reliability. Is it the host? Is it the drive? When we defined, when I was part of the, we defined the zone storage specification, zone namespace specification, command-set specification, I thought rare-leveling was also always on the drive. The reliability was in the drive. And I didn't think that anyone would think that it would be in the host. It didn't cross my mind back then, but it kind of turned out later,
Starting point is 00:31:55 hey, we really like to put things that have the reliability kind of, the host taking care of that on the drive. And it's kind of, ah, and that changes everything. It makes it much, it's more difficult for the software. There's a different software stack. So kind of saying, it's on the drive. Drive manage it.
Starting point is 00:32:13 Great data doesn't go away. If the host does nothing, data stays there. There's something about the zone. Like when you write into it, the capacity is static and fixed. That's also important. There's something called zone active excursions. So that's very technical. And when you program to media,
Starting point is 00:32:28 sometimes you get a program failure and you have to opt out and then you can't write to that flash block anymore. It's important that the SSD, when that happens, it just says, okay, I'll fix that. I mean, you're on the host. You don't need to care about that. You can just keep writing.
Starting point is 00:32:44 So that's kind of like, so that's being kind of defined. And then there's like things around end of live, read-only mode, and so on. So this is really exciting work that's kind of been, yeah, was released here back in July. So it's very recent and very exciting kind of work. Another one we kind of see is towards some storage for embedded devices. So this is I personally haven't been much involved in it, but it's kind of Google has been
Starting point is 00:33:09 pushing zone storage support into mobile. So this is kind of within JDAG. So this has been driven by Google and some of their partners. As you can see, there's like SK Hynix there. So they've been working closely with them. So they're a known one. And it's kind of like the use case they have in mind here is the
Starting point is 00:33:26 Android hardware ecosystem. So this is tablets, Google Pixel, like phones and so on. And there's kind of like, there's a roadmap for kind of like getting into that. There's this word from Google discusses roadmap back at FMS. So this is kind of like going together. The specification was completed back in July. And it's kind of like, so this is kind of planned to go into next-generation mobile platforms. And one of the cool parts is that while for, when we did CNS, we kind of had to start it a little bit, and get everything, file system support,
Starting point is 00:34:01 for some UFS, the main Android file system for doing this is F2FS, which were initially developed by Samsung and then went all the way, and now it's Google as the main developer is there. So that work started out, and now it's very stable today. And the zone support in it is very stable. I mean, it's been tested over a long time now. So when the vendors, the media, like the UFS vendors, kind of delivers these drives, I mean,
Starting point is 00:34:30 BART used to just take the drive, put it in the F2FS file system, put it together, and ship it. So where we had like a long lead time for ZNS to kind of get into, we know ByteDance has probably talked about when they deployed as a production. We kind of had a long lead time for the stoned UFS. I believe this lead time is much shorter because the ecosystem is already there. It's working and so on.
Starting point is 00:34:54 So that's very exciting. And something I'm very, very look forward to happening. All right, we're getting to the last slide. So kind of with some sorts, I mean, with some namespaces, it kind of enables hybrid scalers and these CSPs to meet this increasing customer demand. It helps them go from just five years to seven years and even beyond that. And I feel the ecosystem, the software ecosystem, is very mature at this point. We have, like, for storage array, you have turnkey solutions.
Starting point is 00:35:26 I'm thinking of C-Cell here with DNS and QLC. There's robust file system support, both client, like embedded for tablets and phones and for enterprise workloads with Butterfest. That is also stable, so you don't have to worry about there being some storage underneath. And then there is these end-to-end integrations where you just wanna go, let's go full speed ahead and get all we can. And of course there's databases that's been accelerated, but there's also distributed file system and so on,
Starting point is 00:35:54 and Kubernetes and OpenStack and all that. And we have all that kind of working today. Yeah, and I wanna close on, kind of there's more talks here at C. There was one talk here on Monday by Swapna, which talked about these kind of like cloud workloads and the application they have for storage media, where they talk about the requirements that they
Starting point is 00:36:12 kind of see now and in the future for their storage systems, both for SSDs and for hard drives. There's later here today, they're splitting the gap between host-managed system hard drives and software-defined storage, where Piotr from Light Storage is going to talk about it. That's going to be exciting.
Starting point is 00:36:26 And then here at the end of the day, Hanson Dennis is going to have fun talking about zones and the art of log-structure storage. All right. Thank you. Thanks for listening. For additional information on the material presented in this podcast, be sure to check out our educational library at snea.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.