Storage Developer Conference - #68: Andromeda: Building the Next-Generation High-Density Storage Interface for Successful Adoption

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 68. Hi everyone, my name is Laura Caulfield and I'm a developer in the hardware division of Microsoft Azure. My responsibilities there are in the design and pathfinding work area for solid-state storage in our data centers. Today, I want to tell you about some of the technology trends and application requirements I see in this role

Starting point is 00:00:59 and the interface changes that we're adopting to support cloud-scale workloads for the foreseeable future. So first, we're going to start out with a brief primer on current SSD architecture and the priorities of today's cloud applications. Next, we'll dive into the prototype that we've been building for use across Azure. And then last, we'll explore the current state-of-the-art interface between host and drive that we're looking to develop. So first, we'll dive into the design principles for cloud hardware. These tend to apply to more than just storage, but I've tailored them a bit for Flash. So first, we need to be able to support a broad variety of applications.

Starting point is 00:01:46 Azure alone has over 600 different services, and we use the same set of hardware for our mail servers and our office suite and our search engine and others. This huge number of applications is also a huge amount of hardware, so we need to make sure that our supply chain is very healthy. And we do this in part by making sure all of our devices from different manufacturers behave almost in the same way, at least in the areas that we require. We also are seeing rapid evolution of NAND generations. It's actually still following Moore's law. But on the flip side, we're seeing huge qualification times, right? The complexity of the firmware and its interactions with the host are turning the test time for any single workload into hours,

Starting point is 00:02:32 and we have hundreds of workloads that could each present their own corner case that matters and is important to find and debug before we put this hardware into the data center. Last, we need to make sure that our software has enough flexibility to evolve faster than our hardware. We use SSDs for at least three or five years, typically more because they last so long. And the process for updating this hardware and the firmware

Starting point is 00:02:57 is much heavier than daily pushes of software updates. So now that we've seen the environment that we're working with in the cloud, let's take a look at what the technology provides. So how many of you guys are really familiar with what goes on inside of an SSD? Garbage collection, write amification. All right, so we're about half and half. I'll try to go quickly on the background. In your SSD, essentially, it's mostly flash memory. And each flash memory die is composed of a set of flash blocks. And then within the block, you program individual flash pages.

Starting point is 00:03:30 Program, right? Those two words are used interchangeably. And the flash itself has this bulk erase operation. And then you have to write to each of the pages in order. So to translate this type of interface into the standard update in place interface the ssd has a large amount of dram for every terabyte of storage it has about a gigabyte of dram so now when the host is writing data for example we have four streams of data each individual stream is sequential but they're happening at the same time

Starting point is 00:04:01 the drive accepts this data into its data cache until it has enough data to fill a whole flash page. At this point, it updates the address map and writes the data to the flash. The host continues to send data and the flash continues to fill up until all the flash is full. Now at this point, the SSD controller's job is to free up some flash space. It's got a whole bunch of over-provisioned extra flash space. So by this point, the SSD controller's job is to free up some flash space. It's got a whole bunch of over provision, extra flash space. So by this point, data has been overwritten, or maybe some applications have trimmed their data altogether, and you have this fragmentation in your data and physical space. So at this point, the SSD controller does garbage collection. It copies out the valid data, which is the main step in garbage collection, to a new block.

Starting point is 00:04:45 And then it erases the old block. Now this write amplification is enemy number one. All it does is put additional overhead in the drive. It eats away at performance. You get down to 20% of the performance in steady state for random workload. It also eats away at endurance, which is very limited in Flash. And becoming more limited as the density scales up. So we looked at this and said, how are we going to reduce the write amplification? We'd

Starting point is 00:05:10 like to buy these off-the-shelf parts, so let's try caching up our data on the host side, right? If we cache up a whole block and just write a whole flash block at once, we'll get write amplification down to one, right? Unfortunately, we ran the experiment and saw a much different picture. Writing four megabyte random chunks of data wasn't much different than writing four kilobyte random chunks of data. And we had to increase the block size up to a gigabyte before we saw the right amplification improvements

Starting point is 00:05:37 that we wanted. Now, this might be okay if we have four streams of data, but now scale up to the hundreds or thousands of streams of data that we have, and the host side buffer cache becomes untenable. Now this gets at the heart of a fundamental design trade-off that cloud system designers see very differently from SSD designers. I've given you a hint about the cloud SSD designers, but let me first dive into the perspective of the SSD design. When you have your SSD, you've got actually a large array of flash memory, not just a single die. So you'd like to scale out your performance as much as possible.

Starting point is 00:06:12 And the controller sensibly does this by caching up enough data to stripe it across a whole flash page across all the dies. So the data keeps coming in, keeps getting written, and your effective flash page size and flash block size is now multiplied by the number of dies in your device. A little m.2, for reference, has 64 dies. So this is how we quickly get up from the 4-megabyte flash block size to the effective 1-gigabyte flash block size. And this is our current state-of-the the art, but if you look at how flash is scaled over time, as the density of flash goes up, the block size also goes up. So this is what the

Starting point is 00:06:54 SSD looks like. Now let's switch back to the applications and see in particular how their design trade-offs would establish how you'd map their data to the physical flash array. We'll take a look at three applications, the first of which is our Azure storage backend. Now, this is a lower tier in a storage hierarchy, so by this point, the data is organized into hundreds of distinct streams, each went into sequentially. It's got a couple of priorities in the performance space. It can scale up its performance by scaling up the number of streams on a device. Any single stream doesn't have to get great throughput as long as the whole system gets good throughput. And they try to keep the

Starting point is 00:07:36 reclaim unit size as small as possible because this helps the end latency of their whole system. Now, these application priorities map to what I call a vertical stripe. This is where each stream is scheduled to its own block on its own die. It has the append-only semantics, so your write amplification becomes very low. Each stream is isolated, so if you trim one stream, the other one isn't fragmented. You get high throughput by increasing the number of streams on your drive,

Starting point is 00:08:05 and you get the smallest effective block size possible, that four megabytes today, which might become more megabytes tomorrow. Our next application is a legacy application scheduled in a virtual machine. Now, these tend to be, I mean, all over the map, right? But they can have small updates, they can have bursty performance. But the bottom line here is they expect the same type of behavior that they saw with legacy SSDs. So for this, it makes sense to stripe across blocks on different dies. You get the bursty performance, the high throughput for any given user. But the big difference here between legacy SSDs is that you've scheduled your VMs to different sets of blocks. So when a VM is well-behaved and decides to write sequentially,

Starting point is 00:08:54 it can get low write amplification without the other VMs fragmenting it. And also, when you close a VM, then it doesn't fragment other VMs when it trims its data. Our last application is a new application that's still run within a VM guest. Now, the host is still scheduling a horizontal stripe, scheduling each VM to the horizontal stripe. So the design knobs that this application has are dividing up those blocks within that stripe, perhaps for a set of different logs that it has. And so this is what I call the hybrid stripe, where you have a horizontal stripe, but then you've further striped it into vertical stripes. Now there's a few things to note. One more thing before I say that. So there's a wide variety of applications here,

Starting point is 00:09:47 and you might imagine that they're on different machines doing completely different things in their own world. But in fact, to make the best use of our hardware and scale up and down with demand, we need the flexibility to be able to put all of these applications in the same SSD at the same time. So we might want a vertical stripe here, a horizontal stripe over

Starting point is 00:10:05 there. We essentially need the flexibility to partition these dies out and these blocks out into whatever configuration dynamically. Now there's a few things to notice here. All we really need is a chance to expose all these log write points, right? We don't need access to the NANs that use Ingersys. In fact, we don't want access to those things. We want that to still be managed in the drive. And the bottom, the most fundamental thing, difference here,

Starting point is 00:10:42 is that as we scale up the capacity in the data center, we're not scaling up the size of each application we're not scaling up the size of each application. We're scaling up the number of each application. So to support these priorities, we've boiled it down into three aspects that this new interface needs to have. First, we need to replace the blog abstraction with something that looks more like pend-only write points, and lots of them. We need the interface to have the flexibility to scale up to hundreds or thousands of write points per terabyte. We'd also like, if we're guaranteeing that we're going to be writing in this way, we don't want the drive to have to reserve

Starting point is 00:11:19 large amounts of flash or large amounts of DRAM in the case that we're going to revert back to the legacy system. The next aspect that we need in this new interface is the ability for the host to place the data physically. It needs to be able to understand whether it's co-scheduling on the same die as another application or isolated on its own, or whether it's divided among different blocks and it needs to be able to make the trade-off between reclaim unit size and throughput sorry i've got a smoke thing going on from the wildfires up in seattle okay so the last last aspect we need that has historically been kind of minimized in open channel interfaces

Starting point is 00:12:07 is to keep the reliability management down in the drive, right? We're going to have some challenges and the interesting challenges in defining this interface to make sure that the flash can continue to scale on Moore's Law. We need to enable innovation in that space for new ECC algorithms or whatever gnarliness that I'm not as familiar with these days. So taking a step back and looking at these priorities, I can see how it's kind of evolved over time in the community. So we've got our log abstraction, our in-host data placement policy, and our in-drive reliability.

Starting point is 00:12:50 And I'm going to show some different interface proposals that have evolved mostly over time. In the early days, then, the community realized that there's huge overheads in SSD storage systems, right? We designed SSDs to match what hard drives did so that we could get them into the market, and it worked. It was successful. It was a great strategy. But then people started to realize how much overhead it takes to mimic the hard drive. And so we addressed these overheads first by pulling more control into the drive.

Starting point is 00:13:27 Unfortunately, this has the side effect of locking the software innovation into that pace of hardware innovation, right? You can only move as fast as your firmware is going to evolve. And you can't try new things between the software and firmware in a very agile way. So next on the scene were multi-stream SSDs and IO determin determinism and this is where we started discovering the benefits of isolating uh different users down to the hardware and also some of the benefits of um well some of the ways to place the data in the drive unfortunately the the constraints in this space are kind of a double-edged sword. Both of these interfaces support legacy I.O. patterns, and so they retain all of the DRAM and flash overheads

Starting point is 00:14:13 that are required to quickly revert back to those legacy I.O. patterns. And so then the final set that we're considering here are in the open channel space, and this can mean a lot of things. There's a huge spectrum of proposals there, and there's a lot of evolution that's happened recently. And so my point here is that we need to make sure that this interface evolves so that the media management can stay in the drive and that these systems can become production-ready.

Starting point is 00:14:50 So our next steps are to take this interface and further develop it into a production-ready system. So now that we've evaluated applications, priorities, technology trends, and some of the available options for the new interface, then our next steps here were to build proof of concept and evaluate the overheads and performance that it gets. This is also our first step towards creating a system that's viable for a replacement for a conventional storage stack, right? We still have to support those legacy applications with the same level of performance as conventional SSDs.

Starting point is 00:15:28 So let's start out with defining a few terms, since so many people have so many different definitions for the same set of words. Open channel SSDs, at the kernel of it, we're exposing physical access, or physical addresses from the drive to the host, such as channels, right? So the channels are open.

Starting point is 00:15:49 We're also going to be talking about the flash translation layer. This has conventionally been in the SSD's firmware, and we're starting to see separate into two distinct parts, one called log management, one called media management. Now, the log manager managers main role here is to receive any kind of right maybe an update in place maybe a log structured right and to emit definitely to a new emit IO patterns that are definitely sequential maybe to one or more right points and in order to do that and it

Starting point is 00:16:19 maintains the address map and performs garbage collection the media manager their main role is to basically translate the gnarly physics of flash memory into something that looks more like a software logical sequence for how to access media. So this set of algorithms is conventionally like ECC algorithms, read-retry, read-scrubbing, all the things that change with the NAND generations and across vendors.

Starting point is 00:16:52 So with these terms in place, now we can talk about open channel and the two main variants. So here on the right, we have, next to a standard SSD, the two ends of the spectrum for open channel. So in standard SSDs, the entire FTL is down in the SSD, right? In the early open channel SSD systems, including our prototype, that entire FTL is shifted up into the host. This includes the media management and the log management. And this gives us the best flexibility for playing around with these algorithms and understanding the best division of labor between the log manager and the host manager.

Starting point is 00:17:31 But this isn't a system that I can take to my boss and say we can take to production because the drive can't be warrantied. I have to recompile the host side driver for every new NAND generation. So it's important that we reach this final destination where the media management is pulled back down into the drive. All right, so without further ado, here is our open channel prototype running in what we call Windows Cloud Server, basically the type of server that runs in Azure and our other applications. On the right, we have the prototype card. It's not our standard M.2 form factor

Starting point is 00:18:06 because it's a prototype, but it has essentially the same architecture. It's also running here in the disk manager and the two black screenshots are of it running, the system running under our conventional qualification tools. In the center we have store score running the standard four corners that you hear about,

Starting point is 00:18:26 random read, random write, sequential read, sequential write. And then in the bottom left corner we have a tool that can do secure erase and some of these other operations. And it's exposed as an open channel type of SSD. So this is a very real system. It's pretty exciting to see running. One of the first things we wanted to measure is what opportunity we have for optimization,

Starting point is 00:18:52 right? We all keep talking about reducing the map size and reducing the total end-to-end CPU overhead by reducing the total amount of work. What do those overheads actually look like? So we looked at three areas. The first of these is write amplification, right? This is enemy number one. We want to reduce that as much as possible. We look at a reasonable worst-case workload, which is highly fragmented.

Starting point is 00:19:15 This is a 4K random write workload. And in our normal SSDs, we usually see a write amplification factor between 4 and 5. In this particular system, then we saw a write amplification factor between four and five. In this particular system, then we saw a write amplification factor of four, and it moves from not being in the drive now, but the amplification is happening in the host. Our next area that we looked at was memory. Now, this is essentially the address map. So,

Starting point is 00:19:44 when you do a 4K addressing, then you have the one gigabyte of DRAM for the 1 terabyte of flash. And that's exactly what we saw, right? Our memory usage went from 1 gigabyte in the drive to 0, and the host from 0 gigabytes to 1 gigabyte. And this one is an easy one to optimize for applications, right? If IO patterns start shifting from four kilobytes to eight kilobytes, then right away you can reduce your granularity to 8K, and you won't get that performance hit of read-modify, right? And you don't have to lock your decision into when I'm manufacturing the hardware, how much DRAM do I have to put on my device? It's dynamically available. The last area we looked at was CPU overhead. We're shifting all this work up into the host now. It's doing more activity

Starting point is 00:20:29 in the host rather than in the drive. Now, there are some overheads in this system that are specific to the prototype. There were design decisions we had to make to get the prototype out the door. And in our next iteration of software development, those will go away right away. And then beyond that, there's further optimization we can do by just reducing the overall activity, right? Same time we reduce the write amplification, we'll be reducing the CPU activity as well. So the big takeaway here is that we've quantified the overheads. Now we understand how much optimization opportunity we have. We also need to make sure that we're addressing those legacy applications, right?

Starting point is 00:21:13 We can't ask everyone to rewrite all their applications to get the savings. So we need to make sure they're behaving at least at parity. So I'm happy to say they are. Here we've just run the four corners plus a mixed workload and measured the throughput. In gray we have the average of our standard SSD that we're calling right now. It's an average of three SSDs.

Starting point is 00:21:38 And then our open channel SSD is in green, and our standard SSDs are in various shades of blue. So in looking at this, you can get the high-level takeaways. Basically, the read performance is fantastic. It's in the top of the class or even better. The write performance is a little on the low side, but these FTL algorithms also haven't gone through the optimizations that FTLs typically go through right before shipping.

Starting point is 00:22:04 So I have every faith that that right performance will come up. The second workload that we looked at is notoriously challenging for SSDs, but it's one that I smile because the application designers every month or two come to me and say, what the heck is going on with my read performance? And basically, it's a mismatch in what's going on in the background activity. So they're doing writes in the background. Oftentimes, the benchmarks aren't doing writes in the background. So in this one, we are doing writes in the background. And this SSD does just as well, if not better, than the other SSDs. I'm

Starting point is 00:22:40 showing average latency in the purple and then scaling up through 2, 3, 4, 5, 9 percentile latency. And then a maximum latency run over about an hour after preconditioning. So the open channel SSD here gets maximum latency better than 10 milliseconds, which is better than the other drives. And then the other 2 through 5 nines latency are about on par. So overall, this proof of concept has been very successful, and we've seen that legacy applications can perform at parity, and now we've set up our system, and it's ripe for optimization by all of our different application groups. So then our next step is to make this system production ready. And the important part here is to make sure

Starting point is 00:23:31 that all of our SSD vendors can implement it and we have a standard interface that works for all of them. And so in the next section, I'm going to dive into what we currently have in... Yeah, the current state of the art for the interface that we're looking at. At the core is the physical page addressing interface, right? We keep talking about having a physical address. So the address format now, instead of having logical addresses, has a segment for each of the main architectural elements in an SSD.

Starting point is 00:24:06 You have your channel, which is a SSD channel. You have your parallel unit, which maps to a NAND die. You have a chunk, which maps to a multi-plane block, and then your sectors and pages. Now, all these addresses map to a physical location. They don't change, which also means that the host is exposed to the access pattern required by the flash. We have to erase a whole chunk before writing any of the sectors, and then we have to write the sectors sequentially. This last bullet here about the cache minimum write size

Starting point is 00:24:37 is pretty cool. So this is one example of how we're defining an interface that works for how NAND has scaled in the past and exposes a logical thing that the host-side software can work with. Let me tell you more about what I mean here. Basically, your NAND cells, your memory element, contains more than one bit of data, right? Your MLC typically has two bits, and then your TLC has three bits. And any time, well, okay, so these bits are also split across different pages. So it's possible to write one page and have your memory element half-written. And when it's half-written like this, then the written data is more susceptible to errors caused by reading of this data.

Starting point is 00:25:31 So one of the gnarly things that SSDs do now is make sure that you're not reading from those half-written memory elements. Now enter cache minimum write size. Basically, the contract between the host and the drive is now to say, don't read from these last n pages of data. Make sure you cache at least n kilobytes of data and read from that cache instead of reading from the flash physically. Now you can take this another way and say, well, let's take that cache and put it in the drive.

Starting point is 00:26:02 Let's hide it from the host. And in that case, then you just that cache and put it in the drive. Let's hide it from the host. And in that case, then you just expose a cache minimum write size of zero. There's a funky little animation here I didn't go through. But basically, the picture on the left here is a picture of your memory elements. Each row is a memory element, and then each column is a different bit in that memory element. The numbers are the pages in order.

Starting point is 00:26:29 So at this point, we've written up until page 18. Now we write page 19 and half fill that memory element. Now we write page 20. And then at page 21, you've filled that memory element, so it exits the window of the cache minimum write size. So this is kind of a hint at one of the things we're going to have to do to define a good interface that works for the drive and for the host and scales with new generations of technology. The next one you're probably familiar with. It's a tradeoff that's been going around in the IO determinism circle as well.

Starting point is 00:27:05 And basically, we have a tradeoff between providing good quality of service and providing good reliability. I'm seeing some smiles back there. Basically, as soon as you break up your RAID stripe, you stripe across dyes, right? And as soon as you break that up to get good quality of service, now you don't have your RAID providing a good reliability. So RAID and isolation are at odds, and this is an ongoing discussion that is being solved in the IODeterminism space and is overlapping with the open channel space. So we really do have a spectrum of interfaces here,

Starting point is 00:27:36 and we're going to find some place in the middle that addresses hopefully both sides. Ah, this last bullet here. So in the cloud, we have the benefit of sometimes, Hopefully both sides. Ah, this last bullet here. So in the cloud, we have the benefit of sometimes, in some applications, doing a higher level of replication that can provide the higher reliability. So this is what allows us to maybe reduce the reliability provided by the drive to a known lower level of reliability

Starting point is 00:28:00 because this higher level of replication can rebuild the data. Okay, and then my last topic on the new interface is kind of scaling back a little bit from the pure physical addressing. So we want to do physical addressing, but in moderation. Even in SSDs, we saw how they remap bad blocks through a sparse map, so it's kind of a low overhead option. And I see a similar thing possibly happening with open channel. Basically, at your block level, then you have bad blocks show up, and you want to perhaps have a sparse map to map those out. But if you make your block logical, you can also have the drive provide wear leveling guarantees at the LUN level.

Starting point is 00:28:52 So now you kick your wear leveling into these two parts where the drive guarantees a die level wear level, and then the host uses its normal migration patterns to wear level across the dives. So in terms of wear leveling, we're kind of moving our boundary from the drive level to a long level. And my final point, which I didn't write up here, is that we can do this with a block, but maybe not with a page or a channel because your block in Flash is independent of physical location, right? I expect the same guarantees from block two as from block three. And if I schedule two applications in the same die on these two different blocks,

Starting point is 00:29:36 I can kind of map it around and not notice any difference. And in fact, that's even more the case with NAND than it is with hard drives. Okay, so I'm actually finishing a bit early with the conclusion, so I'm hoping we'll have a good discussion after. We've seen over these slides that it's important for us to architect this interface to address both the cloud scale that we're seeing today and the cloud scale we expect to see in the future

Starting point is 00:30:04 with hundreds and thousands of workers per terabyte. But we also want to make sure we have the correct division of labor to allow NAND to continue to scale. I showed you our proof of concept and the data that we got out of it, basically showing the system overheads and how we're ready to optimize at the application level with this open channel interface. And the final steps that we have to do to bring this production involve the whole community in defining this interface, which brings me to my final solution,

Starting point is 00:30:36 this, or my final point. The final solution for this interface needs input from the whole community, right? We want it to work for all of our different drive designs, for all of our different hyperscalers. So it's a great time to jump in and talk about what this interface should look like and what things we need, what mistakes we need to make sure we're not repeating. So with that, thanks for your attention, and I'm happy to take questions now. Thank you. Yeah. Timeline for production.

Starting point is 00:31:21 So I hesitate to say because I haven't finalized this with everyone, but some things I can say is that Open Channel, we hope to be faster to implement on the drive side. We don't have to have each vendor implement new FTL algorithms and vet them out. And we've already done the majority of the work of getting the complex FTL algorithms working in the host. So next step is getting the interface defined,

Starting point is 00:31:49 and that's hopefully kicking off now. And once we have that, maybe a year after for final hardware. Yeah. Yeah. You mentioned that there are different kinds of applications that you know to support. For example, there's a streaming application where there are multiple streams and it's best for each stream to have an arrow strike, just go through one time.

Starting point is 00:32:20 And there are other applications such as VMs where it's bursty and it's good to describe. So the final solution that you're proposing, do you have some sort of dynamic approach? For basically building that picture? Yeah. Yeah, that's what I would like to see. And that's the way everyone in the company is talking, right? It's how they already operate.

Starting point is 00:32:47 And somehow the amount of buffering also is structuring here. Because ideally you would like to buffer as much as possible so you can then leverage the bandwidth for all of them. You're saying like buffer up, sorry. Like for the first application, make sure you buffer up enough to fill all the dies with your write. What kind of buffering do you need?

Starting point is 00:33:10 The assumption, I guess, is that you have some finite amount of persistent buffer. Ah, yes. So there's persistent and non-persistent as well. And that, I mean, it all happens case by case, right? You talk to this application, they say, yeah, it's okay if you lose that. Most of the time, the data center's powered on.

Starting point is 00:33:31 Once in a blue moon, it'll be powered off, and I've got replication to pull it in from elsewhere. So in that case, you know, it doesn't need to be persistently buffered. In another application space, maybe they're making really small writes, like 200 kilobytes, and so they're not able to fill up a whole flash page, and they don't want to rework their system to basically handle that data loss.

Starting point is 00:33:57 So then they slap in the persistent memory buffer to fill up a whole flash page. So is this going to be described somewhere in a kind of algorithms that you're using to construct stripes? Good question. So the way I see it, so I don't specifically have plans on publishing that kind of thing. The way I see it is once I get the interface going for people, there's going to be this explosion in the application space where, I mean, there won't be the necessity to change your application, but people start to see, oh, I can schedule my data in this way, and my write amplification goes from five to one. So why wouldn't I enable that new technology that has 20% of the endurance?

Starting point is 00:34:40 So I think you'll see a lot of interesting work come out all over the place in how to do this striping. Yeah. For the minimum cache write size, that's addressing a specific problem of the resterve effect on your lots are fully rented. Yes, so I highlighted this in part because it's something that we've seen on many generations, right? Anytime you have a multi-level cell,, it gets denser and it gets harder to manage. But the thing that's consistent is that you need to cache the last end pages. So my goal in this space is basically to rely on the community expertise to say, this is what Flash has been doing for the last 10 or 20 years, and this is how we expect it to continue to behave.

Starting point is 00:35:50 These are the patterns that we're seeing. On the device side, sometimes we don't find this out until we have the memory in our hands for a little bit. It's pretty late when we start to find out some of these things. Especially at higher levels. It's very hard to

Starting point is 00:36:11 provide insight on what we expect the N to do. I'd love to talk with you more about I don't know what you can share. Say again? I'm sorry. Why would a host need to know that? more about, like, I don't know what you can share, but. Say again? Why, I'm sorry, I have a question.

Starting point is 00:36:27 Why would the host needs to know that? You can still, in our model, you can still manage that transparently on the device? Well, this is an example of one that's not fully transparent. But it could be something that you can still in the device and the participation that you're saying, basically, is that what we have now with current SSDs

Starting point is 00:36:59 gives you a lot of wiggle room, right? You can come up with all sorts of creative solutions for managing the errors that you see pop up late, right? So I agree we need to be careful with how we change that to make sure that you still have freedom to do those things. But, I mean, kind of the abstraction we have right now is just requiring so many overheads and we're repeating algorithms across the whole stack. I mean, because you kind of opened up talking about the long test times, the fall times,

Starting point is 00:37:40 and now it just may shift from, say, the device. It may just shift it. It may not improve. Yeah. Yeah, there is that danger. Yep. So my question is more just to you as a system designer with respect to strings. So you've used strings before. Did you get a chance to test that out?

Starting point is 00:38:01 We've been playing with them, yeah. So I'm just curious, because from Box OEMs, guys that we've talked to that use them, there's this finite number of streams, which I can only imagine is far worse in Hyperscale. And how do you coordinate those streams, given that they're finite, they're ephemeral, with the fact they're limited? And how does the FTL deal with the fact that it only has so many streams? How does that all work practically speaking?

Starting point is 00:38:32 Using the FTL on the drive side? Just how to, what's the success been with streams? There's a very finite amount, number of streams, and you guys are probably creating tons of them and it's very dynamic. Yeah, so, I mean, that's a great question. Basically, streams as they are now, you basically get eight or 16 streams in a terabyte, and we're seeing adoption, but it's limited. So, only one of the many applications I see, it's a big one, but only one of them has actually found the

Starting point is 00:39:06 benefit worth the extra implementation they've had to do. And another application is taking that and saying, well, I still need hundreds of streams per terabyte. What if now I guarantee you that I'm only going to write sequentially in each of those streams. Can you provide me that better scale? And the answer is looking like yes. But again, this is, it's kind of a minor step and it applies to one more group. But then there's the next group after that that I've found in the last couple months who want thousands of streams per terabyte. What I don't hear anyone talking about, though, is how an FTL could possibly track thousands of streams.

Starting point is 00:39:48 Yeah. Hundreds. Yeah. There are none that do today. And the resources for that would be amazing. And coordinating that with the folks. Yeah. You're just adding overhead in order to do these things.

Starting point is 00:40:00 And so one of the major goals with Open Channel is to strip out a lot of those overheads and make the whole system more efficient instead of having everyone tracking more and more stuff on their respective sides of the fence. More importantly, I mean, the whole point of NVMe was to simplify things, right? So I mean, one of the reasons for success

Starting point is 00:40:22 in that was that it was very easily implemented and the problem shifted down Yes. into the operating system or use of stack to warrant this? Yes. Yeah. And then I have actually this, where that thing goes. And then I have all of these. So you're trying to say, how does that eliminate the things that you need to do in the house? That wouldn't be a problem. It would just happen.

Starting point is 00:41:30 So what do you think about this? You know, for your application, how do you decide? Is that completely, are you taking the entire source stack from the application down, or is that, is it something else? Yeah, so everywhere. Yeah, so I mean, Is it something else? So, everyone...

Starting point is 00:41:46 Yeah, so, I mean, one of the challenges in discussing these concepts across Microsoft is that each audience member has a different benefit that they could get from Open Channel, right? So when I talk to the qualification team, then for them it means they can put the drives into the data center sooner, mostly. And it's gnarly talking about, like, getting the same hardware working for everyone, right? So we've essentially shown that it can work for everyone at Parity

Starting point is 00:42:26 and those who want to make it better have the chance to optimize up to 5x write amp reduction which right away translates to now use QLC instead of TLC which is at least 10% cheaper and

Starting point is 00:42:44 get performance up 5x what it was before. And part of the reason that the benefit is so great is that we share our hardware massively, so we don't have the situation where we get the low-write amplification. There's almost no applications in the data center that can own a whole SSD. So getting getting it striped out to a large number of people makes a big difference. Yeah. So I'm a little confused by why you think this will actually speed up drive qualification. Okay.

Starting point is 00:43:25 With a new generation of drive from now with XLC, it requires entirely different algorithms for some of these management stuff. It seems like it will take a lot longer for that to perfectly up into the file systems and application layers and stuff. And it would slow down the introduction of new technology. And before all of that gets hung out, now the drive vendors have the device providers have the ability to work out those algorithms and have them in the device and provide something that is functional and working at the interface level not interdependent with the massive amounts of operating systems

Starting point is 00:44:15 and similar. Yeah. So I see the emphasis in a very different way, right? This interface isn't going to work if it needs to change with every NAND generation. So we haven't done our job and we haven't finished it if it changes with every generation. The thing that I do see every day is a new NAND

Starting point is 00:44:35 generation comes out, it's almost no different, right? You've changed some parameters, the silicon's even the same, you just get into test mode and you do some tweaks, and yet you have to go through the full qualification process. And there's two aspects to that that make it so long. Every time you do a new workload, you have to run it for two hours, right? You're not going to get consistent performance if you run it for five minutes because you have to precondition, you have to warm up the garbage collector. And I don't see that happening with new tweaks to NAND management that have to happen. I don't think you're going to come up with something that requires two hours of testing each workload. And I also don't think you're going

Starting point is 00:45:16 to come up with something that has corner cases in one of 500 different workloads that I could run on the device. So I see the emphasis in a much different area with very little changes to our current interface, to our current NAND generations. We have to run a huge amount of qualification. I could do that. You're saying that the log management would not change so much. Yeah. The management might change so much. Yeah. Media management might change.

Starting point is 00:45:46 Exactly. Yeah. So, for example, ECC, et cetera, and the latest optimizations at a physical level, that can happen in the cloud. Yeah, and for reference, I mean, look at other devices, right? They make little changes or even big changes to the media that's in the device,

Starting point is 00:46:07 and they rerun their qualification. The workloads they run, they can run for five minutes. With SSDs, I mean, every other week I'm telling someone new, no, you can't run it for five minutes. You're not going to get the same performance. And people don't realize the overheads it takes to qualify an SSD and how different they are from other devices. In your prototyping phase,

Starting point is 00:46:34 how many different types of N did you actually play with? Just the one. That's a problem. That is a problem, yeah. So, yeah. I will give you some space for a while. Your expectations have changed. It is small. I will tell you something. I've been in space for a while. Your expectations have changed. It is small.

Starting point is 00:46:49 So I've used one. Mattias, how many have you played with? We have all the major TLCs. All the major TLCs? Okay. So we've done one prototype. He's done, yeah. It's easy.

Starting point is 00:47:01 I have done it. But the FTL changes as they change, the hardware. Yeah. Yeah. You're going to see the fact that media managers still in the device. All the things to do with disturbs, data retention, nastiness stays in the device. It's just putting the address translation and the garbage collection commands in the host. All the stuff you're speaking of stays in the device.

Starting point is 00:47:24 But garbage collection and other algorithms do change in the FTL as they change. Yeah. They can be decoupled. Yeah, and I don't mean to minimize this decoupling. This is going to be our biggest challenge, right? We have to come together and figure out what this division of labor looks like to get it right. And that's... My fear is that that you come out with

Starting point is 00:47:45 a 6LC whatever, you know, and yes, your stuff works, but the thing dies after two months. You know, I mean, that's what I'm... I mean, the idea of this, you know, how much, you know, how much interplay between the two layers is

Starting point is 00:48:01 necessary. We're assuming right now that it's not possible. We're actually seeing the opposite. We're talking about people using open channel in order to enable QLC. Our applications can't use QLC because the write amplification is too high. I definitely like that.

Starting point is 00:48:19 You're getting the DRAM out of the QLC. Yeah, that would be fantastic. DRAM out. Use half the flash again. Yeah, yeah. I mean, that's an easy win. Once you get the remapping in the host, a lot of the applications are already doing that mapping, so it just goes away.

Starting point is 00:48:41 All right, I think they're going to kick me off in a couple of minutes, so unless there's other burning questions, thanks for the discussion. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #68: Andromeda: Building the Next-Generation High-Density Storage Interface for Successful Adoption

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.