Storage Developer Conference - #68: Andromeda: Building the Next-Generation High-Density Storage Interface for Successful Adoption
Episode Date: March 27, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org slash
podcasts. You are listening to SDC Podcast Episode 68. Hi everyone, my name is Laura Caulfield and
I'm a developer in the hardware division of Microsoft Azure.
My responsibilities there are in the design and pathfinding work area for solid-state storage in our data centers.
Today, I want to tell you about some of the technology trends and application requirements I see in this role
and the interface changes that we're adopting to support cloud-scale workloads for the foreseeable future.
So first, we're going to start out with a brief primer on current SSD architecture
and the priorities of today's cloud applications.
Next, we'll dive into the prototype that we've been building for use across Azure.
And then last, we'll explore the current state-of-the-art interface between host and drive that we're looking to develop.
So first, we'll dive into the design principles for cloud hardware.
These tend to apply to more than just storage, but I've tailored them a bit for Flash.
So first, we need to be able to support a broad variety of applications.
Azure alone has over 600 different services, and we use the same set of hardware for our mail servers and our office suite and our search engine and others. This huge number of
applications is also a huge amount of hardware, so we need to make sure that our supply chain
is very healthy. And we do this in part by making sure all of our devices from different
manufacturers behave almost in the same way, at least in the areas that we require.
We also are seeing rapid evolution of NAND generations. It's actually still
following Moore's law. But on the flip side, we're seeing huge qualification times, right?
The complexity of the firmware and its interactions with the host
are turning the test time for any single workload into hours,
and we have hundreds of workloads that could each present their own corner case
that matters and is important to find and debug
before we put this hardware into the data center.
Last, we need to make sure that our software
has enough flexibility to evolve faster than our hardware.
We use SSDs for at least three or five years,
typically more because they last so long.
And the process for updating this hardware and the firmware
is much heavier than daily pushes of software updates.
So now that we've seen the environment
that we're working with in the cloud,
let's take a look at what the technology provides. So how many of you guys are really familiar with
what goes on inside of an SSD? Garbage collection, write amification. All right, so we're about half
and half. I'll try to go quickly on the background. In your SSD, essentially, it's mostly flash memory.
And each flash memory die is composed of a set of flash blocks.
And then within the block, you program individual flash pages.
Program, right?
Those two words are used interchangeably.
And the flash itself has this bulk erase operation.
And then you have to write to each of the pages in order.
So to translate this type of interface into the standard update in place interface the ssd has a
large amount of dram for every terabyte of storage it has about a gigabyte of dram
so now when the host is writing data for example we have four streams of data each
individual stream is sequential but they're happening at the same time
the drive accepts this data into its data cache until it has enough
data to fill a whole flash page. At this point, it updates the address map and writes the data
to the flash. The host continues to send data and the flash continues to fill up until all the
flash is full. Now at this point, the SSD controller's job is to free up some flash space.
It's got a whole bunch of over-provisioned extra flash space. So by this point, the SSD controller's job is to free up some flash space. It's got a whole bunch of over provision, extra flash space. So by this point, data has been overwritten, or maybe some applications
have trimmed their data altogether, and you have this fragmentation in your data and physical space.
So at this point, the SSD controller does garbage collection. It copies out the valid data, which is
the main step in garbage collection, to a new block.
And then it erases the old block.
Now this write amplification is enemy number one.
All it does is put additional overhead in the drive.
It eats away at performance.
You get down to 20% of the performance in steady state for random workload.
It also eats away at endurance, which is very limited in Flash.
And becoming more limited as the density scales
up. So we looked at this and said, how are we going to reduce the write amplification? We'd
like to buy these off-the-shelf parts, so let's try caching up our data on the host side, right?
If we cache up a whole block and just write a whole flash block at once, we'll get write
amplification down to one, right? Unfortunately, we ran the experiment and saw a much different picture.
Writing four megabyte random chunks of data
wasn't much different than writing
four kilobyte random chunks of data.
And we had to increase the block size up to a gigabyte
before we saw the right amplification improvements
that we wanted.
Now, this might be okay if we have four streams of data,
but now scale up to the hundreds or thousands
of streams of data that we have, and the host side buffer cache becomes untenable. Now this gets at the
heart of a fundamental design trade-off that cloud system designers see very differently from SSD
designers. I've given you a hint about the cloud SSD designers, but let me first dive into the
perspective of the SSD design. When you have your SSD, you've got actually a large array of flash memory, not just a single die.
So you'd like to scale out your performance as much as possible.
And the controller sensibly does this by caching up enough data to stripe it across a whole flash page across all the dies.
So the data keeps coming in, keeps getting written, and your effective flash page size and flash block size
is now multiplied by the number of dies in your device.
A little m.2, for reference, has 64 dies.
So this is how we quickly get up from the 4-megabyte flash block size
to the effective 1-gigabyte flash block size.
And this is our current state-of-the the art, but if you look at how flash is
scaled over time, as the density of flash goes up, the block size also goes up. So this is what the
SSD looks like. Now let's switch back to the applications and see in particular how their
design trade-offs would establish how you'd map their data to the physical flash array.
We'll take a look at three applications, the first of which is our Azure storage backend.
Now, this is a lower tier in a storage hierarchy, so by this point, the data is organized into
hundreds of distinct streams, each went into sequentially. It's got a couple of priorities
in the performance space. It can scale up its
performance by scaling up the number of streams on a device. Any single stream doesn't have to
get great throughput as long as the whole system gets good throughput. And they try to keep the
reclaim unit size as small as possible because this helps the end latency of their whole system.
Now, these application priorities map to what I call a vertical stripe.
This is where each stream is scheduled to its own block on its own die.
It has the append-only semantics,
so your write amplification becomes very low.
Each stream is isolated, so if you trim one stream,
the other one isn't fragmented.
You get high throughput by increasing the number of streams on your drive,
and you get the smallest effective block size possible, that four megabytes today, which might
become more megabytes tomorrow. Our next application is a legacy application scheduled in a virtual
machine. Now, these tend to be, I mean, all over the map, right? But they can have small updates, they can have bursty
performance. But the bottom line here is they expect the same type of behavior that they saw
with legacy SSDs. So for this, it makes sense to stripe across blocks on different dies.
You get the bursty performance, the high throughput for any given user.
But the big difference here between legacy SSDs is that you've scheduled your VMs to different sets of blocks.
So when a VM is well-behaved and decides to write sequentially,
it can get low write amplification without the other VMs fragmenting it.
And also, when you close a VM, then it doesn't fragment other VMs when it trims its data. Our last application is
a new application that's still run within a VM guest. Now, the host is still scheduling
a horizontal stripe, scheduling each VM to the horizontal stripe. So the design knobs that this
application has are dividing up those blocks
within that stripe, perhaps for a set of different logs that it has. And so this is what I call the
hybrid stripe, where you have a horizontal stripe, but then you've further striped it into vertical
stripes. Now there's a few things to note. One more thing before I say that. So there's a wide variety of applications here,
and you might imagine that they're on different machines
doing completely different things in their own world.
But in fact, to make the best use of our hardware
and scale up and down with demand,
we need the flexibility to be able to put all of these applications
in the same SSD at the same time.
So we might want a vertical stripe here,
a horizontal stripe over
there. We essentially need the flexibility to partition these dies out and these blocks out
into whatever configuration dynamically. Now there's a few things to notice here.
All we really need is a chance to expose all these log write points, right?
We don't need access to the NANs that use Ingersys.
In fact, we don't want access to those things.
We want that to still be managed in the drive.
And the bottom, the most fundamental thing,
difference here,
is that as we scale up the capacity in the data center,
we're not scaling up the size of each application we're not scaling up the size of each application.
We're scaling up the number of each application.
So to support these priorities, we've boiled it down into three aspects that this new interface needs to have.
First, we need to replace the blog abstraction with something that looks more like pend-only write points, and lots of them. We need the interface to have the flexibility to scale up to hundreds or thousands of write points per terabyte.
We'd also like, if we're guaranteeing
that we're going to be writing in this way,
we don't want the drive to have to reserve
large amounts of flash or large amounts of DRAM
in the case that we're going to revert back to the legacy system.
The next aspect that we need in this new interface is the ability for the host to place the data physically.
It needs to be able to understand whether it's co-scheduling on the same die as another application
or isolated on its own, or whether it's divided among different blocks
and it needs to be able to make the trade-off between reclaim unit size and throughput
sorry i've got a smoke thing going on from the wildfires up in seattle
okay so the last last aspect we need that has historically been kind of minimized in open channel interfaces
is to keep the reliability management down in the drive, right?
We're going to have some challenges and the interesting challenges in defining this interface
to make sure that the flash can continue to scale on Moore's Law. We need to enable innovation in that space for new ECC algorithms
or whatever gnarliness that I'm not as familiar with these days.
So taking a step back and looking at these priorities,
I can see how it's kind of evolved over time in the community.
So we've got our log abstraction, our in-host data placement policy,
and our in-drive reliability.
And I'm going to show some different interface proposals
that have evolved mostly over time.
In the early days, then, the community realized
that there's huge overheads in SSD storage systems, right?
We designed SSDs to match what hard drives did so that we could get them into the market, and it worked.
It was successful. It was a great strategy.
But then people started to realize how much overhead it takes to mimic the hard drive.
And so we addressed these overheads first by pulling more control into the drive.
Unfortunately, this has the side effect of locking the software innovation into that pace of hardware
innovation, right? You can only move as fast as your firmware is going to evolve. And you can't
try new things between the software and firmware in a very agile way. So next on the scene were
multi-stream SSDs and IO determin determinism and this is where we started
discovering the benefits of isolating uh different users down to the hardware and also some of the
benefits of um well some of the ways to place the data in the drive unfortunately the the constraints
in this space are kind of a double-edged sword. Both of these interfaces support legacy I.O. patterns,
and so they retain all of the DRAM and flash overheads
that are required to quickly revert back to those legacy I.O. patterns.
And so then the final set that we're considering here are in the open channel space,
and this can mean a lot of things.
There's a huge spectrum of proposals there,
and there's a lot of evolution that's happened recently.
And so my point here is that we need to make sure that this interface evolves
so that the media management can stay in the drive
and that these systems can become production-ready.
So our next steps are to take this interface
and further develop it into a production-ready system.
So now that we've evaluated applications,
priorities, technology trends,
and some of the available options for the new interface,
then our next steps here were to build proof of concept and evaluate the overheads and performance
that it gets. This is also our first step towards creating a system that's viable for a replacement
for a conventional storage stack, right? We still have to support those legacy applications with the same level of performance as conventional SSDs.
So let's start out with defining a few terms,
since so many people have so many different definitions
for the same set of words.
Open channel SSDs, at the kernel of it,
we're exposing physical access,
or physical addresses from the drive to the host,
such as channels, right?
So the channels are open.
We're also going to be talking about the flash translation layer.
This has conventionally been in the SSD's firmware,
and we're starting to see separate into two distinct parts,
one called log management, one called media management.
Now, the log manager managers main role here is
to receive any kind of right maybe an update in place maybe a log structured
right and to emit definitely to a new emit IO patterns that are definitely
sequential maybe to one or more right points and in order to do that and it
maintains the address map and performs garbage collection the media manager
their main role is to basically translate
the gnarly physics of flash memory into something
that looks more like a software logical sequence
for how to access media.
So this set of algorithms is conventionally like ECC algorithms,
read-retry, read-scrubbing,
all the things that change with the NAND generations and across vendors.
So with these terms in place, now we can talk about open channel and the two main variants.
So here on the right, we have, next to a standard SSD, the two ends of the spectrum for open channel.
So in standard SSDs, the entire FTL is down in the SSD, right?
In the early open channel SSD systems, including our prototype,
that entire FTL is shifted up into the host.
This includes the media management and the log management.
And this gives us the best flexibility for playing around with these algorithms and understanding
the best division of labor between the log manager and the host manager.
But this isn't a system that I can take to my boss and say we can take to production
because the drive can't be warrantied.
I have to recompile the host side driver for every new NAND generation.
So it's important that we reach this final destination
where the media management is pulled back down into the drive.
All right, so without further ado, here is our open channel prototype running in what we call
Windows Cloud Server, basically the type of server that runs in Azure and our other applications.
On the right, we have the prototype card. It's not our standard M.2 form factor
because it's a prototype,
but it has essentially the same architecture.
It's also running here in the disk manager
and the two black screenshots are of it running,
the system running under our conventional
qualification tools.
In the center we have store score running
the standard four corners that you hear about,
random read, random write,
sequential read, sequential write.
And then in the bottom left corner we have a tool
that can do secure erase and some of these other operations.
And it's exposed as an open channel type of SSD.
So this is a very real system.
It's pretty exciting to see running.
One of the first things we wanted to measure is what opportunity we have for optimization,
right? We all keep talking about reducing the map size and reducing the total end-to-end CPU
overhead by reducing the total amount of work. What do those overheads actually look like?
So we looked at three areas.
The first of these is write amplification, right?
This is enemy number one.
We want to reduce that as much as possible.
We look at a reasonable worst-case workload,
which is highly fragmented.
This is a 4K random write workload.
And in our normal SSDs,
we usually see a write amplification factor
between 4 and 5.
In this particular system, then we saw a write amplification factor between four and five. In this particular system,
then we saw a write amplification factor of four, and it moves from not being in the drive now,
but the amplification is happening in the host.
Our next area that we looked at was memory. Now, this is essentially the address map. So,
when you do a 4K addressing, then you have the one gigabyte of DRAM for the 1 terabyte of flash. And that's
exactly what we saw, right? Our memory usage went from 1 gigabyte in the drive to 0, and
the host from 0 gigabytes to 1 gigabyte. And this one is an easy one to optimize for applications, right? If IO patterns
start shifting from four kilobytes to eight kilobytes, then right away you can reduce your
granularity to 8K, and you won't get that performance hit of read-modify, right? And you don't have to
lock your decision into when I'm manufacturing the hardware, how much DRAM do I have to put on my
device? It's dynamically available. The last area we looked at was CPU overhead.
We're shifting all this work up into the host now. It's doing more activity
in the host rather than in the drive. Now, there are some overheads in this system that are
specific to the prototype. There were design decisions we had to make to get the prototype
out the door. And in our next iteration of software development,
those will go away right away. And then beyond that, there's further optimization we can do
by just reducing the overall activity, right? Same time we reduce the write amplification,
we'll be reducing the CPU activity as well. So the big takeaway here is that we've quantified the overheads.
Now we understand how much optimization opportunity we have.
We also need to make sure that we're addressing those legacy applications, right?
We can't ask everyone to rewrite all their applications to get the savings.
So we need to make sure they're behaving at least at parity.
So I'm happy to say they are.
Here we've just run the four corners plus a mixed workload
and measured the throughput.
In gray we have the average of our standard SSD
that we're calling right now.
It's an average of three SSDs.
And then our open channel SSD is in green,
and our standard SSDs are in various shades of blue.
So in looking at this, you can get the high-level takeaways.
Basically, the read performance is fantastic.
It's in the top of the class or even better.
The write performance is a little on the low side,
but these FTL algorithms also haven't gone through the optimizations
that FTLs typically go through right before shipping.
So I have every faith that that right performance will come up.
The second workload that we looked at is notoriously challenging for SSDs, but it's one that
I smile because the application designers every month or two come to me and say,
what the heck is going on with my read performance?
And basically, it's a mismatch
in what's going on in the background activity. So they're doing writes in the background. Oftentimes,
the benchmarks aren't doing writes in the background. So in this one, we are doing
writes in the background. And this SSD does just as well, if not better, than the other SSDs. I'm
showing average latency in the purple and then scaling up through 2, 3, 4, 5, 9 percentile latency.
And then a maximum latency run over about an hour after preconditioning.
So the open channel SSD here gets maximum latency better than 10 milliseconds, which is better than the other drives.
And then the other 2 through 5 nines latency are about on par. So overall,
this proof of concept has been very successful, and we've seen that legacy applications can
perform at parity, and now we've set up our system, and it's ripe for optimization by all
of our different application groups. So then our next step is to make this system production ready.
And the important part here is to make sure
that all of our SSD vendors can implement it
and we have a standard interface that works for all of them.
And so in the next section, I'm going to dive into
what we currently have in...
Yeah, the current state of the art for the interface that we're looking at.
At the core is the physical page addressing interface, right? We keep talking about having a
physical address. So the address format now, instead of having logical addresses, has a segment
for each of the main architectural elements in an SSD.
You have your channel, which is a SSD channel. You have your parallel unit, which maps to a
NAND die. You have a chunk, which maps to a multi-plane block, and then your sectors and pages.
Now, all these addresses map to a physical location. They don't change,
which also means that the host is exposed
to the access pattern required by the flash.
We have to erase a whole chunk before writing any of the sectors,
and then we have to write the sectors sequentially.
This last bullet here about the cache minimum write size
is pretty cool.
So this is one example of how we're defining an interface that works for how NAND has scaled
in the past and exposes a logical thing that the host-side software can work with. Let me tell you
more about what I mean here. Basically, your NAND cells, your memory element, contains more than one bit of data, right?
Your MLC typically has two bits, and then your TLC has three bits.
And any time, well, okay, so these bits are also split across different pages.
So it's possible to write one page and have your memory element half-written.
And when it's half-written like this, then the written data is more susceptible to errors caused by reading of this data.
So one of the gnarly things that SSDs do now is make sure that you're not reading from those half-written memory elements.
Now enter cache minimum write size.
Basically, the contract between the host and the drive is now to say,
don't read from these last n pages of data.
Make sure you cache at least n kilobytes of data
and read from that cache instead of reading from the flash physically.
Now you can take this another way and say,
well, let's take that cache and put it in the drive.
Let's hide it from the host.
And in that case, then you just that cache and put it in the drive. Let's hide it from the host.
And in that case, then you just expose a cache minimum write size of zero.
There's a funky little animation here I didn't go through.
But basically, the picture on the left here is a picture of your memory elements.
Each row is a memory element,
and then each column is a different bit in that memory element.
The numbers are the pages in order.
So at this point, we've written up until page 18.
Now we write page 19 and half fill that memory element.
Now we write page 20.
And then at page 21, you've filled that memory element,
so it exits the window of the cache minimum write size.
So this is kind of a hint at one of the things we're going to have to do to define a good interface that works for the drive and for the host and scales with new generations of technology.
The next one you're probably familiar with.
It's a tradeoff that's been going around in the IO determinism circle as well.
And basically, we have a tradeoff between providing good quality of service and providing good reliability. I'm seeing some smiles back there. Basically,
as soon as you break up your RAID stripe, you stripe across dyes, right? And as soon as you
break that up to get good quality of service, now you don't have your RAID providing a good
reliability. So RAID and isolation are at odds,
and this is an ongoing discussion
that is being solved in the IODeterminism space
and is overlapping with the open channel space.
So we really do have a spectrum of interfaces here,
and we're going to find some place in the middle
that addresses hopefully both sides.
Ah, this last bullet here.
So in the cloud, we have the benefit of sometimes, Hopefully both sides. Ah, this last bullet here.
So in the cloud, we have the benefit of sometimes,
in some applications, doing a higher level of replication that can provide the higher reliability.
So this is what allows us to maybe reduce the reliability provided by the drive
to a known lower level of reliability
because this higher level of replication can rebuild the data.
Okay, and then my last topic on the new interface is kind of scaling back a little bit from the pure
physical addressing. So we want to do physical addressing, but in moderation. Even in SSDs, we saw how they remap bad blocks through a sparse map,
so it's kind of a low overhead option.
And I see a similar thing possibly happening with open channel.
Basically, at your block level, then you have bad blocks show up,
and you want to perhaps have a sparse map to map those out. But if you make your block
logical, you can also have the drive provide wear leveling guarantees at the LUN level.
So now you kick your wear leveling into these two parts where the drive guarantees a die level
wear level, and then the host uses its normal migration patterns to wear level across the dives. So in terms of wear leveling,
we're kind of moving our boundary from the drive level to a long level.
And my final point, which I didn't write up here, is that we can do this with a block,
but maybe not with a page or a channel because your block in Flash is independent of physical location, right?
I expect the same guarantees from block two as from block three.
And if I schedule two applications in the same die
on these two different blocks,
I can kind of map it around and not notice any difference.
And in fact, that's even more the case with NAND
than it is with hard drives.
Okay, so I'm actually finishing a bit early with the conclusion,
so I'm hoping we'll have a good discussion after.
We've seen over these slides that it's important for us to architect this interface
to address both the cloud scale that we're seeing today
and the cloud scale we expect to see in the future
with hundreds and thousands of workers per terabyte.
But we also want to make sure we have the correct division of labor
to allow NAND to continue to scale.
I showed you our proof of concept and the data that we got out of it,
basically showing the system overheads
and how we're ready to optimize at the application level
with this open channel interface. And the final steps that we have to do to bring this production
involve the whole community in defining this interface, which brings me to my final solution,
this, or my final point. The final solution for this interface needs input from the whole community,
right? We want it to work for all of our different
drive designs, for all of our different hyperscalers. So it's a great time to jump in and
talk about what this interface should look like and what things we need, what mistakes we need
to make sure we're not repeating. So with that, thanks for your attention, and I'm happy to take
questions now. Thank you.
Yeah.
Timeline for production.
So I hesitate to say because I haven't finalized this with everyone,
but some things I can say is that Open Channel,
we hope to be faster to implement on the drive side.
We don't have to have each vendor implement new FTL algorithms
and vet them out.
And we've already done the majority of the work
of getting the complex FTL algorithms working in the host.
So next step is getting the interface defined,
and that's hopefully kicking off now.
And once we have that,
maybe a year after for final hardware.
Yeah.
Yeah.
You mentioned that there are different kinds of applications that you know to support.
For example, there's a streaming application where there are multiple streams and it's
best for each stream to have an arrow strike, just go through one time.
And there are other applications such as VMs where it's bursty and it's good to describe.
So the final solution that you're proposing,
do you have some sort of dynamic approach?
For basically building that picture?
Yeah.
Yeah, that's what I would like to see.
And that's the way everyone in the company is talking, right?
It's how they already operate.
And somehow the amount of buffering also is structuring here.
Because ideally you would like to buffer as much as possible
so you can then leverage the bandwidth for all of them.
You're saying like buffer up, sorry.
Like for the first application,
make sure you buffer up enough to fill all the dies
with your write.
What kind of buffering do you need?
The assumption, I guess,
is that you have some finite amount of persistent buffer.
Ah, yes.
So there's persistent and non-persistent as well.
And that, I mean, it all happens case by case, right?
You talk to this application, they say,
yeah, it's okay if you lose that.
Most of the time, the data center's powered on.
Once in a blue moon, it'll be powered off,
and I've got replication to pull it in from elsewhere.
So in that case, you know,
it doesn't need to be persistently buffered.
In another application space,
maybe they're making really small writes,
like 200 kilobytes,
and so they're not able to fill up a whole flash page, and they don't want to rework their system to basically handle that data loss.
So then they slap in the persistent memory buffer to fill up a whole flash page.
So is this going to be described somewhere in a kind of algorithms that you're using to construct stripes?
Good question.
So the way I see it, so I don't specifically have plans on publishing that kind of thing.
The way I see it is once I get the interface going for people,
there's going to be this explosion in the application space where, I mean, there won't be the necessity to change your application, but people start to see,
oh, I can schedule my data in this way, and my write amplification goes from five to one.
So why wouldn't I enable that new technology that has 20% of the endurance?
So I think you'll see a lot of interesting work come out all over the place in how to do this striping.
Yeah.
For the minimum cache write size, that's addressing a specific problem of the resterve effect on your lots are fully rented. Yes, so I highlighted this in part because it's something that we've seen on many generations,
right? Anytime you have a multi-level cell,, it gets denser and it gets harder to manage.
But the thing that's consistent is that you need to cache the last end pages.
So my goal in this space is basically to rely on the community expertise to say,
this is what Flash has been doing for the last 10 or 20 years,
and this is how we expect it to continue to behave.
These are the patterns that we're seeing.
On the device side,
sometimes we don't find this out
until we have the memory in our hands for a little bit.
It's pretty late when we start to find out some of these things.
Especially at
higher levels.
It's very hard to
provide insight
on what we expect the N to do.
I'd love to talk with you
more about
I don't know what you can share.
Say again? I'm sorry. Why would a host need to know that? more about, like, I don't know what you can share, but.
Say again?
Why, I'm sorry, I have a question.
Why would the host needs to know that?
You can still, in our model, you can still manage
that transparently on the device?
Well, this is an example of one
that's not fully transparent.
But it could be something that you can still
in the device and the participation that you're saying, basically,
is that what we have now with current SSDs
gives you a lot of wiggle room, right?
You can come up with all sorts of creative solutions
for managing the errors that you see pop up late, right?
So I agree we need to be careful with how we change that
to make sure that you still have freedom to do those things.
But, I mean, kind of the abstraction we have right now is just requiring so many
overheads and we're repeating algorithms across the whole stack.
I mean, because you kind of opened up talking about the long test times, the fall times,
and now it just may shift from, say, the device. It may just shift it. It may not improve.
Yeah.
Yeah, there is that danger.
Yep.
So my question is more just to you as a system designer
with respect to strings.
So you've used strings before.
Did you get a chance to test that out?
We've been playing with them, yeah.
So I'm just curious,
because from Box OEMs,
guys that we've talked to that use them, there's this finite number of streams, which I can only
imagine is far worse in Hyperscale. And how do you coordinate those streams, given that they're
finite, they're ephemeral, with the fact they're limited? And how does the FTL deal with the fact
that it only has so many streams?
How does that all work practically speaking?
Using the FTL on the drive side?
Just how to, what's the success been with streams?
There's a very finite amount, number of streams,
and you guys are probably creating tons of them
and it's very dynamic.
Yeah, so, I mean, that's a great question. Basically, streams as they are now,
you basically get eight or 16 streams in a terabyte, and we're seeing adoption, but it's
limited. So, only one of the many applications I see, it's a big one, but only one of them has actually found the
benefit worth the extra implementation they've had to do. And another application is taking
that and saying, well, I still need hundreds of streams per terabyte. What if now I guarantee
you that I'm only going to write sequentially in each of those streams. Can you provide me that better scale? And the answer is looking like yes. But again, this
is, it's kind of a minor step and it applies to one more group. But then there's the next
group after that that I've found in the last couple months who want thousands of streams
per terabyte.
What I don't hear anyone talking about, though, is how an FTL could possibly track thousands
of streams.
Yeah.
Hundreds.
Yeah.
There are none that do today.
And the resources for that would be amazing.
And coordinating that with the folks.
Yeah.
You're just adding overhead in order to do these things.
And so one of the major goals with Open Channel is to strip out a lot of those overheads and make
the whole system more efficient instead of having
everyone tracking more and
more stuff on their respective sides of the fence.
More importantly,
I mean, the whole point of
NVMe was to simplify things,
right? So I mean, one of the reasons for success
in that was that it was
very easily implemented and the problem shifted down Yes. into the operating system or use of stack to warrant this?
Yes.
Yeah. And then I have actually this, where that thing goes.
And then I have all of these.
So you're trying to say, how does that eliminate the things that you need to do in the house?
That wouldn't be a problem.
It would just happen.
So what do you think about this?
You know, for your application,
how do you decide?
Is that completely,
are you taking the entire source stack from the application down,
or is that,
is it something else?
Yeah, so everywhere. Yeah, so I mean, Is it something else? So, everyone...
Yeah, so, I mean, one of the challenges
in discussing these concepts across Microsoft
is that each audience member has a different benefit
that they could get from Open Channel, right?
So when I talk to the qualification team,
then for them it means they can put the drives into the data center sooner, mostly.
And it's gnarly talking about, like, getting the same hardware working for everyone, right?
So we've essentially shown that it can work for everyone at Parity
and
those who want to make it better
have the chance to optimize up to
5x write amp reduction
which right away translates to
now use QLC instead of TLC
which is at least 10%
cheaper and
get performance up 5x what it was before.
And part of the reason that the benefit is so great
is that we share our hardware massively,
so we don't have the situation where we get the low-write amplification.
There's almost no applications in the data center that can own a whole SSD. So getting getting it
striped out to a large number of people makes a big difference. Yeah.
So I'm a little confused by why you think this will actually speed up drive qualification.
Okay.
With a new generation of drive from now with XLC, it requires entirely different algorithms for some of these management stuff.
It seems like it will take a lot longer for that to perfectly up into the file systems and application layers and stuff.
And it would slow down the introduction of new technology.
And before all of that gets hung out, now the drive vendors have the device providers have the ability to work out those algorithms and have them in the device and provide something that is
functional and working
at the interface level
not interdependent with the
massive amounts of operating systems
and similar.
Yeah.
So I see the emphasis
in a very different way, right?
This interface isn't going to work
if it needs to change
with every NAND generation. So we haven't done our job and we haven't finished it if it
changes with every generation. The thing that I do see every day is a new NAND
generation comes out, it's almost no different, right? You've changed
some parameters, the silicon's even the same, you just get into test mode and you
do some tweaks, and yet you have to go through the full qualification process. And there's two aspects to
that that make it so long. Every time you do a new workload, you have to run it for two hours,
right? You're not going to get consistent performance if you run it for five minutes
because you have to precondition, you have to warm up the garbage collector. And I don't see that happening
with new tweaks to NAND management that have to happen. I don't think you're going to come up
with something that requires two hours of testing each workload. And I also don't think you're going
to come up with something that has corner cases in one of 500 different workloads that I could
run on the device. So I see the emphasis in a much different area
with very little changes to our current interface,
to our current NAND generations.
We have to run a huge amount of qualification.
I could do that.
You're saying that the log management would not change so much.
Yeah. The management might change so much. Yeah. Media management might change.
Exactly.
Yeah.
So, for example, ECC, et cetera,
and the latest optimizations at a physical level,
that can happen in the cloud.
Yeah, and for reference, I mean, look at other devices, right?
They make little changes or even big changes
to the media that's in the device,
and they rerun their qualification.
The workloads they run, they can run for five minutes.
With SSDs, I mean, every other week I'm telling someone new,
no, you can't run it for five minutes.
You're not going to get the same performance.
And people don't realize the overheads it takes to qualify an SSD
and how different they are from other devices.
In your prototyping phase,
how many different types of N did you actually play with?
Just the one.
That's a problem.
That is a problem, yeah.
So, yeah.
I will give you some space for a while.
Your expectations have changed. It is small. I will tell you something. I've been in space for a while. Your expectations have changed.
It is small.
So I've used one.
Mattias, how many have you played with?
We have all the major TLCs.
All the major TLCs?
Okay.
So we've done one prototype.
He's done, yeah.
It's easy.
I have done it.
But the FTL changes as they change, the hardware.
Yeah.
Yeah.
You're going to see the fact that media managers still in the device.
All the things to do with disturbs, data retention, nastiness stays in the device. It's just putting the address translation and the garbage collection commands in the
host.
All the stuff you're speaking of stays in the device.
But garbage collection and other algorithms do change in the FTL as they change.
Yeah.
They can be decoupled.
Yeah, and I don't mean to minimize this decoupling.
This is going to be our biggest challenge, right?
We have to come together and figure out what this division of labor looks like to get it right.
And that's...
My fear is that that you come out with
a 6LC whatever,
you know, and
yes, your stuff works,
but the thing dies after two months.
You know, I mean, that's what I'm...
I mean, the idea of this, you know,
how much, you know, how much
interplay between the two layers is
necessary. We're assuming right now
that it's not possible. We're actually seeing the opposite.
We're talking about people using
open channel in order to enable QLC.
Our applications can't use
QLC because the write amplification
is too high.
I definitely like that.
You're getting the DRAM out of the QLC.
Yeah, that would be fantastic.
DRAM out. Use half the flash again.
Yeah, yeah.
I mean, that's an easy win.
Once you get the remapping in the host,
a lot of the applications are already doing that mapping,
so it just goes away.
All right, I think they're going to kick me off
in a couple of minutes,
so unless there's other burning questions, thanks for the discussion.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss
this topic further with your peers in the storage developer community. For additional information
about the Storage Developer Conference, visit www.storagedeveloper.org.