Storage Developer Conference - #104: Introduction to Open-Channel/Denali Solid State Drives
Episode Date: August 5, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to STC Podcast, episode 104.
My name is Matthias, and I want to talk about OpenStyle SSDs.
I've been doing this for a couple of years now, My name is Matthias, and I want to talk about Open Channel SSTs.
I've been doing this for a couple of years now,
and I'm kind of going to give an introduction to Open Channel SST,
what is it, where are we going,
and then we're going to talk a little bit about what's next.
So we're at the stage now where we're going to look a little bit forward and see where we're going to end up
and what we're going to work on in the next one or two years.
So I previously was at a startup
and I'm in a big company,
so I get to present this nice looking slide
which basically say that
you shouldn't trust what I say.
It doesn't mean anything,
but I still take it for a grain of salt.
So that's all good.
But you got lawyers.
I got lawyers. I got lots of them. Like that's a sandisk. I mean, we're lawyered up. So that's all good. But you got lawyers. I got lawyers.
I got lots of them.
Like, that's a sandisk.
I mean, we lawyer it up.
I mean, we don't need it anymore.
Yeah.
So there's, yeah, the motivation, the interface, the ecosystem, and what's next.
And there's some great news here regarding standardization.
For me personally, it's been, it's something very big for me,
something I look forward to.
We're going to talk about that.
So open-channel SDs, why are they interesting?
So there's this thing,
when you look at an SSD today
and you do 4K random reads to it,
and you kind of look at the latency
and how long it takes,
and then you look at the percentiles, like how many of the IOs completes within 100 microseconds,
150 microseconds.
And you can kind of draw this curve, and if you only have reads to an SSD, it's pretty
consistent.
Like most of your IOs completes within 150 microseconds.
So that's great.
Obviously, over time, you will have these outliers,
but generally, your curve will look like that.
The problem with SSDs are that as soon as you start adding
just a small bit of writes, that's when it hits you,
and you start having these outliers.
Four milliseconds, six milliseconds, and so on.
And I mean, that's okay.
They're still vastly fast.
But the thing is, if you're running these things at scale
and you have 100 SSDs, 1,000 SSDs,
and you have to go ask each of them for this answer,
and you're kind of waiting on the last one in that row,
like you need to get all the data back
before you can give back the final answer,
then your latency will be those four milliseconds.
And that's not good enough.
So how can you either control it, make it better,
and how can you kind of make sure
that it doesn't happen so you avoid it?
And obviously you can say,
well, if I can isolate my workloads
and say, well, I have one SSD.
They're usually, they were small, like a couple hundred gigabytes, and your workload could fit within that.
That's great.
The thing with NAND in general, the media that is within SSDs, is that they're getting bigger and bigger, and more and more dense.
And that means that we are not,
if you're a hyperscaler and you run virtual machine workloads,
then you're going to have multiple customers sharing the same drive.
And they're not coordinating.
One guy can read, another guy can write.
And suddenly everyone has really bad performance.
And that kind of drops into,
so that kind of problem where you have a multi-tenancy environment,
and you kind of have this where many users
are sharing the same drive,
and you get these unpredictable latencies.
The other part is that all these different users
are having different workload.
Some of them are running databases. Some of them are running databases.
Some of them are in place.
It's a MySQL database.
Some of them are out of place, like RockCB.
You have sensors, you have analytics, virtualization, you have video.
And they all have different characteristics, how much can be compressed, and so forth.
And in these environments, you stuff them all onto the same SSD.
How can you build an SSD that has to be generic for all these workloads, that's the hard part,
is it's really hard to make an efficient SSD
that works for all these workloads.
So what usually happens is that a customer,
hyperscaler or an old flaccid rate guy tells them,
that's the volume.
This is my workload, please go optimize for it.
And then an SSD vendor will go off
and do these optimizations.
And you can do that, but then you're optimizing
for that particular workload.
And the problem that especially hyperscalers have
is that obviously you know some workloads,
they don't change that much, but in general,
you don't know, you say my workload is this one week,
and the next week is something else.
So you cannot say this is general,
and it changes too much, and they have to plan years
in advance for kind of when is the new drive,
when are we gonna deploy, and so on.
So that's not really feasible for them,
so the idea here would be to kind of,
can we shuffle this around a bit.
So before that, I just want to give a short introduction
to SSDs in general.
And basically, within an SSD,
you have this media on the bottom that dies, the NAND dies.
And when you access those, you can read from them.
It takes like 50 to 100 microseconds.
When you write to them, it takes like one to 10 milliseconds,
and when you erase for them, it's like three to 15
microseconds, so the idea is that you can read anywhere
from within the NAND chip if you want to,
but you have to write sequentially within this thing
called a fast block, and if you wanna write again, you have to erase it.
That's kind of the constraints that we work under.
That's a lot of more details,
but that's the general kind of things around it.
And then the way that the SSD gets its performance
is that we're gonna shuffle a lot of them together,
and that's how we get the throughput of the drive,
and we kind of get this parallelism going on and so on.
That's why you see the curves, like a low QDFs,
you have wide, like, so-so in performance,
whereas when you start firing off more QDF on them,
then see where you get the bandwidth of the drive.
That's because you can utilize more of these
at the same time.
That's really great.
So what you probably notice is that
NAND has the re-write erase interface,
but we had to opt to the host.
We're actually talking read-write.
So there has to be something in between.
So this can be in hardware and firmware, whatever.
But generally, you have to have this
where you have this logical to physical translation layer
where you take this read-write erase
and convert it into, yeah, into the read-write arrays.
So that's the translation map.
You have wear-leveling, so NAND is not perfect at all.
I mean, that's, I mean, you're lucky
if you could get your data back
unless you have some really good ECC scheme.
That's kind of what it boils down to.
So you need to make sure how you place the data.
You have to wear it evenly.
You have to do a lot of tricks
to kind of make the promises of a normal hard drive,
and you do rewrite, so when you get to that durability level,
you actually have to do quite a lot of work.
Then there's all this bad log management,
there's media error handling, which is where, yeah,
you do all the stuff that you actually make the media doable,
and actually works as you expect it to.
So that's kind of what goes on within an SSD
in the broader picture.
But that's not the only thing.
The other thing is that when we look at the host,
it's actually also a problem, is that if I'm at data in an application,
I sit on the top in user space,
and I'm BroxDB,
and I'm communicating,
I want to read LBA 10.
Fine.
And it hits my file system,
which is kind of also lockstored.
So BroxDB is lockstored,
and the file system within a lockstored structure,
and what goes on is when I hit my file system,
it might convert it onto LBA20.
Fair enough.
And then the thing is, with the FlashTranslator before,
that's also a kind of lock-structured file system.
So, again, I'm going to end up,
I'm going to end up maybe on LBA30 or something like that.
So, I have no idea when I go up from the RuxDB
and go down through the layers
where my data are actually ending up.
And the thing with SSDs are
you want to co-locate data on a flash block
such that the data have the same age.
You want to make sure that you don't have,
at some point, have active data.
When you come around and you want to garbage collect,
if you have like half of the flash block
with active data and half of it with cold,
then half of it will be invalid and doesn't care, but you still have to move the flash block with active data and half of it with cold, then half of it will be invalid,
it doesn't care, but you still have to move
the active data away.
And that part is really expensive.
That's called write amplification on SSDs.
And that's what we wanna avoid,
and that's what Open Channel tries to solve.
And that brings me to this indirect write.
So within an SSD, we have the lock structure,
but we also have, we are writing into a write buffer.
So when you're gonna write to this Flask block,
you're not writing 4K at a time.
You do if you maybe have SLC or something,
that kind of memory.
But in general, you write at like 16K or 48K
or something like that, or maybe even more than that
if you're writing across something called planes.
So you kind of collate all the data together
and then you flush it out to the drive.
And you don't know if you just started
at the start of a block or at the end of a block,
of a flash block.
There you have the user has no knowledge of that.
So that's a big problem.
And it basically becomes this best efforts approach where, I mean, we do our best with the data we got.
And SSD has a read-write interface.
Obviously, it does have more today
with NVMe streams, for example, so we can do more.
But in general, up to recently,
that was kind of what you had, and the SSD could
only do so much.
So that's kind of what we want to look at.
So open-channel SDs, the goals are that we have IO isolation, so these dies, you want
to be able to access them independently.
We have predictable latency, so we want to make sure that we can avoid
having these outliers that we saw in the beginning. And we want to control the data placement
and when we actually access the drive, uh, access the media.
And so this comes into, we want to know, like, if we know the boundaries of the flash block,
we can then be smart and make sure the host maybe already know what kind of, how hot my
data is and what kind of data fits together.
Let's put it on the same FlaskBlock
so when I'm gonna need to erase it,
I can do it at the same time.
So often this is these, is that on the first pass
you just write it in and then later on you optimize it
and then you Flask out the data into different superblocks.
I'm gonna get back to that but that's generally how it works.
And so what Open Channel is, is that we take these parts,
the logical to physical translation map,
we take the garbage collection,
and then we kind of split the value into two.
That's what's happening.
We have this logical to physical translation map,
which we now give the host, or the SOC,
or whoever we want to do it,
to give it that responsibility. The garbage collection, which we now give the host or the SOC or whoever we wanna do it, to give it that responsibility.
The garbage collection, if we know,
so usually when you get these outliers in the SSD
and you don't know why it happens
because the user's not doing any writes
but the drive is, then hey, it's because
the garbage collection within the drive.
We'll move that into the host
so that we can make the decision of when to do it.
And if the data placement, if the application
we kind of have to do it doesn't need garbage collection
in the traditional way, we can avoid that too.
For example, RocksDB has a lock structure in itself,
so we don't need, in the common sense, a garbage collector,
which is normally within the SSD.
Then there's lots of availableing,
which it's kind of an abstraction a little bit of a way.
We don't want to tell the host,
we have like the array cycles, p cycles in SSDs.
We don't want to tell the host that detail
because it's meaningless.
It can change over the lifetime over the SSD and so on.
So instead we want to give the host a hint
of where to place the data.
And that's what we're gonna do.
Obviously, now that we moved all this up to the host,
well, now we need software to drive all that.
So if you wanna have an open channel drive
and let it be a block device like a normal drive,
we can build a host side of TIL,
which does this L2P, a lot of physical translation map,
do the gaps placing, a lot of logical valuing.
And has similar overheads to traditional SSDs.
So one of the things with SSDs that are on the market today
is that if you have one terabyte of storage and media,
you usually use one gigabyte of DRAM
to kind of buffer for that mapping table when it arrives.
There's different optimizations you can do,
but if you don't want to do these kind of lookups
that kind of give you extra latency and so on,
you're going to need one gigabyte of DRAM
for one gigabyte of storage.
So that's really bad.
That costs money.
That's just how it is.
And also, like file systems and databases,
I mean, if we can integrate it there
and utilize what we already have there in FTL,
then we can kind of get, remove this utilize what we already have there in FTL, then
we can kind of get, remove this layer that
are within the SSD to FTL, remove that logic,
and kind of integrate it to what we already
have today.
So when we look at an OpenTL SD, we look
at the concepts. Like, we break it out, what
is it actually we are introducing with this. So it's some, a concept called chunks,
where we do sequential write only, LPA ranges, which
is this, and this. You can say that this maps
into a Flask block where you have to write
sequential within. That's kind of concept that we exposed
up to the host.
Then we want to make sure we can align
the writes to these internal block sizes, so we
want to tell the host what are the boundaries of,
of which this should write.
Then there's hierarchical addressing.
So we have these parallel units,
which we're going to talk about.
How do you address them individually?
We have host-assisted media refresh.
So an SSD actually does rescropping internally and moves data around.
We don't see it, but it happens.
We could move it around as an SSD,
but we can also tell the host, please move it around,
because then the host would know,
and it can kind of schedule it out.
Maybe it doesn't need to because the data is somewhere else.
But there's all these optimizations
where we can start applying.
And then there's this host assisted well living
that I was talking about.
So for chunks, so it's, yeah, it's a range of
LBAs, right, so it's pretty sequential.
The cool thing about it is that we do
this to DRAM for the logical to physical
mapping table by all those magnitudes, so
instead of having a gigabyte per, of DRAM
per one terabyte of media, we can then go
away with like one or two megabytes
per terabyte, which is great.
Like this is the best case, obviously you wanna have
some hybrid, but in generally, in that range we are.
And we can do this hot-cold separation,
we know data, when we start writing to a chunk,
you can kind of move right to this chunk
with the cold data and with this,
with the hot data and so on.
And then the key part is that I said with the NAND flash, you need to erase it
before you can write again. So within the OpenTel world,
that's called the reset, which we can then, when you want to go
back again, then you can kind of, oh, there it is. Perfect.
Then you need to reset it. So basically a chunk starts in this state free
and you want to go into it,
and when you start writing to it,
it changes state into be open.
Then you write a bunch to it,
and then at some point it's full.
Like say you have four megabytes, that's what it is.
Then when you wrote four megabytes,
it will go into a closed state.
And then you have valid data there,
and then until then, and you want to use it again,
you have to reset, and you go back up.
So that's kind of how the normal work with it.
So for those of you familiar with, like,
SMR drives, like single magnetic drives
from the hard drive world,
that's the implicit device model.
We kind of, we took that from the SMR world
and brought it into OpenTel.
So that meant that we could both use the ecosystem
that is already there for SMR,
but it also means that we have a tried model
for how to do this.
So it's more by accident that we kind of fit together
both for SMRs and for how it fits together
with this open channel model.
Cool, so, and obviously,
now that we are talking SMR drives,
some of these drives have conventional zones
and they have sequential write only zones.
And have chunks, chunks.
So in SMR it's called zones,
and open channel is called chunks.
So the open channel spec kind of allows that,
and if it's conventional you can do random
with sequential, that's all good.
And then for those chunks where we need to be sequential right,
we define that and we have to do that.
The host software needs to do that sequential writing.
That's all great.
So then there's hierarchical addressing,
where we kind of see you have a normal SSD down here.
And normally within SSD you have like a NAND controller
and then you have multiple channels.
And then on top of those channels, you have like these NAND controller and then you have multiple channels. And then on top, to do those channels,
you have like these dies attached that you saw before.
So in open channel, we call those channel groups
and then we call the dies parallel units.
So the guarantee from parallel unit
is that they are independent from the other parallel unit.
So you can do read or write to either of them
at the same time and they won't conflict.
Obviously if they share the same channel or group,
they do share some bandwidth constraints,
but in general you can do read and write from the same.
And the important thing is here that this is all,
it's not a one-to-one mapping.
It doesn't have to be.
In implementations today it usually is.
But you could have like four dies that are slot together
and be one parallel unit and expose that up to the host
if you wanted to as an SSD.
But it's not a requirement.
You can do anything how you want it to do.
And then how does that look?
How does the host address it?
So it has these LBAs that we saw before.
We lay on top the chunks that we're gonna have,
and then within those,
then a parallel unit has a set of chunks within it,
and then there's the group
that kind of groups the parallel units together.
And that's all exposed up
through the NVMe address space,
like the LBA address space
that you get through your normal drive.
You just tell the host this is how it works.
So that means you have your nvme namespace
and then you have your groups,
your parallel units, and the chunks.
That's a logical way to look at it.
That's how it's exposed.
Then the media refresh we have.
This idea is that, yeah, so I talked about this step.
This is the NAND.
It doesn't store the data forever.
You kind of need to refresh it.
And yeah, you as a drive can do that.
But now that we kind of let the host do the data placement,
well, now we can tell the host,
hey, please go refresh this data.
So that's what we're gonna do.
And then you can read the data
and then write it somewhere else.
Or maybe, so if you're a hyperscaler,
you might have three different copies
somewhere in your data center,
you don't really need to rephrase it.
That's the next thing, you can do that
kind of optimization as well.
Then there's the host assist, the wire leveling,
where we kind of, yeah, when we write to the SSD,
we don't know the temperature of the data, that's the idea.
And we kind of, when we go through it,
when we garbage collect it,
so we do all the writes,
and we have this concept of a superblock with an SSD.
It has many names, but let's call it a superblock.
It's basically where you stripe across multiple chunks
or multiple flash blocks within an SSD.
So you shuffle everything into that to begin with.
You write it out.
And then later on, you come around the garbage collector,
picks it up, and then starts garbage collecting the data.
And then it sees there's some of it that's warm,
there's still valid data, and there's some of it that's cold,
and then splits it up into two different superblocks.
And then as we go on, then we see, hey,
now we have superblocks that have the warm data
and the cold data, and we kinda make sure
that the data fits together.
What we can do is if we have a prior information about this,
how they're gonna fit, what's the age of the data,
we can do this placement directly at the same time
so we don't have to rewrite the data again and again.
That's kind of the idea.
We enable the host to know if I have cold data, I should probably use chunks
which are near their end of life.
So, you should maybe see, say you have a flash block,
and that flash block, you can erase that,
let's see, say, 3,000 times.
And I'll use the hot data a lot, and then suddenly I only
have like a thousand erasers left.
But I have lots of other blocks
which still have three thousand erasers
left, resets left.
I want to make sure that, okay,
then I know the cold data. Let's move
that into
the chunks that only
have these thousand erasers left
because then we're not going to update it as much. That's the kind of optimization. If we know something about left because then we're not gonna update it as much.
That's the kind of optimization,
so if we know something about the data,
we can kind of apply it.
So all this together is all in the spec,
that's kind of the concepts when you go through the spec,
that's kind of what it's built upon.
So we have the IO isolation where we use,
go through the, yeah, groups and parallel units.
You have the fine-grained data refresh
where we do refresh the data. You have to reduce grain data refresh, where we do refresh the data.
You have to reduce the write amplification
because we can kind of place the data down
at the right places such that we don't have
to rewrite data as much, so it just reduces
the amount of data refresh that we have to do.
There's the DRAM and over-wisening
because we're writing append only,
so we can reduce the mapping table to these couple of megabytes.
And then we have these direct writes
towards this expensive internal data movement.
So that's all great.
That's the spec.
It's worth exactly zero
if you don't have a software ecosystem on top of it.
So one thing is putting all this together.
The other part has been to build
this open ecosystem around it.
And basically what this means is that we've been building,
like in the Linux kernel, we've been extending it
such that, for example, we are taking the NVMe device driver
and we extended it such that it supports us
to detect open challenges as these.
There's support for these 1.2 and 2.0 specifications.
And basically what it does is that there's this
lighten-vim subsystem in the kernel,
which is the open channel subsystems part.
And basically that registers with that part,
so we know it exists.
It's great.
And then the next part is, and then we also, there's a new thing now
which we're going to come back to, it's like we have this own block device, and that's how you kind of
combine it in with how SMR drives works today. So that's really great.
So the LightenVim subsystem, what is that? That's the core functionality
which kind of, we say,
we register up to this system.
So then when we drive come up,
we talk to the LightenBIM subsystem,
and then we're there.
And then we expose it up to the host.
It doesn't do anything at this point.
We then say, hey, we want to put something on top,
and this can be something called,
for example, there's a host side FTL called Pblock,
which we can put on top, and then we can
see it itself as an Open Channel Drive.
So that's cool.
And we could also, if a file system supports it,
this kind of SMR kind of semantics,
we could also put a file system here if we wanted to.
So that's kind of what we did on the kernel side.
And then on the user space side,
since there's SMR support, there's support
like using libcbc, which is what's used
for traditional hard drives, single magnetic hard drives.
There's FIO, which has gained some support.
There's liblite and vim, that's been built on
for a couple of years.
That's Simon sitting there as the main turner off.
There's SPDK, Jim has been graciously
taking in patches on Open Channel.
Not yelled too much at people, but that's really awesome.
So that works as well.
So that's kind of, support is growing
and it kind of makes it easier and easier
to kind of start using it.
So that's really awesome.
So this, all this report kinda started back in 2016,
where we put up the subsystem in the Neos kernel.
We put up the user space library, the liblightenvim,
and had support for that in like April last year.
And that went up like 4.11 of the kernel,
and pblog came in like 4.12,
had the open channel 2.0 spec
released here this January this year, and got support in here 4.17, so that was pretty
quick. Then we got SPDK support here in June that Jim took in, and then we have FIO with
zone support that came in in August. And then on the side, we're working on enabling
like zone block device, so we have an open channel SSD working
and showing itself as an SMR drive
and use that ecosystem.
And then there's an interesting project in,
and it was at Stanford, now it's at Santa Cruz.
There's a guy called Heiner Litz,
which has been working on this idea
that normally within an SSD you have XOR
that you cross, you're just driving across multiple chunks
and then you have this parity on the side.
His idea was that, hey, let's use that,
and then use that to give lower latency.
So the idea is that if he knows
when he kind of wanna read some data that is busy,
I'll use the parity to kind of recover the data
without going into that particular busy die.
So this makes sense because the reads
maybe take 50 microseconds,
but the write or erase might take 15 milliseconds.
It's not that high, but I mean,
if you don't have something called erase,
programs suspend on your chip,
that makes a big difference.
If you do have it, it makes less of a difference,
but you can still get away with them. It still makes a lot of sense. And then we're also working on, chip, that makes a big difference. If you do have it, it makes less of a difference, but you can still get away with them,
it still makes a lot of sense.
And then we're also working on, yeah,
on making a new revision of the spec.
It's been proven to be fairly robust,
so people have been implementing it.
I haven't got too many bug reports.
I have a small list that I kinda keep track of,
and then fix all those small things
and put it in a new revision of it.
That's kind of the overview of Open Channel.
So one thing I want to show,
now that we have time to it today,
is that...
So this is just to show that it kind of works.
So this is kind of inception running,
where we have a virtual machine on my laptop,
and within that we have a virtual machine again.
So it's kind of slow but it gets the idea across.
So for this we kind of brought it up
and we have this NVMe drive that's open channel
which we can see that because it's NVM,
that's the prefix we use for the LightenMEM subsystem
and it's kind of registered.
So that kind of means that we are there.
And then we go in here and the blocks subsystem there,
and let me see, this is normally where you kind of get
your information about the block device that you have.
And here we have a LightenVim thing.
Hold there.
And basically here what we can see is that this particular drive is a 2. Hold there. And basically here, what we can see is
that this particular drive is a 2.0
drive. We can see that
it has, per each
chunk, it's
4096, so this is
a 16 megabyte chunk, because it's 4k
LBAs.
Then we can see
how many chunks are there. There's 64,
but then there's multiple parallel units,
which means we have four.
So in general, there's 256 kind of chunks in this SSD.
So that's great.
So one of the things, so this particular kernel
is extended with exposing an open channel SSD
as a zone block device.
So that means I can take my favorite block zone support,
like command line tool,
which is basically in all newer Linux versions,
this is available,
and I can basically just point it to the open channel SSD.
And then it will list, like,
so this is kind of a convoluted format,
but basically it shows, like, here's the status of, so chunks and format, but basically it shows here's the stator disk.
Chunks and zones, when I intermix them,
it's the same thing, roughly.
They have a stator disk, they have a length,
and they have a write pointer,
and then there's SMR-specific,
and then you can see this particular zone is empty.
That's great.
All that is in place.
Cool.
Then what we can do is, I talked about pblock.
Let's do that.
We can create.
There's here. In the NVMe CLI tool, we have their support for that.
We can go in and say, lring, create.
Then we have the NVMe drive,
and we're gonna call this particular drive instance,
the device we wanna create, the block device,
is called pblog0, and then we're gonna go,
the type is, we wanna instantiate pblog as a target,
so target is called pblog,
that target type is called pblog,
and then we're gonna say we had four parallel units, so let's put an FTL
on those parallel units.
That's great, so that worked.
Basically, we can see here in the bottom,
yeah, that initialized, so that's great.
So let's put a file system on it.
There we go, oh, the on it. There we go.
Oh, the right one.
There we go, and it's created.
So file system is on.
Let's mount it.
So now we have to picture it, it has to drive,
has pblog, now we're gonna put a file system on top.
So that's great, so that worked as well, yay, that's good.
And then we can run an FIO script.
So we just do small IOs, like 128K,
IodF1, small file, whatever.
Just show, hey, it works.
So this performance is out the window, it doesn't matter.
So basically show that it kind of works.
So it lays it out, and we can kind of read-write from it.
So this kind of read-write workload,
we get like measly 30, 62 megabytes per second.
It's not reading.
It's there. It works. That's awesome.
Yeah, cool.
That was the one thing.
So we show that we kind of have an OpenScienC,
and we can see what it can be, be a block device.
Let's go off with this again.
And then we're going to kill this one.
And remove it.
Remote.
And kill this P block one.
There we are.
So now we took P block off, and now we're back to the device.
So the cool thing I want to show, and that's what we kind of lead up to, is that what if we could just take F2FS and point it to our open channel drive?
So F2FS is actually lock structure within,
and it has support for SMR.
And that means now that I exposed open channel SSD
as an SMR drive, it should work seamlessly.
Obviously, when I try to create it,
well, it sees, hey, I'm host managed.
You really need me to have some block feature
in the earth required.
Okay, so we go into there, and you can do like dash m.
That tells the story.
Whoops, it gets created.
So that's awesome.
So now we can do a mount.
This one here, there, and then we can do here.
And now I simply mounted F2FS on top of an open channel
SSD, F2FS is the FTL.
So what this basically means, so P block,
if you bring it up, it takes one gigabyte of DRAM
per one terabyte of storage.
F2FS doesn't do that, it's a couple of megabytes.
So suddenly we reduced the need for DRAM by 1,000x.
So that's pretty cool.
So what we can do now, we can do the same read-write
and fire it off, do it twice,
and see, kind of get the same kind of idea.
It's like, it just works out of the box.
So that's pretty cool.
That's what I wanted to show.
So a long time it's been like,
you can get the users of Open Channel,
but there's not big idea in it
if you just expose it as a block device.
The cool part is when you can do this.
Or when you can do, I've heard,
some computational startup has been doing this
with databases,
and that's pretty awesome.
That's kind of what you could do with this kind of architecture.
You could reduce this DRAM, and you can have...
And you can also...
So say when you were Fusion I.O. back in the days,
you had all this CPU overhead.
Now the CPU doesn't exist because it's in the file system,
and it's something you anyway needed to do,
so you get an FTL for free. And that's, that's pretty awesome. So you don't need to have resource for it. And so that's kind of the, the
idea of it. Cool. Then the last thing I want to talk about
is what's next.
And this has been the past standardization.
So yesterday, the board approved these two new TPARs
that kinda went into NVMe.
So there's two main use cases.
So this is the official NVMe slides,
kind of, when they're presented into the NVMe work group
a while back, and there's two use cases. One is hardware isolation, which is these kind of when they presented into the NVMe work group a while back, and there's two use cases.
One is hardware isolation,
which is these kind of parallel units,
and another one is right-hand production,
which is the stream zones kind of approach.
And there is a need from certain customers
that want to have, be able to slot an SSD
into different pieces, which for OpenTel
means that you have these parallel units,
but kind of what fixed that up came into NVMe
a little while back is the IODeterminism.
And one thing is that we want more of that.
We want to extend it.
So there was going into there.
And then there was, yeah, where we want to have these zones,
where we want to actually be able to
place the data more explicitly.
So streams kind of fixed this in NVMe,
but it's a logical construct,
you don't have a bounds on it.
The idea would be to add these bounds
to make the host more intelligent about placing its data.
So this means that there's this work beginning
in the NVMe work group.
So yesterday in the board meeting,
it got approved for phase two,
which means now we can start working on the spec.
So the one thing is what we call zone namespaces,
and it adds this chunk abstraction into the NVMe.
So this is kind of, it's the SMR interface,
that's kind of where we want to keep it compatible,
such that you can, if you had the crazy idea
that you want to tunnel an NVMe hard drive through that, you could, but it compatible, such that you can, if you had the crazy idea that you want to tunnel
and then hard drive through that, you could,
but it's not really, I don't know if it makes sense,
but you could, but it kind of keep it similar,
because, not because of the hardware,
but because of the software ecosystem.
So the open channel subsystem took three years,
roughly, to make, and similar on the SMR side,
this has been going on for three or four years,
so it's a lot of work that kind of goes into
making this support.
And we don't want to duplicate that,
because then if we go in and build a new spec,
and a new thing, and a new paradigm,
then it's gonna be three to four years
before we have a software stack
that we can actually utilize it.
And then there's these, so that's one part,
and then the other part is the set endurance
groups management where we wanna expose
this parallelism further.
So usually when you take an SSD that supports
higher determinism, you kind of pre-configure it
to one configuration.
The idea is that you can configure it at runtime
if you want to.
And I talked a little bit about the ecosystem.
So basically all this,
what I just showed you here with F2FS,
I mean, building this and getting this into NVMe
and get it standardized,
this will work out of the box.
So we have prototypes working internally
that kind of works with some namespace and everything.
So that's pretty awesome.
So the idea is that all this code
is already being used by Microsoft
and other people in the Linux ecosystem.
So it's already in use today
and been tested and production friendly.
So when we then bring this kind of
zone namespaces and extends the
higher determinism into NVMe,
we can from day one have production software
that the people that uses the SSD can
use from day one.
Cool. So that was what I had.
Thanks.
Thanks for listening.
If you have questions about
the material presented in this podcast,
be sure and join
our developers mailing list
by sending an email to
developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.