Storage Developer Conference - #4: The NVDIMM Cookbook: A Soup-to-Nuts Primer on Using NVDIMMs to Improve Your Storage Performance
Episode Date: April 25, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 4.
Today we hear from Jeff Chang, VP of Marketing and Business Development at Agigatech,
as well as Arthur Saneo, Senior Director of Marketing with Smart Modular,
as they present on the NVDIM Cookbook from the 2015 Stores Developer Conference.
Thanks for joining us for this session today.
This is the NVDIM N Cookbook Primer.
For those of you that were in Paul's session, I heard he went through great detail of the taxonomy,
so we'll
cover it fairly lightly here. And essentially what we'll go through today
is how do you integrate an NVDIMM into your system, why would you want to, how it
can be useful for your system performance, and what are the sort of key
gotchas when looking to use an NVDIMM in your system. So my name's Jeff Chang. I'm with the very enormous company known as Agigatech.
It's a small company.
I'm a co-chair of the NVDIMM SIG with Arthur Sainu.
I always get it wrong.
Arthur Sainu.
So we'll be ping-ponging back and forth.
We also have an extra special guest speaker, Rajesh Ananth.
Thank you. And he's a system architect with SMART. So I'll be covering mostly the hardware piece of things.
Arthur will be covering the BIOS integration piece. Rajesh will be covering the software piece. And then will come back to me and talk about some of these
system application and system performance issues. Okay, so let's get started. This is a SNEA tutorial,
so we are obligated to always have this legal notice,
and essentially it just means if you use NVDIMS
and your system blows up, you can't blame SNEA.
But today we'll be covering,
this is a very long abstract,
I'm not going to read it in great detail,
but essentially NVDIMS are non-volatile DIMS,
which means it's persistent memory or persistent DRAM that you can use as storage in your system.
And what we'll be covering today is you'll understand what an NVDIM is.
So we'll walk through some of the taxonomy, what the difference between an NVDIM N, an NVDIM F, and an NVDIM P is.
You'll understand why an NVDIM can improve your system performance.
That'll actually come last. Right before that, you'll also understand how to
integrate an NVDIMP into your system. So an NVDIMP is plug-and-play-ish. So there
are a lot of hooks that are required from a hardware perspective, BIOS
perspective, software perspective, and we'll walk through each one of those
things so that when you're prepared, if and when you're prepared to use an NVDIMM in your system,
you'll know what to expect.
So the cookbook has several parts.
We have the hardware piece, the NVDIMM part one, that's what I'll cover today.
Part two, Arthur will cover, that's the BIOS.
Part three, OS, will be Rajesh.
And then we'll come back to me and I'll talk about some of the system implementations and use cases.
So the ingredients of the cookbook, if you walk through a stack of the system, quote-unquote stack of the system,
you'll have various ingredients that would be required to integrate an NVDIMM.
From the platform hardware at the very bottom, so all the dark blue boxes are hardware,
to the NVDIMM itself, hardware, to the NVDIMM
itself, the APIs for the NVDIMM into the software layers.
So you have BIOS integration issues, you have OS integration issues, and then you have specific
use case issues that you have to contend with, like do I use it as a block store device,
or do I use it as byte addressable?
So those are all things that we're going to cover today,
and then how that can get integrated
into the overall stack of the system itself.
So for part one, I'm going to cover the NVDIMM hardware
piece of it.
And we'll start with, first, the taxonomy.
So for those of you that weren't in Paul's session,
JEDEC has standardized, in conjunction with the NVDIMM S SIG a taxonomy for different types of NVDIMMs, much like you'd see with Wi-Fi,
802.11a, b, c, n, that type of thing.
Because about a year ago when customers and vendors were talking about NVDIMMs, there
were different flavors of NVDIMMs.
And people were starting to get confused over which type should I use and which type is which.
So if you start at the top, an NVDIMM N, which is mostly what we'll be talking about today,
is memory map DRAM, which is flash-backed.
So it's byte-addressable DRAM, byte-addressable persistent memory using the DRAM.
So the host only has access to the DRAM interface during normal operation.
And when the system loses power, it backs up the data to flash.
That's how you get your persistency in the system.
You have a couple of access methods, though, just like you would with any DRAM.
You could use it as byte addressable memory for load store, block addressable like a RAM disk. Those
are both available in the NVDIMM end. Capacities are DRAM like. So you have ones to tens of
gigabytes of capacity in a single NVDIMM. The latency is the speed of DRAM because you're
accessing the DRAM directly natively by the host. However, you do need an energy source to back up the data
when you have a catastrophic power loss event.
So you can use super caps, you can use batteries,
you can use some hybrid of the two.
We'll touch a little bit on that later in the session.
But for the most part, most vendors that deploy NVDIMs today
use super capacitors because it's the perfect technology
for this type of application. Moving down to NVDIMMs today use supercapacitors because it's the perfect technology for this type of application.
Moving down to NVDIMM F,
NVDIMM F you can think of more like an SSD that's in a DIMM socket.
So you actually have access to the flash itself.
The DRAM's not memory mapped, the flash is memory mapped.
So you have block-oriented access to the flash itself. So it looks like
an SSD sitting in the DRAM channel. You get the benefits of a faster interface than say
PCIe, SATA or SAS, but you still have some software latency above the hardware because
you now have a file system that you have to contend with. But that's NVDIM-F. The capacities are NAND-like, so you can get up to terabytes of NAND flash
capacity just like you would with an SSD. So you can see there are a couple of very
different use cases between NVDIM-N and NVDIM-F.
And then the last type of NVDIM that has been defined or at least proposed within
JEDEC is NVDIM P. So NVDIM N and F, by the way, have been ratified and NVDIM P is just
a proposal at this point. I think in the last, or maybe two JEDEC meetings ago, the first
proposal was shown, the first showing. So it's still going through its definition phases.
But essentially it's a combination of an N and an F. You can get access to the DRAM,
byte addressable access to the DRAM, but you can also have block access to the flash at
the same time. Okay? So those are the three types that we have defined so far in JEDEC.
I expect over the next one to two years we'll probably have
a few more, and then five years from now who knows what will be there. But what we'll focus
mostly on today is NVDIMM N. By the way, I don't mind questions during the session, so
if you have any that might come up while I'm talking, feel free to raise your hand and
we'll call on you. Okay, so in NVDIMM, I touched a little bit about on the taxonomy sheet
how an NVDIMM works.
In a very simplistic view,
an NVDIMM plugs directly into a standard JEDEC socket,
and it looks like DRAM when it interfaces to the host.
When the system boots up,
there is an interrogation process in the BIOS
that identifies it as an NVDIMM so that the host can use it as persistent memory.
And then when the system powers up the supercaps or whatever you have as your energy source will begin to charge.
And then when you have your system available or the NVDIMM is available to be non-volatile, so when it first boots up, it's not necessarily
in a non-volatile state.
There are a few health monitor items
that the host will have to interrogate to ensure
that you're non-volatile.
Once you're non-volatile, then it
can be used as persistent memory.
We'll go into a little more detail on that a little later.
OK, so the energy source is charged.
You are now non-volatile.
So the system can use the NVDIMM,
the DRAM itself, as persistent memory.
If you have 8, 16, 32, 64 gigs in your system,
that can be addressed as
byte-addressable persistent memory for your system
or block-addressable as well.
That's when the health check clears
it's used as persistent
memory by the host.
During a catastrophic power loss event
and we'll go through the sequence of exactly
what happens in the system
itself. Essentially
we're piggybacking off a feature called
ADR, asynchronous DRAM refresh
that was originally developed for
battery back memories but we're piggybacking on sort of that prior art for NVDIM ends.
But during that unexpected power loss event,
the DRAMs are placed in a self-refresh.
At that point, the NVDIM itself, the controller,
can take control of the DRAM and move the contents to the onboard flash.
So it does a data transfer of the flash, of the DRAM contents,
the in-memory state, when you lost power, can be moved to the flash.
It's almost like a suspend-resume event, in a sense.
But it's completely localized to the NVDIMM itself.
So the process takes literally tens of microseconds to complete,
to hand off the memory to the NVDIMM, and
depending on how much capacity you have, it can take 60-ish seconds to back up about 8
gigs of memory. And it's done in parallel, so if you have multiple 8 gig NVDIMMs in your
system it still takes 60 seconds. So there is no hold up time beyond the time it takes to place the DRAM
in a self-refresh. The entire system can go offline at that point. As opposed to, say,
a cache-to-flash implementation where you're having to hold up your entire system while
you're moving your DRAM contents to, say, an SSD with a UPS. And that can take a long
time, and it's probably happening at the worst possible time
in the system.
When you've lost power, you lost your AC,
your cooling's gone out, and you have
to keep your root complex up, your processor root complex up
the entire time.
So we feel that this is a reliable and safe way
to move your critical DRAM contents
into a non-volatile state. So again, once you've saved the DRAM contents into a non-volatile state.
So again, once you've saved the DRAM contents in the flash,
which takes seconds,
the NVDIMM itself goes into a zero power state.
You cut off the power to the caps or the battery or whatever it might be,
and then your data retention is essentially
the specs of the NAND flash itself,
10 years, 5 years, whatever it might be.
When the system power comes back up,
the process happens in reverse.
The DRAM contents are restored back from the NAND flash.
You just move from the NAND flash.
The data transfer is moved back into the NAND flash. You just move from the NAND flash, the data transfer is moved back into
the DRAM. There is some negotiation or communication with the host to cause this to happen because
essentially an NVDIMM is a slave device to the host. The NVDIMM doesn't make many or any assumptions
really about the state of the host. So the host, when the system comes back up,
it will ask the NVDIMM
whether there was a saved image,
whether there is a valid image, and if there is
it will command the NVDIMM
to move the image back into
DRAM. The NVDIMM will tell the host
when it's done, and then the host
can do whatever it wants with the DRAM
contents at that point, either move it
out to a safe state
or just continue to use the memory as it did when it lost power yes question so the question is how
does it react to a power cycle let's say a brownout for example. So if a save command comes in, essentially a save trigger
and we have started our save
we will complete the save until
it's done. In the
next generation JetX
spec, I believe there is an
abort command. So the host could
have some control over that if it wanted
to. But current
legacy, what we call legacy
implementations,
the save will complete.
So once the DRAM is handed back to the host,
it can be used as standard memory again,
as standard persistent memory.
And then the process will happen again
once you have another catastrophic power loss event. Okay, so that's sort of the simplistic view.
As I mentioned, there's a lot of behind-the-scenes activity that occurs
in order to ensure that the DRAM itself will be saved correctly and safely.
So you can't just assume when a power loss,
if we monitored, let's say, DRAM power, 1.5 volts or 1.2 volts,
that we can assume the DRAM is in a safe state at that point and we can just do the backup.
It doesn't happen that way.
So what essentially occurs, starting in the upper left hand corner, so these are the stages here, I think. So we have board logic up here.
We have CPU and chipset here.
And then we have the NVDIMM hardware here.
So everything happens at the board level.
The board itself, or the power supply specifically,
will detect an AC power loss event.
That AC power loss event will assert this ADR feature.
So ADR stands for asynchronous DRAM refresh.
And essentially what that means is the processor in the chipset or the memory controller will flush the ADR protected buffers.
So there are write protected buffers in the memory controller.
Once an ADR trigger comes through, it will flush those buffers out to DRAM.
It will place all of
the DRAM in self-refresh, and at that point, the DRAM is in a safe state for the NVDIMM
to take control over it.
So once self-refresh is asserted, ADR complete is then asserted as well, and this ADR complete
is what we have been calling the save pin. So within JEDEC
there is a standard pin that has been defined as the backup trigger. So once the NVDIMM
sees this trigger it will then switch control of DRAM over to the controller that resides
on the NVDIMM itself and copy the DRAM to flash. And then at the same time, the entire system can shut off all of its power rails.
In a cache-to-flash type of implementation,
you would move all of your data
and have to hold up your system,
including your standard DRAM,
for however long it would take to move that data.
And sometimes that can be tens of minutes, right?
So this whole process, from here to here,
takes roughly 30 microseconds.
If you were using processor caching,
it could take a little longer, on the order of, let's say,
two, three, to five milliseconds.
And then the backup itself will take tens of seconds. So 8 gigabytes
will roughly be about 60 seconds.
And it all depends, of course, on
what type of flash you use,
how fast your DRAM controller is,
how fast your NAND flash controller is.
It's very vendor specific.
But roughly, for an 8 gigabyte
backup, it's about 60 seconds.
Okay, so each of those stages is shown here more in a timing diagram. So again
AC power loss is detected by the power supply and it's sort of a race at this
point. So the ATX power supply spec says that your hold-up time, your DC hold-up time, is at least one millisecond from when you lose power to your DC rails going down.
As I mentioned, this whole process here to get this backup trigger typically takes about 30 microseconds.
So you have a power good signal that then triggers the ADR trigger.
This is the ADR trigger
into the chipset.
The processor is in normal execution.
Once it sees that trigger,
it will flush out
the right protected buffers.
It will place the DRAM in a refresh.
It will give a trigger to the NVDIMM
over the DIMM interface.
This is the save and trigger.
And at that point,
the NVDIMM itself will take control of the DRAM.
It's in self-refresh at that point.
It will pull it out of self-refresh,
and then it will start moving the data into the NAND flash that's on board.
So the nice thing about NVDIMM N is that about a year ago or two years ago, none of
these features were standardized.
It was very proprietary implementations for NVDIMMs.
No NVDIMM vendor looked alike.
There's probably five or six different NVDIMM vendors in the system that offer solutions
today.
So the problem with that is if you wanted to integrate an NVDIM in your system,
they weren't necessarily plug and play between the vendors.
Well, Jetix has solved a lot of that.
So Jetix got together about a year ago,
maybe a year and a half ago,
and started standardizing both the hardware interfaces
and the software interfaces for NVDIMM types.
And the first type is NVDIMM-N.
So one of the things that JEDEC did was added 12-volt power to the pins.
Now, this maybe wasn't specifically for NVDIMM-Ns,
probably for future instantiations of other DIMM types,
but that 12-volt power over the DIMM interface is very, very helpful for an NVDIMM, as you can imagine.
So we can use that power to charge whatever power source might be connected to the NVDIMM,
but also that 12-volt power bus can be used to power the NVDIMMs during a catastrophic power loss event.
So you can actually switch in a central battery
or central super caps or whatever
it might be in your system and keep just the DIMMs up
to do this backup process, as opposed to a very large UPS
type of implementation.
JEDEC also standardized on the hardware interfaces
to trigger the backup.
So the save and pin is standardized on DDR4. to trigger the backup. So the save and pin is
standardized on DDR4, it's pin 230. It's a bidirectional pin. There's some specific implementation
aspects as to why we made it bidirectional. We'll not go into detail on that today. But
essentially that's the pin that the NVDIMM controller will monitor, and when it sees it go, we will do the backup.
Okay?
There is an event pin that's an asynchronous event notification.
This is implementation-specific,
but in some implementations, when you trigger an event,
it will tell the host that something wrong happened. The NVDIMM can no longer be
used as a non-volatile memory, so the host can go in and then check what went wrong with
the NVDIMM itself. There can be multiple reasons why an NVDIMM goes bad. The flash is bad,
any number of things. You don't have enough energy to do a full backup. Those are all things that the host needs to continue to monitor to make sure that
when it's using that NVDIMM
it knows that it's in a safe state. The I2C device addressing,
that has been standardized with NGEDDIC. And as I mentioned the 12-volt
DDR4 simplifies NVDIMM power circuitry and cable routing.
We've taken great advantage of this
12 volts and all the different types of implementations for NVDIMMs. The other thing
that's not up here is I guess that's what this device addressing is. So not only is the device
addressing itself standardized, so the SPD addresses, the SPD values have been all standardized. But also there is a command set.
So when the host talks to the NVDIMM controller over SMBus in the channel,
the protocol to which it was talking to the different vendor implementations previously
was different from vendor to vendor.
Well, JetX standardizes all of that.
So next generation implementations will all be standardized.
Think of this like a USB class device, for example.
So USB mass storage.
Every USB mass storage device looks exactly the same to the host system.
You have a USB class driver.
So that USB class driver can be plug and play for all the different implementations.
Well, we've striven for the same thing here with NVDIMMs. If you have an NVDIMM supported
BIOS in your system it will look exactly the same from vendor to vendor.
So this is a kind of an eye chart. I'm not going to go into great detail on
this. It shows what for DDR4 what we're calling legacy sort of the first
generation these proprietary implementations
and then the second generation which is the jetec standardized implementations all of the items on
yellow are some of the areas that a lot of standardization has occurred so I've covered
some of the stuff in the hardware piece Arthur and Rajesh will cover more of the software aspects so
the software is really I think we're more of the software aspects. So the software is really, I think,
where more of the exciting stuff is occurring in the industry.
Because the hardware is kind of obvious.
If I can use DRAM as persistent memory or non-volatile memory,
yeah, it's obvious that it's going to be the fastest persistent memory
that you could possibly get in your system.
But then you get software bottlenecks on top of that. And that's
where a lot of the work with the NVM programming
model that I think Paul and
Doug had talked about earlier today,
that's where a lot of the work is
happening to make those standardized so
that you remove the software latency
in your system and truly get
the best performance out of your
persistent memory as possible in your applications.
So I'm going to hand it off now to Arthur who's going to cover more of the BIOS aspects
of NVDIMM use in the system and then Rajesh will cover the OS, some of the software stack
activity and then I'll come back and do system use cases.
Thank you Jeff.
Good afternoon everyone, I'm Arthur Sainio. Just a reminder,
we do have a Birds of Feathers session this evening at seven o'clock. So anybody that's
interested, you're invited to come and bring up any topic related to NVDIM, NVDIMs at all.
All right. So I'll talk a little bit about the BIOS. And one thing that I wanted to say is that this has really been a key enabler for the adoption of NVDIMS in the industry.
So with DDR3, there were just a certain number of systems out there available, off-the-shelf systems that would support NVDIMS.
With DDR4, the number has grown dramatically.
So these are Intel-based systems I'm referring to with the MRC and the BIOS enabled. So once that MRC is released to AMI and Phoenix and Insight and so forth, they do support all the different vendors and has really enabled this market to grow quite a bit. So in terms of the bio support specifically there's some functions that have been added.
Just seven of them are listed here and those enable the system to detect these NVDIMMs once
they're plugged in. So there's seven different functions that allow the memory map to be set up,
that allow the NVDIMM to be armed so that it's ready for a power loss scenario.
Other commands, such as flushing the write buffers
and restoring the data upon the power loss,
and also the I squared read write process.
So all these need to be necessary or in place on a system
in order for the NVDIDIMS to be supported.
So this is the standard flow with the shaded area in the middle of being the MRC section,
the memory reference code.
So the system has to detect the DIMS are plugged in,
what density they are, the configuration, the voltage,
and then they do all the timing parameters are set up etc. So that's a standard flow and then
you can see the difference here for the NVDIMM BIOS. So a number of different features here in
terms of NVDIMM detection. So the N versus the Y is what Jeff mentioned the the command set. So the N is for the legacy and the V is for the JETA command set. So we just need
to make sure that when these NVDIMMs are plugged in that system or the host needs to know what
it is and to configure that memory accordingly. So that's all that work on the taxonomy and
the SPD bytes and so forth. So as the NVDIMs continue to proliferate, there will be more letters to be added
and then other modifications made into this BIOS flow.
So they would need to be updated based on the NVDIM types.
The other ones here, the NVDIM restore.
So those are additional commands after the system reboots.
It sees the NVDIM.
First of all, it recognizes it's in the system andots it sees the NVDIMM first of all it recognizes it's
in the system and then checks for an image if that image needs to be backed
up it has to initial a signal or enable signal CKE low to read to do the backup
for the data from the DRAM to the flash before the system continues to boot
okay to boot.
So the recovery process, another bit of an eye chart here. So I'll try to walk through this thing.
So as I mentioned, when the system is restored
and it first checks if there's a save in progress,
if there's a save in progress, there's a save in progress it has
to wait so it's not going to interrupt that and then once that is completed it continues to boot
and just as a standard dim but the other thing on the top right there would be does the nvdim
contain the the backup data and that's when it goes and enables a CKE load
to enable that NVDIMM to go through the restore process.
So the data is backed up into the,
back from the flash to the DRAM,
and that can vary, the time that that takes can vary
depending on the vendor,
maybe anywhere from six to ten seconds
approximately per gigabyte.
So that does impact the boot up time for a server.
And then it continues to restore and completes the process over here, and then the system
boots up.
All right, the EA20 table, some changes going on with the type 12 here. So that's the reserved section of memory which has now been updated to type 7. So type 7 is for persistent memory and as Doug mentioned about the infant tables, that gives specific instructions on where that persistent memory is located. So moving from type 12 to type 7,
and giving specific instructions to the operating system
in order for that block of persistent memory to be located.
So again, once that whole system is booted up,
and depending on the number of NVDIMMs
that are plugged in there,
that'll be blocked off as persistent.
And then it's up to the host system
or the user or the application
to determine what can be used with that portion
of memory.
And then these are just some other examples here of some of
the features about enabling ADR, asynchronous DRAM
refresh, the restore functions, and extra features
related to the BIOS commands that can be variable depending on the user.
Okay, and Jeff mentioned this other section here about the legacy versus the JETA command set,
so that was the case with DDR3 where most of the systems or all the systems virtually are using the legacy command set.
The industry is moving toward the JETA command set that should be done by the end of this year, which allows a standard implementation for all NVDIMMs to be
recognized in the uniform manner. So there's also an updated MRC and BIOS that goes along with that
command set, so that is still in progress. It has not been released yet yet should be also released by the end of this year.
Okay so that's it for the the bio section Rajesh is going to cover the
operating system. I think Doug covered some of this and other talks but if there's
any questions here. Thank you Arthur. I'm Rajesh Anand. I'll be talking about the software things that are
happening in this spectrum.
So the NVDIMM, when it came into the picture, a lot of
vendors, they were having a lot of proprietary solutions.
They were having their own drivers.
They were having their own stack and everything.
So there's an initiative that got started that they wanted
to standardize that.
So the whole trend is like instead of having a traditional storage stack, which has got
the file system sitting on a block driver, sitting on the disk, now the trend is like
have a non-volatile memory.
It's not just for NVIDIMS, it could be for any persistent memory.
So this is what the trend is going to happen.
What you're saying is like
all the applications, they want to know, they want to have a full control over the storage.
They want to have direct access to the storage. Why do we need multiple layers? Why do we
have to have this traditional file system way of accessing, this way of accessing and
all these things? So that is what is happening. So people try to, this solution, so many different vendors
were trying to do it in their own ways,
like you have a memory map interface,
you use my driver, use my application, everything.
Then what Intel did was a couple of years back,
they said, okay, let's make it more standard.
So Intel kind of started defining this,
and now this is pretty much coming into place.
So this is what has become in the 4.0 kernel.
So it's got a lot of things floating around.
So what all that's in the gray portions, this is the interesting thing that has happened in the Linux 4.0 kernel.
These are all become standard drivers now.
That is, you don't have to, if you're using us, NVM or a persistent memory from us, you
don't have to use our drivers.
The Linux 4.0 kernel already has the driver for that.
So this is the NVM library that Intel has in its GitHub site.
So what they have done is they have built a beautiful library system that is the, you
can have, you can plug in any of the vendor's NVM modules or any persistent memory modules.
The library gives you access, like you can use them like an object database or you can
use them like a, what's that like a typical block database block interface,
or you can have a direct persistent memory interface.
The library has all the interfaces defined.
So all you've got to do is just write an up.
Go ahead, please.
Just clarification, right?
Are you now referring to the F and P, or is 2 to N?
Okay, so the thing is is this ecosystem, it's pretty
interestingly, they call
it as non-volatile memory which
kind of applies to both the dash
n and the dash b. The dash
f probably
will cook into this infrastructure but right
now if I have a dash b
which is going to be common
and if I have anything they call it as
persistent memory, that is, what they call as persistent memory is like a memory
device sitting on the memory bus and it's quite accessible using load and
store instructions and everything anything fits into this ecosystem that's
the beauty that is happening that is if somebody says hey this is my memory
device that can sit on a video port or anything. It's quite
accessible. Using this infrastructure you can directly access that. The drivers
will detect it. So it all comes to this
which is this new ACPI specification. I think Paul touched a lot about it. This
latest specification has given a standardization of what these
persistent memory modules have to be. They keep on referring to them as non-volatile
memory, meaning that anything that applies, that has the characteristics of persistent
memory will apply to that. It could be like what they call a spin torque memory, that
memory and everything, it doesn't matter. There's a huge change that is happening in the BIOS.
The BIOS constructs what is called the
when we come from an interface table, that is the new
definition. And the OS kernel detects that thing and it can say
hey, I can access this guy this way, this way, and everything. So this is more like a vendor-independent way.
So this whole driver does a lot of work, and it depends on the buyer.
So a lot of buyer changes are also happening.
So your motherboard buyer has to set up this thing,
and once it is done, the kernel detects everything and sets it up for you.
So the interesting thing is there's a window,
they call it the block driver and
the VTT and everything. The idea is for legacy guys who want to use, see this is, at the
end of the day, this is a memory device that is sitting on a memory bus. But still, if
people want to use it like a conventional block storage, you still have to provide that.
The interesting aspect is now you are a memory
device sitting on a memory bus, but the applications of the file system are always thinking like
it's talking to a block storage. If there is a memory error or that thing happens, the
conventional reporting scheme doesn't report it like there is a sector error or something
like that. To make it happen, this whole infrastructure, they have to build it.
It is the file system that is sitting on this thing,
it gets the error as if it's getting a SCSI device error
or something like that.
A lot of work has gone into this thing and this thing.
But there's always a flip side to this thing.
Now you want to maintain the legacy compatibility.
What it introduces is like now you have a upper level guy who has his own host LBA
which gets translated into another LBA which gets translated so there's a lot
of multiple translation that happens the most interesting aspect which I don't
know at this point is like how much it's gonna affect the performance for this guy
if you want to access it because at the end of the day the bottom most thing
is doing cache like IOs because it's sitting on the memory bus. So even if
you're doing a 4k IO when it goes to the bus it's doing 64 bytes at a time kind
of thing because it has to go to the memory controller and all these things.
This is one thing further experiments will tell,
but this thing, this is pretty fast because this is byte addressable. So the
whole ecosystem, at least in my opinion, is like it's going to flourish this way
because people want to use the whole aspect of the process to memory is like
it's a memory device, let me use it like a byte addressable thing kind of thing.
So this is more like a legacy thing,
only the future will tell, but it's all enabled now. So this is really fast, it's like it
just the device comes up and depending on how many devices you have, it exposes this
like slash dev slash vmem something like that, and you can access them right away. So these are the, whatever I said, these are the things that are,
what is existing, what's going to come, and all these things.
So the kernel aspects are in 4.2 kernel, but Intel is trying very hard to push in
the libraries and everything that hasn't happened yet.
And now, as we all know about Linux distributions, it's like,
just because it's in the kernel, it doesn't mean it's going to be part of the, it's going to make it to
the distribution soon or anything. Most likely Fedora is going to be the first one that's
going to have the 4.2 kernel, we don't know yet. But the rest of the distributions might
be just keeping it all. Yeah, sorry.
What's the difference between the window block driver and the PMEM block driver?
Yes.
So the PMEM is basically, it's a persistent memory, direct memory access, meaning that
you can use a memmap command kind of thing, so it gives you a byte addressable thing.
So the whole aspect of the PMEM is like, it sets you up to access the device, then it
kind of takes it away from them.
So the driver is not involved during the I-O operations.
It's just a bit of a small thing.
The block driver is a typical block driver.
It's like a SCSI driver sitting on top of the memory device.
So it has to translate everything that is your block,
host LBAs into like what offsets in the DIMM is gonna go
and all this SCSI things.
The block driver also, what they have done is, see, if you look at a typical
memory system, you have multiple channels,
you have interleave among the channels and all these kinds of things. That's how the memory devices
are supposed to operate. Now,
the PMIM is basically, if I have four, let's say,
DIMMs on a socket,
it sees them as a whole so that it can interleave data and all these things.
The flip side it brings in is like if one of the DIMMs go bad,
you're seeing your whole storage as a whole device.
It's going to go bad.
So that's what the block window, what they do is you can set them like a RAID portions kind of thing.
On every individual DIMMs, I can assign like a RAID distribution, like RAID 0, RAID 1,
those kind of things.
Those kind of support is there in the block driver.
So imagine you are having memory DIMM devices and you are treating them like a RAID kind
of thing.
So if one of the DIMMs fail, they have the protection logic in the driver
that can record they have the parity
and all those kinds of things.
That's why it adds a lot of complexity
to performance with TAL, whether it is worth it.
Just because it's sitting on the memory bus
doesn't mean it's going to be having memory speed
because there's a lot of software,
what do you say, that layers put on top of it.
But a lot more access.
The DL mode access, it's completely different.
Any questions?
The PMM mode operates like a.
Yes.
So you just get a file, access doesn't really just go.
Yes.
Yeah, that's his primary intention is.
So yes, there is some interesting things also. This finger driver, you see it's called a block driver. So the reason is it comes up and still use it as a regular device under a file system. It
acts as a block device. So this driver gives you dual functionality. That's why this DAX
enabled file system, which is the EXT4, they made huge changes to enable or to disable
all the kernels, buffers when you access the. If I can mount it as a DAX-enabled thing and I can use the
BMAM block driver before using memory maps. So it kind of behaves like that. It doesn't
then copy all these things. So the whole idea is, thank you, the whole idea is all about like
maintaining the legacy compatibility as much as possible. So because this introduction itself is a lot of shock.
And most likely this whole thing is going to prevail
because the next processor that's going to come out,
the load and store instructions that got optimized,
like flush instructions and all these things,
right now the library, they have disabled it.
But the whole idea is it's just the library the
applications don't have to worry about it once they have the new processor with
the new instructions they will just change the library and the applications
don't have to change it so this is the most interesting thing that's going to
happen this is pretty much done it's for the legacy guys so these are the
pointers so if you go to this bmum.io site it's beautiful it's got So these are the pointers.
So if you go to the spmum.io site, it's beautiful.
It's got all kinds of documents, the latest documents that Intel maintains it.
And they have a GitHub reference.
You can download the source.
And if you want to make some contributions, they're open for
making contributions to that.
If you want to make some changes, fix some bugs and stuff like that.
I'm pretty much done.
Any more questions?
My understanding is that, according to your
introduction, my understanding is that if you enable ADR,
ADR will guarantee flash memory channel buffer into PM.
How about RDMA?
If I use RDMA, this also will guarantee flash
to memory channel buffer into PM.
Also, is there any possibility ADL will fail
to flash the buffer cache?
So, when you say ADL will use the auto view and refresh?
Yeah.
So, you're asking about the RDMA aspect of it?
Yeah, if we use RDMA, the ADL will guarantee flash the buffer cache, the member channel, member channel buffer
into persistent memory while power loss?
I do not, honestly I don't know the answer
to the question yet, so you know what I'm saying?
It's trying to be detailed, right?
I'm not aware of any plans to enable access
to the memory when the system's not working.
So is that possible?
If not RDMA, that's the generic DMA,
will the ADR fail to flash in memory channel buffer?
Not remote, just in some single host.
Right.
Yeah, I do RDMA.
Yeah, they're independent of each other
and have different purposes.
Yeah, I will get an answer.
I don't know what that is.
Just to be honest. I'll get it even more informed on work. The slides should be available for download.
I think they're already on the NBDM SIG site.
Okay.
It's a similar presentation to the before show.
The only...
Five minutes.
What's that?
Ah, okay.
I can do this in five.
System implementation and use cases.
So putting it all together and making it work.
There are several platforms available today
that are NVDIM enabled.
There's a few from Intel.
There's a few from Supermicro.
They're only on here
because they were the first ones initially, but almost without exception,
most of the OEMs will have NVDIMM enabled platforms available for Haswell and Broadwell,
probably Broadwell, and then eventually Skylake.
Thirty-four socket systems?
Four socket systems.
I don't see any four socket systems up here, but do know they're coming. Yes. So for example, Supermicro has a list of 15 different
platforms we just didn't have enough room. But I believe four socket systems
are available today. We talked about some of the energy source options. You can
power the NVDIMMs through that 12-volt interface. In some systems, obviously, the system has to be plumbed appropriately for that.
Or you can use an optional external energy source like super caps or a battery or whatever it might be.
NVDIMMs themselves are pretty energy source agnostic, as long as it's energy.
It can be a battery.
It can be a hybrid.
It can be super caps.
Super caps are nice because it has high energy
density for a short period of time and that's all you need for an NVDIMM and
super caps are much more reliable than batteries and that's why they're the
energy source of choice. The population rules are very similar to standard DIMMs.
You can't use LRDIMMs with R DIMMs in the same channel, but you can put
an NV DIMM essentially anywhere in the system and the BIOS will detect it and set it up
appropriately in the EA20 map that Arthur had mentioned. The BIOS is smart enough to
interleave NV DIMMs that are within a system and it will interleave standard DIMs within a system.
So it really doesn't matter where you put them, however, where you put them does impact
performance, right? So you should be smart about it. So for the NVDIM population tips,
interleaving DIMs within a channel provides a small performance benefit. Interleaving
DIMs across channels provides a very large performance benefit. So interleaving dims across channels provides a very large performance benefit so interleaving NV dims across channels by association
gives you the best performance benefit and here are some optimal interleave
examples I'll just point one out the green is an NV dim the light blue is
standard dim gray is an empty slot. So in this configuration, the system will interleave 32 standard dims together
and 8 gigs of NV dim.
And the way that the OS will recognize the NV dims is it will pass up,
the BIOS will pass up the EA20 map,
and it can allocate the upper 8 gigabytes, for example,
of addressable memory as persistent memory.
So the OS or the application can use it appropriately.
Several use cases.
These should be obvious.
In-memory databasing, of course.
In enterprise storage, you can use storage tiering,
caching, write buffering.
Caching is probably the most popular application today
because it's sort of plug and play into most application-aware systems.
Most applications that use a write buffer or write caching
already need some sort of persistent memory in their system,
so there's not a lot of overhead in terms of application development.
That's one of the reasons why we have the block mode driver that's being developed because a block mode driver can more easily plug into
today's applications. However, byte addressable persistent memory is a more compelling use
case in the future for persistent memory.
So an example that a specific customer has shown, and this was shown I think at the NVM summit back in January, this is a storage bridge bay, so you have two high availability nodes, active-active nodes.
In a system like this, you always keep two copies of your persistent memory, and in this case, it's done with an NVRAM PCIe card.
So you have essentially a bunch of latency hops
in order to make sure that the same copy of that data
is on the other side or the other node,
in case one of the nodes goes down.
So you have the data in volatile DRAM.
It needs to be copied to that PCIe card.
So it has to go through the memory channel over a PCIe hop, has to go
back through the PCIe root complex to a 10 gig E NIC card through the backplane, and then through
those same latencies back into the PCIe card. Well, with NVDIMs, it gets much simpler, right?
Once the data is in the NVDIM, it's persistent. So once you've captured it in the NVDIM,
you can go through the non-transparent
bridge, which mirrors the right to the other node. It's much, much faster. And in an implementation
like this, the customer was able to see a 4x performance improvement in the right latency.
4x. It's pretty significant. And I don't know a technology today that could give them that type of performance improvement.
So, with that, I think we're out of time.
I'll open it up for questions.
Just one quick comment.
Sure.
The previous slide was showing the NTD configuration.
Yes.
What we're anticipating in the program in Twig
is that for Eric, who is not in storage devices,
but more servers, is battery-hard DMA
internet, but exactly the same benefits
that you described there.
And that's how we would anticipate
in the question you have there about access to remote memory.
No, it wouldn't be access to remote memory
when a server is down.
It would be access to duplicate copy across the cluster.
So we're using replication as opposed to some assumption
that we would get to powered off memory.
So I think we're approaching that problem.
You had the answer that I was gonna to give you on the slide there,
but just a couple of letters.
Perfect.
Thank you.
Thanks for that comment.
Okay, any other questions before we close?
No? All right.
Thanks, everyone.
Appreciate the time.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.