Storage Developer Conference - #69: Update on Windows Persistent Memory Support
Episode Date: April 10, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 69. My name is Neil Christensen. I'm from Microsoft. I lead the NTFS file system development group.
And one of the things that we have been working on with, obviously, it's not just a file system thing,
but we've been working on persistent memory over the last few years.
In fact, two years ago, I came and talked about our initial support for persistent memory.
And so this is simply an update on that.
This is my agenda.
And the first slide here, what is persistent memory?
You know, so many people have talked about it this week.
If you don't know what it is, you can read it.
But I think everyone here understands what persistent memory is.
So the first thing I want to talk about is I was here, like I said, I was here two years
ago talking about our work. All of that work is already released. It was released a year ago
as far as through the Windows 10 anniversary update, which is the client side, and then
through Windows Server 2016. So our support has been out there for a full year already,
and I'm just going to do a high-level review of the support that we released back then, and then I'm going to move in and talk about some of the new stuff
we've done since then that's in an upcoming release that'll be released shortly. So basically,
here's a picture. And again, I think this has been commonly talked about in several presentations,
how what we call a DAXx mode a direct access volume works
The bottom line of the whole thing is that an application
Can memory map a file and instead of today where it it's to RAM and there's paging IOs to the storage to the
Through the file system to the disk and everything. It's direct access
From that application straight to persistent hardware.
The file system gets out of the way, which gives the application optimal performance.
Now, there are some costs to doing this, and I'll talk a little bit about that as well.
But this is a high-level model.
And what's important, of course, not every app in the world is going to change.
And so we still have to support all of our existing APIs
so that they can function properly in this environment.
And so this is basically, in this slide,
it calls it a block mode application,
but it's basically an application
using standard file APIs.
And they have to work through this whole model,
and I'll talk a little bit about that.
But this is just kind of the high-level view of how it functions.
And the advantage of persistent memory, that byte addressable, sitting on the memory bus,
is this whole direct mapping, get the OS out of the way.
Everyone wants to get the OS out of the way, right? So in our environment, DAX mode is chosen at volume format time.
So at the time you create a volume, you decide if you want it in traditional block mode or
if you want it as a DAX volume.
And one of the reasons we had to go this path is because there's some compatibility issues
with some various components listed there.
And so in DAX mode, these guys get out of the way.
In fact, BitLocker, which is our volume-level software encryption,
and BallSnap, which is our volume-level snapshot provider,
they don't even attach to the DAX volume stack.
It's actually a completely separate stack in Windows,
and none of the existing drivers know how to attach to it. They have to be updated, and those haven't been updated yet.
And also aware there is some functionality that you lose by losing this hook point that the file
system has into people's data, and I'll talk about that. So the real question is, how does memory mapped I.O. work on a DAX volume?
An application uses existing APIs that function today.
There was no change in those APIs.
So any app that already uses memory mapped APIs kind of gets this direct access on a DAX volume.
And as I said, an application that uses memory mapping maximizes performance because there's nothing in their way. Now, how did we do for compatibility? Because we all have standard, all file systems, all operating
systems has these standard file APIs, read, write, they can be cached, non-cached. So how does
cached IO work on a DAX volume? And it's pretty straightforward. The cache manager creates a direct
map. So in essence, there's's a one copy this is a one copy
i o because the user's buffer is copied by the cache manager directly into uh the persistent
memory so again it's much faster we're not going down the storage stack and and there's this one
copy access to and from the user's buffer on reads versus writes.
Oops, did I miss one?
Sorry, too fast.
And then there's non-cached I.O.
In our environment, we convert non-cached I.O. on a DAX volume to cached.
So if you issue non-cached I.O., that's fine,
but we actually do it as cached I.O. to maximize the performance.
It's doing a simple copy.
An example of an existing application,
the compilers and linkers that are used on Windows all memory map their files to do their work.
And we did some tests with those.
You know, everyone wants to build their system on these environments, right?
And it was pretty nice perf improvements by eliminating
that
paging IOs and everything else
by running our build system on a DAX
volume. The problem we have is DAX volumes
aren't that huge yet.
The hardware availability is still pretty minimal.
Now,
I mentioned earlier that there's impacts
to file system functionality on these volumes
because when you get the file system out of the way,
it can't do transformations of data that it likes to do.
And so some examples of things that you lose is this NTFS software encryption.
We have a profile encryption engine.
You can't use that on a memory map file in DAX mode.
We can't do compression.
We have this concept in NTFS called TXF, which is transactional semantics.
We turn that off on DAX volumes as well.
We have a USN journal, which is a change journal.
And one of the features of the journal is you can track ranges that have been modified.
Well, when you don't know that a range is modified because you see no I.O. to it,
because the application is talking directly to persistent memory, you lose that functionality.
Another concept that's in our file system is a concept called resident files.
We have our MFT, which is basically equivalent to Linux inode.
We can actually embed, if the
file is small enough, we can embed the data for that file directly in there.
That doesn't work on a DAX volume.
I mean, it's really nice from a space savings point of view because you're not allocating
any clusters for it.
But one of the features that you can do with resident files is you can memory map them.
Well, when it's a DAX volume, you really can't do this memory map
to our internal metadata structures
and have it be robust and work properly.
So we had to simply turn this functionality off
on DAX volumes.
So there is some impact on functionality.
It's just the price you pay for performance.
Some additional functionality.
One of our big challenges is
you don't want to file changes anymore.
And so when you memory map a file,
we had to make changes that basically said,
hey, if you create a writable memory map section,
we're just going to say you modified the file.
We're just going to assume you're going to modify it
if you're creating a writable section.
And so we update modified and access times.
Again, we have this USN
change journal just tracking changes.
We also just say, hey, you modified
it, because we honestly don't know.
And then we also have
a concept called directory change notify, and
it's a way for an application. This is how
the shell works in Windows. You create a through new command or some other tool all of a
sudden and you have a shell that thing it'll just pop up and show you the new file there that's
because we notify uh the applications through this mechanism that there's been changes in this
directory that notification really doesn't work very well if you don't know that the file's changed.
So again, when you just create the section, we just signal, hey, this file's changed, so don't worry about it.
Do what you're going to do.
We also, we currently do not support sparse files and the ability to defrag.
These are things that we know how to do.
We just haven't had the time to do it. And we have not implemented them yet.
But they are coming. Now, there's this concept in the Windows environment called a file system filter. It's a layering concept. So above the file system, you can put a filter. You can see
all the operations coming into the file system, all the operations coming out. It can manipulate
it. These are examples of types of filters that exist in the
system. In the Windows environment, antiviruses are all implemented as file system filters. That's
how they do their real-time scanning. And there's lots of different kinds of them. This is just a
random list of encryption and quota and compression, all replication. Well, these filters have problems because, again, they've lost their traditional hook points to do some of these operations they want to do.
An example, there are lots of encryption filters out there because everyone wants to do their own encryption.
They can't tell.
You can't do encryption on a DAX volume because you just can't tell when the data is changing.
You don't know when it's being accessed. The idea of these encrypted products, these encrypted filters,
is they want to encrypt it on the media but have it be clear text to the application.
They simply can't do that on a DAX volume because they don't know when the data is being accessed.
They don't know when to transform it. Frankly, it's not sitting in RAM anywhere. It's just directly to the persistent memory, and you can't transform it in place.
And so this functionality is lost. So what we ended up doing for compatibility
is we just don't tell filters about DAX volumes at all, except that we have a new opt-in model.
And so when a filter registers with the system, they can say, hey, I understand DAX.
And then we'll start telling them about this information.
And by the way, that's what most of the antivirus products have done
so that they can understand DAX.
It works fairly well for them
since most of their products scan on open and close
and that really hasn't changed.
So we do have a concept of block mode volumes. All we mean by this is it
just runs the way a hard disk has run for years. It goes down a normal I-O stack.
You do a read and write, it goes down to the storage driver and it does copies
back and forth. And so we have that ability, and it's the fastest.
You take an NVMe device and you compare it to this thing,
it's like 10 times faster.
So it's like the fastest storage device you have.
And so it's there for full compatibility.
Everything works.
All of our file systems work.
All of our existing drivers work.
People can run this way, get a big performance boost,
and maintain their cap compatibility now one of the big problems with that applications have
made assumptions on over the years is sector atomicity they believe that when
I write a sector the whole sector will go out or nothing will go out because
this is basically how hard drives have been functioning for a long time.
And that doesn't work with DAX,
especially as an example in block mode.
The processor is just doing copy.
It's copying from whatever the user's buffer to the sector,
and that's it.
And you can lose power at any point in time
and be in any random state.
And so there is an algorithm called BTT,
called the Block Translation Table.
It was originally created by Intel, but it's been standardized now. And basically, as in software,
as you all know, you can solve all problems by adding a layer of indirection. And basically,
all there is is a translation table, and it does the manipulations that you copy. When you're
copying in, you copy it to a separate piece of of memory and then you swap it atomically in the translation table. So either you see
the complete old sector or the complete new sector. That's what BTT does. We use BTT in block mode
to make those atomicity guarantees. We do not use BTT in DAX mode because again there is no we don't know when an application
is changing things and so those atomicity issues fall on the application
when you're using DAX mode you have to deal with power failure and being able
to recover your state that's the apps problem and they have to deal with it
okay so that's kind of a high-level review of all the functionality that we released
last year.
It was all released a year ago.
It's available today, and there are people starting to use it.
So let's talk about what some of the new stuff we've been doing, adding support to it, continue
to enhance this.
Well, one of the things we did is we realized that we didn't have, when these applications
are trying to do their stuff, you have to remember
there's still processor caches.
There's layers of caches in the system.
And so even though I write to this persistent memory range,
it's not on the media yet.
It's just gone into the L3 cache, L2 cache, L1 cache,
wherever it is, it's there somewhere in this caching hierarchy
that's built into our modern-day processors.
And so we needed some APIs to flush regions so that we could make durability guarantees.
So we defined what we call these RTL APIs.
It's a concept in Windows.
And these are available both from user mode and kernel mode. And it performs the necessary work to optimally flush persistent memory through the CPU caches and get it onto hardware.
And there is documentation available on this up on MSDN.
This is the list of APIs, and I'm not going to go into a lot of details of what all the APIs do. But one of the things,
and you see this starting out,
we have this get a non-volatile token.
The idea that we had is,
you know, someone wants,
someone has created some section
and they're going to be doing
multiple manipulations on it.
You request a token to cover that section.
That's what the buffer and size things are,
and you get this token back.
The idea of the token is it contains state information
about that guy.
And what we can do is,
if you're trying to run in a debug mode,
we can actually maintain some debug information
to help make your stuff more robust in these environments
and verify that, hey, are you flushing all the regions that you want to flush?
Have you flushed everything that you've modified?
Things like this.
And so we have this token concept.
It's super fast, but it gets all the information about
what types of flush operations are available on your processor,
what are the durability requirements, etc. For example,
we see points in the future where people will have batteries on machines. And so if you have
a battery and make some guarantees about how the battery works, maybe you don't actually,
maybe you don't even need to do the flush. Maybe you know off the flush, you call this thing and
it returns immediately and don't do anything because the system is handling
all that. This is what the token gets for us is that we can track all this state
internally, make it transparent to you, you just do this stuff and then it'll
just do the right thing in your hardware environment, most optimal thing that you
can do. So there's a generate token, obviously there's got to be the
free the given token. This is the general command.
It's just flush a non-volatile memory.
It's a range.
We do have this concept in here.
So, you know, you give the token,
you give a buffer and a size, however big it is,
and then it will optimally flush that given region
in a DAX environment.
Now, there is this concept in the processors
that you actually have to drain.
So even though you issue the various flush operations,
CL flush or whatever it is,
they actually don't complete
until you do an S-fence operation.
And so we built in this concept
that we just don't automatically do
the S-fence instruction every single time.
So you can do multiple flushes and then S-Fence.
And so we added this thing to suppress the drain if you want.
We have this flag, and there's additional flags that we can allow.
The idea is I can flush multiple regions with it and make it as optimal as possible.
And then obviously there needs to be a separate
drain operation if you're going to do this and and then we also have one that will flush multiple
ranges in one call there's just basically you have an array of that you identify the array how many
entries are in it and you say you can flush all that so you can do it all in one operation
and again even on these you can say hey can flush all that. So you can do it all in one operation, and again, even on these,
you can say, hey, I'm going to do the drain separately from it.
We also have this
write non-volatile memory.
Bottom line, it's a mem copy.
Right now, it doesn't really,
it doesn't flush the copied memory.
What we need to do is enhance this and put some flags
so you can say, hey, I want to copy, I don't want to drain.
We're planning on doing all that.
We just didn't have time in this time frame.
But we're going to enhance this.
But bottom line, it's a memcpy command
to DAX memory.
And again, you can leverage this thing
when you're running in debug mode,
we have the ability to track the ranges you've copied
and have you flushed all of them and tell you if you don't.
We can do things like that.
It is not.
Actually, that's a really good question.
Today, it is not a non-temporal move.
Again, additional flags do things like
non-temporal
copy with drain
things like that.
Those are definitely things.
For anyone that doesn't know, the Intel processors
have basically
move instructions that are non-temporal.
They bypass the caches.
You can simply do a write to persistent memory
going through the cache so it's actually
persisted immediately. Obviously, there's a bit of performance impact in doing that.
It's a little more performant to go through the caches,
but those are very useful instructions.
Depending on what you want to do.
You want to really start to persistent memory
going through caches.
Yes.
The comment was, if you really want to go to persistent memory going through the c. Yes. The comment was,
if you really want to go to persistent memory
going through the caches,
it may not be a good thing.
Completely fair comment.
We did lose...
There actually is a...
On Windows, we had an API
that would do a non-temporal copy.
I mean, it was basically a version of this,
but it's non-temporal.
We discovered an interesting issue with it. I don't know, I don't remember if this was an actual issue with the CPU.
If you do less than eight bytes, it's actually, it doesn't do the non-temporal portion of it.
And we got bit by this because we didn't realize it in the beginning. And so there's,
you have to understand the processors too. I think Intel was going to do something about that.
Now, one of the things that came up
as people were trying to use persistent memory
is they said, you know, we have these NUMA environments.
What NUMA node is this given persistent memory sitting on?
And so we've defined an FS control in the file system.
One of the things that we do is we
require a persistent memory disk to reside on a single NUMA node. And so our persistent memory
driver down at the bottom does this automatically. If it sees persistent memory on different NUMA
nodes, it'll actually expose them as different disk drives, essentially. And so we don't allow currently,
at least at the driver level,
to combine persistent memory drives across NUMA nodes
because you would have weird performance characteristics
on that drive.
And so this is our solution to that.
And then we added this operation.
So hey, you can find out what NUMA node it's on
and target your application to that NUMA node.
Scope your application to run in the context of that NUMA node just to optimize performance.
So, let's talk about large page support.
I wasn't sure everyone knew what a large page was, so I did this little write-up of what large pages are in modern CPUs. Basically, as everyone
knows, well hopefully everyone knows, CPUs manage for the most part memory in 4k
chunks. Intel has been doing this forever. ARM does it this way as well. An
application, it manages its memory usage. So the memory of an application
is managed by the OS in page tables.
And they're all, again, all 4K page tables.
There are these trees that are built by the memory manager.
Now, it would be very inefficient
if the CPU had to access these page tables
on every single memory access.
And so they have this construct called the TLB.
It's called a Translation Look-Aside Buffer
that caches page table entries in the processor.
And for applications with a very large footprint,
the CPU can spend a lot of time
swapping these page tables in.
They can exceed the size of the TLB.
And so what Intel invented a while ago,
I don't even know when,
but it's been there for a while,
they implemented what they call a large page.
And basically, it's a 2 megabyte page instead of a 4K page.
And if you look at it, it's just removing their table,
like on a 64-bit processor, is three tiers deep.
It's just removing one of the tiers in their page table description.
But basically, it is a contiguous 2 megabyte region of memory that is aligned on a two
megabyte boundary, because that's really important, and is two megabytes in length. When you do that,
the memory manager can describe that in a single entry, and then the processor can fault that
entire range into the TLB as a single entry. Massively more efficient. One of the numbers I
heard when we were talking about implementing this
is SQL, when they were doing
some benchmark testing, and it has a large
data set, it could see
a 30% performance boost by using
large pages over just normal pages.
And so there's a significant performance
impact to this.
So, they came to us
after we had all this nice persistent memory
support, and we're patting ourselves on the back for all the good stuff.
And they said, yeah, but.
Because that's what application guys always say.
Yeah, we like that, but.
We always want more.
And so we needed to implement large page support in our system.
And how we did that is we decided to take NTFS,
which had a current limit of a cluster size of 64K, we upped it to 2 megabytes.
And so basically we now support cluster sizes up to 2 megabytes and powers of 2.
So basically anywhere from 5 12-byte clusters all the way up to 2 meg clusters is now supported today.
On docs volumes, it's 4K to two megabytes. And when you do that,
and then the other thing that we did
is we had to modify our partition manager,
because by default,
say it aligns everything on a one meg boundary,
we had to modify it to align on a two meg boundary
for DAX volumes.
That way, there's this two megabyte reserve stuff
for partitions at the front.
The volume starts on a beautiful 2 meg boundary.
We create 2 meg clusters.
Everything is beautifully aligned so that the memory manager, when it gets the mapping information, can just create large pages automatically.
So the bottom line is that if you have 2 meg clusters and you do this
and you memory map a file on a DAX volume,
it'll automatically use large pages.
It'll just happen automatically.
No problems.
Now, on top of large pages,
there's this concept called huge pages.
And it's just taking another chunk out of the mapping tree.
And a huge page is one gigabyte in size.
Now, I did the math one day and said,
we are not going to add a one gig cluster size to the file system.
That is just not practical.
And so we are going to add a huge page support in the future,
but we're going to actually have to modify the allocator inside of NTFS
to deal with these boundary alignment requirements and managing it.
When we do that, we'll also add the support to have two megabyte boundaries
without using a two megabyte cluster support.
But this was actually a simpler way to go to get us started.
And by the way, I learned, as we were talking about one gigabyte pages, the owner of
the memory manager in Windows, his name is Landy, and he said to me, don't forget that there's
another page size above one gigabyte, which is 512 gigabyte pages. It's not there yet, but he says,
it's coming. In hierarchy, you know, when you get to these full 64-bit address spaces, it gets a little crazy.
So 512, I don't know what they call it, ultra pages?
I don't know what it is.
Ultra huge?
But he warned me, think about these super huge pages in the future, because it's coming to.
And actually, once we do the work in the allocator to deal with this, it'll work with anything.
But that just means you have to have a lot of persistent memory in your machine to even make that worthwhile.
Okay, one of the questions when I came and spoke two years ago was,
what about Hyper-V? What's the support for it?
At the time, my response was, not supported.
Well, that has changed. In our upcoming release, we have a full support for
persistent memory and for DAX volumes in our virtualization environment, which is known as
Hyper-V. Windows and Linux guests in generation 2 VMs see virtual persistent memory, which they call VMEM devices. So basically what happens is they have defined this new VHD type
called the VHD PMEM,
and you create this VHD PMEM file,
and you put it on a DAX volume,
and when you run the guest on that,
the guest just automatically sees it.
It comes through the normal ENFIT tables and everything.
It just looks like persistent memory to the guest, and if the guest is automatically sees it. It comes through the normal ENFIT tables and everything. It just looks like persistent memory to the guest and if
the guest is persistent memory aware, it'll function and it can create its own
DAX volumes and function and it'll be direct mapped all the way to the hardware
without any intervention by the operating system or anyone else. Just all
set up through the page tables they do have the
ability to convert VHDX to this VHD PMEM format because there is a little
bit of a difference so they can convert back and forth and basically this BTT
concept that I talked about they decide for a VHD PMEM, they can decide at creation time if it has a VTT or not.
So if they're going to create a block volume,
if they're going to use this VHD PMEM file as a block volume in the guest,
then they can define a VTT, you get those atomicity guarantees,
it makes everything more robust for the file system.
And so they have a choice.
Also, the VHD PMEM files can be mounted as a SCSI device on the host.
So you don't get it in the direct access.
It doesn't look like a DAX volume when you mount it like that.
But you get access to these files, and you can do normal I.O. to them and everything.
So you can still mount these VHD you could do normal io to them and everything so you can still mount these vhdpm files and operate on them and they it can be in btt mode or not btt mode all that works in this basically local loopback mount of a vhdpm file and and then
each virtual pm which is what it looks like inside the guest, is backed by one VHD PMEM file.
So again, we have full support in the guest to see persistent memory.
Yeah, I have that on the next slide.
So all the functionality is there that is needed.
And then, yeah, again,
basically everything you can do on the host
you can do in the guest
is basically what we're saying there.
They have large page support,
so if you create your DAX volume
with two megabyte clusters
and your VHD PMEM will then be nicely aligned,
you automatically get large page support
inside the guest as well.
There is some functionality that's currently missing that normal VHDs have. There's no live
migration. There's no checkpoints. There's no backup. There's no save and restore. This is
functionality that'll come over time. When we do these six-month release cycles,
there's only so much you can do in one release cycle but they have it functional okay let's talk about NVML there's been some discussions
here at the at the SEC about what NVML is it's the non-volatile memory library
originally implemented by Intel we have been in a joint venture between Intel, HPE, and HP Labs on porting this to
Windows. It is feature complete at this point and usable. It's available up on GitHub. You can go
out there and grab it. We've seen a few hundred people download it and potentially use it.
So this is actually, and what NVML,
I'll just give a brief description of what NVML is.
As I mentioned earlier, if you're talking
to persistent memory directly,
there's these caches and stuff,
and you have to worry about,
even if you're using non-temporal instructions,
you can lose power at any point in time.
The application is responsible for data recovery.
What the NVML library does is it gives
some basic primitive operations that are guaranteed atomic inside of the library. And so they do the
atomicity guarantees. It makes it easier for application guys to develop persistent memory
applications and take advantage of it. The overhead of this is pretty small. And so they get all the performance benefits
without the complexity on their part.
You know, the big guys of the world,
the SQLs of the world, the Azures,
the Amazons of the world,
they can take the resources to do this stuff directly
and they don't necessarily want this.
This is for all the other people in the world
that want to start leveraging
and leveraging persistent memory in their
environment. And we are feature complete at this point. It's available. Now, the thing that we
haven't done yet, we've done all the basic work of supporting these basic primitives.
There has been additional work for remote access to persistent memory through this NVML library.
This is an area we haven't started yet,
but we need to start looking into this
and see what we want to do on the Windows side.
That library continues to grow.
I think I already described this.
It's basically abstracts out OS-specific dependencies.
What's nice about this model, what it does is they do everything just
through memory map files because that's the model on both Windows and Linux. It makes
its own atomicity guarantees. What's kind of cool about it is it runs in both persistent
memory and non-persistent memory environments. You can just run on a normal disk. It's memory
mapping. You'll do paging I.O. But all the guarantees that it makes work. And so it works in both models.
And again, it's simplifying application development so you don't have to deal with the recovery of power loss.
This is just the list of libraries.
And again, a website right here where you can get more information on it.
And I think in this earlier slide,
this is the website where you can go out to GitHub and download it. And I think in this earlier slide, this is the website where you can go out to GitHub and
download it. And that link, I think, is for both the Windows and the Linux implementation of the
library. And right now, the way it works today is APIs are identical between the two platforms.
We normalized the API so they were no different, just to minimize, make it easy to port back and forth.
Okay, let's talk a little bit about our driver model and what we did for that in Windows.
So originally, when we first did persistent memory back in Server 2016,
we basically broke it up into what we call a bus driver and then just the SCM driver. The bus driver was not involved in IO path. Basically it
was the one that gets all the info from the BIOS through ACPI. It's basically the
method does the management of persistent memory. It defines the disks and makes
them available.
This is actually a little incorrect.
In block mode, we come through the driver and copy.
When in DAX mode, we don't even come to the driver.
We come to the driver to get the physical addresses to set up the mappings and then iOS directly.
And then management status also comes out of this
through bus and up.
Now, that was a fine architecture to get us started,
but as we started thinking about,
there's new types of NVDIMMs coming out.
There's specs going on for NVDIMM-P,
which is a different type of device.
You know, 3D crosspoint, Intel's kind of doing their own thing
and how they're exposing 3D crosspoint, Intel's kind of doing their own thing and how they're exposing 3D crosspoint. And so we needed a model
to where we could kind of grow this
and expand on this, not have a monolithic driver.
And so what we did is we broke up
this SCM disk 0101 driver
into two drivers.
We have this generic PMEM.SYS driver which controls
byte addressability and interleave sense and is responsible for I.O. and it's
generic but then we have this very specific NVDIMM-N which is what we
support today in Windows. It controls, it's responsible for the physical aspects of what that particular NVDIMM chip is.
And so we kind of broke up our environment.
And so what this allowed us to do is make it easy to support new types of NVDIMM types as they come out.
We have clear separation of responsibilities for management
and access. This is kind of like what the picture looks like. One of the things you
can do through this thing is you can have multiple chips. Obviously you can have multiple
of these NVDIMM-N chips. In the BIAS you can describe, do you want them interleaved, do
you want them stand alone, however you want.
It's all configurable.
And then this is how they manage it, is they create one for each chip and then they know
whether they're interleaved and deal with all that.
And then there's this generic driver managing it all.
So you've kind of separated it up, cleaned up the environment, and basically this is
a big step forward for where we want to go in the future
as more and more types of this persistent memory
becomes available.
Now, one of the other areas that's really interesting
is uncorrectable error handling.
The processors are kind of getting caught up in this area.
Back when we first released this,
if you got an uncorrectable
memory error on the fly you just bug checked i mean this is kind of what happens with ram bad
ram today and uh they had this concept that you could detect at boot time what memory has been
bad so if you bug checked you came back up and tell you oh this memory is bad this part of the
persistent memory is bad so you can prevent people from using it and bug checking the next time but you had to do a bug check to deal with this and one of the things
that they've added support for is ability to do runtime detection of bad memory and so
so for what happens in the driver and this and it doesn't matter if this is boot time detected
or run time detected now,
is obviously if you do a block I.O.,
if you're in block mode,
you're doing a block I.O. to a bad sector,
it'll protect that.
It just fails that I.O.
just like a bad sector on a disk drive.
If a given block is unmapped,
and when I say unmapped,
it means it's not memory mapped into a file at this time,
basically we fail the mapping requests.
How we work is any time when someone memory maps a file on a DAX volume,
the memory manager asks the file systems for those physical page mappings,
and we send that down to the driver because he's the one that actually knows.
So we take, basically, here's the logical block addresses,
the LBAs for this file, send it down to them.
He gives us physical pages back, which we pass to MM.
Well, when we do that thing, if it's a bad thing,
they fail that now, and we have a way to tell,
and we can narrow it down to exactly which 4K page is bad.
And so we can deal with this.
Now, if it's mapped, now, obviously, this map thing, and it it becomes bad that's a runtime detection of bad memory because we wouldn't allow it to be mapped if we'd known at boot time so now what we
do and this is part of our enhancement is we can actually ask the memory manager to unmap that page
and and there were apis in there that. Unfortunately, there are reasons that applications
can lock these pages down. And so even though we can ask for it, it is a best effort to basically
unmap it. The memory manager will attempt to unmap it, but if an application is pinned it,
there's not much we can do. We don't have the ability to rip it out from under them because
we don't know what state they're in.
And so there is this best effort thing going on here.
These are areas that we want to see if
we can improve
on in the future.
Oh.
Say that again?
For which part?
Today there is this runtime detection
is a notification that the driver is getting
from the hardware.
There is notification that we get
that something has gone bad.
They're doing some sort of scrubbing in the background
at the hardware level,
and they can tell us some page has gone bad on the fly. That's why you can also have situations where application just reads some piece of data
and then that is the moment.
That it goes bad.
Yes.
Something is wrong.
Yeah.
They're actually doing background scrubbing at the hardware layer to detect these things.
They have ECC and whatever.
They're trying to correct all this stuff, but if it gets too far out, then they raise errors.
And they actually have a concept of a hard fault.
And I don't think the term is a soft fault,
but it's basically the idea it's going bad.
And we warn you that it's going bad.
That's defined, but it's not implemented yet in the hardware.
And so in the future, we'll have the ability to warn us that it's going bad.
What's good about that model is the file system can remap blocks.
Bad sectors already.
We have that concept from our long history of dealing with hard disks.
We can remap to a new location pretty transparently to anyone using it
because we have the old data to copy.
Does that make sense?
Does that help?
But yes, there still is the problem
of it going bad right at that instant
as you're reading it.
Go ahead.
When you expect the application to do it,
it's kind of weird if you say that we should not develop the code to identify it. Go ahead. So how this would happen to the application, I don't know how this works in
the Linux environment, but in the Windows environment, you put try accepts when you're
accessing mapped memory, and it'll raise, a fault will be raised that there's a bad page.
It's the application's job to figure out what they want to do with it
and how they deal with it.
But basically when they access something,
it'll raise and just tell them
there's something wrong.
And guess what?
A lot of apps that use memory map files
don't do this, try except capturing,
so they just bug check.
The app will just terminate
if they don't handle it gracefully.
But it's up to the app to deal with it.
In the last case,
it unmaps only a given page.
Yes.
In the middle case, does it unmap a page
or does it fail the entire map?
No.
Actually, that's a good question.
What Tom was asking is
when we're requesting a mapping request,
we're requesting a whole range.
What we do is we fail the mapping request.
Actually, what we do is we return a failure on the mapping request, but we put in there how many mappings we gave.
So we know where the line is.
And so we can give back those mappings.
We know this page is bad, and then MM can re-request for the rest of it.
And so we give back what we can up to the boundary of where the failure is,
and that's how we identify it.
Okay, but the failure is identified with 4K.
Yes.
It turns out that they can do cache line information, but we can only control pages, so we do 4K? Yes. It turns out that they can do
cache line information,
but we can only control pages,
so we do 4K pages.
And if you have bad memory in there,
you can't get a large page.
It doesn't work.
Yes.
Kind of in closing,
a little bit of the work we've done
on standardization.
These are all... I'm not going to dig into this.
There's all, this whole persistent memory stuff,
there's been a lot of standardization work to make all of this
so that Linux and Windows and whoever else, Mac,
can all use this stuff and have a consistent model for the hardware.
So there's been a great deal of standardization work
that we have been very involved with.
And then, last of all,
when is this new stuff coming out?
Basically, we've announced October 17th
is our next client update,
and I don't have an exact date
for when Windows Server is going out,
but I've been told it'll be later this year sometime. And I don't get to announce the date.
But that is all the work that we've done.
Thank you for coming and open to questions.
Go ahead.
So you're saying it's a fortunate idea.
Yes. Yes. Yes. So the question is, is the APIs to enumerate the PMEM devices?
I deal more with the file system. I'm positive it is available because
they show up as disk device. They show up as regular disk drives.
If I wanted to create my own virtual machine instead, then do I have to go into the UI
to be able to view it?
Ah. In our virtual, you're talking about inside the guest.
Inside the guest, they wrote a virtual ENFIT.
No, Hyper-V is doing that.
In a Gen 2 VM, it actually is emulating the ENFIT structures,
so it just comes out and it looks just,
the guest can't tell that it's not regular PMM real hardware.
I'm no expert on this.
I got my slides from the Hyper-V guys.
I don't work on the Hyper-V team.
What they just told me is that they emulate it.
I do not know if they have ACPI there or not, but they told me they emulated the ENFIT tables
and do it through their,
they call it VP-MEM, virtual PMEM environment.
As far as it being public,
come and talk to me afterwards.
If you have some questions,
we can reach out to the devs that are responsible for that.
I saw a question back here. Go ahead.
Yes.
So the question was, is the NVDIM driver in the new model that we talked about,
is it more of a class driver or device-specific?
It's specifically, if you look at the name, it's an NVDIM-N.
It is specifically for the dash N NVDIMs.
And so it handles all classes of NVDIMs that are JDEK compliant.
We actually had an interesting issue where someone put a non-JDEK compliant NVDIM into one of our machines.
It didn't work very well for obvious reasons because we're coded to the to the standard so in
that sense it works for all anyone that's jade at compliant and BDM dash in
it'll work for it there will in the future there'll be one for NBDM dash P
there'll be one for 3d cross point whatever else standards that that are
created out there and as long as they conform to the standards,
they should function with our driver.
We are planning on opening up this to third parties
to develop their own persistent memory drivers.
That's coming in the future.
Any other questions?
If not, thank you very much for being here.
You're more than welcome to come up and chat with me afterwards.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the storage developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.