Storage Developer Conference - #69: Update on Windows Persistent Memory Support

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 69. My name is Neil Christensen. I'm from Microsoft. I lead the NTFS file system development group. And one of the things that we have been working on with, obviously, it's not just a file system thing, but we've been working on persistent memory over the last few years. In fact, two years ago, I came and talked about our initial support for persistent memory.

Starting point is 00:01:03 And so this is simply an update on that. This is my agenda. And the first slide here, what is persistent memory? You know, so many people have talked about it this week. If you don't know what it is, you can read it. But I think everyone here understands what persistent memory is. So the first thing I want to talk about is I was here, like I said, I was here two years ago talking about our work. All of that work is already released. It was released a year ago

Starting point is 00:01:30 as far as through the Windows 10 anniversary update, which is the client side, and then through Windows Server 2016. So our support has been out there for a full year already, and I'm just going to do a high-level review of the support that we released back then, and then I'm going to move in and talk about some of the new stuff we've done since then that's in an upcoming release that'll be released shortly. So basically, here's a picture. And again, I think this has been commonly talked about in several presentations, how what we call a DAXx mode a direct access volume works The bottom line of the whole thing is that an application Can memory map a file and instead of today where it it's to RAM and there's paging IOs to the storage to the

Starting point is 00:02:18 Through the file system to the disk and everything. It's direct access From that application straight to persistent hardware. The file system gets out of the way, which gives the application optimal performance. Now, there are some costs to doing this, and I'll talk a little bit about that as well. But this is a high-level model. And what's important, of course, not every app in the world is going to change. And so we still have to support all of our existing APIs so that they can function properly in this environment.

Starting point is 00:02:49 And so this is basically, in this slide, it calls it a block mode application, but it's basically an application using standard file APIs. And they have to work through this whole model, and I'll talk a little bit about that. But this is just kind of the high-level view of how it functions. And the advantage of persistent memory, that byte addressable, sitting on the memory bus,

Starting point is 00:03:14 is this whole direct mapping, get the OS out of the way. Everyone wants to get the OS out of the way, right? So in our environment, DAX mode is chosen at volume format time. So at the time you create a volume, you decide if you want it in traditional block mode or if you want it as a DAX volume. And one of the reasons we had to go this path is because there's some compatibility issues with some various components listed there. And so in DAX mode, these guys get out of the way. In fact, BitLocker, which is our volume-level software encryption,

Starting point is 00:03:53 and BallSnap, which is our volume-level snapshot provider, they don't even attach to the DAX volume stack. It's actually a completely separate stack in Windows, and none of the existing drivers know how to attach to it. They have to be updated, and those haven't been updated yet. And also aware there is some functionality that you lose by losing this hook point that the file system has into people's data, and I'll talk about that. So the real question is, how does memory mapped I.O. work on a DAX volume? An application uses existing APIs that function today. There was no change in those APIs.

Starting point is 00:04:36 So any app that already uses memory mapped APIs kind of gets this direct access on a DAX volume. And as I said, an application that uses memory mapping maximizes performance because there's nothing in their way. Now, how did we do for compatibility? Because we all have standard, all file systems, all operating systems has these standard file APIs, read, write, they can be cached, non-cached. So how does cached IO work on a DAX volume? And it's pretty straightforward. The cache manager creates a direct map. So in essence, there's's a one copy this is a one copy i o because the user's buffer is copied by the cache manager directly into uh the persistent memory so again it's much faster we're not going down the storage stack and and there's this one copy access to and from the user's buffer on reads versus writes.

Starting point is 00:05:27 Oops, did I miss one? Sorry, too fast. And then there's non-cached I.O. In our environment, we convert non-cached I.O. on a DAX volume to cached. So if you issue non-cached I.O., that's fine, but we actually do it as cached I.O. to maximize the performance. It's doing a simple copy. An example of an existing application,

Starting point is 00:05:53 the compilers and linkers that are used on Windows all memory map their files to do their work. And we did some tests with those. You know, everyone wants to build their system on these environments, right? And it was pretty nice perf improvements by eliminating that paging IOs and everything else by running our build system on a DAX volume. The problem we have is DAX volumes

Starting point is 00:06:14 aren't that huge yet. The hardware availability is still pretty minimal. Now, I mentioned earlier that there's impacts to file system functionality on these volumes because when you get the file system out of the way, it can't do transformations of data that it likes to do. And so some examples of things that you lose is this NTFS software encryption.

Starting point is 00:06:39 We have a profile encryption engine. You can't use that on a memory map file in DAX mode. We can't do compression. We have this concept in NTFS called TXF, which is transactional semantics. We turn that off on DAX volumes as well. We have a USN journal, which is a change journal. And one of the features of the journal is you can track ranges that have been modified. Well, when you don't know that a range is modified because you see no I.O. to it,

Starting point is 00:07:07 because the application is talking directly to persistent memory, you lose that functionality. Another concept that's in our file system is a concept called resident files. We have our MFT, which is basically equivalent to Linux inode. We can actually embed, if the file is small enough, we can embed the data for that file directly in there. That doesn't work on a DAX volume. I mean, it's really nice from a space savings point of view because you're not allocating any clusters for it.

Starting point is 00:07:38 But one of the features that you can do with resident files is you can memory map them. Well, when it's a DAX volume, you really can't do this memory map to our internal metadata structures and have it be robust and work properly. So we had to simply turn this functionality off on DAX volumes. So there is some impact on functionality. It's just the price you pay for performance.

Starting point is 00:08:03 Some additional functionality. One of our big challenges is you don't want to file changes anymore. And so when you memory map a file, we had to make changes that basically said, hey, if you create a writable memory map section, we're just going to say you modified the file. We're just going to assume you're going to modify it

Starting point is 00:08:23 if you're creating a writable section. And so we update modified and access times. Again, we have this USN change journal just tracking changes. We also just say, hey, you modified it, because we honestly don't know. And then we also have a concept called directory change notify, and

Starting point is 00:08:42 it's a way for an application. This is how the shell works in Windows. You create a through new command or some other tool all of a sudden and you have a shell that thing it'll just pop up and show you the new file there that's because we notify uh the applications through this mechanism that there's been changes in this directory that notification really doesn't work very well if you don't know that the file's changed. So again, when you just create the section, we just signal, hey, this file's changed, so don't worry about it. Do what you're going to do. We also, we currently do not support sparse files and the ability to defrag.

Starting point is 00:09:18 These are things that we know how to do. We just haven't had the time to do it. And we have not implemented them yet. But they are coming. Now, there's this concept in the Windows environment called a file system filter. It's a layering concept. So above the file system, you can put a filter. You can see all the operations coming into the file system, all the operations coming out. It can manipulate it. These are examples of types of filters that exist in the system. In the Windows environment, antiviruses are all implemented as file system filters. That's how they do their real-time scanning. And there's lots of different kinds of them. This is just a random list of encryption and quota and compression, all replication. Well, these filters have problems because, again, they've lost their traditional hook points to do some of these operations they want to do.

Starting point is 00:10:11 An example, there are lots of encryption filters out there because everyone wants to do their own encryption. They can't tell. You can't do encryption on a DAX volume because you just can't tell when the data is changing. You don't know when it's being accessed. The idea of these encrypted products, these encrypted filters, is they want to encrypt it on the media but have it be clear text to the application. They simply can't do that on a DAX volume because they don't know when the data is being accessed. They don't know when to transform it. Frankly, it's not sitting in RAM anywhere. It's just directly to the persistent memory, and you can't transform it in place. And so this functionality is lost. So what we ended up doing for compatibility

Starting point is 00:10:54 is we just don't tell filters about DAX volumes at all, except that we have a new opt-in model. And so when a filter registers with the system, they can say, hey, I understand DAX. And then we'll start telling them about this information. And by the way, that's what most of the antivirus products have done so that they can understand DAX. It works fairly well for them since most of their products scan on open and close and that really hasn't changed.

Starting point is 00:11:24 So we do have a concept of block mode volumes. All we mean by this is it just runs the way a hard disk has run for years. It goes down a normal I-O stack. You do a read and write, it goes down to the storage driver and it does copies back and forth. And so we have that ability, and it's the fastest. You take an NVMe device and you compare it to this thing, it's like 10 times faster. So it's like the fastest storage device you have. And so it's there for full compatibility.

Starting point is 00:11:57 Everything works. All of our file systems work. All of our existing drivers work. People can run this way, get a big performance boost, and maintain their cap compatibility now one of the big problems with that applications have made assumptions on over the years is sector atomicity they believe that when I write a sector the whole sector will go out or nothing will go out because this is basically how hard drives have been functioning for a long time.

Starting point is 00:12:27 And that doesn't work with DAX, especially as an example in block mode. The processor is just doing copy. It's copying from whatever the user's buffer to the sector, and that's it. And you can lose power at any point in time and be in any random state. And so there is an algorithm called BTT,

Starting point is 00:12:44 called the Block Translation Table. It was originally created by Intel, but it's been standardized now. And basically, as in software, as you all know, you can solve all problems by adding a layer of indirection. And basically, all there is is a translation table, and it does the manipulations that you copy. When you're copying in, you copy it to a separate piece of of memory and then you swap it atomically in the translation table. So either you see the complete old sector or the complete new sector. That's what BTT does. We use BTT in block mode to make those atomicity guarantees. We do not use BTT in DAX mode because again there is no we don't know when an application is changing things and so those atomicity issues fall on the application

Starting point is 00:13:30 when you're using DAX mode you have to deal with power failure and being able to recover your state that's the apps problem and they have to deal with it okay so that's kind of a high-level review of all the functionality that we released last year. It was all released a year ago. It's available today, and there are people starting to use it. So let's talk about what some of the new stuff we've been doing, adding support to it, continue to enhance this.

Starting point is 00:14:01 Well, one of the things we did is we realized that we didn't have, when these applications are trying to do their stuff, you have to remember there's still processor caches. There's layers of caches in the system. And so even though I write to this persistent memory range, it's not on the media yet. It's just gone into the L3 cache, L2 cache, L1 cache, wherever it is, it's there somewhere in this caching hierarchy

Starting point is 00:14:24 that's built into our modern-day processors. And so we needed some APIs to flush regions so that we could make durability guarantees. So we defined what we call these RTL APIs. It's a concept in Windows. And these are available both from user mode and kernel mode. And it performs the necessary work to optimally flush persistent memory through the CPU caches and get it onto hardware. And there is documentation available on this up on MSDN. This is the list of APIs, and I'm not going to go into a lot of details of what all the APIs do. But one of the things, and you see this starting out,

Starting point is 00:15:08 we have this get a non-volatile token. The idea that we had is, you know, someone wants, someone has created some section and they're going to be doing multiple manipulations on it. You request a token to cover that section. That's what the buffer and size things are,

Starting point is 00:15:29 and you get this token back. The idea of the token is it contains state information about that guy. And what we can do is, if you're trying to run in a debug mode, we can actually maintain some debug information to help make your stuff more robust in these environments and verify that, hey, are you flushing all the regions that you want to flush?

Starting point is 00:15:48 Have you flushed everything that you've modified? Things like this. And so we have this token concept. It's super fast, but it gets all the information about what types of flush operations are available on your processor, what are the durability requirements, etc. For example, we see points in the future where people will have batteries on machines. And so if you have a battery and make some guarantees about how the battery works, maybe you don't actually,

Starting point is 00:16:19 maybe you don't even need to do the flush. Maybe you know off the flush, you call this thing and it returns immediately and don't do anything because the system is handling all that. This is what the token gets for us is that we can track all this state internally, make it transparent to you, you just do this stuff and then it'll just do the right thing in your hardware environment, most optimal thing that you can do. So there's a generate token, obviously there's got to be the free the given token. This is the general command. It's just flush a non-volatile memory.

Starting point is 00:16:47 It's a range. We do have this concept in here. So, you know, you give the token, you give a buffer and a size, however big it is, and then it will optimally flush that given region in a DAX environment. Now, there is this concept in the processors that you actually have to drain.

Starting point is 00:17:09 So even though you issue the various flush operations, CL flush or whatever it is, they actually don't complete until you do an S-fence operation. And so we built in this concept that we just don't automatically do the S-fence instruction every single time. So you can do multiple flushes and then S-Fence.

Starting point is 00:17:30 And so we added this thing to suppress the drain if you want. We have this flag, and there's additional flags that we can allow. The idea is I can flush multiple regions with it and make it as optimal as possible. And then obviously there needs to be a separate drain operation if you're going to do this and and then we also have one that will flush multiple ranges in one call there's just basically you have an array of that you identify the array how many entries are in it and you say you can flush all that so you can do it all in one operation and again even on these you can say hey can flush all that. So you can do it all in one operation, and again, even on these,

Starting point is 00:18:06 you can say, hey, I'm going to do the drain separately from it. We also have this write non-volatile memory. Bottom line, it's a mem copy. Right now, it doesn't really, it doesn't flush the copied memory. What we need to do is enhance this and put some flags so you can say, hey, I want to copy, I don't want to drain.

Starting point is 00:18:29 We're planning on doing all that. We just didn't have time in this time frame. But we're going to enhance this. But bottom line, it's a memcpy command to DAX memory. And again, you can leverage this thing when you're running in debug mode, we have the ability to track the ranges you've copied

Starting point is 00:18:48 and have you flushed all of them and tell you if you don't. We can do things like that. It is not. Actually, that's a really good question. Today, it is not a non-temporal move. Again, additional flags do things like non-temporal copy with drain

Starting point is 00:19:07 things like that. Those are definitely things. For anyone that doesn't know, the Intel processors have basically move instructions that are non-temporal. They bypass the caches. You can simply do a write to persistent memory going through the cache so it's actually

Starting point is 00:19:23 persisted immediately. Obviously, there's a bit of performance impact in doing that. It's a little more performant to go through the caches, but those are very useful instructions. Depending on what you want to do. You want to really start to persistent memory going through caches. Yes. The comment was, if you really want to go to persistent memory going through the c. Yes. The comment was,

Starting point is 00:19:46 if you really want to go to persistent memory going through the caches, it may not be a good thing. Completely fair comment. We did lose... There actually is a... On Windows, we had an API that would do a non-temporal copy.

Starting point is 00:20:00 I mean, it was basically a version of this, but it's non-temporal. We discovered an interesting issue with it. I don't know, I don't remember if this was an actual issue with the CPU. If you do less than eight bytes, it's actually, it doesn't do the non-temporal portion of it. And we got bit by this because we didn't realize it in the beginning. And so there's, you have to understand the processors too. I think Intel was going to do something about that. Now, one of the things that came up as people were trying to use persistent memory

Starting point is 00:20:32 is they said, you know, we have these NUMA environments. What NUMA node is this given persistent memory sitting on? And so we've defined an FS control in the file system. One of the things that we do is we require a persistent memory disk to reside on a single NUMA node. And so our persistent memory driver down at the bottom does this automatically. If it sees persistent memory on different NUMA nodes, it'll actually expose them as different disk drives, essentially. And so we don't allow currently, at least at the driver level,

Starting point is 00:21:08 to combine persistent memory drives across NUMA nodes because you would have weird performance characteristics on that drive. And so this is our solution to that. And then we added this operation. So hey, you can find out what NUMA node it's on and target your application to that NUMA node. Scope your application to run in the context of that NUMA node just to optimize performance.

Starting point is 00:21:33 So, let's talk about large page support. I wasn't sure everyone knew what a large page was, so I did this little write-up of what large pages are in modern CPUs. Basically, as everyone knows, well hopefully everyone knows, CPUs manage for the most part memory in 4k chunks. Intel has been doing this forever. ARM does it this way as well. An application, it manages its memory usage. So the memory of an application is managed by the OS in page tables. And they're all, again, all 4K page tables. There are these trees that are built by the memory manager.

Starting point is 00:22:14 Now, it would be very inefficient if the CPU had to access these page tables on every single memory access. And so they have this construct called the TLB. It's called a Translation Look-Aside Buffer that caches page table entries in the processor. And for applications with a very large footprint, the CPU can spend a lot of time

Starting point is 00:22:33 swapping these page tables in. They can exceed the size of the TLB. And so what Intel invented a while ago, I don't even know when, but it's been there for a while, they implemented what they call a large page. And basically, it's a 2 megabyte page instead of a 4K page. And if you look at it, it's just removing their table,

Starting point is 00:22:54 like on a 64-bit processor, is three tiers deep. It's just removing one of the tiers in their page table description. But basically, it is a contiguous 2 megabyte region of memory that is aligned on a two megabyte boundary, because that's really important, and is two megabytes in length. When you do that, the memory manager can describe that in a single entry, and then the processor can fault that entire range into the TLB as a single entry. Massively more efficient. One of the numbers I heard when we were talking about implementing this is SQL, when they were doing

Starting point is 00:23:27 some benchmark testing, and it has a large data set, it could see a 30% performance boost by using large pages over just normal pages. And so there's a significant performance impact to this. So, they came to us after we had all this nice persistent memory

Starting point is 00:23:43 support, and we're patting ourselves on the back for all the good stuff. And they said, yeah, but. Because that's what application guys always say. Yeah, we like that, but. We always want more. And so we needed to implement large page support in our system. And how we did that is we decided to take NTFS, which had a current limit of a cluster size of 64K, we upped it to 2 megabytes.

Starting point is 00:24:16 And so basically we now support cluster sizes up to 2 megabytes and powers of 2. So basically anywhere from 5 12-byte clusters all the way up to 2 meg clusters is now supported today. On docs volumes, it's 4K to two megabytes. And when you do that, and then the other thing that we did is we had to modify our partition manager, because by default, say it aligns everything on a one meg boundary, we had to modify it to align on a two meg boundary

Starting point is 00:24:38 for DAX volumes. That way, there's this two megabyte reserve stuff for partitions at the front. The volume starts on a beautiful 2 meg boundary. We create 2 meg clusters. Everything is beautifully aligned so that the memory manager, when it gets the mapping information, can just create large pages automatically. So the bottom line is that if you have 2 meg clusters and you do this and you memory map a file on a DAX volume,

Starting point is 00:25:09 it'll automatically use large pages. It'll just happen automatically. No problems. Now, on top of large pages, there's this concept called huge pages. And it's just taking another chunk out of the mapping tree. And a huge page is one gigabyte in size. Now, I did the math one day and said,

Starting point is 00:25:29 we are not going to add a one gig cluster size to the file system. That is just not practical. And so we are going to add a huge page support in the future, but we're going to actually have to modify the allocator inside of NTFS to deal with these boundary alignment requirements and managing it. When we do that, we'll also add the support to have two megabyte boundaries without using a two megabyte cluster support. But this was actually a simpler way to go to get us started.

Starting point is 00:26:01 And by the way, I learned, as we were talking about one gigabyte pages, the owner of the memory manager in Windows, his name is Landy, and he said to me, don't forget that there's another page size above one gigabyte, which is 512 gigabyte pages. It's not there yet, but he says, it's coming. In hierarchy, you know, when you get to these full 64-bit address spaces, it gets a little crazy. So 512, I don't know what they call it, ultra pages? I don't know what it is. Ultra huge? But he warned me, think about these super huge pages in the future, because it's coming to.

Starting point is 00:26:39 And actually, once we do the work in the allocator to deal with this, it'll work with anything. But that just means you have to have a lot of persistent memory in your machine to even make that worthwhile. Okay, one of the questions when I came and spoke two years ago was, what about Hyper-V? What's the support for it? At the time, my response was, not supported. Well, that has changed. In our upcoming release, we have a full support for persistent memory and for DAX volumes in our virtualization environment, which is known as Hyper-V. Windows and Linux guests in generation 2 VMs see virtual persistent memory, which they call VMEM devices. So basically what happens is they have defined this new VHD type

Starting point is 00:27:29 called the VHD PMEM, and you create this VHD PMEM file, and you put it on a DAX volume, and when you run the guest on that, the guest just automatically sees it. It comes through the normal ENFIT tables and everything. It just looks like persistent memory to the guest, and if the guest is automatically sees it. It comes through the normal ENFIT tables and everything. It just looks like persistent memory to the guest and if the guest is persistent memory aware, it'll function and it can create its own

Starting point is 00:27:52 DAX volumes and function and it'll be direct mapped all the way to the hardware without any intervention by the operating system or anyone else. Just all set up through the page tables they do have the ability to convert VHDX to this VHD PMEM format because there is a little bit of a difference so they can convert back and forth and basically this BTT concept that I talked about they decide for a VHD PMEM, they can decide at creation time if it has a VTT or not. So if they're going to create a block volume, if they're going to use this VHD PMEM file as a block volume in the guest,

Starting point is 00:28:36 then they can define a VTT, you get those atomicity guarantees, it makes everything more robust for the file system. And so they have a choice. Also, the VHD PMEM files can be mounted as a SCSI device on the host. So you don't get it in the direct access. It doesn't look like a DAX volume when you mount it like that. But you get access to these files, and you can do normal I.O. to them and everything. So you can still mount these VHD you could do normal io to them and everything so you can still mount these vhdpm files and operate on them and they it can be in btt mode or not btt mode all that works in this basically local loopback mount of a vhdpm file and and then

Starting point is 00:29:20 each virtual pm which is what it looks like inside the guest, is backed by one VHD PMEM file. So again, we have full support in the guest to see persistent memory. Yeah, I have that on the next slide. So all the functionality is there that is needed. And then, yeah, again, basically everything you can do on the host you can do in the guest is basically what we're saying there.

Starting point is 00:29:52 They have large page support, so if you create your DAX volume with two megabyte clusters and your VHD PMEM will then be nicely aligned, you automatically get large page support inside the guest as well. There is some functionality that's currently missing that normal VHDs have. There's no live migration. There's no checkpoints. There's no backup. There's no save and restore. This is

Starting point is 00:30:15 functionality that'll come over time. When we do these six-month release cycles, there's only so much you can do in one release cycle but they have it functional okay let's talk about NVML there's been some discussions here at the at the SEC about what NVML is it's the non-volatile memory library originally implemented by Intel we have been in a joint venture between Intel, HPE, and HP Labs on porting this to Windows. It is feature complete at this point and usable. It's available up on GitHub. You can go out there and grab it. We've seen a few hundred people download it and potentially use it. So this is actually, and what NVML, I'll just give a brief description of what NVML is.

Starting point is 00:31:10 As I mentioned earlier, if you're talking to persistent memory directly, there's these caches and stuff, and you have to worry about, even if you're using non-temporal instructions, you can lose power at any point in time. The application is responsible for data recovery. What the NVML library does is it gives

Starting point is 00:31:26 some basic primitive operations that are guaranteed atomic inside of the library. And so they do the atomicity guarantees. It makes it easier for application guys to develop persistent memory applications and take advantage of it. The overhead of this is pretty small. And so they get all the performance benefits without the complexity on their part. You know, the big guys of the world, the SQLs of the world, the Azures, the Amazons of the world, they can take the resources to do this stuff directly

Starting point is 00:31:58 and they don't necessarily want this. This is for all the other people in the world that want to start leveraging and leveraging persistent memory in their environment. And we are feature complete at this point. It's available. Now, the thing that we haven't done yet, we've done all the basic work of supporting these basic primitives. There has been additional work for remote access to persistent memory through this NVML library. This is an area we haven't started yet,

Starting point is 00:32:25 but we need to start looking into this and see what we want to do on the Windows side. That library continues to grow. I think I already described this. It's basically abstracts out OS-specific dependencies. What's nice about this model, what it does is they do everything just through memory map files because that's the model on both Windows and Linux. It makes its own atomicity guarantees. What's kind of cool about it is it runs in both persistent

Starting point is 00:32:55 memory and non-persistent memory environments. You can just run on a normal disk. It's memory mapping. You'll do paging I.O. But all the guarantees that it makes work. And so it works in both models. And again, it's simplifying application development so you don't have to deal with the recovery of power loss. This is just the list of libraries. And again, a website right here where you can get more information on it. And I think in this earlier slide, this is the website where you can go out to GitHub and download it. And I think in this earlier slide, this is the website where you can go out to GitHub and download it. And that link, I think, is for both the Windows and the Linux implementation of the

Starting point is 00:33:32 library. And right now, the way it works today is APIs are identical between the two platforms. We normalized the API so they were no different, just to minimize, make it easy to port back and forth. Okay, let's talk a little bit about our driver model and what we did for that in Windows. So originally, when we first did persistent memory back in Server 2016, we basically broke it up into what we call a bus driver and then just the SCM driver. The bus driver was not involved in IO path. Basically it was the one that gets all the info from the BIOS through ACPI. It's basically the method does the management of persistent memory. It defines the disks and makes them available.

Starting point is 00:34:28 This is actually a little incorrect. In block mode, we come through the driver and copy. When in DAX mode, we don't even come to the driver. We come to the driver to get the physical addresses to set up the mappings and then iOS directly. And then management status also comes out of this through bus and up. Now, that was a fine architecture to get us started, but as we started thinking about,

Starting point is 00:34:48 there's new types of NVDIMMs coming out. There's specs going on for NVDIMM-P, which is a different type of device. You know, 3D crosspoint, Intel's kind of doing their own thing and how they're exposing 3D crosspoint, Intel's kind of doing their own thing and how they're exposing 3D crosspoint. And so we needed a model to where we could kind of grow this and expand on this, not have a monolithic driver. And so what we did is we broke up

Starting point is 00:35:17 this SCM disk 0101 driver into two drivers. We have this generic PMEM.SYS driver which controls byte addressability and interleave sense and is responsible for I.O. and it's generic but then we have this very specific NVDIMM-N which is what we support today in Windows. It controls, it's responsible for the physical aspects of what that particular NVDIMM chip is. And so we kind of broke up our environment. And so what this allowed us to do is make it easy to support new types of NVDIMM types as they come out.

Starting point is 00:36:02 We have clear separation of responsibilities for management and access. This is kind of like what the picture looks like. One of the things you can do through this thing is you can have multiple chips. Obviously you can have multiple of these NVDIMM-N chips. In the BIAS you can describe, do you want them interleaved, do you want them stand alone, however you want. It's all configurable. And then this is how they manage it, is they create one for each chip and then they know whether they're interleaved and deal with all that.

Starting point is 00:36:35 And then there's this generic driver managing it all. So you've kind of separated it up, cleaned up the environment, and basically this is a big step forward for where we want to go in the future as more and more types of this persistent memory becomes available. Now, one of the other areas that's really interesting is uncorrectable error handling. The processors are kind of getting caught up in this area.

Starting point is 00:37:02 Back when we first released this, if you got an uncorrectable memory error on the fly you just bug checked i mean this is kind of what happens with ram bad ram today and uh they had this concept that you could detect at boot time what memory has been bad so if you bug checked you came back up and tell you oh this memory is bad this part of the persistent memory is bad so you can prevent people from using it and bug checking the next time but you had to do a bug check to deal with this and one of the things that they've added support for is ability to do runtime detection of bad memory and so so for what happens in the driver and this and it doesn't matter if this is boot time detected

Starting point is 00:37:45 or run time detected now, is obviously if you do a block I.O., if you're in block mode, you're doing a block I.O. to a bad sector, it'll protect that. It just fails that I.O. just like a bad sector on a disk drive. If a given block is unmapped,

Starting point is 00:38:01 and when I say unmapped, it means it's not memory mapped into a file at this time, basically we fail the mapping requests. How we work is any time when someone memory maps a file on a DAX volume, the memory manager asks the file systems for those physical page mappings, and we send that down to the driver because he's the one that actually knows. So we take, basically, here's the logical block addresses, the LBAs for this file, send it down to them.

Starting point is 00:38:30 He gives us physical pages back, which we pass to MM. Well, when we do that thing, if it's a bad thing, they fail that now, and we have a way to tell, and we can narrow it down to exactly which 4K page is bad. And so we can deal with this. Now, if it's mapped, now, obviously, this map thing, and it it becomes bad that's a runtime detection of bad memory because we wouldn't allow it to be mapped if we'd known at boot time so now what we do and this is part of our enhancement is we can actually ask the memory manager to unmap that page and and there were apis in there that. Unfortunately, there are reasons that applications

Starting point is 00:39:07 can lock these pages down. And so even though we can ask for it, it is a best effort to basically unmap it. The memory manager will attempt to unmap it, but if an application is pinned it, there's not much we can do. We don't have the ability to rip it out from under them because we don't know what state they're in. And so there is this best effort thing going on here. These are areas that we want to see if we can improve on in the future.

Starting point is 00:39:36 Oh. Say that again? For which part? Today there is this runtime detection is a notification that the driver is getting from the hardware. There is notification that we get that something has gone bad.

Starting point is 00:39:53 They're doing some sort of scrubbing in the background at the hardware level, and they can tell us some page has gone bad on the fly. That's why you can also have situations where application just reads some piece of data and then that is the moment. That it goes bad. Yes. Something is wrong. Yeah.

Starting point is 00:40:15 They're actually doing background scrubbing at the hardware layer to detect these things. They have ECC and whatever. They're trying to correct all this stuff, but if it gets too far out, then they raise errors. And they actually have a concept of a hard fault. And I don't think the term is a soft fault, but it's basically the idea it's going bad. And we warn you that it's going bad. That's defined, but it's not implemented yet in the hardware.

Starting point is 00:40:43 And so in the future, we'll have the ability to warn us that it's going bad. What's good about that model is the file system can remap blocks. Bad sectors already. We have that concept from our long history of dealing with hard disks. We can remap to a new location pretty transparently to anyone using it because we have the old data to copy. Does that make sense? Does that help?

Starting point is 00:41:11 But yes, there still is the problem of it going bad right at that instant as you're reading it. Go ahead. When you expect the application to do it, it's kind of weird if you say that we should not develop the code to identify it. Go ahead. So how this would happen to the application, I don't know how this works in the Linux environment, but in the Windows environment, you put try accepts when you're accessing mapped memory, and it'll raise, a fault will be raised that there's a bad page.

Starting point is 00:41:46 It's the application's job to figure out what they want to do with it and how they deal with it. But basically when they access something, it'll raise and just tell them there's something wrong. And guess what? A lot of apps that use memory map files don't do this, try except capturing,

Starting point is 00:42:00 so they just bug check. The app will just terminate if they don't handle it gracefully. But it's up to the app to deal with it. In the last case, it unmaps only a given page. Yes. In the middle case, does it unmap a page

Starting point is 00:42:14 or does it fail the entire map? No. Actually, that's a good question. What Tom was asking is when we're requesting a mapping request, we're requesting a whole range. What we do is we fail the mapping request. Actually, what we do is we return a failure on the mapping request, but we put in there how many mappings we gave.

Starting point is 00:42:39 So we know where the line is. And so we can give back those mappings. We know this page is bad, and then MM can re-request for the rest of it. And so we give back what we can up to the boundary of where the failure is, and that's how we identify it. Okay, but the failure is identified with 4K. Yes. It turns out that they can do cache line information, but we can only control pages, so we do 4K? Yes. It turns out that they can do

Starting point is 00:43:05 cache line information, but we can only control pages, so we do 4K pages. And if you have bad memory in there, you can't get a large page. It doesn't work. Yes. Kind of in closing,

Starting point is 00:43:21 a little bit of the work we've done on standardization. These are all... I'm not going to dig into this. There's all, this whole persistent memory stuff, there's been a lot of standardization work to make all of this so that Linux and Windows and whoever else, Mac, can all use this stuff and have a consistent model for the hardware. So there's been a great deal of standardization work

Starting point is 00:43:45 that we have been very involved with. And then, last of all, when is this new stuff coming out? Basically, we've announced October 17th is our next client update, and I don't have an exact date for when Windows Server is going out, but I've been told it'll be later this year sometime. And I don't get to announce the date.

Starting point is 00:44:09 But that is all the work that we've done. Thank you for coming and open to questions. Go ahead. So you're saying it's a fortunate idea. Yes. Yes. Yes. So the question is, is the APIs to enumerate the PMEM devices? I deal more with the file system. I'm positive it is available because they show up as disk device. They show up as regular disk drives. If I wanted to create my own virtual machine instead, then do I have to go into the UI

Starting point is 00:44:58 to be able to view it? Ah. In our virtual, you're talking about inside the guest. Inside the guest, they wrote a virtual ENFIT. No, Hyper-V is doing that. In a Gen 2 VM, it actually is emulating the ENFIT structures, so it just comes out and it looks just, the guest can't tell that it's not regular PMM real hardware. I'm no expert on this.

Starting point is 00:45:34 I got my slides from the Hyper-V guys. I don't work on the Hyper-V team. What they just told me is that they emulate it. I do not know if they have ACPI there or not, but they told me they emulated the ENFIT tables and do it through their, they call it VP-MEM, virtual PMEM environment. As far as it being public, come and talk to me afterwards.

Starting point is 00:45:55 If you have some questions, we can reach out to the devs that are responsible for that. I saw a question back here. Go ahead. Yes. So the question was, is the NVDIM driver in the new model that we talked about, is it more of a class driver or device-specific? It's specifically, if you look at the name, it's an NVDIM-N. It is specifically for the dash N NVDIMs.

Starting point is 00:46:28 And so it handles all classes of NVDIMs that are JDEK compliant. We actually had an interesting issue where someone put a non-JDEK compliant NVDIM into one of our machines. It didn't work very well for obvious reasons because we're coded to the to the standard so in that sense it works for all anyone that's jade at compliant and BDM dash in it'll work for it there will in the future there'll be one for NBDM dash P there'll be one for 3d cross point whatever else standards that that are created out there and as long as they conform to the standards, they should function with our driver.

Starting point is 00:47:09 We are planning on opening up this to third parties to develop their own persistent memory drivers. That's coming in the future. Any other questions? If not, thank you very much for being here. You're more than welcome to come up and chat with me afterwards. Thanks for listening. If you have questions about the material presented in this podcast,

Starting point is 00:47:34 be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #69: Update on Windows Persistent Memory Support

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.