Storage Developer Conference - #4: The NVDIMM Cookbook: A Soup-to-Nuts Primer on Using NVDIMMs to Improve Your Storage Performance

Episode Date: April 25, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 4. Today we hear from Jeff Chang, VP of Marketing and Business Development at Agigatech, as well as Arthur Saneo, Senior Director of Marketing with Smart Modular, as they present on the NVDIM Cookbook from the 2015 Stores Developer Conference.
Starting point is 00:00:53 Thanks for joining us for this session today. This is the NVDIM N Cookbook Primer. For those of you that were in Paul's session, I heard he went through great detail of the taxonomy, so we'll cover it fairly lightly here. And essentially what we'll go through today is how do you integrate an NVDIMM into your system, why would you want to, how it can be useful for your system performance, and what are the sort of key gotchas when looking to use an NVDIMM in your system. So my name's Jeff Chang. I'm with the very enormous company known as Agigatech.
Starting point is 00:01:27 It's a small company. I'm a co-chair of the NVDIMM SIG with Arthur Sainu. I always get it wrong. Arthur Sainu. So we'll be ping-ponging back and forth. We also have an extra special guest speaker, Rajesh Ananth. Thank you. And he's a system architect with SMART. So I'll be covering mostly the hardware piece of things. Arthur will be covering the BIOS integration piece. Rajesh will be covering the software piece. And then will come back to me and talk about some of these
Starting point is 00:01:59 system application and system performance issues. Okay, so let's get started. This is a SNEA tutorial, so we are obligated to always have this legal notice, and essentially it just means if you use NVDIMS and your system blows up, you can't blame SNEA. But today we'll be covering, this is a very long abstract, I'm not going to read it in great detail, but essentially NVDIMS are non-volatile DIMS,
Starting point is 00:02:23 which means it's persistent memory or persistent DRAM that you can use as storage in your system. And what we'll be covering today is you'll understand what an NVDIM is. So we'll walk through some of the taxonomy, what the difference between an NVDIM N, an NVDIM F, and an NVDIM P is. You'll understand why an NVDIM can improve your system performance. That'll actually come last. Right before that, you'll also understand how to integrate an NVDIMP into your system. So an NVDIMP is plug-and-play-ish. So there are a lot of hooks that are required from a hardware perspective, BIOS perspective, software perspective, and we'll walk through each one of those
Starting point is 00:03:04 things so that when you're prepared, if and when you're prepared to use an NVDIMM in your system, you'll know what to expect. So the cookbook has several parts. We have the hardware piece, the NVDIMM part one, that's what I'll cover today. Part two, Arthur will cover, that's the BIOS. Part three, OS, will be Rajesh. And then we'll come back to me and I'll talk about some of the system implementations and use cases. So the ingredients of the cookbook, if you walk through a stack of the system, quote-unquote stack of the system,
Starting point is 00:03:34 you'll have various ingredients that would be required to integrate an NVDIMM. From the platform hardware at the very bottom, so all the dark blue boxes are hardware, to the NVDIMM itself, hardware, to the NVDIMM itself, the APIs for the NVDIMM into the software layers. So you have BIOS integration issues, you have OS integration issues, and then you have specific use case issues that you have to contend with, like do I use it as a block store device, or do I use it as byte addressable? So those are all things that we're going to cover today,
Starting point is 00:04:05 and then how that can get integrated into the overall stack of the system itself. So for part one, I'm going to cover the NVDIMM hardware piece of it. And we'll start with, first, the taxonomy. So for those of you that weren't in Paul's session, JEDEC has standardized, in conjunction with the NVDIMM S SIG a taxonomy for different types of NVDIMMs, much like you'd see with Wi-Fi, 802.11a, b, c, n, that type of thing.
Starting point is 00:04:35 Because about a year ago when customers and vendors were talking about NVDIMMs, there were different flavors of NVDIMMs. And people were starting to get confused over which type should I use and which type is which. So if you start at the top, an NVDIMM N, which is mostly what we'll be talking about today, is memory map DRAM, which is flash-backed. So it's byte-addressable DRAM, byte-addressable persistent memory using the DRAM. So the host only has access to the DRAM interface during normal operation. And when the system loses power, it backs up the data to flash.
Starting point is 00:05:11 That's how you get your persistency in the system. You have a couple of access methods, though, just like you would with any DRAM. You could use it as byte addressable memory for load store, block addressable like a RAM disk. Those are both available in the NVDIMM end. Capacities are DRAM like. So you have ones to tens of gigabytes of capacity in a single NVDIMM. The latency is the speed of DRAM because you're accessing the DRAM directly natively by the host. However, you do need an energy source to back up the data when you have a catastrophic power loss event. So you can use super caps, you can use batteries,
Starting point is 00:05:52 you can use some hybrid of the two. We'll touch a little bit on that later in the session. But for the most part, most vendors that deploy NVDIMs today use super capacitors because it's the perfect technology for this type of application. Moving down to NVDIMMs today use supercapacitors because it's the perfect technology for this type of application. Moving down to NVDIMM F, NVDIMM F you can think of more like an SSD that's in a DIMM socket. So you actually have access to the flash itself.
Starting point is 00:06:18 The DRAM's not memory mapped, the flash is memory mapped. So you have block-oriented access to the flash itself. So it looks like an SSD sitting in the DRAM channel. You get the benefits of a faster interface than say PCIe, SATA or SAS, but you still have some software latency above the hardware because you now have a file system that you have to contend with. But that's NVDIM-F. The capacities are NAND-like, so you can get up to terabytes of NAND flash capacity just like you would with an SSD. So you can see there are a couple of very different use cases between NVDIM-N and NVDIM-F. And then the last type of NVDIM that has been defined or at least proposed within
Starting point is 00:07:06 JEDEC is NVDIM P. So NVDIM N and F, by the way, have been ratified and NVDIM P is just a proposal at this point. I think in the last, or maybe two JEDEC meetings ago, the first proposal was shown, the first showing. So it's still going through its definition phases. But essentially it's a combination of an N and an F. You can get access to the DRAM, byte addressable access to the DRAM, but you can also have block access to the flash at the same time. Okay? So those are the three types that we have defined so far in JEDEC. I expect over the next one to two years we'll probably have a few more, and then five years from now who knows what will be there. But what we'll focus
Starting point is 00:07:50 mostly on today is NVDIMM N. By the way, I don't mind questions during the session, so if you have any that might come up while I'm talking, feel free to raise your hand and we'll call on you. Okay, so in NVDIMM, I touched a little bit about on the taxonomy sheet how an NVDIMM works. In a very simplistic view, an NVDIMM plugs directly into a standard JEDEC socket, and it looks like DRAM when it interfaces to the host. When the system boots up,
Starting point is 00:08:20 there is an interrogation process in the BIOS that identifies it as an NVDIMM so that the host can use it as persistent memory. And then when the system powers up the supercaps or whatever you have as your energy source will begin to charge. And then when you have your system available or the NVDIMM is available to be non-volatile, so when it first boots up, it's not necessarily in a non-volatile state. There are a few health monitor items that the host will have to interrogate to ensure that you're non-volatile.
Starting point is 00:08:53 Once you're non-volatile, then it can be used as persistent memory. We'll go into a little more detail on that a little later. OK, so the energy source is charged. You are now non-volatile. So the system can use the NVDIMM, the DRAM itself, as persistent memory. If you have 8, 16, 32, 64 gigs in your system,
Starting point is 00:09:13 that can be addressed as byte-addressable persistent memory for your system or block-addressable as well. That's when the health check clears it's used as persistent memory by the host. During a catastrophic power loss event and we'll go through the sequence of exactly
Starting point is 00:09:34 what happens in the system itself. Essentially we're piggybacking off a feature called ADR, asynchronous DRAM refresh that was originally developed for battery back memories but we're piggybacking on sort of that prior art for NVDIM ends. But during that unexpected power loss event, the DRAMs are placed in a self-refresh.
Starting point is 00:09:55 At that point, the NVDIM itself, the controller, can take control of the DRAM and move the contents to the onboard flash. So it does a data transfer of the flash, of the DRAM contents, the in-memory state, when you lost power, can be moved to the flash. It's almost like a suspend-resume event, in a sense. But it's completely localized to the NVDIMM itself. So the process takes literally tens of microseconds to complete, to hand off the memory to the NVDIMM, and
Starting point is 00:10:26 depending on how much capacity you have, it can take 60-ish seconds to back up about 8 gigs of memory. And it's done in parallel, so if you have multiple 8 gig NVDIMMs in your system it still takes 60 seconds. So there is no hold up time beyond the time it takes to place the DRAM in a self-refresh. The entire system can go offline at that point. As opposed to, say, a cache-to-flash implementation where you're having to hold up your entire system while you're moving your DRAM contents to, say, an SSD with a UPS. And that can take a long time, and it's probably happening at the worst possible time in the system.
Starting point is 00:11:07 When you've lost power, you lost your AC, your cooling's gone out, and you have to keep your root complex up, your processor root complex up the entire time. So we feel that this is a reliable and safe way to move your critical DRAM contents into a non-volatile state. So again, once you've saved the DRAM contents into a non-volatile state. So again, once you've saved the DRAM contents in the flash,
Starting point is 00:11:31 which takes seconds, the NVDIMM itself goes into a zero power state. You cut off the power to the caps or the battery or whatever it might be, and then your data retention is essentially the specs of the NAND flash itself, 10 years, 5 years, whatever it might be. When the system power comes back up, the process happens in reverse.
Starting point is 00:11:57 The DRAM contents are restored back from the NAND flash. You just move from the NAND flash. The data transfer is moved back into the NAND flash. You just move from the NAND flash, the data transfer is moved back into the DRAM. There is some negotiation or communication with the host to cause this to happen because essentially an NVDIMM is a slave device to the host. The NVDIMM doesn't make many or any assumptions really about the state of the host. So the host, when the system comes back up, it will ask the NVDIMM whether there was a saved image,
Starting point is 00:12:29 whether there is a valid image, and if there is it will command the NVDIMM to move the image back into DRAM. The NVDIMM will tell the host when it's done, and then the host can do whatever it wants with the DRAM contents at that point, either move it out to a safe state
Starting point is 00:12:45 or just continue to use the memory as it did when it lost power yes question so the question is how does it react to a power cycle let's say a brownout for example. So if a save command comes in, essentially a save trigger and we have started our save we will complete the save until it's done. In the next generation JetX spec, I believe there is an abort command. So the host could
Starting point is 00:13:18 have some control over that if it wanted to. But current legacy, what we call legacy implementations, the save will complete. So once the DRAM is handed back to the host, it can be used as standard memory again, as standard persistent memory.
Starting point is 00:13:40 And then the process will happen again once you have another catastrophic power loss event. Okay, so that's sort of the simplistic view. As I mentioned, there's a lot of behind-the-scenes activity that occurs in order to ensure that the DRAM itself will be saved correctly and safely. So you can't just assume when a power loss, if we monitored, let's say, DRAM power, 1.5 volts or 1.2 volts, that we can assume the DRAM is in a safe state at that point and we can just do the backup. It doesn't happen that way.
Starting point is 00:14:15 So what essentially occurs, starting in the upper left hand corner, so these are the stages here, I think. So we have board logic up here. We have CPU and chipset here. And then we have the NVDIMM hardware here. So everything happens at the board level. The board itself, or the power supply specifically, will detect an AC power loss event. That AC power loss event will assert this ADR feature. So ADR stands for asynchronous DRAM refresh.
Starting point is 00:14:47 And essentially what that means is the processor in the chipset or the memory controller will flush the ADR protected buffers. So there are write protected buffers in the memory controller. Once an ADR trigger comes through, it will flush those buffers out to DRAM. It will place all of the DRAM in self-refresh, and at that point, the DRAM is in a safe state for the NVDIMM to take control over it. So once self-refresh is asserted, ADR complete is then asserted as well, and this ADR complete is what we have been calling the save pin. So within JEDEC
Starting point is 00:15:26 there is a standard pin that has been defined as the backup trigger. So once the NVDIMM sees this trigger it will then switch control of DRAM over to the controller that resides on the NVDIMM itself and copy the DRAM to flash. And then at the same time, the entire system can shut off all of its power rails. In a cache-to-flash type of implementation, you would move all of your data and have to hold up your system, including your standard DRAM, for however long it would take to move that data.
Starting point is 00:16:01 And sometimes that can be tens of minutes, right? So this whole process, from here to here, takes roughly 30 microseconds. If you were using processor caching, it could take a little longer, on the order of, let's say, two, three, to five milliseconds. And then the backup itself will take tens of seconds. So 8 gigabytes will roughly be about 60 seconds.
Starting point is 00:16:28 And it all depends, of course, on what type of flash you use, how fast your DRAM controller is, how fast your NAND flash controller is. It's very vendor specific. But roughly, for an 8 gigabyte backup, it's about 60 seconds. Okay, so each of those stages is shown here more in a timing diagram. So again
Starting point is 00:16:53 AC power loss is detected by the power supply and it's sort of a race at this point. So the ATX power supply spec says that your hold-up time, your DC hold-up time, is at least one millisecond from when you lose power to your DC rails going down. As I mentioned, this whole process here to get this backup trigger typically takes about 30 microseconds. So you have a power good signal that then triggers the ADR trigger. This is the ADR trigger into the chipset. The processor is in normal execution. Once it sees that trigger,
Starting point is 00:17:31 it will flush out the right protected buffers. It will place the DRAM in a refresh. It will give a trigger to the NVDIMM over the DIMM interface. This is the save and trigger. And at that point, the NVDIMM itself will take control of the DRAM.
Starting point is 00:17:47 It's in self-refresh at that point. It will pull it out of self-refresh, and then it will start moving the data into the NAND flash that's on board. So the nice thing about NVDIMM N is that about a year ago or two years ago, none of these features were standardized. It was very proprietary implementations for NVDIMMs. No NVDIMM vendor looked alike. There's probably five or six different NVDIMM vendors in the system that offer solutions
Starting point is 00:18:21 today. So the problem with that is if you wanted to integrate an NVDIM in your system, they weren't necessarily plug and play between the vendors. Well, Jetix has solved a lot of that. So Jetix got together about a year ago, maybe a year and a half ago, and started standardizing both the hardware interfaces and the software interfaces for NVDIMM types.
Starting point is 00:18:46 And the first type is NVDIMM-N. So one of the things that JEDEC did was added 12-volt power to the pins. Now, this maybe wasn't specifically for NVDIMM-Ns, probably for future instantiations of other DIMM types, but that 12-volt power over the DIMM interface is very, very helpful for an NVDIMM, as you can imagine. So we can use that power to charge whatever power source might be connected to the NVDIMM, but also that 12-volt power bus can be used to power the NVDIMMs during a catastrophic power loss event. So you can actually switch in a central battery
Starting point is 00:19:26 or central super caps or whatever it might be in your system and keep just the DIMMs up to do this backup process, as opposed to a very large UPS type of implementation. JEDEC also standardized on the hardware interfaces to trigger the backup. So the save and pin is standardized on DDR4. to trigger the backup. So the save and pin is standardized on DDR4, it's pin 230. It's a bidirectional pin. There's some specific implementation
Starting point is 00:19:53 aspects as to why we made it bidirectional. We'll not go into detail on that today. But essentially that's the pin that the NVDIMM controller will monitor, and when it sees it go, we will do the backup. Okay? There is an event pin that's an asynchronous event notification. This is implementation-specific, but in some implementations, when you trigger an event, it will tell the host that something wrong happened. The NVDIMM can no longer be used as a non-volatile memory, so the host can go in and then check what went wrong with
Starting point is 00:20:32 the NVDIMM itself. There can be multiple reasons why an NVDIMM goes bad. The flash is bad, any number of things. You don't have enough energy to do a full backup. Those are all things that the host needs to continue to monitor to make sure that when it's using that NVDIMM it knows that it's in a safe state. The I2C device addressing, that has been standardized with NGEDDIC. And as I mentioned the 12-volt DDR4 simplifies NVDIMM power circuitry and cable routing. We've taken great advantage of this 12 volts and all the different types of implementations for NVDIMMs. The other thing
Starting point is 00:21:12 that's not up here is I guess that's what this device addressing is. So not only is the device addressing itself standardized, so the SPD addresses, the SPD values have been all standardized. But also there is a command set. So when the host talks to the NVDIMM controller over SMBus in the channel, the protocol to which it was talking to the different vendor implementations previously was different from vendor to vendor. Well, JetX standardizes all of that. So next generation implementations will all be standardized. Think of this like a USB class device, for example.
Starting point is 00:21:51 So USB mass storage. Every USB mass storage device looks exactly the same to the host system. You have a USB class driver. So that USB class driver can be plug and play for all the different implementations. Well, we've striven for the same thing here with NVDIMMs. If you have an NVDIMM supported BIOS in your system it will look exactly the same from vendor to vendor. So this is a kind of an eye chart. I'm not going to go into great detail on this. It shows what for DDR4 what we're calling legacy sort of the first
Starting point is 00:22:24 generation these proprietary implementations and then the second generation which is the jetec standardized implementations all of the items on yellow are some of the areas that a lot of standardization has occurred so I've covered some of the stuff in the hardware piece Arthur and Rajesh will cover more of the software aspects so the software is really I think we're more of the software aspects. So the software is really, I think, where more of the exciting stuff is occurring in the industry. Because the hardware is kind of obvious. If I can use DRAM as persistent memory or non-volatile memory,
Starting point is 00:22:57 yeah, it's obvious that it's going to be the fastest persistent memory that you could possibly get in your system. But then you get software bottlenecks on top of that. And that's where a lot of the work with the NVM programming model that I think Paul and Doug had talked about earlier today, that's where a lot of the work is happening to make those standardized so
Starting point is 00:23:16 that you remove the software latency in your system and truly get the best performance out of your persistent memory as possible in your applications. So I'm going to hand it off now to Arthur who's going to cover more of the BIOS aspects of NVDIMM use in the system and then Rajesh will cover the OS, some of the software stack activity and then I'll come back and do system use cases. Thank you Jeff.
Starting point is 00:23:42 Good afternoon everyone, I'm Arthur Sainio. Just a reminder, we do have a Birds of Feathers session this evening at seven o'clock. So anybody that's interested, you're invited to come and bring up any topic related to NVDIM, NVDIMs at all. All right. So I'll talk a little bit about the BIOS. And one thing that I wanted to say is that this has really been a key enabler for the adoption of NVDIMS in the industry. So with DDR3, there were just a certain number of systems out there available, off-the-shelf systems that would support NVDIMS. With DDR4, the number has grown dramatically. So these are Intel-based systems I'm referring to with the MRC and the BIOS enabled. So once that MRC is released to AMI and Phoenix and Insight and so forth, they do support all the different vendors and has really enabled this market to grow quite a bit. So in terms of the bio support specifically there's some functions that have been added. Just seven of them are listed here and those enable the system to detect these NVDIMMs once
Starting point is 00:24:52 they're plugged in. So there's seven different functions that allow the memory map to be set up, that allow the NVDIMM to be armed so that it's ready for a power loss scenario. Other commands, such as flushing the write buffers and restoring the data upon the power loss, and also the I squared read write process. So all these need to be necessary or in place on a system in order for the NVDIDIMS to be supported. So this is the standard flow with the shaded area in the middle of being the MRC section,
Starting point is 00:25:33 the memory reference code. So the system has to detect the DIMS are plugged in, what density they are, the configuration, the voltage, and then they do all the timing parameters are set up etc. So that's a standard flow and then you can see the difference here for the NVDIMM BIOS. So a number of different features here in terms of NVDIMM detection. So the N versus the Y is what Jeff mentioned the the command set. So the N is for the legacy and the V is for the JETA command set. So we just need to make sure that when these NVDIMMs are plugged in that system or the host needs to know what it is and to configure that memory accordingly. So that's all that work on the taxonomy and
Starting point is 00:26:22 the SPD bytes and so forth. So as the NVDIMs continue to proliferate, there will be more letters to be added and then other modifications made into this BIOS flow. So they would need to be updated based on the NVDIM types. The other ones here, the NVDIM restore. So those are additional commands after the system reboots. It sees the NVDIM. First of all, it recognizes it's in the system andots it sees the NVDIMM first of all it recognizes it's in the system and then checks for an image if that image needs to be backed
Starting point is 00:26:49 up it has to initial a signal or enable signal CKE low to read to do the backup for the data from the DRAM to the flash before the system continues to boot okay to boot. So the recovery process, another bit of an eye chart here. So I'll try to walk through this thing. So as I mentioned, when the system is restored and it first checks if there's a save in progress, if there's a save in progress, there's a save in progress it has to wait so it's not going to interrupt that and then once that is completed it continues to boot
Starting point is 00:27:32 and just as a standard dim but the other thing on the top right there would be does the nvdim contain the the backup data and that's when it goes and enables a CKE load to enable that NVDIMM to go through the restore process. So the data is backed up into the, back from the flash to the DRAM, and that can vary, the time that that takes can vary depending on the vendor, maybe anywhere from six to ten seconds
Starting point is 00:27:59 approximately per gigabyte. So that does impact the boot up time for a server. And then it continues to restore and completes the process over here, and then the system boots up. All right, the EA20 table, some changes going on with the type 12 here. So that's the reserved section of memory which has now been updated to type 7. So type 7 is for persistent memory and as Doug mentioned about the infant tables, that gives specific instructions on where that persistent memory is located. So moving from type 12 to type 7, and giving specific instructions to the operating system in order for that block of persistent memory to be located. So again, once that whole system is booted up,
Starting point is 00:28:57 and depending on the number of NVDIMMs that are plugged in there, that'll be blocked off as persistent. And then it's up to the host system or the user or the application to determine what can be used with that portion of memory. And then these are just some other examples here of some of
Starting point is 00:29:13 the features about enabling ADR, asynchronous DRAM refresh, the restore functions, and extra features related to the BIOS commands that can be variable depending on the user. Okay, and Jeff mentioned this other section here about the legacy versus the JETA command set, so that was the case with DDR3 where most of the systems or all the systems virtually are using the legacy command set. The industry is moving toward the JETA command set that should be done by the end of this year, which allows a standard implementation for all NVDIMMs to be recognized in the uniform manner. So there's also an updated MRC and BIOS that goes along with that command set, so that is still in progress. It has not been released yet yet should be also released by the end of this year.
Starting point is 00:30:05 Okay so that's it for the the bio section Rajesh is going to cover the operating system. I think Doug covered some of this and other talks but if there's any questions here. Thank you Arthur. I'm Rajesh Anand. I'll be talking about the software things that are happening in this spectrum. So the NVDIMM, when it came into the picture, a lot of vendors, they were having a lot of proprietary solutions. They were having their own drivers. They were having their own stack and everything.
Starting point is 00:30:39 So there's an initiative that got started that they wanted to standardize that. So the whole trend is like instead of having a traditional storage stack, which has got the file system sitting on a block driver, sitting on the disk, now the trend is like have a non-volatile memory. It's not just for NVIDIMS, it could be for any persistent memory. So this is what the trend is going to happen. What you're saying is like
Starting point is 00:31:05 all the applications, they want to know, they want to have a full control over the storage. They want to have direct access to the storage. Why do we need multiple layers? Why do we have to have this traditional file system way of accessing, this way of accessing and all these things? So that is what is happening. So people try to, this solution, so many different vendors were trying to do it in their own ways, like you have a memory map interface, you use my driver, use my application, everything. Then what Intel did was a couple of years back,
Starting point is 00:31:36 they said, okay, let's make it more standard. So Intel kind of started defining this, and now this is pretty much coming into place. So this is what has become in the 4.0 kernel. So it's got a lot of things floating around. So what all that's in the gray portions, this is the interesting thing that has happened in the Linux 4.0 kernel. These are all become standard drivers now. That is, you don't have to, if you're using us, NVM or a persistent memory from us, you
Starting point is 00:32:15 don't have to use our drivers. The Linux 4.0 kernel already has the driver for that. So this is the NVM library that Intel has in its GitHub site. So what they have done is they have built a beautiful library system that is the, you can have, you can plug in any of the vendor's NVM modules or any persistent memory modules. The library gives you access, like you can use them like an object database or you can use them like a, what's that like a typical block database block interface, or you can have a direct persistent memory interface.
Starting point is 00:32:48 The library has all the interfaces defined. So all you've got to do is just write an up. Go ahead, please. Just clarification, right? Are you now referring to the F and P, or is 2 to N? Okay, so the thing is is this ecosystem, it's pretty interestingly, they call it as non-volatile memory which
Starting point is 00:33:10 kind of applies to both the dash n and the dash b. The dash f probably will cook into this infrastructure but right now if I have a dash b which is going to be common and if I have anything they call it as persistent memory, that is, what they call as persistent memory is like a memory
Starting point is 00:33:28 device sitting on the memory bus and it's quite accessible using load and store instructions and everything anything fits into this ecosystem that's the beauty that is happening that is if somebody says hey this is my memory device that can sit on a video port or anything. It's quite accessible. Using this infrastructure you can directly access that. The drivers will detect it. So it all comes to this which is this new ACPI specification. I think Paul touched a lot about it. This latest specification has given a standardization of what these
Starting point is 00:34:06 persistent memory modules have to be. They keep on referring to them as non-volatile memory, meaning that anything that applies, that has the characteristics of persistent memory will apply to that. It could be like what they call a spin torque memory, that memory and everything, it doesn't matter. There's a huge change that is happening in the BIOS. The BIOS constructs what is called the when we come from an interface table, that is the new definition. And the OS kernel detects that thing and it can say hey, I can access this guy this way, this way, and everything. So this is more like a vendor-independent way.
Starting point is 00:34:46 So this whole driver does a lot of work, and it depends on the buyer. So a lot of buyer changes are also happening. So your motherboard buyer has to set up this thing, and once it is done, the kernel detects everything and sets it up for you. So the interesting thing is there's a window, they call it the block driver and the VTT and everything. The idea is for legacy guys who want to use, see this is, at the end of the day, this is a memory device that is sitting on a memory bus. But still, if
Starting point is 00:35:17 people want to use it like a conventional block storage, you still have to provide that. The interesting aspect is now you are a memory device sitting on a memory bus, but the applications of the file system are always thinking like it's talking to a block storage. If there is a memory error or that thing happens, the conventional reporting scheme doesn't report it like there is a sector error or something like that. To make it happen, this whole infrastructure, they have to build it. It is the file system that is sitting on this thing, it gets the error as if it's getting a SCSI device error
Starting point is 00:35:53 or something like that. A lot of work has gone into this thing and this thing. But there's always a flip side to this thing. Now you want to maintain the legacy compatibility. What it introduces is like now you have a upper level guy who has his own host LBA which gets translated into another LBA which gets translated so there's a lot of multiple translation that happens the most interesting aspect which I don't know at this point is like how much it's gonna affect the performance for this guy
Starting point is 00:36:23 if you want to access it because at the end of the day the bottom most thing is doing cache like IOs because it's sitting on the memory bus. So even if you're doing a 4k IO when it goes to the bus it's doing 64 bytes at a time kind of thing because it has to go to the memory controller and all these things. This is one thing further experiments will tell, but this thing, this is pretty fast because this is byte addressable. So the whole ecosystem, at least in my opinion, is like it's going to flourish this way because people want to use the whole aspect of the process to memory is like
Starting point is 00:36:58 it's a memory device, let me use it like a byte addressable thing kind of thing. So this is more like a legacy thing, only the future will tell, but it's all enabled now. So this is really fast, it's like it just the device comes up and depending on how many devices you have, it exposes this like slash dev slash vmem something like that, and you can access them right away. So these are the, whatever I said, these are the things that are, what is existing, what's going to come, and all these things. So the kernel aspects are in 4.2 kernel, but Intel is trying very hard to push in the libraries and everything that hasn't happened yet.
Starting point is 00:37:40 And now, as we all know about Linux distributions, it's like, just because it's in the kernel, it doesn't mean it's going to be part of the, it's going to make it to the distribution soon or anything. Most likely Fedora is going to be the first one that's going to have the 4.2 kernel, we don't know yet. But the rest of the distributions might be just keeping it all. Yeah, sorry. What's the difference between the window block driver and the PMEM block driver? Yes. So the PMEM is basically, it's a persistent memory, direct memory access, meaning that
Starting point is 00:38:12 you can use a memmap command kind of thing, so it gives you a byte addressable thing. So the whole aspect of the PMEM is like, it sets you up to access the device, then it kind of takes it away from them. So the driver is not involved during the I-O operations. It's just a bit of a small thing. The block driver is a typical block driver. It's like a SCSI driver sitting on top of the memory device. So it has to translate everything that is your block,
Starting point is 00:38:38 host LBAs into like what offsets in the DIMM is gonna go and all this SCSI things. The block driver also, what they have done is, see, if you look at a typical memory system, you have multiple channels, you have interleave among the channels and all these kinds of things. That's how the memory devices are supposed to operate. Now, the PMIM is basically, if I have four, let's say, DIMMs on a socket,
Starting point is 00:39:06 it sees them as a whole so that it can interleave data and all these things. The flip side it brings in is like if one of the DIMMs go bad, you're seeing your whole storage as a whole device. It's going to go bad. So that's what the block window, what they do is you can set them like a RAID portions kind of thing. On every individual DIMMs, I can assign like a RAID distribution, like RAID 0, RAID 1, those kind of things. Those kind of support is there in the block driver.
Starting point is 00:39:34 So imagine you are having memory DIMM devices and you are treating them like a RAID kind of thing. So if one of the DIMMs fail, they have the protection logic in the driver that can record they have the parity and all those kinds of things. That's why it adds a lot of complexity to performance with TAL, whether it is worth it. Just because it's sitting on the memory bus
Starting point is 00:39:55 doesn't mean it's going to be having memory speed because there's a lot of software, what do you say, that layers put on top of it. But a lot more access. The DL mode access, it's completely different. Any questions? The PMM mode operates like a. Yes.
Starting point is 00:40:16 So you just get a file, access doesn't really just go. Yes. Yeah, that's his primary intention is. So yes, there is some interesting things also. This finger driver, you see it's called a block driver. So the reason is it comes up and still use it as a regular device under a file system. It acts as a block device. So this driver gives you dual functionality. That's why this DAX enabled file system, which is the EXT4, they made huge changes to enable or to disable all the kernels, buffers when you access the. If I can mount it as a DAX-enabled thing and I can use the BMAM block driver before using memory maps. So it kind of behaves like that. It doesn't
Starting point is 00:41:12 then copy all these things. So the whole idea is, thank you, the whole idea is all about like maintaining the legacy compatibility as much as possible. So because this introduction itself is a lot of shock. And most likely this whole thing is going to prevail because the next processor that's going to come out, the load and store instructions that got optimized, like flush instructions and all these things, right now the library, they have disabled it. But the whole idea is it's just the library the
Starting point is 00:41:45 applications don't have to worry about it once they have the new processor with the new instructions they will just change the library and the applications don't have to change it so this is the most interesting thing that's going to happen this is pretty much done it's for the legacy guys so these are the pointers so if you go to this bmum.io site it's beautiful it's got So these are the pointers. So if you go to the spmum.io site, it's beautiful. It's got all kinds of documents, the latest documents that Intel maintains it. And they have a GitHub reference.
Starting point is 00:42:15 You can download the source. And if you want to make some contributions, they're open for making contributions to that. If you want to make some changes, fix some bugs and stuff like that. I'm pretty much done. Any more questions? My understanding is that, according to your introduction, my understanding is that if you enable ADR,
Starting point is 00:42:39 ADR will guarantee flash memory channel buffer into PM. How about RDMA? If I use RDMA, this also will guarantee flash to memory channel buffer into PM. Also, is there any possibility ADL will fail to flash the buffer cache? So, when you say ADL will use the auto view and refresh? Yeah.
Starting point is 00:43:13 So, you're asking about the RDMA aspect of it? Yeah, if we use RDMA, the ADL will guarantee flash the buffer cache, the member channel, member channel buffer into persistent memory while power loss? I do not, honestly I don't know the answer to the question yet, so you know what I'm saying? It's trying to be detailed, right? I'm not aware of any plans to enable access to the memory when the system's not working.
Starting point is 00:43:42 So is that possible? If not RDMA, that's the generic DMA, will the ADR fail to flash in memory channel buffer? Not remote, just in some single host. Right. Yeah, I do RDMA. Yeah, they're independent of each other and have different purposes.
Starting point is 00:44:12 Yeah, I will get an answer. I don't know what that is. Just to be honest. I'll get it even more informed on work. The slides should be available for download. I think they're already on the NBDM SIG site. Okay. It's a similar presentation to the before show. The only... Five minutes.
Starting point is 00:44:45 What's that? Ah, okay. I can do this in five. System implementation and use cases. So putting it all together and making it work. There are several platforms available today that are NVDIM enabled. There's a few from Intel.
Starting point is 00:45:01 There's a few from Supermicro. They're only on here because they were the first ones initially, but almost without exception, most of the OEMs will have NVDIMM enabled platforms available for Haswell and Broadwell, probably Broadwell, and then eventually Skylake. Thirty-four socket systems? Four socket systems. I don't see any four socket systems up here, but do know they're coming. Yes. So for example, Supermicro has a list of 15 different
Starting point is 00:45:30 platforms we just didn't have enough room. But I believe four socket systems are available today. We talked about some of the energy source options. You can power the NVDIMMs through that 12-volt interface. In some systems, obviously, the system has to be plumbed appropriately for that. Or you can use an optional external energy source like super caps or a battery or whatever it might be. NVDIMMs themselves are pretty energy source agnostic, as long as it's energy. It can be a battery. It can be a hybrid. It can be super caps.
Starting point is 00:46:01 Super caps are nice because it has high energy density for a short period of time and that's all you need for an NVDIMM and super caps are much more reliable than batteries and that's why they're the energy source of choice. The population rules are very similar to standard DIMMs. You can't use LRDIMMs with R DIMMs in the same channel, but you can put an NV DIMM essentially anywhere in the system and the BIOS will detect it and set it up appropriately in the EA20 map that Arthur had mentioned. The BIOS is smart enough to interleave NV DIMMs that are within a system and it will interleave standard DIMs within a system.
Starting point is 00:46:46 So it really doesn't matter where you put them, however, where you put them does impact performance, right? So you should be smart about it. So for the NVDIM population tips, interleaving DIMs within a channel provides a small performance benefit. Interleaving DIMs across channels provides a very large performance benefit. So interleaving dims across channels provides a very large performance benefit so interleaving NV dims across channels by association gives you the best performance benefit and here are some optimal interleave examples I'll just point one out the green is an NV dim the light blue is standard dim gray is an empty slot. So in this configuration, the system will interleave 32 standard dims together and 8 gigs of NV dim.
Starting point is 00:47:35 And the way that the OS will recognize the NV dims is it will pass up, the BIOS will pass up the EA20 map, and it can allocate the upper 8 gigabytes, for example, of addressable memory as persistent memory. So the OS or the application can use it appropriately. Several use cases. These should be obvious. In-memory databasing, of course.
Starting point is 00:48:00 In enterprise storage, you can use storage tiering, caching, write buffering. Caching is probably the most popular application today because it's sort of plug and play into most application-aware systems. Most applications that use a write buffer or write caching already need some sort of persistent memory in their system, so there's not a lot of overhead in terms of application development. That's one of the reasons why we have the block mode driver that's being developed because a block mode driver can more easily plug into
Starting point is 00:48:29 today's applications. However, byte addressable persistent memory is a more compelling use case in the future for persistent memory. So an example that a specific customer has shown, and this was shown I think at the NVM summit back in January, this is a storage bridge bay, so you have two high availability nodes, active-active nodes. In a system like this, you always keep two copies of your persistent memory, and in this case, it's done with an NVRAM PCIe card. So you have essentially a bunch of latency hops in order to make sure that the same copy of that data is on the other side or the other node, in case one of the nodes goes down.
Starting point is 00:49:15 So you have the data in volatile DRAM. It needs to be copied to that PCIe card. So it has to go through the memory channel over a PCIe hop, has to go back through the PCIe root complex to a 10 gig E NIC card through the backplane, and then through those same latencies back into the PCIe card. Well, with NVDIMs, it gets much simpler, right? Once the data is in the NVDIM, it's persistent. So once you've captured it in the NVDIM, you can go through the non-transparent bridge, which mirrors the right to the other node. It's much, much faster. And in an implementation
Starting point is 00:49:52 like this, the customer was able to see a 4x performance improvement in the right latency. 4x. It's pretty significant. And I don't know a technology today that could give them that type of performance improvement. So, with that, I think we're out of time. I'll open it up for questions. Just one quick comment. Sure. The previous slide was showing the NTD configuration. Yes.
Starting point is 00:50:25 What we're anticipating in the program in Twig is that for Eric, who is not in storage devices, but more servers, is battery-hard DMA internet, but exactly the same benefits that you described there. And that's how we would anticipate in the question you have there about access to remote memory. No, it wouldn't be access to remote memory
Starting point is 00:50:49 when a server is down. It would be access to duplicate copy across the cluster. So we're using replication as opposed to some assumption that we would get to powered off memory. So I think we're approaching that problem. You had the answer that I was gonna to give you on the slide there, but just a couple of letters. Perfect.
Starting point is 00:51:09 Thank you. Thanks for that comment. Okay, any other questions before we close? No? All right. Thanks, everyone. Appreciate the time. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Starting point is 00:51:33 Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.