Storage Developer Conference - #2: Managing the Next Generation Memory Subsystem

Episode Date: April 11, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 2. Today we hear from Paul Von Beren, software architect with Intel, as he presents Managing the Next Generation Memory Subsystem from the 2015 Storage Developers Conference.
Starting point is 00:00:46 My name is Paul Von Beren. I'm at Intel and I'm going to talk about managing memory and a lot of new concepts. I'm not going to get into a lot of detail. I won't have any XML dumps on the slides or the normal stuff that makes most people scared of management software. It's kind of conceptual, but trying to set up why things are changing, why this is more interesting. Traditionally, you don't manage memory. It's just there. It has capacity. That's about it. It's getting a lot more interesting. Okay.
Starting point is 00:01:23 I've got to get rid of this. Hold on. How's that, better? So, and also, I normally work on the programming model side of things. The architect has been doing our manageability software at Intel, did a lot of this work, pretty well put together the slides himself. But he accepted a new job a few weeks ago and is in training this week.
Starting point is 00:01:47 So I'm presenting on his behalf, but I was involved with reviews and working on the work that we had done. So what I'm going to do is talk about memory technologies, things that are coming up that change the picture. This includes NVDIMS, but some other things as well concepts and practices for the next generation memory and emerging management standards
Starting point is 00:02:13 open source code documentation and emerging is a big word here because we're just getting started we're trying to get some things in place trying to get some understanding we're also looking for feedback. Particularly the operating system vendors, the NVDIMM vendors, management software vendors. This is the opportunity we'd like for people to get involved in the discussion.
Starting point is 00:02:37 Let's talk about memory technologies and management. What kind of use cases are there? First off, we'll talk about the current way of doing things. What do we know about processor cache? Not a lot. It's pretty transparent. We know how much is there.
Starting point is 00:02:55 We know some information about what's being dedicated for core. But for the most part, you don't really manage it. It's just there. There's also the memory related to... I can't even read my laptop here. To the... Let me scoot around here.
Starting point is 00:03:18 Memory controllers, channels, slims, slots, and DIMMs. This becomes a little bit more important because of NUMA locality, but again, for a lot of applications, they don't really care at all. Also BIOS interleave, you frequently need to set this up in the BIOS, can be very confusing. There's frequently a range of settings, including, I don't wanna think about it,
Starting point is 00:03:43 frequently one of the BIOS options, just because in a lot of cases, just let the hardware do the best thing is the right answer. But it becomes more interesting as we have different kinds of memory moving forward. The memory and channel clock speed, which is the speed of installed memory, there's information about that that's available now. Typically, it doesn't get in your way but typically the changes and speeds across different kinds of DIMM in any particular system is relatively small and again we're talking about potentially larger differences now that become more important to manage the redundancy ranks rank sparing, mirroring.
Starting point is 00:04:25 So a lot of people don't realize this, but high-end servers allow raid-like features of memory. They have for a while. If you have the cache to fully populate very high-end servers and make your capacity divided in half, you'll be very interested in the mirroring features. So it's not available in all that many systems and not a lot of people using it,
Starting point is 00:04:54 but those that do value it very much. Another level of redundancy for certain types of computing is always critical. I mentioned here SSD cache for HDDs so we've been talking about the role of persistent memory we're not necessarily in this case thinking specifically about SSDs but the way SSDs work as a cache now we see growing into use of persistent memory in the future
Starting point is 00:05:19 so I've got this notation here, NVDIM N, and I have a couple other NVDIM hyphen somethings that are coming up. JDAQ is a big force in DIMS. They are the standardization at the hardware level around DIMS. They have announced interest in supporting new NVDIMM standards. As far as I know, they are not quite released yet, but there has been quite a bit of information out. And by the way, my understanding is they'll be released real soon now. So we can expect more on this. But there's been a lot of information like Flash Memory Summit and some of the other SNEA events.
Starting point is 00:06:02 So I'll be using their terminology just because it's kind of helpful to have names to talk about these different classes of NVDIMMs. So the one is non-volatile memory created by combining volatile and non-volatile media with the power source. The non-volatile typically today being Flash. And what you see in terms of your software activity is DRAM speeds, triggering mechanisms like ADR. This is the Intel asynchronous DRAM or device. I'm not even going to try.
Starting point is 00:06:37 It's a mechanism to trigger a flush of the volatile memory into flash because there's a power down situation going on. There's a platform support, BIOS support that makes NVDIMM ends work. There are standards that are put in place very recently that allow the operating system to tell the difference between different types of DIMMs. There is interleave separately from DRAM, so it has features much like typical DRAM features for interleaving, but it's
Starting point is 00:07:13 being handled differently. So that doesn't make any sense to interleave volatile and non-volatile DIMMs. So they have different rules. Again, an opportunity for management software. I mentioned BIOS is uniquely identifying the different types. So the kind of use cases for this flashback DRAM hardware, how to trigger the save, how to trigger the restore. This is mostly transparent, but sometimes there's hooks into the power supplies or power source to specifically manage this or at least tune it. Monitor save and restore status so
Starting point is 00:07:57 that you can understand what the condition is at any moment. Monitor energy source, flash health, save readiness. So it's a lot of diagnostic type of data, but from the operating system perspective, it's basically treated like volatile memory, other than the fact that it can have these management cases. Standard NVDIMM use cases apply as well. So these are the kind of things we touched on
Starting point is 00:08:26 in the previous supply. They're even mostly common to DIMMs. You want to be able to replace a failed DIMM, update firmware, and decommission, erase sensitive persistent content. Now that's a very new persistent memory concept. Since volatile memory didn't hold data, you really didn't have to worry about this.
Starting point is 00:08:50 It's becoming a huge issue around persistent memory. The next type that JDAQ has classified and are in the market now are NVDIMs that act like block devices, basically SSDs. It's a DIMM form factor, custom BIOS, uniquely identifies block NVDIMMs that act like block devices, basically SSDs. It's a DIMM form factor. Custom BIOS uniquely identifies block NVDIMM F capability in the system address map. So this is the information that the BIOS provides to the operating system about the topology of the hardware.
Starting point is 00:09:21 So they identify the stuff so the operating system doesn't try to use it for volatile memory, realizes it's a different access approach. It coexists with DRAM. Some system DRAM may be used for caching. It's another characteristic of this. Sometimes they're taking part of the standard system memory. Sometimes they use the DRAM is on the NVDIMM F chips themselves. Custom driver presents the DIMMs as standard block devices to the operating system. Better
Starting point is 00:09:55 than SSD performance. That's the really, you know, one of the big factors here, the benefits is the performance. The other is it's transparent. Everybody knows how to do file systems, do I.O. to disks, work exactly the same with these. The kind of use cases that come up are monitor flash, spare wear, and other drive, they're using smart, sometimes with variations as smart tends to be, but providing data on the health of the device. using smart, sometimes with variations as smart tends to be, but providing data on the health of the device.
Starting point is 00:10:28 Standard block device partitioning formatting, these are operating system management built-in features, but they still apply. Software RAID can be used with these kinds of devices. Standard NVDIMM use cases apply as well. Update firmware, decommission, erase sensitive data, and backing up the data. In the case of the flashback NVDIMMs,
Starting point is 00:10:54 you didn't have to worry about backup because they had the built-in backup capability. With these, you would back up the data just like you would any other block device. NVDIMP, this is proposed. My understanding is this is not going to be used in the upcoming JDAQ standards, but trying to capture this idea of a combination of non-valuable memory fast enough for direct memory controller access. I used the MC all over the place because I was running out of space here.
Starting point is 00:11:31 That's the memory controller. Directly accessible DRAM and NAND. So it has persistent media capabilities as well. They're recognized as persistent media. It's not just a transparent backup near DRAM speeds directly accessed by memory controller very large capacities may include multi-mode capable byte and or block addressable so multi-mode capable starts to throw a big wrinkle in the world of manageability. Rather than the management stack thinking just in terms of monitoring and reporting what's there, if the device provides the ability to let the administrator make choices,
Starting point is 00:12:21 now in addition to passive management reporting you have to have active management the administrator gets to make choices here so this becomes a much more complicated an interesting management problem here I just wanted to touch out that's a you know kind of lost in the paragraph there that's a very key concept the bias uniquely identifies volatile nonatile memory regions I mentioned this before and how the information about the memory topology is reported so the use cases here are you can configure
Starting point is 00:12:56 RAS and performance characteristics via the BIOS and configure block and direct access devices via driver so traditionally if you had BIOS and configure block and direct access devices via driver. So traditionally, if you had any choices about your memory configuration, like interleaving, you really could only do them from the BIOS.
Starting point is 00:13:21 The standards are being extended to a point where the possibility of having software control over some, but not all, of these characteristics. And we figured it's a pretty drastic shift to reduce the size of volatile memory for a management command on a running system. But increasing it might not be too bad. And changing whether it's byte or block addressable unused persistent memory shouldn't be an impact either. So there's a subset of configuration changes you can make to your memory configuration on a live system.
Starting point is 00:13:48 And again, many of the previous use cases apply. Replace a failed interleaved DIMM, update firmware, and we never get away from decommissioning erasing sensitive data. So, these are the types of things that complicate the memory management problem. I touched on this a little bit before, and I should just expand. NUMA is, if you're not familiar, is the affinity of you
Starting point is 00:14:19 get better performance of using memory that's local to a socket for processes that are running on course on that socket. In the case of both DRAM and FLASH, is there any concern about the reliability of their own endurance between those two technologies? I know DRAM has very good endurance. Is there a question about that? There is likely to be differences in endurability characteristics of
Starting point is 00:14:50 this NVDIMS. It really depends on the types of technology being used, but yes, there are issues there. In terms of the management software, being able to monitor things and get statistics and see, to some extent, a lot of what's being considered
Starting point is 00:15:07 is very close to what's being done with SSDs. So it's a somewhat known problem, but it's a new management issue for memory. Yes? Isn't keeping the hands off the memory controller and letting the hardware figure it out directly at odds a new one. We did a test where we wrote some really high performance code using a conventional program
Starting point is 00:15:34 called OpenCL. We ran it on two sockets. We ran it on one socket. We actually got better performance on a single socket than we did on a single socket. Sometimes it depends on the nature of your software you may be able to to predict things uh also operating systems are are you know will if you can set up and configure how the the numa configuration is being used or how an application is tied into it so you can can say, I'm setting things up so that the memory for this and the active processes are all tied to a specific core.
Starting point is 00:16:12 If you don't do that, then you have a choice between the operating system Monte Carlo choices versus the hardware's choices if you say you want to use Yuma. And having the hardware guess at it might be the right answer. So yeah, it can be tricky. Right now, NUMA really seems to be a HPC feature. That's where you get the most use of it.
Starting point is 00:16:38 They're more interested in that fine-grained control. For most applications, like I said, letting the server run in UMA mode might be the best choice. And it's certainly the most common choice, as far as I can tell. So I was done with that slide. Oh, I was talking about NUMA. The other thing that's being introduced now, which is another memory management use case, is that HPC-oriented memory CPUs
Starting point is 00:17:12 are now introducing in-package or unpackaged memory, which is basically soldered to the CPU. With today's NUMA configurations, there's a performance penalty if you go off socket. With some of these HPC configurations, you can go off socket to standard DRAM, but a different socket can't get to another in package memory. So now you have another type of behavior that, again, mostly HPC applications have to be aware of. So there's work being done on providing libraries that can combine information about the topology of the system, understanding of the types of memory, and be able to help software make policy decisions
Starting point is 00:17:59 about what regions of memory should it avoid if it has certain goals? And that's one of the things that Doug was talking about, this open source library. We've taken some of the software we had for fake, volatile memory over persistent memory and are working it into a common library so that there's a single place that applications can go and get some of this information. Does this library allow explicit programming control
Starting point is 00:18:26 where to place your data on the memory? Yeah, yes. So it's kind of moving towards a memory flow at the moment. It's kind of a memory flow at the DLE moment. Yeah. I mean, hopefully it's a little more flexible than that, but your application to get the most performance will take advantage of this. And like I said, right now that's essentially HPC software. than that, but your application to get the most performance
Starting point is 00:18:45 will take advantage of this. And like I said, right now, that's essentially HPC software. Yes? Can I ask, these different modes, how you mingle things like DMA and RDMA? So we have 10DN, the three different flavors. Is there any comment on how we might
Starting point is 00:19:05 push the full data into those I-O devices using DMA? Well, the NVDIM-F is a block device. Right. So, yeah, that's being handled differently. DMA should work with the byte addressable persistent memory part. Exactly how this rolls out relative to the hardware is still a little bit unclear.
Starting point is 00:19:35 Because when you're acting like a block device, or part of your capacity is acting like a block device, and part of it's acting like memory, it's not really quite sure what the right answer is. But so. I see, I think, a public, a public Intel introduced some instructions for flushing.
Starting point is 00:19:53 When you do a DMA, you don't necessarily know if the cache has got the flushing stuff. Right. . No. Thank you. Any other questions? Yes.
Starting point is 00:20:06 So do you expect to have regions of the cache lockable, like programmer control? Instead of just doing cache lock, just do a chunk of the cache. They can explicitly lock and explicitly evict. So this would make it easier to be able to move data in, right? So there's a whole bunch of compiler optimizations you can do by hand if you really care.
Starting point is 00:20:29 In my world, I really care. So it's like the optimizations. So if you think of multi-level caches, what we're talking, right? To just have different sizes. So if we're doing cache blocking at every level, one of the things that would be nice is if you could just block or lock the whole chunk.
Starting point is 00:20:45 Right. Do you expect these kinds of memory systems to have that kind of capability? Not just a line, but a block? I think what is done at times to achieve that is to make sure that the application's requirements are met by the hardware and not have anything ever move. So, you know, this is common in dedicated systems that are doing one thing. So if you basically have one application or one service,
Starting point is 00:21:17 you know, that there's a little memory left over for the operating system and the application gets everything else, then it doesn't have to worry about memory moving around. But in terms of hardware support, I'm not sure. Yes? Yeah, on your memory management, these cases show that the NEM working back to the place that they'll dim in a Italy mode.
Starting point is 00:21:44 But does it say anything about external energy source, like a supercap. Is there plans to move forward with that? Which one is that? You're going to provide to the Type N or the Type F or the Type, or actually the Type B. Because Type N does have super caps. Right.
Starting point is 00:22:06 I'm wondering, it has management software to detect that. I'm just wondering, going forward, is that going to be supported? Who owns the management software? I'll get to that, who owns it, soon. In just a couple slides here. In terms of the SuperCAP, that's something that's interesting from a management point of view.
Starting point is 00:22:32 A lot of it's pretty opaque right now or vendor-specific, and that's one of the other challenges here. But that's true with all devices. I think the management solutions will be a combination of what can be done generically across a variety of vendor solutions along with the vendor specific stuff so you know i think that's common for you know in just about any kind of hardware devices there's a little bit of each but um the um i believe that super cap i am again i've seen some of the java work uh jeddak work it's as far as i know it's not publicly available, and I shouldn't be speaking too much about it
Starting point is 00:23:08 other than what they talked about at my Flash Memory Summit. But I don't recall seeing that they were too specific about the power source information. Other than they're also talking about some manageability interfaces something else that could be monitored if the vendor wanted to provide that but I don't think they were necessarily describing
Starting point is 00:23:34 exactly what had to be there so next generation memory concepts what we're anticipating for the next generation memory is not a monolithic resource I've mentioned this there are different kinds of memory Memory concepts, what we're anticipating for the next generation. Memory is not a monolithic resource. I've mentioned this. There are different kinds of memory. They have different features, different quality of service, different characteristics.
Starting point is 00:23:58 Traditionally, memory has been treated as a monolithic resource. Even though NUMA has been in place for quite a while on systems, there's an awful lot of software around manageability that knows nothing about NUMA has been in place for quite a while on systems. There's an awful lot of software around manageability that knows nothing about NUMA. So these are a variety of things that are kind of forcing us to think about a change in the way we approach memory. So that's probably the biggest takeaway. Multiple types of devices plugged into the memory bus.
Starting point is 00:24:23 Devices may coexist or require their own channel. They may work cooperatively or may be segregated. BIOS recognizes distinct device characteristics. Management tools need to differentiate memory types and manage accordingly. So there's slightly different issues between management software and applications that are using the data but some of the stuff overlaps and I expect to see
Starting point is 00:24:52 more and more awareness of these differences and how to react to issues and also to optimize Configuration required I mentioned this earlier, that what's bubbling up soon is memory systems, NVDIMM
Starting point is 00:25:13 type systems that have different types of capabilities and the administrator can make choices. So volatile versus persistent, interleaved, mirrored sets, those kinds of choices that you would have, block access, bite access, cooperative relationships, that was, you know, NUMA affinity being the main one, but there could be other kinds of relationships that you may want to be aware of. Constraints, topology restrictions, operating system support, workload requirements. And the workload requirements really ties into the ability to make choices around the types of configurations, choices that you have with NVDIMMs. Andy talked about cases where the block-addressable pseudo-SSDs is the best way existing applications work.
Starting point is 00:26:13 This may be a good starting point. As evolutions get smarter about working with the persistent memory model, you should be able to get better performance with a PMEM-aware application with PMEM hardware. And so I think there will be transitions of times where the new version of the database starts supporting PMEM and you can switch over your hardware. Is it possible that the use of persistent memory
Starting point is 00:26:41 would simplify the amount of devices on the memory bus? At this point, it doesn't seem like it. But it may not make things worse. But it certainly opens the opportunity for things to get pretty complicated. I mean, there's been a real significant attempt to allow different kinds of devices to be used on the memory device,
Starting point is 00:27:08 on the memory bus with NVDIMMs, which adds some complications, and particularly population rules. It was touched on in one of the previous slides there. Things have to be on the same channel, or they must be on different channels, and there's things like that that you have to be aware of. Moving on, persistent handles. So the idea here is similar to other kinds of devices. There'll be ways that names for the persistent memory devices, you may not be able to use them one by one.
Starting point is 00:27:46 It's really kind of a choice between the operating system, driver design, and hardware choices. But you're going to have something, just like with block devices, things that look like file names that are actually representative of names of devices. So for the DAX work that's in Linux, the PMEM devices or namespaces that are discovered are just a dev PMEM0, PMEM1, PMEM2. So it's a very simple namespace. And then on top of that, you mount a PMEM-aware file system, and then you have names that you can set for the files that represent the blobs of persistent memory that get used by applications. So file systems and the drivers support exactly this type of behavior for other resources,
Starting point is 00:28:36 very similar to what you see with disks. They may be able to allocate and label a region of persistent memory. So in some cases, we expect NVDIMM devices, just the capacity you get for a device is exactly the capacity of one DIMM, or maybe all of them, all form into one. But there's, I'll mention it a little later in more detail, an ACPI standard, NFIT, that provides a way for basically a separation of those, kind of similar to the way that storage arrays will allow you to configure a blob of stuff together at the RAID level and then divide it up into a bunch of virtual logical units. Deallocation when done, modify if needed.
Starting point is 00:29:29 Options are there that may or may not apply to the technology. Data management needed. Persistence creates reliability, serviceability, and security concerns. Interleaving NVDIMMs complicates failure domains. If you've worked with servers that you needed to worry about interleaving what's there it's already complicated so this this makes it even more so yes Tom but you're also looking at maybe a self-encryption function as well. Yeah, the need for something is very clear.
Starting point is 00:30:12 The right answer is a little bit unclear. We've started some discussions in the NVM programming twig and SNEA of, if hardware provided 10 encryption keys for NVDIMM, would that be enough? Consider you're running 20,000 containers on your server. Is 10 keys enough? Does that even make sense to think that way? Are there going to be niche cases where you absolutely
Starting point is 00:30:39 need one or zero, but you probably couldn't take advantage of more than that? Is software encryption the right answer? So it's a lot of considerations. I was just kind of getting started with that, but that really wasn't a primary concern of the TWIG when we got started. But the more you start thinking about this, this is something we really need to have an answer for as an industry.
Starting point is 00:30:59 When the car goes off, the data doesn't go away. Right. Andy had all those pictures of core. I remember my first, I got out of college and went to Sperry. the data doesn't go away. Right. Andy had all those pictures of core. I remember my first, I got out of college and went to Sperry. I was immediately swallowed up into a Unisys. Had a beta customer that was having all kinds of funky problems with their storage with torn rights.
Starting point is 00:31:19 What a totally new concept. And they, I had to debug with somebody is that the Air Force it's great to have a highly confidential site as a beta customer his he couldn't he had to basically get centered he had to walk have intelligence people in the room with them before he could answer my questions on the phone and they needed a quiet place to work, so they used a room that was originally built for the core memory.
Starting point is 00:31:49 They'd have a power loss in that data center. They'll grab all the core memory out of the servers and lock it up in this concrete vault. And it was like a panic room. So that's where he called me from. At the time, it was just you would hang out and have coffee and phone calls. But it was that close to that error.
Starting point is 00:32:09 I didn't work with that. It was just afterwards. Failed server. Need to migrate NVDIMS to a new server. This also gets tricky. Your new server may not have the same kind of socket configuration that you need, but if it does, assuming you have one that's exactly the same, you should be able to move DIMMs and reinsert them, paying attention to the same population rules, things like that, that may be useful in some cases.
Starting point is 00:32:38 But you probably need some help from the software to tell you what strategy you need to take. Repurposing NVDIMMs, if you decide you're done with them in their current use, then you have to worry about the encryption problem and then how to clear it, make sure that any sensitive data is gone. Optimization is hard. I talked about NUMA a bit, and there's probably nothing really new on this slide versus what I said already. The operating systems are already doing a lot of the tricky work with today's NUMA. They will again schedule threads to run optimal for the memory. That's really easy to do with volatile memory because
Starting point is 00:33:25 you have to make all those decisions since the last time the session was booted. Anything that was there before is gone after the reboot. It gets a lot more complicated when you have persistent memory. Now you've got, you started an application running, you've got its threads running on the core 2, or cores on socket 2, which made perfect sense
Starting point is 00:33:46 at the time and then the thing decides to open up a persistent memory file with affinity to core 1 what do you do? In the case of typical NUMA right now, you can have some non-optimal IOs
Starting point is 00:34:04 you can try to have the application set up, again, using utilities that are available to set up that affinity ahead of time. So generally, we think that existing tools will suffice with persistent memory versus volatile memory and NUMA now. But the on-socket memory is a brand new headache. It's going to be a real challenge to get that right. And optimal. Things will work, it will be trickier to get things optimal. Public resources. What's going on with standards and emerging work and all that, and open source. So I have a little chart here of kind of a stack, and just using this as a reference for where the standards are.
Starting point is 00:34:56 Again, I don't have been loosely following the JEDEC work. I believe that the impact is going to be really a combination of where the blue boxes are here and the physical access to the DIMMs themselves is where their NVDIMM work is going on. But they're going to be something very important to monitor. I just didn't have public information yet to share.
Starting point is 00:35:24 So they're missing from my discussion here, very important to monitor, I just didn't have public information yet to share. So they're missing from my discussion here, but they're really a big force. But looking at the top from the kind of the high level user space, there's end user tools, command line tools, XML representation, integration with management libraries, kernel, in the case of Linux they have the sysfs, iOctools, hooks into the drivers. These are going to be extended the way they have been for other kinds of devices as ways for management software to figure out what's going on. Out of band, this is through BMCs and that kind of communication being done,
Starting point is 00:36:06 not through the normal operating system interfaces, but you can monitor things even before the OS is even installed. You can monitor things when the system appears to be offline. Jim mentioned this morning the need for a power reset. It's one of the other tricks that's provided with VMCs. So IPMI is kind of the way the world is right now, and Redfish is emerging standard for out-of-band interface to servers, working with that activity to get NVDIMS plugged into the model.
Starting point is 00:36:47 BIOS. ACPI. ACPI 6.0 came out earlier this year. I mentioned earlier they introduced NFIT. NFIT, like I said, adds this level of indirection between the physical topology and the logical topology. So if your PMEM devices support this, you can create what appears to be a pool of capacity
Starting point is 00:37:14 from one or multiple NVDIMMs and divide it up into multiple namespaces. This is not a software construction. This is not software array. This is at the BIOS level. This is being done. Once the configuration is done and saved off, there's not like a RAID manager or a partitioning manager going on.
Starting point is 00:37:35 Once it's done, it's fairly passive in the firmware. But you can configure this stuff at the BIOS level. But again, you have options of configuring this stuff from software as well if the devices support it. You don't have to use that, so there's also going to be NVGIMS that have much simpler really read-only configurations. Hardware look at registers, firmware interfaces, it's all part of the fun puzzle here. So for the BIOS work that's going on, I mentioned BIOS tables that describe the NVDIMM resources, the operating system. Actually,
Starting point is 00:38:15 this is the same kind of tables that talk about all kinds of hardware, motherboard resources, or system resources really. So so the data structures describing them are put into memory, the operating system can read them. So this has been extended to have multiple kinds of devices for memory, not just one. It also includes, you know, the ENFIT, this ability I was just talking about and kind of a level of indirection.
Starting point is 00:38:42 DSMs or device specific methods. There's a link to the examples paper here. So ACPI 6.0 really opened up a lot of functionality related to NVDIMMs at a higher level than what JEDEC is doing at this mostly the level where you'd have interaction with the operating system, though operating systems probably directly be working with some of the JEDEC interfaces as well. At the kernel level, the Linux PMEM driver, which is known as DAX, direct access, has gone upstream in kernel 4.0.
Starting point is 00:39:23 This whole universe of the BIOS support that I was just talking about, that's pretty well all in place in 4.2 kernel, which is also now upstream. There's going to be patches and tweaks and things like that for a while, but the basic structure is in place. You can use it. As far as I know, there's no standard distros that provide it enabled yet. But wait a couple of weeks, and it might have changed. Linux is a very fast-moving community. So those of us that are playing with this stuff now, it's not terribly difficult to do a kernel build that enables these features.
Starting point is 00:39:59 One of the cool things that's provided there is a little hook. So the BIOS tables I was talking about representing different memory types is the EA20 table. And if you identify yourself as a NVDIMM type, then the Linux with these changes in will just not put your memory into the volatile memory pool. So there's a special syntax for a command line option that says I want to create a physical memory address range
Starting point is 00:40:27 and have you pretend it had an NVDIMM E820 type. And when you do that, you just have a range of memory. It's really volatile memory. You know it's volatile memory. It's the stuff that was there yesterday. It didn't change any features, but the system treats it. You can use the DAX features, you can mount a file system, you can use MMAP, it will not do the paging that Andy had talked
Starting point is 00:40:54 about earlier, and you can start looking at how you adapt your software to work with persistent memory until you reboot. It's not persistent anymore. But it's still a very useful feature to evaluate how your software works today. The PRD is a Git repo which picks up the kernel and adds emerging work in this area. So Intel is managing this until all of the bits and pieces get in place.
Starting point is 00:41:29 But it's very easy to grab a tarball from it, or a zip file, I guess. Or you can use Git and download a copy of this. It's already patched in the not quite approved pieces and get a kernel up and running pretty quickly. The namespace spec is available through pmem.io.
Starting point is 00:41:54 That's the same place that the NVML library is. It's just different areas in the same site. There's also the device writer's guide. Low-level Linux-only library. This is a little command-fine tool for some NVDIMM management services, just kind of a starting point. In addition to that, Intel has been defining experimental standards for SIM-based management.
Starting point is 00:42:29 So we worked in two different groups because right now the server side of SIM-based management is done in DMTF, while the storage side is done in SNEA, in SMIS standards. So we realized that it really makes sense to update related stuff in both sides in order to cover this. So there's a read-only model for memory that has been updated to include more information about pneumotypology, because that was just missing in the previous version,
Starting point is 00:43:06 but also the possibility that memory might be there, DIMMs might be there, but not part of the volatile memory pool. So it doesn't have all of the configuration choices because the model there was much simpler before and we didn't really want to make it that much more complicated. But at least you can recognize that there's this much capacity and DIMMs are attached to the system, but only this much is part of the volatile memory pool. To understand the rest of it, you have to look at the SNEA models, and those are in two layers. These profiles tend to be small pieces
Starting point is 00:43:43 that you piece together depending upon the capabilities of your hardware. So one of them is just managing the system address map and allocating the system address map. So basically, it says how you take your capacity and expose it as whether it's block addressable, persistent, byte addressable, persistent, byte addressable, persistent, or volatile.
Starting point is 00:44:07 So you can do that level of tweaking, and that's about it with that one. The other is around the persistent memory regions, following the same NFIT model that ACPI describes. If you're interested in those, they're emerging, and with anything emerging, you can get more information if you're an insider. In 2016, you should be able to see public versions of the SMIS-17 spec, including these profiles, from the same place where all of the in-review SNEA technical specs are through the URL here
Starting point is 00:44:48 if you're a SMITSG member, this is a working group that deals with the SMI stuff you can see the emerging work now and there's actually ballots going on on approving these right now and there's a URL for those
Starting point is 00:45:04 that would be the latest. I got an older version of this work made public for review several months ago. But as with anything emerging, a lot has changed in that time. So it's significantly different than... Really, the model is about the same.
Starting point is 00:45:21 The names have gone through significant change. People reviewed them and said, I don't like that name, and changed this around. A couple things got combined. But that's available. It's publicly available now. So all those URLs are there. Just a real one-minute description.
Starting point is 00:45:40 I mentioned that Intel is looking at an implementation of this. We have implemented the SIM model. We're kind of going through validation now. We are really using the same general model outside of SIM as well. So if you look at the CLI that we have, when you look at the CLI we have, because it's really not available yet, it will look a lot like the way that the SIM model works. And the idea is that depending upon the context, depending upon the operating system,
Starting point is 00:46:12 there's no perfect answer for manageability tools right now. For the software that is SIM-enabled, there's a lot of desire because there's a place where you can drive standards and have discussion. The rest of it just kind of evolves to meet the problem. But at the moment we're going through several approaches. We're also actively looking at now a Redfish adapt. A few months ago we didn't know what Redfish was and I think six months ago probably not
Starting point is 00:46:40 much of anybody did. It's become a big force in server management. So we've been monitoring that and realized that's a simpler type of thing. It's much more focused. But we're trying to shoehorn the same model in there and work with that community as well. So trying to drive standards where we can. But like I said, there's going to be cases
Starting point is 00:46:59 where there's going to be vendor-specific management. There's just always an aspect of hardware management that really isn't worth standardizing because it's really vendor-specific management. There's just always an aspect of hardware management that really isn't worth standardizing, because it's really vendor-specific. Then you kind of get to a layer above that, and some generic rules apply. So we anticipate a combination of both of those for a while. Any other questions?
Starting point is 00:47:27 All right. So it's about moving things between systems and maintaining their identity. So if you have something which maps on one system, it's not particular. Yeah, it's just as much fun as moving a bunch of disks
Starting point is 00:47:40 in an array group. Yeah. Obviously, there is metadata on the device itself, which kind of identifies the. Yes. Some of the information that is in some of these guides and stuff I said, that reference there talks about how
Starting point is 00:47:57 metadata is laid out. Some of the identity information is actually built into the At the hardware level as well, so there's you know Like like IDs that that could be used but it's a kind of a combination of both of those and and along with the population rules that are server specific and so like I said If you have to a spare copy of the same server that you're pulling the DIMMs out of, chances are pretty good you can repopulate your data. Otherwise, take up backups a lot.
Starting point is 00:48:33 Any other questions? All right. Well, thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference,
Starting point is 00:49:01 visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.