Storage Developer Conference - #25: The SNIA NVM Programming Model

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 25. Today we hear from Doug Voigt, Distinguished Technologist with Hewlett Packard Enterprise, as he presents the SNEA NVM programming model from the 2016 Storage Developer Conference. So today I'm going to describe the SNEA NVM programming model.

Starting point is 00:00:53 I'm going to cover what it is and why we've done this work. A number of the implications. This is primarily an overview, maybe a little bit of a tutorial on the programming model. It's got implications in terms of how you access data. That's the map and sync concept. Questions about how to manipulate and manage pointers, about atomic operations of a slightly different sort than we're mostly accustomed to, and a little bit about exception handling.

Starting point is 00:01:30 So some of the basic interesting aspects that the programming model surfaces. And then a little bit about ongoing work in the area of persistent memory data structures. This is the area that I believe Andy Rudolph's talk earlier today focused on. And also in the area of persistent memory data structures. This is the area that I believe Andy Rudolph's talk earlier today focused on. And also in the area of high availability. So I'm going to start with the why. I think a number of people are aware of this. As we have moved through flash technology and now we're starting to see technologies beyond

Starting point is 00:02:03 flash latencies for persistent memory technologies or any kind of non-volatile memory technologies are coming down. This is to represent that. These purple bars are showing sort of abstract ranges of latency for different technologies, such as hard disks, SSDs, perhaps with SATA, and then NVMe reducing the minimum latency somewhat more, and then persistent memory, which is a bit lower still. Now, I've got this range here that's between 200 nanoseconds and 2 microseconds.

Starting point is 00:02:42 That's kind of an interesting range that causes some disruption in the way data is accessed. And the reason is because if you are accessing data and you know that it's going to take something like a couple microseconds or more, then you are probably going to context switch. You're probably going to start the data access and then let your process

Starting point is 00:03:05 block, go away for a while, and it'll be interrupted. You come back and your data access is complete. On the other hand, if you are under, let's say, 200 nanoseconds, and these are approximate numbers that are pretty architecture specific, but sort of rules of thumb then you're doing a memory access you're in kind of the non-uniform memory access region where it's okay to allow a processor instruction to access the memory instead of instead of waiting for what would have been previously an I.O., basically. We'll talk some more about the specific implications of that, but that means that there's this kind of interesting middle ground

Starting point is 00:03:53 between a couple hundred nanoseconds where the technology is not quite as fast as memory to where you might not want to install the processors data access pipeline during the access. But it's also faster than what you'd probably wait for with a context switch. And in this range, you've got some choices. You might, for example, pull for completion because by the time you got around to completing a context switch, the data would already be there. So the point is that as you descend through this range,

Starting point is 00:04:34 which is what current technologies are now actually evidencing or doing, you have some choices to make about how you access your data and whether you view it as storage or as memory. So this was the significant motivation for the MVM programming model because if you're going to access data a little differently, if you're going to access persistent memory using processor instructions, that's a significant difference to an application, and it will take some period of time for applications to adopt that style of accessing data stored in persistent memory. So we wanted to start paving the way for that, to develop an ecosystem that emerges over time, has a lot of components and a lot of software that can tap that ecosystem

Starting point is 00:05:26 so as to drive application acceptance, basically, of this type of new programming model. So to talk more about persistent memory specifically, there are a couple of types of NVDIMMs. Really, this slide should be of NVDIMMs. Really this slide should be labeled NVDIMMs. And I believe there are other presentations and there have been in other conferences about NVDIMMs specifically.

Starting point is 00:05:54 But some NVDIMMs, although they all go in memory slots, they are DIMMs, some of them are actually kind of disk-like even though they're attached to the memory bus. You know, for example, you might, you know, use the memory to send a command to flash on the DIMM and then wait for that command to complete, you know, but still using a memory channel to access it. So those are the sort of disk-like NVDIMMs,

Starting point is 00:06:23 and they're really block storage from the point of view of an application. On the other hand, there are other memory DIMMs that are NVDIMMs that are memory-like. Those do literally appear as memory to applications, and those are the ones where data is stored directly in bi-addressable memory, and you don't really have any I.O. and you don't necessarily have a traditional DMA as a kind of a bulk data mover involved. So we started the programming model actually back in 2012, started work on it. We're currently on revision 1.1 since mid last year and we're now pushing to complete the content for version 1.2. I'll talk a little bit later about some of the things that we're

Starting point is 00:07:14 expecting to see in there. The programming model includes both block storage and persistent memory. There are some features that have been developed in the industry and have been standardized in other standards groups, in some cases, that are not yet exposed to applications in a uniform way. So the programming model spends some time elaborating on those features and how they can be presented to applications. And those specifically include atomicity, block storage atomicity, what kind of capability and granularity do you have with particular block storage? Cuz an application may care about that.

Starting point is 00:07:57 And certain aspects of thin provisioning management. So what's happening here is that we're embellishing the existing application view without really perturbing it very much. On the other hand, when you're using persistent memory, we recommend that you use memory map files, which is not brand new. The technique has been around for a while, but when you use it to access persistent memory, it works particularly well. So it's an existing abstraction and we feel that it can act as a bridge for applications as these

Starting point is 00:08:30 technologies emerge to allow applications to use them sooner than they otherwise might have. There are already some open source implementations available of this type of memory map files using persistent memory. And this is a programming model as opposed to an API. The reason we did that, the reason we call it that is because we need consistent enough behavior across software and hardware, and some of that software is operating system software, to create an ecosystem that people can depend on in terms of the behavior of the ecosystem.

Starting point is 00:09:09 But we don't expect all implementations to have the exact same API. We want OSs, for example, to be able to present their APIs in ways that are natural for those OSs. So that's why we focus on our programming model. It describes behavior in terms of attributes and actions, and then we illustrate them as use cases. And then what happens is to deploy the programming model,

Starting point is 00:09:38 an API will take specific calls, for example, and say this is our implementation of the NVM programming model's action or attribute that corresponds to that API call, right? So they kind of map, you know, people can map their APIs then to the programming model. Let's see. I think I've really kind of covered this pretty well probably for this audience.

Starting point is 00:10:07 You know, I've found though that sometimes when I talk about these difference between, you know, the traditional block and file access methods using I.O. compared with what the programming model calls the volume and persistent memory modes that use processor instructions such as load and store, that I need to spend some time making sure people understand what the difference really is. Because when you're doing storage access today using I.O., you're usually reading or writing data using RAM buffers. You're copying data from a buffer to a storage device or vice versa. And your software controls how it's going to wait. It may context switch or pull, as we were talking about earlier,

Starting point is 00:10:54 but the software can make that decision on its own. And then when the command is done, the status is explicitly checked. It's returned and checked by the software. You get a response code, an error code, that sort of thing. On the other hand, when you're using load or store instructions, and these are proxies for any instruction that accesses memory, a processor instruction, the data is generally being loaded or stored into and out of processor registers as opposed

Starting point is 00:11:25 to RAM. RAM is on one side the target of the loader store and the other side is usually it's some kind of a some register inside the processor. And the processor makes the software wait for the data during the instruction. Once the instruction starts, the application doesn't have any choice, right? The processor is executing the instruction. So it doesn't decide whether to wait or how to wait. It's stuck at that point.

Starting point is 00:11:51 That thread is stalled until the instruction completes. So that's a significant difference. And the other significant difference is that there's no status returned. When you access persistent memory, if there's an status returned when you access persistent memory. If there's an error, you'll get an exception. So you can't just check the status. In normal circumstances, you do the memory access, it finishes and nothing happens, right? Nothing abnormal happens. You just keep going. You don't really check anything. If something goes wrong with the persistent memory access, then you get an exception, which certainly there are plenty of things that generate exceptions,

Starting point is 00:12:32 but that is a change from an application point of view because now it means it went to store something in persistent memory and suddenly it got an exception and it now has to decide where it was that the exception occurred and how to recover from it. So that's why exception handling, error handling, is significantly different when you're using persistent memory in the programming model. So drilling in now on details of the programming model when used for persistent memory. This is a slightly modified, you know, typical IO stack with an application and a file system and a driver. There are two layers that are specified by the programming model in this picture.

Starting point is 00:13:26 They're represented by these red squiggly lines, which has kind of become a trademark for our illustrations from the programming model. The lower one we call the volume, and really its purpose is to describe where the ranges of persistent memory are. That's its primary function. It does assist in that way in discovery of the memory and some information about atomicity of the ranges of memory. And that can be used either from inside the kernel

Starting point is 00:13:59 by any sort of persistent memory aware kernel module. So it can call the PM capable driver and say, where are the ranges of persistent memory aware kernel module, right? So it can call the PM-capable driver and say, you know, where are the ranges of persistent memory and perhaps what are they called? Are they structured? You know, what volumes are they? You know, where's this particular volume of persistent memory? And it gets a group of memory ranges back

Starting point is 00:14:20 and it can then access those memory ranges knowing that they are persistent memory. One of the consumers of that, then, would be a persistent memory-aware file system where it presents typical file system functionality with the map and sync instructions, commands that already exist implemented. And when you do the map,

Starting point is 00:14:43 it's actually called M map in Linux or map you a file in Windows what happens is the application can now see the range of persistent memory in its address space it's normal you know variable address space and that's represented by this end line where it so that once its memory maps the application can do load store instructions and through the processor's MMU get directly at the persistent memory without going into the kernel. Now the map and sync instructions currently the sync can go into the kernel, but we have

Starting point is 00:15:21 an alternative that we call optimize flush where the intent is to potentially avoid going into the kernel. So that's the basic kind of layout of the persistent memory part of the NBM programming model. So I've already been talking quite a bit about map and sync. So this is already, I think, a fairly familiar subject. The map does this association of memory addresses with a file that has previously been opened. So that now it's in the map. The caller may request specific addresses,

Starting point is 00:16:07 and that gets into the question of how to deal with pointers, which is kind of an interesting question with a couple of options that are not standardized. It's not like everybody's chosen one option or the other. The sync command then makes sure when you're using persistent memory that the CPU's cache, which is generally considered to be volatile, is flushed for the indicated range. Not just the cache, but anything that may be storing data in a volatile way in the pipeline to persistent memory. So the intent of the sync is to assure that the data has made it to persistence.

Starting point is 00:16:49 Now, there are several ways to do that. One of them is to literally flush the caches and any buffers in between in the pipeline through the CPU to the memory. Another way is to have sufficient power when the line voltage is lost, essentially after an external power failure, to have sufficient power to ensure that any data that was in the pipeline is flushed to persistent memory before the power goes completely away within the processor subsystem itself. So in some cases, it may be that you can take advantage of that sort of flush-on-fail capability and put less emphasis on the sync.

Starting point is 00:17:37 In other cases, maybe your system doesn't have that capability, and there are several other possible reasons why you might still want to do a sync. So there are a couple of options there. And that's one of the new things that we're working into the programming model is this difference between a flush-on demand and a flush-on fail. We may not use those particular terms, but we're starting to work towards incorporating that concept into the programming model in version 1.2.

Starting point is 00:18:05 Then we, as I mentioned, added the optimized flush. It has two differences. The idea is that the sync as originally specified in POSIX, for example, has just one address range that you're supposed to ensure is persistent. The optimized flush has multiple ranges, and it's intended to be able to execute from user space. The optimized flush and verify was conceived to be similar to a kind of disk write and verify where you go to extra effort to ensure that the data

Starting point is 00:18:40 has made it to the persistent memory and that any errors that may have occurred in that process are fully reported or create exceptions before the sync or optimized flush is complete. So that's to give you some additional assurance that the data has in fact made it all the way to persistent memory without any errors. The interesting thing about the sync is that it does not guarantee complete order of operations that occurred prior to the sync. It's got a very specific limited guarantee. If you can picture what's happening, the sync's job is to

Starting point is 00:19:20 make sure that the CPU's cache is flushed to persistent memory. The CPU is managing its own cache all along and that cache may have filled before you even attempted a sync. So the CPU may have flushed some of your data out earlier. It may have been even before you even started the sync. So the only guarantee is that when you sync or when you use an optimized flush, the data in the indicated range or ranges has been flushed to persistent memory before the sync completes.

Starting point is 00:19:55 But you don't know in what order it was flushed to persistent memory. And that's kind of an interesting dichotomy because it means that you have to be very careful about what you assume that the sink gives you. It only gives you precisely this guarantee. But on the other hand, that does give you the flexibility to manage a cache to improve the performance of your read and write pipeline. When there are no syncs happening. So that's an important performance acceleration that you want to selectively force to some degree when you need to get specific data

Starting point is 00:20:35 all the way to persistent memory. So one interesting thing is now, okay, I've got data structures. They're in persistent memory. I have allocated them. Maybe I have allocated a heap of space for variables in persistent memory. And I can do a type of memory allocation, a type of malloc, a p malloc, persistent malloc, a P malloc, persistent malloc, to get a range of persistent memory where I can now store variables as data structures using normal equals, right, in regular programs,

Starting point is 00:21:11 right? Now, if I do that, I can build data structures right in persistent memory. But suppose, then, that I want to put in one data structure a pointer to another persistent memory data structure. How will that pointer work? Now, everything that you've memory mapped has a virtual address, and normally the application uses the virtual address as a pointer to whatever data it's trying to access. That's great, but it assumes that if you have, for example, opened in memory map to file and it's got some pointers to some data that's

Starting point is 00:21:50 either in that file or maybe even some other file. And then you close the file and you later open it again to use it some more, that you'll get the same virtual addresses for the file that you got the last time. And there are some systems and some things, intentional things in some cases about systems that either don't guarantee that you'll be able to get the same virtual address or that for security purposes ensure that you'll get a different virtual address.

Starting point is 00:22:19 So, you know, that creates this interesting dichotomy to say, okay, now if I've got pointers from one structure to another, is it just a pointer? Will I get that virtual address back? Or does it have to be something relocatable? Perhaps something like an offset from the start of a file is what your pointer really is. And that means that the pointer implicitly includes

Starting point is 00:22:43 some sort of reference to the file namespace, perhaps. So now your compiler, when it's determining how do I dereference a pointer in one data structure, it may have to say, okay, where's the current base address for the file that that pointer's in, and what's the offset into that, rather than assuming that you always have the same virtual address. So the issue of pointers is interesting. I think Andy spoke quite a bit about the NBM library, and I've got a reference to it later, but I don't go deep into it because Andy did. And these are some of the types of things that can be addressed inside that sort of library, to hide some of these issues from the application, you know,

Starting point is 00:23:30 so that it's less disruptive for the application to use persistent memory. This is one example of that. In fact, actually all three of my, you know, little dilemmas about persistent memory are things that you can encapsulate inside a library like the MBM library and make it easier for applications as a result. The next one is about atomicity, specifically failure atomicity. Now, you know, those of us who have dealt with caches, non-volatile caches in disk arrays, are fairly familiar with reasoning about failure atomicity. But from a point of view of a processor system,

Starting point is 00:24:12 it's not really comprehended exactly. When we talk about atomicity in a processor, it's normally trying to guarantee inter-process consistency, such as what happens with symmetric multi-process consistency, such as what happens with, you know, symmetric multi-processing, or, you know, if you write a piece of data in one process, when will any neighboring, any other processes see that data, right, and how do those processes see it in the correct order? So that's the, you that's the concept of atomicity

Starting point is 00:24:45 that's already understood well by processors, but processors only provide limited atomicity with respect to failure. The proposition on failure is, if I'm merrily writing along, writing areas of persistent memory, and suddenly my processor loses power, my whole system loses power,

Starting point is 00:25:05 when the power comes back, what state will that persistent memory be in? And will I see that some of my writes have completely completed and some of them have not completed at all, but none of them have partially completed, right? It's another kind of atomicity, but it happens at failure, right? At a power failure or perhaps even a hardware failure. All right.

Starting point is 00:25:30 So we've had to do a fair amount of work surrounding this concept of atomicity to add it to the sort of interprocess or interprocess concept of atomicity that was already well established. And it turns out that there are some failure atomicity properties offered by some processor architectures, and normally they apply to aligned fundamental data types. Think of it as like an integer or a pointer, so that when the processor is writing a cache line containing a pointer, it actually guarantees during its power loss that the pointer will either be completely written or not written,

Starting point is 00:26:11 but never partially written. So that's an example of failure atomacy that you actually can get from some processors today, but it's a little bit architecture specific and not very well called out. So I think I've covered, yeah, okay. So what happens then is if you want to create atomicity of larger data structures,

Starting point is 00:26:37 then what you end up doing is trying to leverage the atomicity of a pointer or an integer into a larger atomic action. And that's easy to think of if what you've got is a pointer. You can construct a new piece of data in free space, for example, for perhaps a link list or a tree or something like that that's not yet referenced by the rest of the data structure, and then atomically update a pointer that starts referencing that piece of data that you've just constructed,

Starting point is 00:27:10 and as a result that data is atomically incorporated into the rest of the structure. So that's just an example of how you can take a pointer or integer atomicity and convert it into, you know, larger arbitrary, into larger arbitrary constructs of atomicity. If you can't get atomicity, failure atomicity from your processor or from your system, then generally the callback or the fallback has been to have an additional checksum. So what you do then is every time you write something you write a record, let's say, and the reason I say that is because this is very common in databases, you compute a checksum on the record and if the checksum after a

Starting point is 00:27:56 power loss does not match, doesn't compute, then you know that that data was affected during the power loss. It was partially written. It was torn during the power loss. And now you have to have some way of recovering from that. So this leads you essentially into the database school of transactions as to how you recover when things like that happen. So if you don't get atomicity from your processor architecture, then the fallback is you have to add it yourself with some kind of additional checksum. The other area I wanted to elaborate a little on is error handling.

Starting point is 00:28:40 In this picture, and it does get a little busy, but I think what we'd want to do, we should start here with this little sequence of lines, right? Traditionally, you would just get an exception from memory, and it would go back into the interrupt handling system of the processor and be delivered to some particular process. The problem that we have with their handling now is that there may be some recovery that can occur inside the file system in addition to the potential need for recovery inside the application. So what we need is a registration process in which multiple parties, such as a file system and an application, can register to receive the exception.

Starting point is 00:29:25 All right, and then when the machine exception occurs, those registrations are looked at here when the number two occurs. And delivered to whichever party is registered for them. So that, for example, if you get a memory exception and you're working with a PM aware file system that has some means of recovering from that type of exception on its own, you give it a chance to recover from the exception. And depending on how that recovery worked, you know, whether it worked, you may or may not need to notify the application that the recovery occurred. You know, so there are two levels where, you know, where you

Starting point is 00:30:03 have to deal with exception handling. One of them is perhaps within something like a PMWare file system or if you're doing some kind of redundancy within that layer. And the other one is farther up where the application may have to be notified. And then we get into some details of exactly how processors handle memory errors. And there are three properties involved there. An exception is contained if we know the exact memory location

Starting point is 00:30:39 that generated the exception. It's precise if we know that the instruction execution can be resumed from the point of the exception. And it's live if we know that we can resume from that point without doing a restart. So it takes actually somewhat special handling of exceptions. And depending on the processor,

Starting point is 00:31:05 it may not have this type of exception handling, in which case every time you get a memory exception, you'll perhaps always get a restart. And then that, once again, affects how applications recover from errors because the exception comes along essentially with an application restart and special processing

Starting point is 00:31:25 that has to be done during the restart. So there are dependencies to get graceful error handling. There are dependencies on the processor itself in order to make that easier. Yeah? So 1, 2, and 3, are they ongoing, or are they triggered by events? OK. Triggered by events. The one caveat is item 1 is kind of registration for the event.

Starting point is 00:31:57 So that would happen as a file system starts up or something like that. But then items 2 and 3 are triggered by the event. Yeah. Yeah. Male Speaker 3 Let's see. So, let's see.

Starting point is 00:32:10 So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see.

Starting point is 00:32:18 So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. So, let's see. You have to restart if the error was not live or if it was not precise.

Starting point is 00:32:40 If the error was precise and live but not contained, then there may be some ways that you can avoid the restart, but they require so much overhead that you might end up restarting anyway. So pragmatically, if you don't get all these three properties, you're probably going to have to restart. So that's a picture of what the Prairie Model is, how it works for persistent memory, and some of the interesting aspects where applications need some help in order to do this well. So in ongoing work, I now refer specifically to some of the MBM library work that Andy talked about this morning.

Starting point is 00:33:32 And that occurs up here. So the programming model, you know, sits right here between the application and the file system, right, PMWare file system, it would be right in this boundary. You know, the library, you know, the NVM library, which allows you to create persistent memory data structures, you know, sits above the NVM programming model. And it uses the programming model so that that library can operate

Starting point is 00:34:03 on multiple operating systems and with different kinds of hardware in a consistent manner But then the library goes to the effort of hiding some of those difficult scenarios from the application And then you know, so That's what that's what the work how the how the library makes it simpler for applications to use the programming model is by encapsulating them as essentially persistent memory data structure libraries. We have completed an atomicity white paper about persistent memory libraries and their potential transactional nature. The NBM library is an example of one of these,

Starting point is 00:34:48 and that's in final review at this point. So we're hoping within a month or so we'll have that white paper released. And it's tied to the NBM library as an example. There are other ways to do it, basically, and in the future there may be other libraries. But the NBM library as an example. There are other ways to do it, basically, and in the future there may be other libraries, but the NVM library is one example. The other area where we've had ongoing work, and we released the white paper earlier this year,

Starting point is 00:35:16 is in high availability. If you're storing your persistent data in persistent memory, you might want some of those same features that you have with storage solutions such as redundancy, RAID, basically some kind of high availability solution. In order to do that and avoid, for example, all single points of failure, you have to get the data to another server, let's say another node that's separate. And the way we envision doing that is to use remote direct memory access as the protocol.

Starting point is 00:35:55 We view it as likely to be the lowest latency protocol for that purpose. But what we've found as we start drilling into that is that today's remote direct memory access protocol lacks some optimization for accessing persistent memory. In particular, it's actually difficult to know when the data becomes persistent on a remote node when you're using RDMA. So that has led to an interesting white paper on the subject and some additional work across the industry in the RDMA community to say, you know, how can we resolve this problem and get, you know, essentially, let's say, mirroring across, persistent memory across servers using RDMA to be as low latency as possible, you know, while assuring remote durability

Starting point is 00:36:50 and appropriate error handling, right? So a number of those same issues now start to come up again when you talk about how you remotely access persistent memory. And there's quite a bit of presentation tomorrow and Tuesday on both of these two topics, you know, on experiences with the NBM library and on how we are working towards evolving, you know, persistent memory access using RDMA and some of the interesting technical aspects that come up in that problem. So trying to stand back a little bit, especially in the sort of persistent memory library area,

Starting point is 00:37:38 I like to model this as a kind of a journey from where we have been, you know, and for the most part still are, right, you know, towards, you know, being able to fully use persistent memory. And a first step, and some of this has already been happening, you know, some in open source and some elsewhere, is to create a persistent memory aware file system that can run faster because it knows it's using persistent memory even if the application is still just using it as a file system. Another step in that journey is what we're seeing with the MBM library where the application inserts a library that manages persistent data structures for

Starting point is 00:38:26 it and that library is aware of persistent memory and it uses a persistent memory aware file system. The final stage of this evolution which is probably still years away although there's been some experimentation on this is suppose your compiler is aware of persistent memory and your application can easily use it directly because the work that used to be done in the persistent memory library is now built directly into the language that your compiler is using. There are some people at HP Labs and some people in Oracle who have done experiments on how would you extend the language in order to do that, and there are some prototypes

Starting point is 00:39:10 of that. But that type of innovation in language, and usually today it's happening in C or C++, those take a long time to kind of mature and become standardized and ultimately adopted. So I view this as a sort of dual stack scenario where you'll have applications that don't understand persistent memory and use block access regardless of whether accessing a disk drive or even persistent memory you can still access it like a RAM disk. So a lot of applications may never move from that. On the other hand, applications to get maximum benefit from persistent memory can evolve

Starting point is 00:39:52 into a persistent memory programming model domain, use libraries or language extensions ultimately to take full advantage of it. And in fact, even if they don't have persistent memory, there are ways of coordinating RAM access with disk drives, obviously much slower than persistent memory, but would still allow you to use the memory map model. So it's a classic dual stack scenario where you have PM-aware applications and some that aren't, and you have some systems that have persistent memory and some that don't. And all of those permutations can be made to work with it, you know, but to different levels of performance obviously Dual stack scenario

Starting point is 00:40:40 Let's see, I think I've already covered this. And here's a specific detail on that remote access work. Checking over this slide, I want to highlight this part up here, is what you start to discover when you're dealing especially with remote access is that because of the somewhat looser ordering constraints that you can get, you know, from the sync semantics, you may have to recover to a recent consistency point rather than to the exact, you know, let's say load or store instruction, or store instruction specifically, where the exception or error occurred, all right? So for that reason, inside the white paper on remote access, we go into a remote access taxonomy of different types

Starting point is 00:41:48 of remote access systems and recoverability requirements, which then really kind of drive the requirements for how an RDMA, for example, has to behave in order to ensure that ultimately the application can recover from a failure and get the kind of consistency that it needs after the failure. So that work, the white paper, generates requirements, essentially, and some modeling for remote access, in particular remote flush, in effect, in order to make sure that everything volatile in the chain is flushed out. And as a result, as I mentioned, there are multiple parties in the industry, including the Open Fabric Alliance, InfiniBand Trade Association and several vendors, you know, who are now looking at this problem and starting to do work on how to optimize RDMA for persistent memory to solve these problems.

Starting point is 00:42:53 Yeah. Chris. When a node fails, even if the data is flushed, PM is not accessible, right? Yeah, so, you know, you have to think in terms of a... Okay, the question is when a node fails, even if the data is flashed, the persistent memory in the node is no longer accessible, at least not at that point.

Starting point is 00:43:18 That's right. So now you have to start the line of reasoning that you have when you've got, let's say, a disk array that has redundancy as a storage analog, right, and you've got application failover as well. You can say, okay, I have a system here that can tolerate one failure. If that failure was a whole server, I've lost one whole copy of my persistent memory, but I can continue operating with the other one, you know, as long as I've only one whole copy of my persistent memory, but I can continue operating with the other one

Starting point is 00:43:45 You know as long as I've only had one failure and then when this one is repaired, right? I will need to reestablish my you know my redundancy, right? You know or if I lost persistent, you know a piece of my persistent memory I may be able to just go to the other still functioning server and restore that. All right. So, you know, yes, you get into, you know, all of those permutations of, you know, media failure and server failure with failover. But those are very familiar. Yeah.

Starting point is 00:44:21 Yeah, their storage does that, right? Let's see. It depends on a lot of things. I think, okay, thank you. The question is whether you can take persistent memory out of one system and put it in another. You know, it's in some sense removable. So let me go through the stack of obstacles. The first is today persistent memory usually appears as NVDIMMs and they're not necessarily accessible from the edge of a server. They're inside it. So you've got to pull out a server board and manipulate the DIMMs, right?

Starting point is 00:45:06 So it's not quite the same type of repair as you would get with removable media. Okay, so that said, if you either fix that or ignore that, you say, okay, you know, it's likely that, you know, your data is not necessarily constrained to one DIMM, right? You probably are going to have to move a group. And then you're going to have to move the configuration information that tells the system how to present that group in such a way that the application would recognize it. So if you do enough of those things, yes,

Starting point is 00:45:41 you could probably get that same kind of volume mobility that you can get from from disk arrays But actually it's you know, the constraints with disk arrays are ultimately similar You know, you kind of have to move, you know everything, right So, you know, yeah, I think we can we will get there in terms of solutions, you know But I don't think we've gone very far down that path up to this point. Yeah, a question.

Starting point is 00:46:11 Is there any work being done on plug-in ability on the physical level, like connectors or something? Even if I had a failover, I'd still get my system back eventually without having to take it apart. Yeah. So the question is, I'm learning, whether there's any work being done on hot plug, let's say memory hot plug, right? It's essentially what it boils down to, so you get that same kind of removability benefit. It's all proprietary. So I'd say that there is some work going on there, but I don't know of any that's specifically

Starting point is 00:46:56 memory in a standard environment. So the question is that, you know, it sounds like in this environment when you get a failure you may have to backtrack is what I call it. You may have to actually recover from some, you know, historical point, you know, or state of the system rather than an instantaneously current one and then perhaps roll forward or move forward from that state. And yes, that's right. And that's why we talk so much about transactions in the persistent memory context because basically as soon as you try and do that sort of thing, you're going to want some kind of transactional

Starting point is 00:47:43 model. Perhaps a lightweight one would work well and the NVM library provides that sort of thing. You're going to want some kind of transactional model, perhaps a lightweight one would work well, and the NVM library provides that sort of thing, right? So yes, you're going to end up recovering from a recent consistency point, especially if you're in a high availability situation, and you're probably going to want a transaction formalism as a way of doing that. So, you know, the bad news is, yes, that means applications would have to pay attention to it. The good news is that there is actually a lot of prior work done on transactions. We basically know how to do them. So we just have to apply that.

Starting point is 00:48:21 Any other questions? I think that's it. All right. Thanks. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #25: The SNIA NVM Programming Model

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.