Storage Developer Conference - #1: Preparing Applications for Persistent Memory

Episode Date: April 4, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 1. Today we hear from Doug Voigt, Distinguished Technologist with HPE, as he presents Preparing Applications for Persistent Memory from the 2015 Storage Developers Conference.
Starting point is 00:00:50 Probably most of you or all of you perhaps went to Andy Rudolph's talk. I'm one of Andy's partners in crime on the MVM programming model stuff. We, you know, this is a theme, actually, that persistent memory is a theme for this year's SDC. And so there are a number of talks about it. I'll mention a range of them at the end. And I've looked at some of them. There's a little bit of overlap, but we each take a different perspective slightly. And when we drill down, we drill down in different ways.
Starting point is 00:01:20 So even though a lot of the boilerplate and the topic is the same, there's a lot of information from the various talks that you have available. I'll start with this slide, a variation on a slide that you've seen Jim Handy use this slide. It may have shown up a couple times. But this is how I like to introduce the inflection point, the disruption that we're talking about. And I think if you were at the keynotes this morning, I don't really need to go into this much more. But just to make sure, I kind of leveled the playing field here. For a long time in large, very large, super mini computers and mainframes, it's been acceptable in NUMA, non-uniform memory access systems,
Starting point is 00:02:14 to tolerate latencies of up to about 200 nanoseconds in a large memory system. Of course, some memory is an order of magnitude faster than that, but when you put it in some kind of a memory fabric or network in a large or mainframe type computer, it can get up into that range. So that's been kind of acceptable for a long time, regardless of the persistent memory technology. And then there's another threshold that I put around two microseconds, although it is processor architecture, processor technology specific. As to, you know, if you've asked for something like an I.O.,
Starting point is 00:02:56 you know, something that you think is going to take a little bit of time, at what point will you wish that you had context switched? You can sit there and wait for it and block one core or one thread, waiting for something that you think is going to be pretty quick. But if it actually takes longer, you'll wish you'd context switched. Or if you made the wrong decision the other way, you'll say, darn, I just context switched and it's already done. So that's this threshold here.
Starting point is 00:03:24 When I talk about context switch, people hopefully realize that that's when you give up the processor to wait for an I.O. or something like that. So in my view, this is a key disruptor when persistence comes below this range. Because that's where you'll say, well, not only do I not need to do a context switch and I'm willing to give up one core
Starting point is 00:03:52 while waiting for this thing to finish, but I'm actually willing to block my whole right pipeline while waiting for this thing to finish, for example. That's when it's down in the NUMA range. And when you're in between, and I think we will see some technologies that are in between, that's an interesting trade-off.
Starting point is 00:04:12 Because it's kind of a somewhat more unknown territory when you're kind of in that middle range. So I think people can perhaps see how this makes a lot of difference to the way you might write an application if you're going to try to get full benefit from persistent memory
Starting point is 00:04:30 that is down in this sort of lower range. So I think people realize that's the motivation here. That's why we think there's a disruption. And in response to that that we've realized that, this was also in the keynotes, that there's some work that applications might need to do to get more and more benefit. This is where Andy talked about moving up the stack, right, getting more disruptive for higher return. So to realizing that, we've been working on the programming model, and this is just a quick rundown of where we're at on the CNVM programming model. We got a version 1.1 out this year.
Starting point is 00:05:17 It contains content about both block and file, and with or without persistent memory. And here are just a few of the things that are addressed in the specification. And then it makes a strong case or drives really this idea that memory map files are a good way to use to access persistent memory, at least as a bridge until other methods perhaps are developed. But even so, it's a way for applications in the context of almost all the existing infrastructure to get that direct access to persistent memory. And this is the same thing that Andy was talking about. And he mentioned the idea of a programming model as opposed to an API.
Starting point is 00:06:11 So I think people understand that. Still going through the basic background, the same picture that Andy used, with the same reasoning. The behavioral model of the programming model is at these squiggly lines. And here I've got some additional notes about the kinds of things that are now in the persistent memory part of the programming model. This volume part is a kind of discovery abstraction.
Starting point is 00:06:42 It's the way the kernel would say, tell me about the persistent memory capacity that you have and where it is. That's what this is for. And then this says, allow someone to memory map the persistent memory based on the authorization that they have. So I think this has been reasonably well covered, especially if people have gone to the keynote. So going a bit further into what the programming model says and what we further have discovered while working through a whole bunch of use cases
Starting point is 00:07:22 and some boundary conditions and stuff like that that you run into when you try to do this thing that seems relatively straightforward, perhaps at first. First of all, drilling a little bit into the map and sync paradigm, this is a pre-existing method of doing memory map files. It works better with persistent memory because the old way, when you did a sync, that's when you would write your variables from DRAM out to the disk after having memory mapped the
Starting point is 00:07:49 DRAM. But now with persistent memory, you memory map the persistent memory and you just have to make sure that your data is flushed out of the processor's volatile caches in order to make sure that it is actually persistent. So the mapping associates memory addresses
Starting point is 00:08:06 with the persistent memory that is where your files are stored. The sync ensures that modifications become persistent, but now we run into the first catch, which is that the sync does not preserve order. The processor has kind of a will of its own about its caches. It thinks it owns them. And if they get full, it might start flushing things out of its caches, you know, at will in order to continue operating with the high performance that it, that's the purpose of it having a cache. You know, so just because you haven't said, I want to sync something yet that
Starting point is 00:08:43 I've written, you know, it doesn't mean that it hasn't been written. It may have been written. But other things in this block of memory that you say you want to sync may not have been written yet so they get written during the sync. So you cannot assume that you have any sort of order guarantee
Starting point is 00:08:59 as a result of doing sync. All that the sync semantic says is by the time this sync is over, all of the writes that you've issued in the address range of the sync have made it to persistent memory. That's the whole semantic, right? So that's something to be very careful with,
Starting point is 00:09:19 and I'll show you an example that kind of toys with this order question a little bit more. So that's something to be very aware of. There are proposals out there for alternative sinks that do more, and those are all kind of in the research domain pretty much at this point, and some have been prototyped, and there's not really some sort of industry conclusion about what other sync behaviors and what they should look like other than some of the things I'll say later that are also in Andy's persistent memory library. So another area, and Andy has sometimes touched on this in his presentations,
Starting point is 00:10:06 but I don't think he mentioned it much today, is the question of pointers. So you've got a data structure. It's like any other data structure except that it's being stored in persistent memory. That's kind of the thesis here. And it may have references to other data structures. And in theory, they could be anywhere. So how do you actually make the reference? And there have been sort of two schools of thought about that.
Starting point is 00:10:35 One school of thought says that once you've got universal memory, what you want is one huge global virtual address space so that everything you could ever want to access is in one virtual address space, and every time you go to access something, its virtual address is the same. So you can actually use a virtual address in a persistent memory data structure to point to some other persistent memory data structure, regardless of where it is. So that's one school of thought, is you want a global virtual
Starting point is 00:11:07 address space. There are some current practices that get in the way of that. It turns out that with current operating systems, you don't always necessarily get the same virtual address when you go to memory map something. In fact, there are security related practices that intentionally mix up the virtual addresses. So this all kind of turns into some controversy. The other school
Starting point is 00:11:33 of thought is that you should always use an offset from a relocatable base. So if I memory mapped a file and the file has a bunch of data structures in it, then maybe all of my pointers amongst those data structures should be basically offsets relative to the file has a bunch of data structures in it, then maybe all of my pointers amongst those data structures should be basically offsets relative to the file I'm in. Or you can imagine maybe that's not quite enough, but that sort of gives you the idea of this sort of relocatable base. Continuing through the laundry list, we've done a lot of investigation into failure atomicity. And here's the thing about atomicity.
Starting point is 00:12:16 You might say, well, gee, don't we already have atomicity? We have compare and swap. We have various types of atomic operations but the problem is that historically since we've never had persistent memory all of those atomicity features and even the ones that are built into the processor workflow those are about making sure that processes have a consistent view all the processes have a consistent view all the processes have a consistent view
Starting point is 00:12:46 of all the data so that's actually failure, the atomicity that you can get today is actually about inter-process consistency and not about making sure that your persistent memory was written atomically
Starting point is 00:13:01 so that if you get a power fail at some instant, when you come back, the thing that you wrote, like a pointer that you thought you were writing atomically so that it's either the new pointer or the old pointer, you want that proposition when you come back from a power loss, which has to do with exactly how that variable was written to memory, which never mattered before,
Starting point is 00:13:29 because if you had a power loss before, it would be gone. So there's actually a lot of logic that's starting to emerge about how to achieve failure atomicity as opposed to inter-processed atomicity. And it turns out that the techniques available for getting failure atomicity are process architecture specific. Some processors might guarantee that a cache line or something like that is fully written before they will stop writing in the event of a power loss.
Starting point is 00:14:06 Logic like that gets into the mix. So the phrase that we've started using in the spec is atomicity of fundamental data types, which is actually the most common convention. And it's one that is built into the C language. There's a way for you to declare that a variable update should be atomic with respect to how it's written, up to some fundamental data type.
Starting point is 00:14:32 Fundamental data types are like integers and pointers and basic things like that. So that's kind of the building block that you have for Adamicity. And it really has nothing to do with when you did a sink. It has everything to do with how the data got written to memory by the processor whenever it was written. And then finally, the other area is exception handling.
Starting point is 00:14:59 Of course, if you're using persistent memory, you no longer get a status. You no longer are doing a command with the data transfer and a response. You're just writing some data, let's say. And if there's a failure, you get an exception. And it turns out there's a lot to how exceptions are handled on different processors. If you have a low-end processor, you may have no opportunity to actually intercept that exception and do something about it without basically rebooting, restarting the system. On other processors, they do a lot more for you and give you perhaps enough information to affect a recovery from an application point of view during an exception. So this has been another area where we've started doing a lot more work on exception handling.
Starting point is 00:15:52 And one of the things that comes into that work is the question of, you know, if I've lost something, some piece of data has become inaccessible, perhaps in its current location, but maybe I have some redundancy or I have some options. I can maybe restore a specific piece of data. But then the question is, how does that fit in with all the rest of the data
Starting point is 00:16:16 that was being written in some somewhat loose order? So this leads you into the potential for backtracking. In the event that my redundancy is not quite up to date with where my processor currently is, you start having to account for the possibility that the application may have to do something like abort a transaction in order to use the restored data because of the way
Starting point is 00:16:45 it's managing its consistency. So these are some of the forefront issues that we've uncovered in the process of working through the spec. And several of these are continuing to get attention as we refine the spec and we're writing additional white papers about different aspects. So I go through this to give you a little bit more detail on one of the things Andy was saying is that we really can't afford to turn the whole industry into a bunch of deep system programmers. What we really need to do is make this easier for applications.
Starting point is 00:17:31 So there's one way to illustrate that, and this is, I think, consistent with the messages that we've talked about already this morning, is that we're on an interesting journey from this kind of a stack that everybody already knows and loves already this morning is that we're on an interesting journey from this kind of a stack that everybody already knows and loves to something where we have persistent memory available.
Starting point is 00:17:53 And the first and easiest thing to do is to accelerate the middleware. Since the file system was already written by file system programmers, they may know enough esoteric stuff to get some acceleration from persistent memory at their level, and then present their current functionality to a file system without modification from the abstraction point of view. So this is already happening. Actually, all of these, little pieces of all of these things are already happening. Actually all of these little pieces of all of these things are already happening.
Starting point is 00:18:26 But then the next step would be to say, okay, let's have a persistent memory library that encapsulates a whole bunch of those issues that I listed earlier and allows the application to ignore them because it's accessing a library that implements a persistent memory data structure that's native. It's fully aware of how persistent memory works. It gets full advantage of it, but it's encapsulated inside a data structure library, which may use persistent memory directly. It may still use the file system. The application can still use the file system.
Starting point is 00:19:00 So here we start to see the dual stack. This is a classic dual stack scenario. Some applications will never learn how to use persistent memory, so even in the end there's still a file system, for example. Now this is independent of the evolution to things like object storage and stuff. I'm not writing those off, I just kept it simple for this talk. It's a classic, just like the IPv4, IPv6 transition. It will take a long, long time. In fact, it will never actually finish. It's a classic dual-stack scenario.
Starting point is 00:19:37 I think it makes sense to anticipate it, thinking about it that way. Finally, what makes it really easy for applications is when the language is evolved to the point where you can do things like open a block of code with a declaration that you want the modifications that occurred inside that block of code to be atomic, all of them together.
Starting point is 00:20:01 And then the application can just go on and write its code the way it always did, having made this declaration. The compiler, perhaps in combination with some libraries, does what's needed to make that work to the application's expectation.
Starting point is 00:20:17 That's what's happening in the languages category. I think you can see how this is just another model of the horizons, the evolution that we've sort of started to embark on as we make it easier and easier for applications to get benefit from persistent memory. So I want to talk quite a bit more about what is this persistent memory library. Andy talked about it, he over viewed it. I'm going to drill into a couple of things and I'm going
Starting point is 00:20:53 to not back up right now. Thank you. This is the same picture that Andy used, one of them, where here's where we're talking about. We've got a PMWare file system. It can do memory mapping. We're sticking a library in right there. So let me give a really trivial example that I think will shed some light for people who don't already visualize what I'm talking about, about what's different
Starting point is 00:21:28 about a persistent memory data structure. This example is an append-only log, and there There is one of these inside Andy's PMEM library. And you may have allocated some fixed amount of persistent memory for the log, just to keep it simple. And part of it is filled. So here I have, let's say, some sort of integer or something that points out to what part of the log is filled. And the way I append something to the log is I create a new log entry in this free space. So that's this
Starting point is 00:22:08 work in progress here in the free part of the log. And since int hasn't changed yet, if I lose power or something this will just disappear because it was never committed. And then the act of committing is
Starting point is 00:22:23 bump my fill pointer. So now all of a sudden when I was done with my work in progress I bumped my fill pointer. And one of the tricks that I think we're going to see used more and more at least at first is to do a sync after having filled in my work in progress to make sure it's persistent. And to make sure it's persistent. And then knowing that it's persistent, have a section of code where I only update the pointer,
Starting point is 00:22:54 and then I sync again. So I know for sure that nothing happened between these two syncs other than this change. So I've now used two syncs to create a type of, to leverage the atomicity of my update to my field pointer into something bigger, which is a whole record in a log. So when you say sync,
Starting point is 00:23:19 you really mean cache line cache from a CPU? Yeah. Now, there may be fancier sinks, you know, but yes, that's basically the minimum. So, hopefully this gives people a concrete idea and although this is kind of
Starting point is 00:23:37 a lame example, it's real. You know, you could imagine building link lists and trees and hash tables that leverage this basic concept that says you want to do your work not in place. You want to do your work in a place that is not exposed and then do an update to a fundamental data type that commits it. To the extent that you can do that, this is like the oldest trick in the book. I'm an old array guy.
Starting point is 00:24:08 I've built embedded systems that have persistent memory for a long time in my cache. And this is how they work. So let's go further, though. So it's not that you always have to avoid updating in place. The other way to do it is to take a pre-image, the classic rollback scenario where you say, OK, I'm about to update something,
Starting point is 00:24:36 but I want to be able to abort a transaction that has this update in it. So I take a snapshot of my physical memory in this area before I start modifying it, and I record that to persistent memory. And now if I have a power loss before I commit, I can use that to set whatever work I was doing back to the way it was when it started. This is another old trick.
Starting point is 00:25:00 Databases have used this forever. So that's the other way to do it, but it may be more efficient when you can to avoid modifying in place using newly allocated place space, but that means that the allocation itself has to be in some sense transactional,
Starting point is 00:25:18 right? Because the trick I did in my simple example is the allocation was just a simple move, so both the allocation and the commit were in the same change to my integer. They were together. Now, if you're actually allocating new space,
Starting point is 00:25:34 it's not just an append-only log. You had to go get some new persistent memory space. It got allocated. There's some records about the allocation of the space, and those records now have to be tied to your transactional behavior. Because if you lose power, you can't
Starting point is 00:25:50 afford to have that space not get reclaimed. Somebody has to remember. So this is another thing that now gets wrapped up in the PMEM library. It understands how to do atomic allocation and it couples that allocation to the persistent memory data structures that it encapsulates. So another reason to use that type of approach. It also allows you to form groups of data structures. I talked earlier about the pointer problem, you know, and the way the approach that the persistent memory library uses there is to say okay I'm gonna have a pool of persistent memory
Starting point is 00:26:33 that I know about and I'm going to make sure that I know how to resolve pointer references within that pool. Alright so it may be a file, you know, that's the easiest way to think of it. It doesn't have to be only a file, but it's something that you can catalog, that you can manage with some notion of some common root. So you know that all of the references within this range of data structures are tied together through some root and so other addresses can be resolved. So that's another thing that's encapsulated inside the persistent memory library. And then finally, it's great that I have
Starting point is 00:27:12 modifications to individual data structures, atomic, and I may be able to do that without some kind of formal transaction, such as I did with my append-only log. But then I want to write a lot of programs and manipulate multiple data structures atomically together. So I still need something larger to track scenarios where I've done that. So several persistent memory data structures
Starting point is 00:27:37 implemented by the library that are each atomic in their own right can now be made atomic together by participating in a transaction that is also known to and built into the persistent memory library. What that transaction does is it takes care if you did need to modify something in place you can declare that and it will take care of your pre-images and rollbacks for you and it takes care of when to do your syncs or flushes to get into the persistence domain at the right time. So this is how the persistent memory library makes it easier for applications, right?
Starting point is 00:28:12 It's by taking over that group of issues. So use that. That kind of gives you a clue about where we think we're headed. Here's a pointer to where the persistent library exists in open source. And I just kind of summarized, and Andy already flashed up a more complete summary of the operations, the objects in the persistent memory library. It's got some basic assist functions if you want to do these things more on your own. It's got some whole persistent memory data structures,
Starting point is 00:28:51 and then it's got the notion of a persistent memory object that takes care of those additional things and starts to introduce a notion of type safety among the different manipulations of different persistent memory objects. There is research going on about language extensions. There's actually probably quite a lot of it. I would highlight two. Why do language extensions? I mentioned the example earlier. It should be more convenient. The logic that can take place inside a block of code that has some sort of language structure around it
Starting point is 00:29:33 can actually be more sophisticated. Compilers can be pretty smart. They can track a lot of stuff for you. You actually might be surprised at how much stuff they're already tracking in order to make things happen correctly. And you don't have to pay any attention to that. And ultimately, it may be safer, you know, because, you know, it can do type checking built in.
Starting point is 00:29:54 You know, it's a much more, you know, direct extension of the language that you already know. All right, so these are some of the motivations for ultimately moving to a language environment. There are two, I point out here, pointers to two public works. One of them came from HP Labs in which language support is added for failure atomic code sections that are based on existing critical sections. The logic here says, you know, in order to do correct concurrent programming,
Starting point is 00:30:28 I had to lock something anyway. I have a critical section. What if I say that critical section is also the thing that contains my critical modifications to my data structures? So it couples those things and makes it easy to do both at the same time. Then there's another from Oracle. Gee, that thing is pretty persistent. I should have told it not to remind me again.
Starting point is 00:31:00 In which there's a set of language extensions, and sometimes these are built into precompilers that allow you to do some management of NVM regions using files, implement transactions and locks and heat management in the context of a language. So there are some common themes. There are some differences in implementation and the direction that people are trying out. So there's activity in all of those columns,
Starting point is 00:31:33 right? Accelerating things like file systems or as Andy mentioned JVMs, creating libraries that make it easier and working towards what kind of language extensions would make sense ultimately. So I'm going to segue to another area that we've been working on. We have a white paper out from the NBM Programming Model Group called Remote Access for High
Starting point is 00:32:06 Availability. There's a work in progress draft of that paper available on the SNEA website. And it starts going into more of the high availability problem and it was referenced at least once or twice this morning
Starting point is 00:32:22 and it was referenced in the Microsoft talk yesterday on the subject. So here, what we suggest is that through a combination, and Andy showed one of these, right? A combination of a library and an improved PM-aware file system, you achieve RAID or erasure coding is for redundancy. The advantage of doing some activity in the library
Starting point is 00:32:50 is that you don't have to switch into kernel space in order to get it done. But you can't necessarily do everything in user space depending on what you've run into, what you're trying to recover from. I want to talk for a minute about high durability versus high availability. I think high availability is actually a pretty well-defined
Starting point is 00:33:14 industry term. High durability may not be quite as well-established in the industry. But what I mean is that if you have high durability, you have the ability to recover data after a failure that affected your data. But you have no guarantee about your ability to access that data shortly after the failure.
Starting point is 00:33:39 You can get it back eventually. And local mirroring is an example of that. If the failure was an NBDIM failure and you're doing something like a store that has some sort of a machine right here nearby that mirrors NBDIMs, then you have redundancy. But if your whole
Starting point is 00:33:58 server fails, you may not have access to whatever you have in that pair of DIMMs until the server comes back. But you do have two copies, and you might even be able to move those DIMMs to another server and get your data back that way. So it is highly durable,
Starting point is 00:34:16 but you may not be able to access it. On the other hand, with high availability, it's both. It's highly durable, and you have a method of assuring that you can access it in spite of some number of failures, perhaps only one. And for that, you have to basically get to another node, another server. So that involves network communication, and thus higher overhead. So one of the things that we're anticipating is that people may start leveraging that sync operation to achieve some level of higher availability,
Starting point is 00:34:57 especially since there's some network overhead today in achieving that. So that's what we've started to elaborate on in this white paper. And say, okay, why not during the sync use RDMA to do a remote sync? Now, this is agnostic to how you did RDMA, and in fact, there's debate about whether the optimal implementation is in fact RDMA or adheres to today's RDMA specification. It may be some extension to RDMA, but we're using RDMA as the way of exploring this use case.
Starting point is 00:35:42 So in the persistent memory programming model, there's this function called optimized flush, which extends the sync. It's really pretty much the same as a sync, except that it allows you to list multiple ranges. The current sync semantics that are backward compatible only have one memory range. So in optimized flush, I can say I want to do a sync of all of these memory ranges. It's still not atomic, but it applies the sync semantics to multiple ranges. So the really nice thing would be if I had a remote-optimized flush,
Starting point is 00:36:20 which could do the same thing while hopefully minimizing network latency. However, today the problem is achieving remote durability. The way RDMA works is you don't know how far the write has gotten by the time you get a response to the write. If you surround it with a command response queue
Starting point is 00:36:38 and you send a command and then you do an RDMA interaction and then you get a response which is the way the whole thing was really conceived in the first place, that's well defined and you can do a lot of stuff that way. But that has a lot of round trips. Here all we want to do is blast
Starting point is 00:36:53 something over there and get it persistent by the time we get to the end of the sink and move on. How fast can we do that? So that's kind of the thought experiment here. And when you get to the remote side, even if your RDMA has made it to the remote RNIC on the other side, now how do you assure that it has made it all the way to the persistent memory on the other side? And it turns out in today's processor architecture, you have to go through the processor in order to get
Starting point is 00:37:21 from PCI, where the RNM is connected to the memory. Who is going to make sure that anything the processor did on that path caching wise is in fact flushed? Well, there are some configuration things you can do and I think there's a talk later tomorrow where some of that type of stuff may be described.
Starting point is 00:37:43 But ideally we would want to be able to say, get this RDMA right flush to persistent memory, just like you did on the optimized flush locally. But you would like to do that as efficiently as possible remotely. Today, you have to interrupt the processor to do that, unless you've used some configuration options that may also be architecture specific.
Starting point is 00:38:06 So exploring that space about how to achieve remote durability is kind of the first order requirement here. There has been a proposal of a new RDMA completion type in the stack, the RDMA protocol stack library. And this new completion type would delay the RDMA completion until the data is, in fact, persistent in the remote domain. So this is now becoming a sort of industry-recognized problem, to say how do we do efficient remote durability in order to manage high availability for persistent memory. So people are starting to pitch in.
Starting point is 00:38:56 The industry is starting to innovate as only it can in terms of how to solve this problem. So even if you do solve that problem, we still have this order issue, the lack of order statement or semantics in sync. So if you think of it at the application level, you want to recover from failure, you're going to have to have both robust local and remote error handling to do that. the application ends up having to be aware of how to recover from data that was retrieved from another node. And it comes back to that same ordering problem.
Starting point is 00:39:57 I mentioned the potential for backtracking recovery. And if you're trying to do high availability, right now we don't actually have a way to do that without teaching the application how to do backtracking recovery. Because the thing it got back, well okay, here's the reason, ultimately the reason is because it's way too expensive to only do one write at a time and always know the order it was in, right? The sync semantic that disregards order is more efficient. So to the extent that that's the case, you may have to do backtracking, which gets you into the question of consistency.
Starting point is 00:40:33 When you start talking about consistency, you're really talking about an application-specific assertion. There are a number of ways of treating consistency that have grown out of the storage remote replication capability, the disaster recovery capability, except those are on millisecond-to-second time frames. So now... Okay, I'll do the right thing this time.
Starting point is 00:41:00 All right. So there's a line of thinking here that says, if my application knows how to do backtracking to a recent consistency point and is willing to lose a little bit of work in that process, then I can start applying the concept that I already know about for disaster recovery, but on several orders of magnitude faster time scale. In order to do that type of recovery, perhaps have a recovery point objective for even my local, perhaps in a rack or even in a blade chassis type of high availability. So you're very quickly running down a slippery slope into basically applications that use transactions and are able to recover to a recent consistency point
Starting point is 00:41:55 in order to make the flow of high availability information to a nearby node efficient. So that's where this slope kind of leads. I think I've already pretty much talked about this. The next question is, is the application capable of handling backtracking during an exception or does it basically have to restart? Did it have any choice? If the problem is that I need
Starting point is 00:42:28 high availability and I'm mirroring across servers and one of the servers failed, I'm basically going to have to now restart that application on another server. So depending on the failure scenario, you may you might have a choice of doing it without restarting or you might
Starting point is 00:42:44 not actually have a choice. The it without restarting, or you might not actually have a choice. The scenario may have forced a restart. And this is where transactions start coming in really handy in terms of how to bury all of these problems inside something that's well understood to be a reliable way to recover. So there is a deeper tour of this little journey. Some of
Starting point is 00:43:10 the things we've discovered, some of the responses, the work in progress. It's really interesting to see the industry start responding to the issues that we're having, to rally around a problem.
Starting point is 00:43:27 So this is what we've kind of embarked on, and my belief is that this NBM programming model has helped to stimulate some of this thinking, helped to sort of organize it, describe it perhaps in a more and more consistent way so that players in the industry can develop an ecosystem that actually solves this range of problems.
Starting point is 00:43:55 Questions? Discussion? Oh, related talks. Go ahead with questions first. So, with some of the paleo-moves you're talking about recovering from the loss of a DIN, that's kind of like you have an incorrect memory error in traditional RAM, which you don't normally try and recover from.
Starting point is 00:44:17 So obviously you're thinking that these devices have a higher failure rate than the regular system. Is that not true? No, I'm not necessarily assuming that. What I'm assuming is that when someone wants 99.9999% high availability, that RAM doesn't give you that. Yeah. I'm measuring between systems for the availability,
Starting point is 00:44:41 but within a system, because you were kind of showing both models there. Okay, so the, yeah, you probably wouldn't use them both at the same time. No. Alright, so I was just illustrating kind of the raising of the bar from durability to availability.
Starting point is 00:44:58 And I have one other little question. Okay. I presume you would tend to keep lock-type structures out of the system. Probably, yeah. I presume you would tend to keep lock type structures out of the system probably yeah because well first of all to the extent
Starting point is 00:45:14 that they're extremely dynamic you may not want them in that type of memory and also you may not want them to be persistent because if anybody's ever started thinking through the persistent lock problem it's like gosh I might not want that. So yes, yeah. So there's a fundamental here that I assumed that applications will need to know what memory is persistent and what memory is not from their abstraction point of view, right? As Andy pointed out, you can use persistent memory as if it were volatile.
Starting point is 00:45:46 That's fine, but the application is going to end up with two pools, one that it knows will come back and one that it knows will not come back. So that's a fundamental assumption that I think is pretty well accepted in the industry is the application is going to have to at least know that. Your last chart, the evolution chart. Does it seem like things are getting simpler as you go along, or does it just seem to me that things are getting more complicated programming? It depends on where you stand. It depends on where you stand in this picture. What I'm suggesting is that it gets simpler to do very high-performance transactional manipulations from the point of view of an application,
Starting point is 00:46:37 or at least not any more complex. But if you stand down here in the files or in the libraries that do those things, or in the compiler itself, there is more complexity that you're dealing with. So that's why I would frame it that way. But the idea is to relegate that complexity to the people who are steeped in that knowledge, rather than the whole world. Right. the whole world
Starting point is 00:47:07 as far as we know every section this is the the the so uh... Is that correct? So the simple answer is yes, but it's not clear that the application itself ever deals with the driver.
Starting point is 00:47:35 We have to be very careful when we involve software. There's a sync implementation. Would you say the sync implementation is part of a driver? Right now it's part of the file system protocol. Would you say that doing a flush during a sync implementation, would you say the sync implementation is part of a driver? Right now, it's part of the file system protocol. Would you say that doing a flush during a sync involved a driver? Today, you wouldn't really say that. So we have to be very careful
Starting point is 00:47:55 exactly what the role of the driver is. So that's why the simple answer is yes, but you have to tease it apart a little bit. I had a question on the RDMA for failure recovery. Of course, the further away something gets, the higher the latency. So for what you're talking about here, I think what's reasonable in terms of remote... Okay, what's not reasonable is persistent memory-based disaster recovery across the continent. Forget it. No, that will never happen.
Starting point is 00:48:30 Well, somebody could try it, but it wouldn't be competitive. Definitely. Well, that's true. Yeah, that's right. If you, as a man, are an island, then perhaps. So let's talk about rack scale or maybe a couple of rack scale. Those are the sort of things that people are thinking in terms of.
Starting point is 00:48:53 Gotcha, okay. There's more in other talks. You know, there are talks talks about NVDIMM hardware. That one's coming up today I believe. Persistent memory management is next. So a lot of related talks here. This is a regular theme for us this year. We have a talk tomorrow that Dominic and I are giving on how to measure persistent memory performance with a pure workload generator. And there are talks related to remote access and failure recovery
Starting point is 00:49:35 from Tom Talpe and Chet Douglas. Those are tomorrow. So you can learn a lot more about this area. We also have a couple of other application-related talks from Intel and Pure. Well, Pure is not as, well, it's an application of persistent memory. It depends on where you stand as to what you view as an application. Obviously, Andy's keynote, and you'll be able to pick up those slides.
Starting point is 00:50:04 And then there was a bunch of stuff about persistent memory application. Obviously, Andy's keynote, and you'll be able to pick up those slides. And then there was a bunch of stuff about persistent memory from the pre-conference that took place on Sunday, so that stuff's out there too. So we're hitting it hard this year. There was a lot of information that you can get. I didn't even mention them all. I didn't realize that Microsoft was going to do a talk that was related to it. They have several, actually. So, you know, that's not even all. Okay, thanks. Thanks for listening. If you have questions about the material presented in this podcast, be sure
Starting point is 00:50:40 and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storage-developer.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.