Storage Developer Conference - #17: Solving the Challenges of Persistent Memory Programming

Episode Date: August 16, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 17. Today we hear from Sarah Jelinek, Senior Software Engineer with Intel, as she presents Solving the Challenges of Persistent Memory Programming from the 2015 Storage Developer Conference.
Starting point is 00:00:46 I'm Sarah Jelinek from Intel. Unfortunately, Sarah has come down with a horrible case of bronchitis, and her doctor tells her not to get on an airplane and fly out here. But Sarah and I are in the same group at Intel, and so we worked on this project together. We worked on the slides together. So here I am. I'm Andy Rudolph.
Starting point is 00:01:14 I'm your backup Sarah for today. And what we're going to talk about here is what's so hard about using persistent memory? Gosh, it sounded pretty easy based on Andy's talk yesterday. So we're going to cover these topics. I will do a little bit of review talking about what I'm really talking about with persistent memory. What do I really mean? And some of the problems, the context of the work that we're doing. You'll notice that when I get to the actual detail of the work, I never actually mention the context of the work that we're doing. You'll notice that when I get to the actual detail of the work, I never actually mention the piece of software that we're modifying to be persistent memory aware. That's because it's something that hasn't been announced yet,
Starting point is 00:01:51 that we're working on with one of our partners. But other than that, all the information is going to be there. So what is persistent memory? And I mentioned this if you saw my talk yesterday. I have kind of my own definition of persistent memory. It's byte addressable. It's something where it's so fast that you probably wouldn't context switch away. You'd probably just reasonably stall a load from the CPU in order to get the data.
Starting point is 00:02:20 So it's not NAND. NAND is too slow. NAND, slow stuff. We don't even use that anymore. Now we use this cool, look at the, how could you not want to use something that has that really cool picture? That's 3D cross point. So, you know, computers don't really access things in bytes.
Starting point is 00:02:40 They access things in cache lines. So really think of persistent memory as something that is able to speak these cache lines on Intel. The cache line is 64 bytes, and all the transfers to and from a DIMM are these little 64-byte transfers. The other cool thing about persistent memory is that you can DNA to it, so you can do things that you can't really do with persistence otherwise. You can DNA directly into your persistence. That's a lot of what Tom's talk was about in the previous hour. And come in the next hour, or the next talk after this one,
Starting point is 00:03:11 Chet's going to talk about how to make that work. It's actually quite cool. We're very careful how we phrase things because we're a bunch of engineers and we're not allowed to announce things. At IDF, somebody announced that we expect the capacities of this 3D cross-point memory to be up to 6
Starting point is 00:03:31 terabytes on a two-socket system. So that gives you an idea of what kinds of capacities we're talking about and why we're so interested in this, why we're so excited about this stuff. We're paying all this attention to it. So we created a programming model for this.
Starting point is 00:03:48 It's based on memory mapped files and I have some kind of Linux-y words up here, but a couple days ago Neil Christensen from Microsoft also talked about how the same kind of programming model is going to be exposed in Windows. We have other OS's working on this as well.
Starting point is 00:04:04 And there is kind of something that hits people right away when they look at the programming model. You open up a file, you memory map it, and now you have the memory right there in your process, ready to use, do loads and stores to it. You do some stores, and now everybody else that has that file mapped for sharing can see your stores,
Starting point is 00:04:25 and they can start using your data and this shared memory and making calculations based off of it, making decisions based off of it. And then the program crashes, and those stores, it turns out, weren't actually persistent yet. Okay, so, my gosh, is that a bug?
Starting point is 00:04:41 How could you have done that, right? And so a lot of people do ask this. They think, oh, we've discovered a horrible hole in our strategy. But, you know, it has always been this way. Memory mapped files have been around for more than 30 years now. And it's always been the case that when you do stores to these files, these memory mapped files, they are visible but they're not durable until you do some sort of flushing operation. They might be durable before this because normal cache pressure can flush things out.
Starting point is 00:05:08 But they're not known to your program to be persistent until you flush things out. On Linux, the call to do this is msync or one of its variants. So for persistent memory, we don't use the page cache. We allow an application to access it directly with loads and stores. I drew a picture of that that I showed yesterday.
Starting point is 00:05:35 And so msync, where it is normally about flushing the page cache, is actually about flushing other places in the machine now for persistent memory. So if you just think of the data path of a typical machine, a core does a store. Store is called move in Intel assembly language. And it goes through the L1 cache, the L2 cache, the L3 cache, the memory controller, and eventually it hits this persistent memory down here.
Starting point is 00:06:01 And so just coloring in red all the hiding places. These are the places where the store could be if your program crashes before you're done flushing or if the machine crashes before you're done flushing. So what do we do to get stuff out of there? Well, we've introduced some new instructions
Starting point is 00:06:17 in Intel to help with this. There's the processor cache part up here and the instructions to do this are the CL flush variety. And then there's the processor cache part up here, and the instructions to do this are the CL flush variety. And then there's the memory subsystem down here, and the instruction down here is called pcommit. So CL flush, the instruction called CL flush that flushes the cache line, it's been around for years.
Starting point is 00:06:39 But it's a serialized instruction, so if you just wrote 4K of data and you're going to loop through and flush it all out that little loop will go serially every CL flush has to wait for the previous one to finish so we made the CL flush serialized in every implementation that we've
Starting point is 00:06:57 done in every processor that Intel has done we didn't feel like we could just remove that because we thought we'd break some existing software. Somebody somewhere is depending on that serialization and so that's why we introduced this new one, it's optimized that same loop I talked about, if you just loop around
Starting point is 00:07:14 doing clflushopt, those will happen concurrently and it's not until a little fence instruction that you do at the end of the loop, are you sure that they're all done clwb cache line writeback is a similar, only it doesn't invalidate the thing that it flushed. So if you're actually going to expect to use it again and you want that to be a cache hit, you use this. These are all public instructions. They're all documented
Starting point is 00:07:37 in the Intel software development manual on intel.com. You can just Google for these. You'll find them. PCommit is a broadcast operation. It doesn't take an argument. It basically says, if there are any stores that are headed for persistent memory, mark them. Mark them. And then when you wait at the end of the PCommit,
Starting point is 00:07:57 you put an S fence there, a fence instruction. That waits for all the marked instructions to drain. So again, it's kind of a just, I might end up waiting for things I don't care about. Might end up waiting for somebody else's stores. Or my stores might have already been flushed out. Doesn't matter. This is about correctness.
Starting point is 00:08:13 If a program needs to know that its stores are definitely durable, this is the way you do it. You do these flushes and P commits. Okay, so this is the standard. Here's our architecture diagram, and I just got done telling you that the way you flush things here
Starting point is 00:08:33 is to use those instructions. If you do use msync, if you do use one of the standard ways of making something durable, then you'll tap into the kernel here, and the msync code will use those instructions. But those instructions are available from user space. You can use them up here and save yourself some
Starting point is 00:08:50 code path. These things that are for enabling here in Linux have already been delivered into the Linux kernel. There are links at the end of the presentation on where you can get this stuff or you can just go to kernel.org and grab the latest kernel and play can get this stuff or you can just go to kernel.org and grab the latest kernel
Starting point is 00:09:05 and play around with this stuff if you like. But since it's based on memory mapped files pretty much all the code that I could show you and all the examples in our library just work on any file system. They just don't happen to be as fast as they would be on persistent memory because paging is happening
Starting point is 00:09:21 underneath. But it allows you to start writing programs now. So in order to make persistent memory programming easier, we wanted to create a library, a set of libraries really for the different use cases. And we call it NVML, the NVM library. It's completely open sourced. It's on open sourced.
Starting point is 00:09:46 It's on GitHub today. You can play with it. We just had our alpha quality release, which means all the serious bugs seem to be taken care of at the moment anyway. And I'm going to quickly walk through what the libraries do, and then I'm going to show you how we took a problem and selected from these libraries different things, which library we should use to solve our problem. So the lowest level library is just called libpmim.
Starting point is 00:10:12 Nice, very simple name. This library is very small. I mean, I've got to tell you, I literally wrote this library over a weekend, and it's just a few lines of code. It's just nice, convenient wrappers around those instructions that I just showed you. So if you just want to do,
Starting point is 00:10:30 you don't want anybody to do anything fancy for you, you just want help flushing this stuff, your rights to persistent memory out, but you don't want to write the assembly language yourself. And actually,
Starting point is 00:10:42 the kind of gory part is checking to see if the machine supports the certain instructions that you want to use and following back to other instructions. It's a pain in the butt. You don't want to write that over and over and over again. So that's what libpmem does. It looks up which instructions are available on the platform and picks
Starting point is 00:10:56 the right one for you, the best one for your situation. We have a few entry points in here to do things like a memcopy, copy a range to persistent memory, and it's got some heuristics in it that say, oh, you're copying a range that's so big, I'm going to use special instructions that go around the cache so that I don't have to do the cache flushes at the end. Things like that. Just the simple stuff. So libpmem is kind of, if
Starting point is 00:11:19 you want to roll your own, you probably still want libpmem because you don't want to roll your own, you probably still want libpm because you don't want to keep writing that code over and over again. libpm block. This is basically for arrays of blocks that are all the same size where you want to be able to update any one block atomically. When you write to that block, you want either the old block or the new block
Starting point is 00:11:39 in the case of failure. You don't want part of the old block and part of the new block. This is, think of it for really big arrays like database caches, things like that. This is actually the exact same code and algorithm that we use in the drivers
Starting point is 00:11:56 down in the kernel when we want to make persistent memory look like a block device. It's very convenient to have a block device also have atomic blocks associated to it. PMEM log is just one of our use cases. It's very convenient to have a block device also have atomic blocks associated to it. PMEM log is just one of our use cases. It's a persistent memory resident log file.
Starting point is 00:12:12 So if you think about an application like a database that has some sort of undo log that it writes. A lot of times they call these append-only files. I think that's what Redis calls it. And the append-only file, it is append-only. You literally just append stuff to this.
Starting point is 00:12:29 You never read it back. You write, write, write, write. The only time you read it back is if the program crashes, and you come back and you look in the end of the log to see what operation you should be undoing. So every time you're doing that append, you're going down into the kernel. You're going down into the file system. The file system maybe even has to do some metadata operations even if it's as mundane as updating the
Starting point is 00:12:48 access time or something. And then you have to go through the block stack so it's quite a long code path when all you're trying to do is just append something to a little file. So moving that up into persistent memory makes this into a much shorter thing in the PMM log when you append, it's a memcpy followed by a couple of these flush instructions. It's very fast. But it's really just also for a very specific use case
Starting point is 00:13:10 because the log file is just one of these things that has to kind of be append only and not replicated in this use case. Another library we have is libpmemobj that's a persistent memory resident object store which is really kind of a fancy way of saying this is just our most generic library
Starting point is 00:13:33 object, I'm using object not in the Java object sense I'm using object in the storage sense where object really just means a variable size block so this is the most flexible library we have. It allows you to begin a transaction, do all sorts of things like allocate from persistent memory,
Starting point is 00:13:51 free, and make a bunch of changes. And all that information goes into an undo log. And then there's a commit. And so if you get interrupted before you do the commit, everything reverts back to the way it was at the beginning of the transaction. So this is our most general library. If you're just starting out playing with MVML, this is probably what you want.
Starting point is 00:14:12 And we have a whole bunch of tutorials and examples on the GitHub site for this. Libvmem. Okay, so now this is the volatile use of persistent memory. And I know that sounds a little ugly to have a term that has both volatile and persistent in it at the same time. But, you know, in an earlier slide I said,
Starting point is 00:14:29 well, we expect to see 6 terabytes of this stuff on a system. And another part of the IDF announcement was we expect it to be cheaper than DRAM. So that means, you know, to build a system today with 6 terabytes in it out of DRAM, that's pretty expensive. And so we are envisioning a time when you might say, well, I need a lot of memory,
Starting point is 00:14:49 but actually I don't need it all to perform at DRAM speeds. So I'll put in some DRAM, save a bunch of money, and put in some of this non-volatile memory for the rest of the terabytes that I need. And so that's great, but now you have two types of memory in the system. What do you get in a C program if you call malloc? Well, you get the system memory, whatever that is. So that's the DRAM. So how do you ask for the
Starting point is 00:15:12 other kind of memory? That's what libvmem is for. Libvmem just provides another malloc and free. That's all it is. It's just for C programmers to say, I want a malloc from that pool of memory instead of the normal system memory. It's volatile in the sense that if the system crashes, it's forgotten.
Starting point is 00:15:31 It's forgotten. If the system crashes, or the program crashes for that matter, it's all given back to the system, just like the stuff that you get when you call malloc. And then the last library that we have out there is just another version of libvm. It's called libvm-malik. And this is one for transparently converting an application's calls to malik and free into persistent memory maliks and frees. So this would be a case where, as an administrator, you had a whole bunch of programs running on the system, but you only really needed a subset of them to run out of DRAM.
Starting point is 00:16:07 The rest of them, you want them to be memory-resonant. You want them to be faster than if they were actually paged out. So you just use this library when you run them, and it'll just say, oh, this guy gets to allocate from the slower, bigger, cheaper pool, and he has no idea that's even happening, so the application is not even modified in this case. I bet the security guy would hate to live being that way when he meets you. Well, actually, so why do you say that?
Starting point is 00:16:39 Because people are going to be writing to something as if it's thegram, assuming that when the power goes away, it vanishes. There are all kinds of government secrets into that area. So let me ask you this. What do people do for temporary files? Temporary files are, by the way, in use by most of the programs that you're running on your laptops today. Many of them use temporary files. How do you handle that same problem for a temporary file?
Starting point is 00:17:04 I've asked the security guys. I don't know. Well, so I'll tell you a common thing. Of course, you're right that there are more secure environments where they are much more paranoid about it. But for a lot of Unix-y systems, there is a POSIX interface for creating a temporary file. And what it does is it creates the file and allocates as much of it as it wants.
Starting point is 00:17:28 And then it deletes the name from the file system namespace. And so nobody else can attach to it at that point. There have been age-old security bugs where people caught it. But in the general case, no one can attach to it. And when the system crashes or when the program exits, that space is freed up and it's freed up again. Now, it still may have the old data in it, but that's how deleting file works.
Starting point is 00:17:49 If anybody tries to allocate from that free space, it's zero fill on demand. And so that's the same here. So it's the same security as files is what I'm saying. But if you're in an environment that needs better security than that, absolutely, somebody needs to go back and zero that stuff out. Yes, sir? Pull out that DIMM
Starting point is 00:18:09 that's got DRAM on it that file is gone. Or is this not the case that it was persistent? There's a different security process. So, did you all hear the question? The question was, well, so wait a minute. Since you're using a non-volatile technology
Starting point is 00:18:26 in a volatile way can somebody come and yank the DIMM out and now they've got your secrets and my answer to that is that I think will be an area for vendors to compete with interesting solutions so it's not necessarily the case
Starting point is 00:18:42 there could be solutions so for example going back to my temporary file example, if I'm going to put an SSD in my system and I'm going to use a temporary file, if somebody yanks that SSD out, now do they have my data? Not if it's a self-encrypting drive and I didn't give them the unlock password for it, right? So there are solutions. But there are considerations. I totally agree with you.
Starting point is 00:19:09 Okay. So let's talk about my current work, my Sarah's current work. So Sarah was partnered up with this application and the application had a block cache in it.
Starting point is 00:19:23 The block cache basically could use either memory or an SSD. And we looked at it as a way of reducing DRAM because, like I say, we are expecting this emerging technology to be cheaper than DRAM and to allow us to build bigger capacities. Gosh, it does happen to be persistent, so maybe we can leverage that and make the cache
Starting point is 00:19:50 warm when the program starts up. That's a possibility, so we wanted to look into that. We wanted to look into also just using it as volatile and decide which one was better and when we should use which one. We knew that we had the NBM libraries
Starting point is 00:20:06 and this is maybe not quite stated correctly. In reality, we were working on the libraries at the same time as we were working on these ideas, these proof of concepts and we were feeding, you know, requirements back and forth to each other. So a lot of the design of the library is based on work like
Starting point is 00:20:21 this where we figured out things that we wanted. So, you know, the first challenge you get is, you know, really where to integrate this into your program. You know, I actually get this question a lot, and it comes in many forms, so I'll paraphrase. The question is, are you kidding me? I have to modify my entire program to use
Starting point is 00:20:45 persistent memory? And the answer is, well, probably not. At least in most of the cases that we found, that's not what's going to happen. Usually what's going to happen, you hear what? We're not replacing memory. There's still going to be system memory. You know, we're not replacing that with persistent memory. We're not replacing storage. There's still going to be storage. We're not replacing that with persistent memory. We're not replacing storage. There's still going to be storage. We're not replacing that. We're introducing a new tier. And so probably what's going to happen if you want your program to leverage this stuff
Starting point is 00:21:14 non-transparently, you want to actually make modifications to get the best leverage, is that you're going to figure out what data structure should live in this new tier, and you're going to make a few modules for managing that new tier. You're not going to go through your entire program and rewrite it.
Starting point is 00:21:29 You're mostly going to decide, ah, I do need to comprehend this new tier, just like what happened when people used to have programs that dealt with storage and memory and then SSDs came along and they started thinking, hmm, I could probably do something clever by
Starting point is 00:21:45 putting some of my data on an SSD instead of a hard drive for performance. Yes, sir? So, is the persistent memory address terrain, is that actually addressed through normal page tables, and is it just your particular implementation and modification of the lens kernel that is
Starting point is 00:22:01 actually separately treating those, and alternatively, is it actually treating the entire space? Yes, so the question was, here I am, the only way that the program has the ability to get to persistent memory is by opening up a file and then memory mapping it, and then, sure enough, it's's there and it's using MMU map, it's using page tables. Could you have an operating system that does this completely differently?
Starting point is 00:22:30 And that's what I was talking about yesterday with the transparency levels. You could certainly have something where it's used only at this level and the application has no idea what's going on. And it could be, you know, maybe the operating system decides sometimes to give you a page of this stuff. But it's persistent and usually when you want something to be, you know, maybe the operating system decides sometimes to give you a page of this stuff. But it's persistent.
Starting point is 00:22:46 And usually when you want something to be persistent, you have to associate some sort of name with it so that you can reattach to your data. Otherwise, you know, why think of it as persistent, right? And so that's why we came up with this idea of using file names as the names of the blobs because people were already kind of used to that. And it already has a permission model. So, of course, other models are possible. So access and dirty bits are available, all the usual? Yep. Reference, modify bits, that's all available. Okay.
Starting point is 00:23:19 So the other challenge here is, like I say, picking which data structure you put on persistent memory. And we spend an awful lot of time just kind of skipping down here. What happens if a failure occurs? We spend an awful lot of time thinking about that, especially for the persistent part. I'm going to talk about that when we get in there. Okay, I'm going to skip this slide
Starting point is 00:23:45 and just go to Sarah's picture. This is the picture of the software that is not yet to be named, but will be someday soon open sourced, we think. And the whole idea here isn't that important, but it's obviously got some sort of fan out, just like so many things do in a big data world these days. And these servers here
Starting point is 00:24:07 each get some work sent to them, and they have some big giant existing block cache today, but it's actually not that big or that giant. It's just based off of existing technology. And we want to look at moving it from DRAM to some combination of DRAM and persistent memory. I'm also going to skip this slide. In our software that we were given, there's one block cache per server. There's
Starting point is 00:24:36 a two-stage allocation, we think, because at the first time that you know that you need to use the cache, you can allocate the space, but you don't know the data yet, and so then you come back and you put the data in. So it's kind of a multi-stage cache entry part. At least that's how the software is already designed.
Starting point is 00:24:58 And then there's a sort of a wrapper that goes into the cache that holds both the key and the value in there. There's a least recently used algorithm in the cache. So this describes the cache in so many programs, right? So many different programs have ways of caching things. But the details aren't that important. What's really important is that we've identified a cache that we want to move to persistent memory.
Starting point is 00:25:20 So the first thing we did was this volatile node version because we thought, well, that'll be easier, it'll be faster for us to implement, and it was. The goal was to reduce DRAM. In other words, take this cache, which was largely using up all the DRAM, and just move it into volatile mode. And this is great because it is really awfully easy to use. You don't care about flushing.
Starting point is 00:25:44 You don't care about persistence. The amount of space that you have is bounded. Now, this is actually kind of interesting. On many Linux distros, malloc will virtually never return null. malloc always succeeds, and it's because of this little feature in Linux called memory overcommit. On one distro, Malik will return null because overcommit is turned off. It's actually the reason why some people use that distro, because they don't
Starting point is 00:26:16 want you thinking that you have memory that you don't. Anybody know the distro? No. Then you don't win the prize. So, you know, Sarah doesn't give out information like that. I don't know. It's in SLES.
Starting point is 00:26:35 The guys from SUSE do that. So, again, we're kind of using just a volatile node here. We base this off of libvmem. And really the whole point that you want to know about this slide is we just picked a version of this cache and we changed the places where it does malloc and free to these calls to vmem malloc and vmem free, which is what we call it in our library.
Starting point is 00:27:02 It was a pretty straightforward thing. I'm lying just a little bit because the code is C++, so actually we changed constructors and destructors to use these things. But other than that, it was pretty straightforward. This is a very light lift. If you just want to take a data structure and put it into persistent memory as volatile mode, this is an incredibly light
Starting point is 00:27:26 lift. It's a great way to try out the different timings of the different tiers of memory without worrying about the persistence. So we had this done very quickly. The hardest part about it, if you think about it, remember that diagram I showed you where you open up a file and you memory
Starting point is 00:27:42 map it. Well you don't want to go to a programmer and say, here, I'm going to map two terabytes into your address space. Have fun. You want them to have some sort of allocator there, some alloc and free. So that's what the library does for you. So if you had to write your own allocator, it would take some time.
Starting point is 00:27:59 Most of us haven't had to do that since CS101 or whatever when you finally learn how allocators work. So it's a pain in the butt. And if you go out and look at all the third-party allocation programs, memory allocators that are out there, none of them expect to work on their own little pool of memory.
Starting point is 00:28:16 They all expect to, oh, we're a faster allocator. We're going to take over being the allocator for your main memory. And so we took JEMALIC, which is the memory allocator on FreeBSD and used on a lot of other systems as well, and we just modified it to, instead of assuming it's the only allocator on the system, we just modified it to have the ability to operate on multiple independent pools of memory,
Starting point is 00:28:40 and that's exactly what this library is. It's really just a wraparound JEMalik. We didn't write a whole other memory allocator. Not for volatile. For persistence, we did. So all the tracing things and what that means, J.E. Malick has, they work? They do work. We made them available.
Starting point is 00:28:56 All the tracing in J.E. Malick is available through our library if you want the stats. So, this is great. Piece of cake, right? Very light lift. So, what did we have to watch out for? Well, this first one actually really, really, really annoyed me. The first
Starting point is 00:29:15 one is, if you get it mixed up, if you get some things with malloc and you get some things with vmem malloc, and then you call the wrong free, the libraries don't know what to do. I mean, they actually just kind of do the wrong thing, which is not crash most of the time. They put it in the wrong tree or something,
Starting point is 00:29:33 and later on you discover it when you do crash, and you're like, wow, what the heck happened here? So we're fixing this, right? Because it's just totally annoying. You don't want to put that burden on the programmer. And the way we're fixing this is with another library that now is on GitHub again that I think is probably going to replace
Starting point is 00:29:51 libvmem. We're probably going to just sort of unify on one. It's called libmemkind. Intel had already done this library for allocating from different NUMA distances so that you could mix your local and your remote allocations and things like that.
Starting point is 00:30:08 LibMemkind is clever enough. You can pass anything that it allocated to free, and it'll figure it out. And it's willing to take over both the system main memory and these other pools. We went ahead and added the ability for it to allocate from persistent memory as well. So LibMemkind seems to be this kind of unifying thing
Starting point is 00:30:25 and then over time as we see these new kinds of memory emerging over the next few years, we'll just keep adding more kinds to LibMemkind and it gives you a nice unified programming model for volatile memory. So, at least some of us are all focusing on this. There's several groups of us
Starting point is 00:30:41 now. The high performance computing guys and me, the persistent memory group and several other groups inside of Intel are all just agreeing this is the unifying library for volatile memory allocation. So, yeah. Is LibMemKind NUMA aware
Starting point is 00:30:58 still? Yes, LibMemKind is still NUMA. For the persistent DIMT? Yes. For anything that has NUolocality, yes. So, you know, LibMem, I actually added this bullet to Sarah's slide just now before I got up here. LibMem kind wasn't quite ready for your use when she did this. So now she's going to see this slide and say, well, why was I told not to use this? So I think it actually makes a lot of this a little simpler because you get this unified model.
Starting point is 00:31:33 Yeah? So when you do my lock in these new libraries, are the page table entries any different from the process? Or let's say if I'm looking at the process term, can I say that, okay, this memory came from the persistent memory and this came from the main memory? So the question is, if I allocate some things from the normal system main memory and some things from the persistent memory pool,
Starting point is 00:32:00 are there differences in the page tables, for example? And there aren't differences in the page tables, for example? And there aren't differences in the page tables, but in Linux there are these map files in slash proc that you can look at, and it tells you what file your pages came from for memory-mapped files. So it makes it actually quite clear. You see the range of your address space for that process, and you see that it came from a file on a persistent memory aware file system so you can reverse engineer
Starting point is 00:32:28 and say oh this came from persistent memory and the other ones show up as anonymous memory so they don't show up as anonymous they show up as associated with that file system yes sir you talked about involved memory versus persistent memory
Starting point is 00:32:44 you talked about cache-volatile memory versus persistent memory, you talk about cache and independent non-volatile portion of it. Are you referring to the non-volatile DRAM on the DIMM, or is this outside of the DIMM? Outside of the DIMM. I'm just talking about the normal processor cache hierarchy, the L1, L2, L3 caches in the processor. So let's move into persistent memory here since I'm about, well, I'm more than halfway through my time
Starting point is 00:33:12 and I really wanted to get to this part. So what's so hard about it? I just got through telling you how easy it was to just use this malloc and free interface, right? Well, a couple things are hard. And let me start with an example. Remember I was telling you that these things are like memory mapped files, right? So on Linux, the way you memory map a file is you open it, you memory map it,
Starting point is 00:33:31 and then you have this pointer that you got back from MMAP that you can just do stores to. And here I am doing stores with a strcpy, everybody's favorite not quite secure copy command. And I'm copying my name to this persistent memory. And now I want to make it durable. Well, my name is with the Null Terminator 5 bytes. Oh, I know, I know. I really should have changed that,
Starting point is 00:33:56 but then the example didn't quite work. Yes, so I'm copying Andy's name. Because, you know, he's such a fun guy to work with. And with the null terminator, it's five bytes long, and so to make it durable, I can call msync, just like I told you. Or I just got done telling you that we have a libpmim that essentially does the same thing but without having to trap into the kernel.
Starting point is 00:34:20 So this is what it looks like if you use libpmim. It flushes those five bytes out. Piece of cake, right? So, and there's just kind of a picture in the architecture of where libpmim lives. So you can kind of see. But what happens if I copy something bigger? Now my co-worker, Andy Rudolph's full name, is actually 12 bytes with the null terminator. And so here I am.
Starting point is 00:34:48 I copy out those 12 bytes, and now I'm going to flush them all to durability. Piece of cake, right? This is really easy stuff. But before this call returns, the system or the program crashes. So what are the possible results here? Well, one is, assuming it all started out as all nulls, that none of the data got copied. I crashed before anything got flushed out. Or maybe it all got copied out
Starting point is 00:35:09 there. That's perfectly possible. Or what about this number two and number three here? Maybe it got part of the way flushing things out. It all seems kind of reasonable. Or look at number four here. It's starting to look kind of ugly. The last part of my co-worker Andy Rudolph's
Starting point is 00:35:26 name went out, but the first part didn't. So you can see with this kind of a flush idea, there's nothing transactional, right? Nothing is transactional, and that's what makes it hard. If I'm trying to recover from this, I can't just look to say, oh, how far did the program
Starting point is 00:35:42 get, because I get cases like this. Things can go out of order because cache pressure works that way. On a modern-day set associative cache, things go out in any order. You can't depend on it. So we need transactions, and transactions are not provided by the hardware. Now, wait a minute, you say. I went to the Intel website and I went to the software developer's manual and I searched for the word atomic.
Starting point is 00:36:08 And Spencer did that earlier today. He told me it showed up like dozens of times, didn't you? Sorry, I had to use somebody. So, yeah, you'll find the word atomic in there all the time. But, you know, we didn't have persistent memory for the past N years of Intel development. So they're not talking about atomicity with respect to durability. They're not talking about atomicity with respect to failure, like power failure.
Starting point is 00:36:34 They're talking about visibility. So if you see an instruction that says it atomically stores 16 bytes, like there's a compare-exchange instruction called compare-exchange 16b, they're talking about visibility. No thread running concurrently to your thread will see part of that store. Cool, right? It has nothing to do with durability. So even though you see these atomic stores that are much bigger in the manual,
Starting point is 00:37:08 like there's a 512-bit one in the AVX-512 instruction set, for power failure, it's still 8 bytes. 8-byte atomicity. If you're storing 8 aligned bytes and you lose power, you'll get the old 8 bytes in your persistent memory or the new 8 bytes in your persistent memory. That's cool. That's kind of like a little mini atomic store. Anything bigger than 8 bytes?
Starting point is 00:37:28 All bets are off. Software has to build transactions out of those 8 byte transactions, those 8 byte atomic instructions. Software has to build bigger transactions in software. What about the TSX instructions? There are these new things that came out
Starting point is 00:37:43 with Haswell. Yeah, thank you. See, you do know this better than I do. So, the TSX instructions, which stands for transactional, the T stands for transactional, have an instruction called X begin and X end.
Starting point is 00:37:59 So you do an X begin and you can do a bunch of changes and you do an X end. And again, it's all transactional with respect to other threads. But if you get, it's really meant to allow you to do this kind of optimistic locking where you don't have to grab a lock. But if a conflict does happen,
Starting point is 00:38:16 you get this thing called an X abort and then you're expected to take a lock and redo it. Well, one of the things that always causes an X abort is a cache flush. It's just not made for persistent memory. Maybe in the future, but every time I go to the hardware guys and I ask them to make this stuff work for persistent
Starting point is 00:38:31 memory, they give me a very long reply about how hard it is. So it's not going to be anytime soon. I also mentioned compare exchanges. It's used by these lockless algorithms. Also, this is the heart of a lock itself, but these lockless algorithms are all in vogue these days that go around and use, you know,
Starting point is 00:38:49 they call them non-blocking data structures where they don't grab locks, but instead they use these compare-exchange things. That's fine. It will work, but it will not work as far as durability is concerned. These things, there is no one instruction that says compare, exchange, flush, p-commit, and then let somebody else see it. There's no instruction to do that.
Starting point is 00:39:08 So we have to do that in software. And so, like I said, we have a library that is a general purpose transaction library and it sits on top of libpmim and it has operations like begin a transaction, end a transaction, allocate and free.
Starting point is 00:39:24 It's got a memory allocator, but this is more than your garden variety, malloc and free. Now, if you allocate something, but before you get around to using it, you crash, that memory goes back on the free list. Otherwise, it would be not only a memory leak, it would be a persistent memory leak. Well, that's like double bad, right? So that same example of writing my co-worker Andy Rudolph's name to a field, I'm just going to kind of skip
Starting point is 00:39:53 a lot of the details for now, but the point is now there's a way of beginning a transaction, which I put it into a macro here in C, just to make the code a little cleaner to look at. There's a begin and an end and so now this stir copy either completely takes place or doesn't take place at all in the face of
Starting point is 00:40:09 failure. Now our transactions also provide you the opportunity for some multi-thread safety at the same time. What we found when we started converting programs to use these transactions is that everybody does multi-threaded programming today.
Starting point is 00:40:27 The places where you protect data structures for visibility with other threads almost always lined up with the places where you protected things for power failure atomicity, right? It was so common that that's why we made this C macro. It not only begins a transaction, but the underscore lock version of it here offers to grab one of your locks. This is just a lock the programmer defined. It's protecting this data structure to make this code multi-thread safe. And so what happens is,
Starting point is 00:40:58 you begin the transaction and you're also locking out of the threads. You do the transaction, and then the end here commits the change and drops the threads. You do the transaction, and then the end here commits the change and drops the lock. Turned out that actually a lot of the multi-threaded code got simpler because there were places
Starting point is 00:41:13 in multi-threaded code where you grabbed a lock and you started making your change, and you realized something was wrong. There's an error path. And you had to kind of put things back into a sane state before dropping the lock again. But we have an undo log here. We do that for you. Cool.
Starting point is 00:41:29 Sarah thought that was really cool. Good job, Sarah. Yeah. And the kind of locks that we give you is now this, you know, this is on the Linux code, so Pthread locks are the common kind of locks these days. We offer these wrappers around the Pthread locks called PMEM mutexes. A PMEM mutex is just like a normal mutex, like a Pthread mutex, except for every time the persistent memory file is opened, all the locks are magically initialized.
Starting point is 00:42:04 It's all done with a generation number. So imagine if you have a data structure with potentially millions of locks. It's a tree and it has a lock in every node of the tree. There are millions of locks. Some of them are held and some of them are not in the program crashes. When you come back up, you don't want those locks to
Starting point is 00:42:19 be in that old state. So we just magically, if they're this type, we magically revert them. And like I say, it's all done by incrementing one generation versus very fast. So in addition to that, I just got done explaining how cool all these transactions are. We also found that outside of a transaction, it's nice to just be able to do allocates and frees and know that they're safe, they're atomic. In other words, you might just want to say, oh, I'm just going to allocate these things and fill them up and start using them.
Starting point is 00:43:00 And if the program crashes, I just need a way when I come back of walking through everything that I allocated. And so we added a type field to our allocator, and you can allocate things and provide a little type. That's a little tag. It's like a data type. And then when you come back up, when you first open the pool, you can say, I need you to walk through everything of this type. I'm going to put it all into a hash table.
Starting point is 00:43:21 I need you to walk through everything of this type. I'm going to put that into a different hash table, and so on. It turned out that for a lot of the problems that we were dealing with, this was all you needed. We didn't actually need those transactions. We just got a lot of simpler code here. The other thing that we found that was handy to have outside of a transaction are list operations.
Starting point is 00:43:42 So we just made a generic doubly linked list, just like in the C++ standard template libraries, and they're a D list or something like that. It's a doubly linked list that everybody, one person had to code it up and optimize it and get it right, and everybody benefits from it. Same idea here. It's a doubly linked list, only it has some cool features to it.
Starting point is 00:44:01 You can allocate a new node and put it on the list atomically. So if you crash, that thing is either completely allocated and on your list, or that space reverts back to the free list. The same thing on the remove side. You can remove something from a list and free it atomically. If you crash, it either goes back to where you found it because you're not done yet, or it completes the operation. Or you can move from one list to another. So the combination of the earlier allocations that are safe and atomic and these lists that are safe and atomic
Starting point is 00:44:32 actually allow you to solve a lot of problems without even worrying about transactions. And so we turned out this, you know, we kind of added these as an afterthought, thinking, you know, this might come in handy. Well, for Sarah's work, she turned out not to use transactions at all. She ended up using this.
Starting point is 00:44:47 And so she made that block cache by just allocating those items and putting them into a hash table that lived in DRAM because the hash table was the critical code. It was the thing that needed the performance of DRAM. And then when the program shuts down, we don't worry about it. When the program comes back up, we just walk through all the lists of stuff that are in the cache,
Starting point is 00:45:09 potentially terabytes of it, and quickly rebuild the hash table. Quickly. For terabytes, we think it's probably tens of seconds per terabyte at least. But it's still not so bad when you have a pretty big cache like that that you're warming up.
Starting point is 00:45:32 So this has just been kind of a summary of Sarah telling you which of these operations that we offer in the library she ended up using, and I just gave it away so you already know. I'm going to skip a few slides because I want to get to the end in time for questions. And, of course, I added the slides that had the word Andy in them, so now I didn't use some of Sarah's slides. So, yes, it is challenging to use persistent memory. We're trying to make it easier with libraries,
Starting point is 00:46:10 but you at least have to comprehend which things are going to live in persistent memory and which things aren't. It's not something you do transparently, at least not in this example, right? We're trying to say to really get leverage, you modify your application. If you want something transparent, then you're down lower in the stack doing this kind of change to middleware
Starting point is 00:46:30 or doing this kind of change in the kernel. It's very critical, though, to be thinking about when something is actually visible versus when it's persistent, like I was saying. You can use our transactions to help with that. And, of course, it's important to consider the state of the persistent memory at any point in the crash. So when I was telling you that Sarah decided not to use our transactions, she allocates something onto a list, and we guarantee a certain initial contents, all zeros.
Starting point is 00:47:01 So she has a valid field, and that valid field is initially zero. Then she goes and fills it in, and she persists it. And then she has a valid field, and that valid field is initially zero. Then she goes and fills it in and she persists it. And then she turns the valid field to one. So in a way, she kind of created her own very lightweight transaction. And on startup, when she's rebuilding the hash table, if she finds anything with a valid field still set to zero, she says, oh, I guess
Starting point is 00:47:19 I never finished filling that one in, and she just frees it. So you really have to think that through. Using the NVML library seemed to make it easier. We kind of did some time estimates. We thought if she had to write her own allocator, that's probably the big lift, or if she had to do her own transactions for the persistent version, that's also a big lift.
Starting point is 00:47:39 We thought it would probably have taken her 10 to 12 man months to do that, 10 to 12 man months to do that. 10 to 12 Sarah months. Instead, she did it with the NVM library in less than a month. It kind of gives you an idea of what we hope to achieve with this library. We hope to save time. That's the summary. I think I managed to leave myself
Starting point is 00:47:59 with a couple minutes for questions. Any questions? The library is open sourced under a BSD license. I chose the BSD license because you're free to then take the library code or the library itself and put it in a closed source product. You can make a T-shirt out of it, build a bridge out of it. We won't care because it's not a money maker for us
Starting point is 00:48:26 we're not trying to make money from selling the library we're trying to make persistent memory programming easier so yes sir that's a great question wouldn't this be easier if we added a persistent keyword or a transaction syntax to C? And I think eventually that's what's going to happen. I'm not sure it'll ever happen to C.
Starting point is 00:48:55 I expect to see it in higher-level languages first, anyway. But that's a much longer thing. So we wanted to start out making the library handle the interesting operations and get some experience with them. And so today, if you decided, I'm going to have a type of an object in Java that's a persistent object, and you're in there and you're working on the JVM
Starting point is 00:49:16 to make it do that, probably the way that you would make that work is by calling this library, this C library, in your implementation. Right, but it would be very thorough and hard. Yeah, to use this directly, I agree with you. We did make Valgrind
Starting point is 00:49:32 work with it, by the way, and Valgrind is a Linux tool for finding leaks and things like that. One of the things it'll tell you is if you ever write to persistent memory locations and you don't flush, or if you flush multiple times, anything that doesn't seem quite right. So we're working on the tools.
Starting point is 00:49:48 They're out there. They're on GitHub also. But we've got a lot of work to do there. Right? Absolutely. But I see that as a longer term thing because I don't think we know the answer yet. But we are headed there. Definitely. Other questions? Okay, thank you.
Starting point is 00:50:08 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.