Storage Developer Conference - #41: Breaking Barriers: Making Adoption of Persistent Memory Easier

Episode Date: April 18, 2017

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 41. Today we hear from Andy Rudolph, Architect, Data Center Software, Intel, as he presents Breaking Barriers, Making Adoption of Persistent Memory Easier, from the 2016 Storage Developer Conference.
Starting point is 00:00:49 So I'm Andy Rudolph. I work at Intel. And I'm here to talk more about how we can make persistent memory easier to use. And you've heard a bunch of stuff now about the NVM libraries, if you've been to some of the previous talks. So, thank you. So, you know, I'm just going to like cut to the chase. It's about transactions. It's all about transactions. That's the hard part of programming and with persistent memory. And so, we're trying to do some things to address that. But I just want to be clear. I don't have the answer. I have an answer and I actually have lots of little answers.
Starting point is 00:01:30 And we're trying to get a lot of things going at once here on a lot of different approaches to see where we end up, which things end up being the things that stick. So when we talk about the NVM library, I just kind of want to make it clear, this is just one idea. If you were in the Druvas talk this morning, that's another idea. And they're both completely valid, compatible ideas.
Starting point is 00:02:01 In fact, you really could technically use them both at the same time. But the point is that I think they're also still a bunch of ideas we haven't come up with yet, right? And so what I'm trying to do is get a lot of these ideas for how to program with persistent memory out into the ecosystem and let the ecosystem do its thing, right? People will try different things. They'll do research. They'll try different ideas. So please don't think that I'm presenting this as the one true solution. It couldn't be further from the truth. This
Starting point is 00:02:30 is a solution. So, you know, from the 50,000 foot level, the way I like to think of it is that for, you know, all of my lifetime and way before that, computers at runtime really primarily put their data in two tiers. It's either sitting in memory or it's out on some sort of storage. And we're all kind of taught this in school, and it's very basic. And in today's architecture, sometimes that storage is used
Starting point is 00:03:00 through a file system, sometimes not. It doesn't really matter. The point is that it's slower out on storage and so you have to kind of bring data in on memory to operate on it. And we're moving into this world where there's three tiers, right? And I'm very adamant about this, that persistent memory is not the replacement for storage. Frankly, it's going to be too expensive initially, right? Because you're paying a premium for it.
Starting point is 00:03:25 So even though we have these kind of exciting announcements about it becoming very large capacity and so on, you're not just going to stop using storage, right? So I'm not trying to replace storage. On the other hand, it's memory accessible, but it's not the replacement for memory. It may not be, depending on the technology, it may not be large enough, it may not be fast enough. We believe it'll be cheaper than memory but it's certainly not a replacement for memory. So that's why I keep making this point when I give these talks about we're moving to three tiers, right? And of course, things that work today like storage, this model here where there's just memory and storage,
Starting point is 00:04:06 this will continue to work and will make it continue to work using persistent memory behind the scenes, right? So there's plenty of transparent ways that you can take advantage of persistent memory. So how can we keep giving all these talks about ways of like hacking up your application to use persistent memory? Because that's probably how you get the most leverage out of it. But we're still going to support all sorts of, you know, legacy models and transparent use cases. I just want to make that clear too that again, we're talking about these ways of programming like all the macros and stuff
Starting point is 00:04:38 that you saw in the previous talk. We're talking about that as a way of exploiting persistent memory but not something that people are forced to move to, right? Okay. So, like I said, you can modify your applications, you can use the programming model itself, the thing that Doug described without any libraries and, you know, what happens? You call mMap and you get this big blob of persistence, right? And so, you know, the announcements from Intel last year talked about having something like 3 terabytes per socket of persistent memory possible. So you can easily imagine on the two socket system, the most popular machine out there in the server space. You can imagine some program calling M-Map and getting back six terabytes. So now what do you do? Right? I mean, just handing somebody a pointer, some
Starting point is 00:05:31 programmer a pointer to six terabytes of persistence is not very friendly. And, you know, they've already had to type M-Map just to figure out what the heck that was, right? Most people don't even know what it is. So it's not very friendly. So we decided, what can we do today without modifying languages and going through language committees and, you know, modifying compilers and stuff? What can we provide today as a way to start being able to use this stuff so that we can now take our time and go through and get the, figure out what is the right way of modifying languages and things like that. So they're not mutually exclusive for ideas. And what we came up with initially is this set of six
Starting point is 00:06:10 libraries. You've seen this slide if you've watched any of the other talks today. It's been up a couple of times. Three of our libraries are transactional and the one that has the most general purpose transactions is this one called libpmemobj. And so, you know, if you want to re-architect your application here, up here, to use the three tiers instead of two tiers, this is meant to help you do that. Okay, so why would you choose to do that? Why would you re-architect an application? Well, one of the things that I look for, for use cases for persistent memory are very large data sets, you know, like terabytes in size data sets. Those things, today they're
Starting point is 00:06:52 sitting in storage and people are paging them into DRAM, right? So that's a nice fit for persistent memory where they can sit there where you can just access them with byte addressable persistence. You need them to be persistent, you need them to be byte addressable, that's what makes all this worth it. Especially if you have lots of small random accesses, because if you think about it, if I have a huge data set, say I have like a big hash table, like I'm a storage vendor and this is a dedupe table, right, those things can be pretty big. But every time I look up something in the hash table, I'm only accessing a few bytes. Well, to do that on a system today where you're paging, every one of those times you touch a page,
Starting point is 00:07:31 you have to bring that whole page in. And the bytes that you don't access, that's wasted bandwidth. But with persistent memory, you get all that bandwidth back because you're touching just these little random things here and there. You're not paging. You're not even imposing a DRAM footprint at all.
Starting point is 00:07:44 OK, so that's one of the reasons reasons if you have a use case like that where you might decide to re-architect your application. What if you want to do DMA directly to your persistence? Today, that's virtually impossible. You do DMA to DRAM and then you DMA from DRAM into your persistence. But with persistent memory, you go right to the persistence. Ooh, that sounds pretty cool. So, you know, bottom line, if you find a performance-critical application that's going to benefit from some of the things I have listed here, that might be enough to motivate you to re-architect your application. But, you know, I don't drop that word, re-architect, lightly. There are plenty of reasons why you might not decide to re-architect your application.
Starting point is 00:08:28 First of all, if one of the many transparent ways to use persistent memory works well enough for you, then why would you spend all that effort to re-architect your application? So, for example, remember I just complained about paging. But, you know, paging, which is something people go to great lengths to avoid right now. If you install one of the major databases on a system today and you call them up and say, my performance is terrible, they'll say, oh, type this command. Are you paging? If so, you're misconfigured. Sorry, we won't even help you. They won't even talk to you, right, because they consider paging like the death of their performance because it's
Starting point is 00:09:05 the surprise I.O. that happens, you know, in the background. But that's going to change because we're talking about the stuff that's gone up three orders of magnitude in performance. So now suddenly paging is not as big of a penalty as it has been for the past 30 years. So maybe paging is now a good solution to that big multi-terabyte data set. So you still keep it out on storage. That's all going to work with persistent memory. It's just going to be transparent. So no work, no change to your app. So maybe that's good enough for you. Or, you know, people will emulate block storage on top of persistent memory and it'll be fast, right? So maybe that's fast enough. Or maybe some middleware is doing the stuff
Starting point is 00:09:47 that I'm about to talk about. It's, they modified, you know, their middle library layer to use persistent memory and you build on top of it. Like maybe somebody modified the Java virtual machine to use persistent memory to make what it does faster. And you're just writing your Java app and you don't care about whether it uses persistent memory or not. Perfectly reasonable explanations of why you might decide not to re-architect your application. But bottom line, again, it's when the cost of re-architecting outweighs the benefit.
Starting point is 00:10:14 And the cost is not just doing the architecture or figuring out where you're going to place data in these three tiers, but the design and the implementation. And of course, the big cost turns out to be validation. So we have these major database apps, for example, that are very careful about when they'll make changes. And they have massive validation cycles. So they're not just going to say, oh, putting this one
Starting point is 00:10:37 data structure in persistent memory does make it faster, and then ship it. They're signing up for a huge amount of validation and support if they do that rearchitecture Okay, so let's take an example now This is a little bit of a kind of a narrow down contrived example Just to serve my purposes of when you might decide that you have an application that you you want to try and use with Persistent memory, so I've taken this example. I kind of haven't named any real applications here. But one you'll probably will come to mind right away
Starting point is 00:11:10 if you're thinking about it. Let's take a database-like application. It's doing transactions. Databases do transactions, right? And it's doing these transactions to its tables. And the tables might actually be in memory. They might actually be sitting in DRAM. So they're very fast Right. So, hmm great. So now we have an application. It's actually quite fast
Starting point is 00:11:32 But you know, it also does care a little bit about persistence So as it's making these these transactions in its tables, it's also doing right-ahead logging It's got a little log that it writes everything to. And you know, like a lot of databases, this log is written and basically never read. Why does it do that? Well, it does it in case of a crash, right? The log is needed for recovery,
Starting point is 00:11:56 but recovery is the.0001% case, right? The most common case is everything's brought up normally, everything's shut down normally, so you write the log all the time and you never bother to use it. And so you're basically, like in Redis, this is called the append-only file. If you guys are familiar with Redis, it's called AOF, append-only file. It's basically a log that just gets appended to constantly. And, you know, the path, if you think about it right here, here's the application,
Starting point is 00:12:21 and he's running, he's going to append something to the log. It goes down into the kernel. It goes through the file system. I'm kind of skipping a lot of layers here, it goes through the block stack and everything and it ends up appending to this little log file here. You know, it might end up writing to metadata because, you know, it's an append so it's maybe allocating other blocks and having to change the file size, things like this. So each one of these things, these appends could take quite a trip through the kernel. So you might think, ah, this is a good example of something where we could say, hey, you know this log file here, let's move that into persistent memory.
Starting point is 00:12:56 And then we'd save ourselves that trip into the kernel. Okay? So you might start out and say, oh, well, I saw this little talk from Andy that said there's some library that allows me to do this. So you go and you look up the libpmemobj API, pmem.io, all the man pages are there and everything. You say, oh, yeah, I can do this. And sure enough, we have our little library there. It is pmemobj.
Starting point is 00:13:16 It's built on top of another little library for flushing stores. And every one of these appends to the log just becomes a little transaction. And now it's much faster, but again your app had to change. So you're saying, gee, thanks Andy, that's faster, and you made me change my app. That's not very nice. And besides that, as somebody pointed out during the previous talk, these macros that you end up using are quite ugly and, you know, not that friendly. And so, you know, before making this change, you just opened up your log file and did a write and did a sync and you're done. That was the append. But now,
Starting point is 00:13:56 you're saying, well, okay, these guys told me this is really cool stuff so I'm going to use it and you had to use this like TxBegin macro, and then you did all the stuff here to make your little, you know, copy your changes to the end of the log and TxEndMacro. And it is cool. It did work. It's now PowerFail safe. But, you know, it is complicated and it's not something that programmers, your average programmer is really going to love using. And, you know, this is kind of intentional. Like I say, we started out what could we implement today without modifying the compilers. So that's why we ended up with a lot of these C macro things going on here. It's meant for early adopters. It's meant for language implementers, right? The guy, we have a guy right now who's putting
Starting point is 00:14:41 transactions into Python, persistent memory transactions into Python. And he has a persistent memory aware dictionary in Python. And when you set a value, it happens transactionally. This is what it's doing down in the C code of the Python interpreter. But the guy writing Python has no idea. He just says, oh, I assigned something to this dictionary and it's transactional. That's my intention, right? That this is like the implementer's language,
Starting point is 00:15:07 like the guy who's down in the nuts and bolts. Probably day to day, I'm not sure that it matters, whether this looks ugly when it has macros and things like that, because day to day, I expect people to be using higher level languages. Okay, but could we do better than this? Like, this is our libpmmobs.jpi right here. Could we at least make it look a little more like this
Starting point is 00:15:26 just to help ease the adoption? Well, I think we could. And so we're working on another library, a seventh library. Remember, the library had six libraries in it. We're working on a seventh library now, which I call libpmfile. And it's libpmfile because it's basically the
Starting point is 00:15:45 file paradigm, the file accesses, only it's all in your little user space pool of persistent memory. And so instead of calling open, the application calls pmemfile open, but with the same arguments. Instead of calling write, he calls pmemfile write. Should there be a pmemfile sync? Well, you know, our library is transactional. Why don't we just make everything transactional because it's not like it's in memory, right? It's in persistent memory. It's not like I'm going to say, oh, I'm not going to write it out. It is written out. Persistent memory is right in place, right? If I just move a byte into persistent memory, it's persistent. So So I guess I don't even need a sync.
Starting point is 00:16:27 But I could provide one that just is a no-op, just so that people... I haven't decided. It makes you feel good. It's like the guy who types sync three times. Which is, I can't type sync once, I have to type it three times, because somebody told me a long time ago that you used to have to do that. It's probably all apocryphal. Okay, so this is now maybe a little easier, right? So it's easier for people to understand because they grew up knowing about how these system calls worked,
Starting point is 00:16:58 and now they can see their PMM file versions here. So that's not so bad, and we just build it right on top of libpmimobj. So internally, libpmimfile is just using the transactions of libpmimobj. So now I just have this stack of three little libraries here. And they look bigger than they are. They're actually pretty, the code paths are pretty short. You know, the Linux version of this would be modeled after POSIX. It's a familiar API. But the app still had to change.
Starting point is 00:17:27 So maybe I made it a little easier. Maybe I lowered the barrier to adoption a little bit. But remember, if you change one line of code, if you change a comment, you still have to revalidate. And that's what's pissing people off. So how can I do this without pissing people off so much? Not pissing people off is a good goal. So one way I could do this is to try and use libpmem file transparently.
Starting point is 00:17:58 So we already do this for our volatile libraries. We have a library for persistent memory for volatile usages, right? And so why would you do that? Well, you know, like I said before, persistent memory is expected to be cheaper than DRAM and, you know, terabytes in capacity. So you might actually not care about the persistence. You just want to use it as another tier of volatile memory. So we have a library that does that and it has has a libvm malic, and it has a libvm free, and they look just like the libc malic and free, only they work on a different pool of memory.
Starting point is 00:18:34 Well, that's pretty cool. But one of those six libraries, and I guess I should maybe have said this when I was back on this slide, one of these six libraries here, libvm malic, just makes this happen transparently. So if you take any application and you use the linker magic, the LD preload environment variable,
Starting point is 00:18:54 you can have all of its malik and free calls kind of magically replaced here. So you can just take an application that, without modifying it, you can just make the binary put all of its dynamic memory into persistent memory, right? So again, that's the cheaper memory so it doesn't impose such a big footprint on your more expensive DRAM. So taking that as a model, I'm thinking, huh, could I do that for libpm file?
Starting point is 00:19:20 I mean, it wasn't too hard to just interpose on, you know, we had to do what, malloc, free, realloc, posix, memeline, there were a couple of others, calloc. But it wasn't hard. In fact, JEMalloc, a popular allocation library, is actually what's inside this. We just used JEMalloc. We didn't have to write a new volatile memory allocator. And it already had this kind of interposing ideas. Sometimes people do this with JEMalloc. They take a binary that was written to use libcMalloc and they interpose
Starting point is 00:19:54 on it. Okay. So if I can do that, couldn't I do that here with my libpmim file. Just using some sort of linker magic, I could preload these libraries onto a binary without modifying it. And the linker magic helps with interception. So let me say a couple of words about that. A long time ago, in the before time, when men were men and we didn't have dynamic libraries, everything was built statically. And libraries were first invented in Unix anyway, when shared libraries were first invented.
Starting point is 00:20:36 Of course, they existed on other systems already, like VMS had them forever and things like that. So this guy who worked at Sun at the time saw shared libraries as this great solution for doing all sorts of things we never dreamed of before. And one of them was called building an interposing debugger. And he was going to make it so that you could build debuggers that interposed on all these library functions, right? And the way that that's why if you look at LD.SO today, it has interfaces for looking up functions and interposing all these RTLD interfaces, DLopen, if you guys ever played with these. And what really happened is it kind of turned
Starting point is 00:21:16 out to be not that interesting. I don't know of any interposing debuggers today. I don't know of anybody who even uses library interposing for anything other than like a couple of fringe cases. And a few years ago, the guys who maintain the GNU libc, the one that comes on Linux, kind of said, boy, you know, if somebody interposes on our library and changes something like that's an internal library call, they would then report a bug and it's not our bug, it's their bug. So they disabled it. So in other words, today in the GNU libc, if you call, let's see, what calls open? So if you call opender, it has internally a call to open. Open is a system call. And if you say, wow, I really want to interpose on that, the library doesn't allow it. Right? What happens is it actually has
Starting point is 00:22:10 an inline call to open inside of OpenDir and prevents you from interposing it because they don't want you to break their OpenDir and then blame it on them. So that was a long story. But the point is that the interposition thing sounds like a pretty good idea, but there are some obstacles. However, it's very compelling because, first of all, you can just do this transparently to the app. The app, the binary is unchanged. And what you do is you just make it an administrative configuration thing, right?
Starting point is 00:22:43 So libpm file, when it starts up, it looks for some sort of configuration thing, right? So libpm file when it starts up, it looks for some sort of configuration information which we actually point it to with an environment variable. And it says things like, well, when this application does most opens, it should just go right down into the kernel. It should go to the normal thing. So if I just open up etsy password, I actually just call into the kernel and the normal thing. So if I just open up ETSI password, I actually just call into the kernel and the normal path happens. But if I open up, you know, slash dollar PMEM slash mylog, that's the got the special token in it and that tells lib PMEM file when you've configured it correctly to put that into persistent memory, okay? So
Starting point is 00:23:22 that's the idea of using the interceptor. And, you know, there's more to it than just saying you didn't have to modify your binary. What if you're thinking of re-architecting your app like I showed in the first couple of slides? But again, it's quite an investment in time. Wouldn't it be nice if you could just try it out? Say, well, you know, I can think of a couple of things that I might re-architect to use libpm to store them. Today, they're storing that stuff in files. So why don't I just do this, you know, in my normal test environment and try it out, do some performance measurements. And if I
Starting point is 00:24:01 find out that, boy, when I use this libpm file, things got a lot faster, then you might think, OK, it's probably worth me doing the re-architecting work. On the other hand, if you put this in, and there was no difference because actually the bottleneck was somewhere else, it's good to know. And you didn't have to do any app re-architecting or revalidating. So this is kind of the great try it know, try it before you buy it mechanism.
Starting point is 00:24:25 Are you buying it? Yeah? Okay, maybe a little. And there's one thing that's even cooler than this. libpmemobj, this library right here, has already built into it the ability to replicate. So today, the shipping libpmemobj, you can write to the API, that transactional begin and API that I mentioned to you before. You can write to that API.
Starting point is 00:24:56 And the application, without changing, you can make some configuration steps, some administrative steps, and it replicates for you. Now, the shipping library today will replicate between two persistent memory files on the same machine. And then we actually, these are out on GitHub already. We just haven't tagged it as a release yet because we're still testing it. But we've added the ability to replicate over RDMA to a remote node. Okay, so again, the API is the same.
Starting point is 00:25:29 It's actually transparent to the app. Whether you replicate or not is transparent to the app. That sounds kind of like the way we do with, you know, with storage. Today with storage, you can just write an app and it thinks it's talking to files on a local disk and then you install a RAID stack and the app doesn't change. It still thinks it's talking to files on a local disk but somebody's actually replicating. So we kind of have the same model here. And so since that works for libpmemobj, it works for a library built on libpmemobj. So my little file thing, not only did it transparently start intercepting everything this application did, but you also get this replication feature for free. Ooh! A hush falls over the crowd. That wasn't a very good hush.
Starting point is 00:26:13 Did you say for free? Yeah, there you go. See, now you guys had two people up here when you were doing your little skit. You know, I just have me. Yeah, thanks. I appreciate that. It's definitely not as cute as the little skit you guys are doing. Okay, so let's talk a little bit more about that interception. Some things just work. They were actually quite easy to get working. Open, close, read, and write. It was easy. Our interception thing worked fine. We were able to get those working pretty much immediately, either by using linker magic or by actually figuring out where the calls are
Starting point is 00:26:53 and just literally patching the code to jump to us. A little gross, but we got it working. A few functions were a little more challenging, but we got them working. So like the OpenDir example I talked about, again, we did get that working, but we got it working by finding the calls, the actual calls that do the system calls inside of libc and then getting control there. Little tricky, but we got it working.
Starting point is 00:27:19 What about version, changing version? Yeah, we didn't do it by like, I mean, we're not like hard coding offsets or anything. We did it by looking for instructions, we actually. Turns out there are a few libraries of people who've already had to do this for debugging and tracing and performance counters. So we leveraged some of those things.
Starting point is 00:27:38 It's still a little gross though, just a little gross. Okay, so maybe it's a lot gross. But actually, it's pretty cool. So the application sees normal files and directories, unless they start with the magic token, and then sometimes those files live in p-mem pool. So if you have a database program, most of those have some sort of config file. MySQL, go look at the comp file. Redis, go look at the comp file.
Starting point is 00:28:06 Whatever. Most of these things have a config file. So all we do is we just say, well, those things that I think are good candidates for living in persistent memory in their config file, we have put the little $pmem thing in their path. So it's pretty easy. Binaries didn't have to change.
Starting point is 00:28:21 So done, right? Piece of cake. Ship it. Or are we done? So, there are a few operations that are kind of problematic. And the worst one is fork when you're not execing. So, here you are. You've got this lovely program binary. It's magically thinking that some of these files are on storage when they're in persistent memory and it decides to fork. So now you have these two copies of it.
Starting point is 00:28:50 They're both kind of operating on the same PMEM pool. They're interfering with each other. It's ugly, right? So we do catch that fork and we try and do the most reasonable thing if only to like error out if it's not safe. But we do have ideas of how we could actually make the library survive before it can continue to work. But so far, we haven't been very motivated to get that working. This is a lot of work. This is our ugliest work right now. Another thing if you think about it, if you're intercepting file operations, where does the
Starting point is 00:29:27 file descriptor come from? If somebody calls open but I didn't do a real open, what file descriptor do I hand them? So yeah, this is pretty ugly. And it turns out, you know, we do a lot of tricks to sort of make up file descriptors that aren't going to get you in trouble and are going to error out if you try and use them for something else. Of course, since they're not real file descriptors, what happens if you call select on them? Well, it turns out that calling select or pull on a file isn't really very common. It's very common on a socket or even sometimes on devices. But it's who calls select on like just a normal file?
Starting point is 00:30:07 That's not so common. So we haven't run into it a lot. And if we did, it would say, I'm sorry, I can't do select on this. Right? You'd get an error. So at least we don't pretend it works when it doesn't. Well, what about MMAP? Huh.
Starting point is 00:30:23 Now, that's starting to hurt my brain because, of course, we got the persistent memory by opening up a file on a persistent memory-aware file system. Our library, libpmemob, called MMAP and got this big giant range of memory. And then he built his own little pretend file system in there that libpmemfile now owns. And now you're using that API, transparently, and you call mmap. Gah! What should we do? So, you know, if we page align everything, if we make our library smart enough to page align all of user data,
Starting point is 00:31:01 the pmem file user data, we think we probably could make mmap mostly work. But right now we just error out. We just say, sorry, it's not a good candidate. So if your application, your unmodified application, if you give it a path name for one of the files and it calls MMAP on it, it'll just error out. So this is something, again, we kind of have a vision of how it could work, but I'm just not sure we should spend the time to make it work.
Starting point is 00:31:28 Think about MMAP with copy on write and stuff like that. It really starts to hurt your brain. How about AIO? So AIO is this set of interfaces that supposedly go off and do the IO and then give you a callback when it's done, right? And so you can launch a whole bunch of asynchronous IOs at once and then get these notifications back. They work on files. All of our IO is synchronous because, again, there's nothing to wait for. The CPU has to put the byte in somewhere and then there's no completion. You're done once you put the byte there. So what would AIO do? Well, what it would probably do is call back immediately. In other words,
Starting point is 00:32:09 you do an AIO write and in AIO write, we go, well, I copied the byte, I'm done and you call the callback. But the guy who called you is still waiting for the return from AIO write. So that doesn't happen today in normal AIO. And when we do it, you get this kind of stack going down and down because the application is too stupid to do anything about it. He's like, oh, I'm done. I'm going to call AIO right again. And he just calls it again and again and again.
Starting point is 00:32:32 So it's not pretty. And we could do all sorts of nasty things like saying, well, I'm going to call it later. There's no good solution to this. We talked about everything, honestly. But they're all so ugly that we just decided that these things should just also be not supported by our library. So we air out on AIO for now unless somebody tells us. And then there are a few rare syscalls. I was actually surprised to find some of these syscalls. I'd never heard of them. You know, I went through the syscall table and I'm like, what is this? What's renameat2? Oh, renameat2.
Starting point is 00:33:06 Okay, yeah, it's not that, not renameat. So anyway, some of these things we were able to figure out and just make them work and some of them are kind of too weird. And then, you know, so part of making this work is going to have to be some sort of tool that helps you decide whether it's appropriate for your application or not. But it's certainly starting to sound a lot more like either you should be calling our API so that you're responsible for what you're calling or you should be using this transparent thing as a try before you buy mechanism. Would you actually ship an application where this
Starting point is 00:33:38 is the transparent solution? I don't know. We'll have to think about that one, but we'll see. Multi-processing is kind of a big open for us. So all of the libraries, all of the NVM libraries right now that are transactional, they don't support multi-process access. They support multi-threaded access. So if you open up one of our PMEM pools with one of our libraries and you fork a thousand threads, it will work and we spend a lot of time making sure it's not just multi-thread safe, it's empty hot. We really are optimized for multi-threads.
Starting point is 00:34:15 Most of our threads can do allocations of persistent memory without grabbing a lock because we keep little caches of per thread caches and stuff like that. But for multiprocess, there's a lot more work to it, a lot more cache flushing and declaring things volatile and all this kind of stuff that has to work between multiple processes. And we haven't been convinced that there's enough need for it yet. Now, I say that, and then yesterday, on Sunday, we were
Starting point is 00:34:40 doing this tutorial about programming with NVML. And somebody in the audience came up with a decent use case for one application, right? But it's, so that's one and I have maybe one other application that where people have told me, oh, we have to have multiple processes access to the same pool. So, I'm still not compelled enough probably to go and do all this work to make multi-processing work. But it's out there. And this is a limitation of all of the libraries, right? So libpmem file suffers from it, just like libpmem-obj does. So if you have requirements for multiprocess access for persistent memory, drop me a line,
Starting point is 00:35:15 because I'd like to hear more about it. I want to keep considering it. At some point, I'll build up a big enough list. It'll push me over, and I'll say, OK, we got to go do this. It's not that we don't think it can be done. It's just that we think it's a lot of work, especially in validation. Yes, sir? Are you documenting somewhere on the site? Yeah. So on PMEM.io, we have a blog area and we have a blog entry on how we might
Starting point is 00:35:37 make multiprocessing work, for example. So in that particular one, we have a blog about it. But there's also an issues database from Kipa, an open. In fact, I think it is one of our open issues right now. So we're just using GitHub issues, you know. Yeah. Anyway, like I said before, I think the key here is to make sure we don't silently do the wrong thing, right? So if the PMM file is going to be useful to someone, it should be sure to bail out if they do something that doesn't support. So that's kind of our, one of our primary design goals here. .
Starting point is 00:36:12 Well, unfortunately, I mean, the whole point is that you're back here. This is a binary that you didn't compile, right? So we thought about making a tool that actually says, oh, I see you're using this syscall, right? So we thought about making a tool that actually says, oh, I see you're using the syscall, right? In other words, like it scans through your binary. The problem is that may have nothing to do with like the one file you're trying to put into PMEM, right? So we're still kind of noodling on that one. Right now, it just bails out at runtime. So in other words, the advice is get your test case ready, then put this library in place and run your test case. And if it bails out, then sorry. I mean, we'll
Starting point is 00:36:51 tell you why we bailed out, but then you'll learn. The question is now, okay, what if it didn't bail out during your test case, but then you actually try and use it in production, right? So that's why I say we're trying to think that one through a little more. You might be interested in your library within two slides. Once you start testing only, then you realize the last slide that all these things doesn't work. Yeah. So, you know, it's really, it's an error-prone part of using this interception thing. And, you know, a lot of times when I tell you, you guys haven't done the shutter yet, but you will if you think about it long enough. A lot of times when I describe the
Starting point is 00:37:27 interception thing, people are just like, what? You can't do that. And well, it turns out, you know, there's really no documented supported way on Linux to do this kind of interception, right? I mean, the closest thing we have is like Ptrace, right? Something that's meant for debugging, but it's just not performant, right? And there's also a, what is it called? Secmod or something? It's a set of auditing. It's basically for auditing or things like that.
Starting point is 00:37:55 So it's not quite right for this, because those things are done by another process. They invoke a context switch. You know, you do one context switch, you're done, right? Because the whole point here was to avoid all that and just do everything with this little short little code path that just goes directly down into your persistent memory. So we're still noodling over this and it may be that we're able to come up with a better, more reliable interception mechanism but only by patching the Linux kernel and adding some sort of new capability. I actually have a
Starting point is 00:38:26 bunch of ideas on how to do that, but we haven't implemented any of them yet. Okay. So what we did do was we do some combination of the linker magic and patching. And like I say, ld.so and libc try to protect the app from unexpected behavior and we have to overcome that a little bit. But there's really no well-specified thing. By the way, in Windows, there is a well-specified documented way of patching and grabbing these things. It's called... Yeah, I forget. Anyway, it's documented on MSDN. It says this is how you intercept things and this is how you chain them. Because you remember the old DOS days, right? That's how you did everything. You intercepted the little ints, the int21s. So it is documented. And
Starting point is 00:39:16 so when we were considering this library, I went over to Paul's team and said, hey, this will be easy on Windows. And they did the shutter. They're like, no, nobody uses that. Don't use that. You know, so I think it's just as bad even on Windows, unfortunately. However, if we can just make a solution that we think is good enough, you know, like frankly, I think the open, close, read, write interception is good enough for a lot of these cases I'm talking about. The Redis append-only file, it does nothing but open and write. It really does
Starting point is 00:39:49 nothing else. It's very simple and it works and we've done that with our library very easily. But, you know, the thing to beat is it's all about performance, right? We wouldn't do any of this if it's not a performance benefit. So the thing to beat is just using a real file system. We have file systems that we can, they're P-MEM aware already, right? EXT4, XFS, NTFS, these things already understand persistent memory. So if you put them in this DAX mode and say, you know, now the application is just going to run, the binary is going to run, It's transparently doing what it's doing. That should just work.
Starting point is 00:40:26 Or we also have the ability to make persistent memory look like a block device. And then we can just put any file system on top of it. But of course, you're back to having a memory footprint when you page things in. But this one didn't have a DRAM footprint when we turned on DAX mode. So if these things are fast enough,
Starting point is 00:40:42 forget the PMM file. Right? That's the thing to beat. Now, if these things are fast enough, forget libpm file. That's the thing to beat. And the code paths for things like append are really the issue here. They go very deeply through the file system code. They do metadata updates.
Starting point is 00:40:56 They do multiple updates, multiple syncs for some operations. Whereas with libpm file, these things just turn into loads and stores and some cache flushes now and then in user space. And coming soon to an Intel platform near you are some cache flush instructions that are more optimal than CL flush.
Starting point is 00:41:17 And so that'll make this even faster by quite a bit, we believe. So we did some proof ofof-concept results. This is kind of an eye chart, so I won't spend too much time on it. It'll be online in the slides that we deliver. But the point is that over here we have a file system. Here's our first try at PMEM file.
Starting point is 00:41:39 It was a little faster than ButterFS when we tried it, but we were really micromanaging all of the allocations and making everything persistent, and we realized we could actually kind of do some more clever allocations where we do bigger allocations. And here's XFS and here's the XT4 and then here's what PMEM file can do today. So this is 64K sized appends to a file and this is how many we can do per second with a little microbenchmark. Okay? And this, we got it up to, you know, 18,000 where the first thing we tried at here was below 2,000. So, we do think that there's some value here. But we're worried about the
Starting point is 00:42:19 interception part because it's error prone trade off, right? And I have a similar set of results here for if you're not appending but you're just updating the file. And so it's not as dramatic but you can see we still were able to get it out on top. So what am I really trying to do here with my seventh library in the NVML suite? Well, you can take an unmodified application and you can just run it on top of persistent memory using the emulated block driver and everything is cool. You will use some DRAM buffers to copy into the non-volatile memory and you're basically
Starting point is 00:43:02 at microsecond level application latency. If you think about like doing a kind of a one 4K block thing, we're in the microseconds, right? On the other hand, you could come over here and use libpm, sorry, you could use our NVML APIs. This means you re-architect for PMM. And now you're doing everything in place. You're not doing these little IOs anymore.
Starting point is 00:43:27 And now you're at the lowest app latencies you can have. I almost put a number here, but the marketing guys made me take it off. But let's just say that it's order of magnitude memory speeds, so pretty fast. So that required you to redesign your application. But then over here is what I've been trying to target with today's talk.
Starting point is 00:43:52 Is there a happy medium between this and this? This is a new library. It emulates the file APIs so the application continues to be unmodified. And now you're back to the low app latency. Not as low as this, but lower than this. Right? So it's somewhere in the middle. And as a bonus, you inherit things like replication that libpm-mops did for you. Ooh. The replication is actually pretty cool. So we have a lot of... Yes, sir? How much is the overhead of the replication?
Starting point is 00:44:28 The overhead of the replication is significant because of RDMA round trip times. And come to Chet Douglas' talk tomorrow. He'll tell you exactly how much that is and what we're going to do to make it better. See, there's a good plug for Chet's talk. He's not here now, but he'll be here in a moment. So we have a lot of ideas for how to use persistent memory transparently, and a lot of them are implemented and they're going to work.
Starting point is 00:44:54 So when we put up this ugly code, these APIs and things like that, don't be afraid. There are lots of ways of using persistent memory without even modifying your app. But we describe one idea here, and there are lots of others coming. And we think that these ideas, these transparent ideas, lower the barrier to adoption, especially for, I call them, tier two applications, right? The people who have applications, but they don't have rows of software engineers to spare to re-architect them.
Starting point is 00:45:28 The big companies are already figuring out what to do for persistent memory. Nobody is claiming they have the one true answer that I'm aware of, of how persistent memory programming should work. I think there are a lot of multiple competing ideas, but I think we want to encourage that. I think we all benefit from that. And we definitely want to get experience with a lot of these different solutions before maybe we settle on a couple of key solutions. I really like the idea of being able to try out persistent memory before we are kicking your app.
Starting point is 00:45:58 So watch for libpmem file. Sometime next year, I didn't want to put a date on here because I really do want to spend some time figuring out how to make the interception less error-prone. It scares me. That's the part that scares me. The performance benefit looked cool though, right? So that's where I am. So any questions?
Starting point is 00:46:15 Libpmm file? MDML? Yes, sir? I'm curious what exactly goes wrong with fork. What exactly goes wrong with fork is two different things. One is our library isn't multiprocess aware right now. And so when it forks, we say, oh, I'm sorry, because we really just don't do the right things in our transactions to account for another process.
Starting point is 00:46:43 Even if we fix that, there are cases like Redis does a very interesting thing. It wants to take a snapshot so it forks and the one process just keeps going but it's marked all the pages copy on right. And so it just keeps going and the other one has the frozen bunch of the pages and it goes and makes a big backup. So it's the copy on write thing that, you know, if it's trying to do that with one of our files, it's not going to get what it thinks. But that case would have been, you know, probably disqualified by one of the other things on my slide. But so it's kind of those two things there, that there's the semantics of what happens after a fork without an exec, and then there's just the multiprocess thing.
Starting point is 00:47:27 Any other questions? All right, it's the end of the day. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference,
Starting point is 00:47:57 visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.