Storage Developer Conference - #41: Breaking Barriers: Making Adoption of Persistent Memory Easier
Episode Date: April 18, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 41.
Today we hear from Andy Rudolph, Architect, Data Center Software, Intel,
as he presents Breaking Barriers, Making Adoption of Persistent Memory Easier,
from the 2016 Storage Developer Conference.
So I'm Andy Rudolph. I work at Intel.
And I'm here to talk more about how we can make persistent memory easier to use.
And you've heard a bunch of stuff now about the NVM libraries,
if you've been to some of the previous talks.
So, thank you. So, you know, I'm just going to like cut to the chase. It's about transactions. It's all about transactions.
That's the hard part of programming and with persistent memory. And so, we're trying to
do some things to address that. But I just want to be clear. I don't have the answer.
I have an answer and I actually have lots of little answers.
And we're trying to get a lot of things going at once here on a lot of different approaches
to see where we end up, which things end up being the things that stick.
So when we talk about the NVM library,
I just kind of want to make it clear,
this is just one idea.
If you were in the Druvas talk this morning,
that's another idea.
And they're both completely valid, compatible ideas.
In fact, you really could technically use them
both at the same time.
But the point is that I think they're also still a bunch of ideas we haven't come up
with yet, right? And so what I'm trying to do is get a lot of these ideas for how to
program with persistent memory out into the ecosystem and let the ecosystem do its thing,
right? People will try different things. They'll do research. They'll try different ideas.
So please don't think
that I'm presenting this as the one true solution. It couldn't be further from the truth. This
is a solution. So, you know, from the 50,000 foot level, the way I like to think of it
is that for, you know, all of my lifetime and way before that, computers at runtime really
primarily put their data in two tiers.
It's either sitting in memory or it's out on some sort of
storage.
And we're all kind of taught this in school, and it's very
basic.
And in today's architecture, sometimes that storage is used
through a file system, sometimes not.
It doesn't really matter.
The point is that it's slower out on storage and so you have to kind of bring data in on
memory to operate on it.
And we're moving into this world where there's three tiers, right?
And I'm very adamant about this, that persistent memory is not the replacement for storage.
Frankly, it's going to be too expensive initially, right?
Because you're paying a premium for it.
So even though we have these kind of exciting announcements about it becoming very large capacity and so on,
you're not just going to stop using storage, right?
So I'm not trying to replace storage.
On the other hand, it's memory accessible, but it's not the replacement for memory.
It may not be, depending on the technology, it may not be large enough, it may not be fast enough. We believe it'll be cheaper than memory but it's certainly not
a replacement for memory. So that's why I keep making this point when I give these talks
about we're moving to three tiers, right? And of course, things that work today like
storage, this model here where there's just memory and storage,
this will continue to work and will make it continue to work using persistent memory behind
the scenes, right?
So there's plenty of transparent ways that you can take advantage of persistent memory.
So how can we keep giving all these talks about ways of like hacking up your application
to use persistent memory?
Because that's probably how you get the most leverage out of it. But we're still going to support all sorts of,
you know, legacy models and transparent use cases. I just want to make that clear too
that again, we're talking about these ways of programming like all the macros and stuff
that you saw in the previous talk. We're talking about that as a way of exploiting persistent memory but not something that people are forced to move to, right? Okay. So, like I said, you can modify your applications,
you can use the programming model itself, the thing that Doug described without any
libraries and, you know, what happens? You call mMap and you get this big blob of persistence,
right? And so, you know,
the announcements from Intel last year talked about having something like 3 terabytes per
socket of persistent memory possible. So you can easily imagine on the two socket system,
the most popular machine out there in the server space. You can imagine some program calling M-Map and getting back
six terabytes. So now what do you do? Right? I mean, just handing somebody a pointer, some
programmer a pointer to six terabytes of persistence is not very friendly. And, you know, they've
already had to type M-Map just to figure out what the heck that was, right? Most people
don't even know what it is. So it's not very friendly. So we decided, what can we do today without modifying languages
and going through language committees and, you know, modifying compilers and stuff? What
can we provide today as a way to start being able to use this stuff so that we can now
take our time and go through and get the, figure out what is the right way of modifying
languages and things like that. So they're
not mutually exclusive for ideas. And what we came up with initially is this set of six
libraries. You've seen this slide if you've watched any of the other talks today. It's
been up a couple of times. Three of our libraries are transactional and the one that has the
most general purpose transactions is this one called libpmemobj. And so, you know, if
you want to re-architect your application here, up here, to use the three tiers instead
of two tiers, this is meant to help you do that. Okay, so why would you choose to do
that? Why would you re-architect an application? Well, one of the things that I look for, for
use cases for persistent memory
are very large data sets, you know, like terabytes in size data sets. Those things, today they're
sitting in storage and people are paging them into DRAM, right? So that's a nice fit for
persistent memory where they can sit there where you can just access them with byte addressable
persistence. You need them to be persistent, you need them to be byte addressable, that's what makes all this worth
it. Especially if you have lots of small random accesses, because if you think about it, if
I have a huge data set, say I have like a big hash table, like I'm a storage vendor
and this is a dedupe table, right, those things can be pretty big. But every time I look up
something in the hash table, I'm only accessing a few bytes. Well, to do that on a system today where you're paging,
every one of those times you touch a page,
you have to bring that whole page in.
And the bytes that you don't access,
that's wasted bandwidth.
But with persistent memory, you get all that bandwidth back
because you're touching just these little random things
here and there.
You're not paging.
You're not even imposing a DRAM footprint at all.
OK, so that's one of the reasons reasons if you have a use case like that where you
might decide to re-architect your application. What if you want to do DMA directly to your
persistence? Today, that's virtually impossible. You do DMA to DRAM and then you DMA from DRAM
into your persistence. But with persistent memory, you go right to the persistence. Ooh,
that sounds pretty cool. So, you know, bottom line, if you find a performance-critical application
that's going to benefit from some of the things I have listed here, that might be enough to
motivate you to re-architect your application. But, you know, I don't drop that word, re-architect,
lightly. There are plenty of reasons why you might not decide to re-architect your application.
First of all, if one of the many transparent ways to use persistent memory works well enough for you,
then why would you spend all that effort to re-architect your application?
So, for example, remember I just complained about paging.
But, you know, paging, which is something people go to great lengths to
avoid right now. If you install one of the major databases on a system today and you
call them up and say, my performance is terrible, they'll say, oh, type this command. Are you
paging? If so, you're misconfigured. Sorry, we won't even help you. They won't even talk
to you, right, because they consider paging like the death of their performance because it's
the surprise I.O. that happens, you know, in the background. But that's going to change
because we're talking about the stuff that's gone up three orders of magnitude in performance.
So now suddenly paging is not as big of a penalty as it has been for the past 30 years.
So maybe paging is now a good solution to that big multi-terabyte data set. So you
still keep it out on storage. That's all going to work with persistent memory. It's just
going to be transparent. So no work, no change to your app. So maybe that's good enough for
you. Or, you know, people will emulate block storage on top of persistent memory and it'll
be fast, right? So maybe that's fast enough. Or maybe some middleware is doing the stuff
that I'm about to talk about. It's, they modified, you know, their middle library layer to use
persistent memory and you build on top of it. Like maybe somebody modified the Java
virtual machine to use persistent memory to make what it does faster. And you're just
writing your Java app and you don't care about whether it uses persistent memory or not.
Perfectly reasonable explanations
of why you might decide not to re-architect your application.
But bottom line, again, it's when
the cost of re-architecting outweighs the benefit.
And the cost is not just doing the architecture
or figuring out where you're going to place data
in these three tiers, but the design and the implementation.
And of course, the big cost turns out to be validation.
So we have these major database apps, for example,
that are very careful about when they'll make changes.
And they have massive validation cycles.
So they're not just going to say, oh, putting this one
data structure in persistent memory does make it faster,
and then ship it.
They're signing up for a huge amount of validation
and support if they do that rearchitecture
Okay, so let's take an example now This is a little bit of a kind of a narrow down contrived example
Just to serve my purposes of when you might decide that you have an application that you you want to try and use with
Persistent memory, so I've taken this example. I kind of haven't named any real applications here.
But one you'll probably will come to mind right away
if you're thinking about it.
Let's take a database-like application.
It's doing transactions.
Databases do transactions, right?
And it's doing these transactions to its tables.
And the tables might actually be in memory.
They might actually be sitting in DRAM. So they're very fast
Right. So, hmm great. So now we have an application. It's actually quite fast
But you know, it also does care a little bit about persistence
So as it's making these these transactions in its tables, it's also doing right-ahead logging
It's got a little log that it writes everything to.
And you know, like a lot of databases,
this log is written and basically never read.
Why does it do that?
Well, it does it in case of a crash, right?
The log is needed for recovery,
but recovery is the.0001% case, right?
The most common case is everything's brought up normally,
everything's shut down normally,
so you write the log all the time and you never bother to use it.
And so you're basically, like in Redis, this is called the append-only file.
If you guys are familiar with Redis, it's called AOF, append-only file.
It's basically a log that just gets appended to constantly.
And, you know, the path, if you think about it right here, here's the application,
and he's running, he's going to append something to the log.
It goes down into the kernel. It goes through the file system. I'm kind
of skipping a lot of layers here, it goes through the block stack and everything and
it ends up appending to this little log file here. You know, it might end up writing to
metadata because, you know, it's an append so it's maybe allocating other blocks and
having to change the file size, things like this. So each one of these things, these appends could take quite a trip through the kernel.
So you might think, ah, this is a good example of something where we could say, hey, you
know this log file here, let's move that into persistent memory.
And then we'd save ourselves that trip into the kernel.
Okay?
So you might start out and say, oh, well, I saw this little talk from Andy that said
there's some library that allows me to do this.
So you go and you look up the libpmemobj API, pmem.io, all the man pages are there and everything.
You say, oh, yeah, I can do this.
And sure enough, we have our little library there.
It is pmemobj.
It's built on top of another little library for flushing stores.
And every one of these appends to the log just becomes a little transaction. And now it's much faster, but again your app had to change.
So you're saying, gee, thanks Andy, that's faster,
and you made me change my app. That's not very nice.
And besides that, as somebody pointed out during the previous talk,
these macros that you end up using are quite ugly and, you know,
not that friendly. And so, you know, before making this change, you just opened up your
log file and did a write and did a sync and you're done. That was the append. But now,
you're saying, well, okay, these guys told me this is really cool stuff so I'm going
to use it and you had to use this like TxBegin macro, and then you did all the stuff here to make your little, you know, copy your changes to the end of
the log and TxEndMacro. And it is cool. It did work. It's now PowerFail safe. But, you
know, it is complicated and it's not something that programmers, your average programmer
is really going to love using. And, you know, this is kind of intentional. Like I say, we started
out what could we implement today without modifying the compilers. So that's why we
ended up with a lot of these C macro things going on here. It's meant for early adopters.
It's meant for language implementers, right? The guy, we have a guy right now who's putting
transactions into Python, persistent memory transactions into Python.
And he has a persistent memory aware dictionary in Python.
And when you set a value, it happens transactionally.
This is what it's doing down in the C code of the Python interpreter.
But the guy writing Python has no idea.
He just says, oh, I assigned something to this dictionary and it's transactional.
That's my intention, right?
That this is like the implementer's language,
like the guy who's down in the nuts and bolts.
Probably day to day, I'm not sure that it matters,
whether this looks ugly when it has macros
and things like that, because day to day,
I expect people to be using higher level languages.
Okay, but could we do better than this?
Like, this is our libpmmobs.jpi right here.
Could we at least make it look a little more like this
just to help ease the adoption?
Well, I think we could.
And so we're working on another library,
a seventh library.
Remember, the library had six libraries in it.
We're working on a seventh library now,
which I call libpmfile.
And it's libpmfile because it's basically the
file paradigm, the file accesses, only it's all in your little user space pool of persistent
memory. And so instead of calling open, the application calls pmemfile open, but with
the same arguments. Instead of calling write, he calls pmemfile write. Should there be a
pmemfile sync? Well, you know, our library is transactional.
Why don't we just make everything transactional because it's not like it's in memory, right?
It's in persistent memory. It's not like I'm going to say, oh, I'm not going to write it
out. It is written out. Persistent memory is right in place, right? If I just move a
byte into persistent memory, it's persistent. So So I guess I don't even need a sync.
But I could provide one that just is a no-op, just so that people...
I haven't decided.
It makes you feel good. It's like the guy who types sync three times.
Which is, I can't type sync once, I have to type it three times,
because somebody told me a long time ago that you used to have to do that.
It's probably all apocryphal.
Okay, so this is now maybe a little easier, right?
So it's easier for people to understand because they grew up knowing about how these system calls worked,
and now they can see their PMM file versions here.
So that's not so bad, and we just build it right on top of libpmimobj. So internally, libpmimfile is just using the transactions of libpmimobj.
So now I just have this stack of three little libraries here.
And they look bigger than they are.
They're actually pretty, the code paths are pretty short.
You know, the Linux version of this would be modeled after POSIX.
It's a familiar API.
But the app still had to change.
So maybe I made it a little easier.
Maybe I lowered the barrier to adoption a little bit.
But remember, if you change one line of code,
if you change a comment, you still have to revalidate.
And that's what's pissing people off.
So how can I do this without pissing people off so much?
Not pissing people off is a good goal.
So one way I could do this is to try and use libpmem file transparently.
So we already do this for our volatile libraries.
We have a library for persistent memory for volatile usages,
right? And so why would you do that? Well, you know, like I said before, persistent memory
is expected to be cheaper than DRAM and, you know, terabytes in capacity. So you might
actually not care about the persistence. You just want to use it as another tier of volatile
memory. So we have a library that does that and it has has a libvm malic, and it has a libvm free,
and they look just like the libc malic and free,
only they work on a different pool of memory.
Well, that's pretty cool.
But one of those six libraries,
and I guess I should maybe have said this
when I was back on this slide,
one of these six libraries here, libvm malic,
just makes this happen transparently.
So if you take any application and you use the linker magic,
the LD preload environment variable,
you can have all of its malik and free calls
kind of magically replaced here.
So you can just take an application that,
without modifying it, you can just make the binary
put all of its dynamic memory into persistent memory, right?
So again, that's the cheaper memory so it doesn't impose such a big footprint on your
more expensive DRAM.
So taking that as a model, I'm thinking, huh, could I do that for libpm file?
I mean, it wasn't too hard to just interpose on, you know, we had to do what, malloc, free,
realloc, posix, memeline, there were a couple of others, calloc.
But it wasn't hard.
In fact, JEMalloc, a popular allocation library, is actually what's inside this.
We just used JEMalloc.
We didn't have to write a new volatile memory
allocator. And it already had this kind of interposing ideas. Sometimes people do this
with JEMalloc. They take a binary that was written to use libcMalloc and they interpose
on it.
Okay. So if I can do that, couldn't I do that here with my libpmim file. Just using some sort of linker magic, I could preload these libraries onto a binary
without modifying it.
And the linker magic helps with interception.
So let me say a couple of words about that.
A long time ago, in the before time, when men were men and we didn't have dynamic libraries,
everything was built statically.
And libraries were first invented in Unix anyway, when shared libraries were first invented.
Of course, they existed on other systems already, like VMS had them forever and things like
that. So this guy who worked at Sun at the time saw shared libraries as
this great solution for doing all sorts of things we never dreamed of before. And one
of them was called building an interposing debugger. And he was going to make it so that
you could build debuggers that interposed on all these library functions, right? And
the way that that's why if you look at LD.SO today,
it has interfaces for looking up functions and interposing all these RTLD interfaces,
DLopen, if you guys ever played with these. And what really happened is it kind of turned
out to be not that interesting. I don't know of any interposing debuggers today. I don't
know of anybody who even uses library interposing for anything
other than like a couple of fringe cases. And a few years ago, the guys who maintain
the GNU libc, the one that comes on Linux, kind of said, boy, you know, if somebody interposes
on our library and changes something like that's an internal library call, they would then
report a bug and it's not our bug, it's their bug. So they disabled it. So in other words,
today in the GNU libc, if you call, let's see, what calls open? So if you call opender,
it has internally a call to open. Open is a system call. And if you say, wow, I really want to interpose on that, the library doesn't allow it. Right? What happens is it actually has
an inline call to open inside of OpenDir and prevents you from interposing it because they
don't want you to break their OpenDir and then blame it on them. So that was a long
story. But the point is that the interposition thing sounds like a
pretty good idea, but there are some obstacles.
However, it's very compelling because, first of all, you can just do this transparently
to the app.
The app, the binary is unchanged.
And what you do is you just make it an administrative configuration thing, right?
So libpm file, when it starts up, it looks for some sort of configuration thing, right? So libpm file when it starts up, it
looks for some sort of configuration information which we actually point it to with an environment
variable. And it says things like, well, when this application does most opens, it should
just go right down into the kernel. It should go to the normal thing. So if I just open
up etsy password, I actually just call into the kernel and the normal thing. So if I just open up ETSI password, I actually
just call into the kernel and the normal path happens. But if I open up, you know, slash
dollar PMEM slash mylog, that's the got the special token in it and that tells lib PMEM
file when you've configured it correctly to put that into persistent memory, okay? So
that's the idea of using the interceptor. And, you know, there's more
to it than just saying you didn't have to modify your binary. What if you're thinking
of re-architecting your app like I showed in the first couple of slides? But again,
it's quite an investment in time. Wouldn't it be nice if you could just try it out? Say,
well, you
know, I can think of a couple of things that I might re-architect to use libpm to store
them. Today, they're storing that stuff in files. So why don't I just do this, you know,
in my normal test environment and try it out, do some performance measurements. And if I
find out that, boy, when I use this libpm file, things got a lot faster, then you might think, OK,
it's probably worth me doing the re-architecting work.
On the other hand, if you put this in,
and there was no difference because actually the bottleneck
was somewhere else, it's good to know.
And you didn't have to do any app re-architecting
or revalidating.
So this is kind of the great try it know, try it before you buy it mechanism.
Are you buying it?
Yeah?
Okay, maybe a little.
And there's one thing that's even cooler than this.
libpmemobj, this library right here, has already built into it the ability to replicate. So today, the shipping libpmemobj,
you can write to the API, that transactional begin and API
that I mentioned to you before.
You can write to that API.
And the application, without changing,
you can make some configuration steps,
some administrative steps, and it replicates for you.
Now, the shipping library today will replicate between two persistent memory files on the same machine.
And then we actually, these are out on GitHub already.
We just haven't tagged it as a release yet because we're still testing it.
But we've added the ability to replicate over RDMA to a remote node.
Okay, so again, the API is the same.
It's actually transparent to the app. Whether you replicate or not is transparent to the
app. That sounds kind of like the way we do with, you know, with storage. Today with storage,
you can just write an app and it thinks it's talking to files on a local disk and then
you install a RAID stack and the app doesn't change. It still thinks it's talking to files
on a local disk but somebody's actually replicating. So we kind of have the
same model here. And so since that works for libpmemobj, it works for a library built on
libpmemobj. So my little file thing, not only did it transparently start intercepting everything
this application did, but you also get this replication feature for free. Ooh! A hush falls over the crowd. That wasn't a very good hush.
Did you say for free?
Yeah, there you go. See, now you guys had two people up here when you were doing
your little skit. You know, I just have me. Yeah, thanks. I appreciate that. It's definitely
not as cute as the little skit you guys are doing. Okay, so let's talk a little bit more
about that interception. Some things just work. They were actually quite easy to get
working. Open, close, read, and write. It was easy. Our interception thing worked fine. We were able to get those working pretty much
immediately, either by using linker magic
or by actually figuring out where the calls are
and just literally patching the code to jump to us.
A little gross, but we got it working.
A few functions were a little more challenging,
but we got them working.
So like the OpenDir example I talked about, again, we did get that working, but we got
it working by finding the calls, the actual calls that do the system calls inside of libc
and then getting control there.
Little tricky, but we got it working.
What about version, changing version?
Yeah, we didn't do it by like, I mean,
we're not like hard coding offsets or anything.
We did it by looking for instructions, we actually.
Turns out there are a few libraries of people
who've already had to do this for debugging and tracing
and performance counters.
So we leveraged some of those things.
It's still a little gross though, just a little gross.
Okay, so maybe it's a lot gross.
But actually, it's pretty cool. So the application sees normal files and directories,
unless they start with the magic token,
and then sometimes those files live in p-mem pool.
So if you have a database program,
most of those have some sort of config file.
MySQL, go look at the comp file. Redis, go look at the comp file.
Whatever.
Most of these things have a config file.
So all we do is we just say, well, those things that I
think are good candidates for living in persistent memory in
their config file, we have put the little $pmem
thing in their path.
So it's pretty easy.
Binaries didn't have to change.
So done, right?
Piece of cake.
Ship it. Or are we done? So, there are a few operations
that are kind of problematic. And the worst one is fork when you're not execing. So, here
you are. You've got this lovely program binary. It's magically thinking that some of these
files are on storage when they're in persistent
memory and it decides to fork.
So now you have these two copies of it.
They're both kind of operating on the same PMEM pool.
They're interfering with each other.
It's ugly, right?
So we do catch that fork and we try and do the most reasonable thing if only to like
error out if it's not safe. But we do
have ideas of how we could actually make the library survive before it can continue to
work. But so far, we haven't been very motivated to get that working. This is a lot of work.
This is our ugliest work right now. Another thing if you think about it, if you're intercepting file operations, where does the
file descriptor come from?
If somebody calls open but I didn't do a real open, what file descriptor do I hand them?
So yeah, this is pretty ugly.
And it turns out, you know, we do a lot of tricks to sort of make up file descriptors
that aren't going to get you in trouble and are going to error out if you try and use them for something else.
Of course, since they're not real file descriptors, what happens if you call select on them? Well,
it turns out that calling select or pull on a file isn't really very common. It's very
common on a socket or even sometimes on devices. But it's who calls select on like just a normal file?
That's not so common.
So we haven't run into it a lot.
And if we did, it would say, I'm sorry, I can't do select on this.
Right?
You'd get an error.
So at least we don't pretend it works when it doesn't.
Well, what about MMAP?
Huh.
Now, that's starting to hurt my brain because, of course, we got
the persistent memory by opening up a file on a persistent memory-aware file system.
Our library, libpmemob, called MMAP and got this big giant range of memory. And then he
built his own little pretend file system in there that libpmemfile now owns. And now you're
using that API, transparently, and you call mmap.
Gah!
What should we do?
So, you know, if we page align everything, if we make our library smart enough to page align all of user data,
the pmem file user data, we think we probably could make mmap mostly work.
But right now we just error out.
We just say, sorry, it's not a good candidate. So if your application, your unmodified application, if you
give it a path name for one of the files and it calls
MMAP on it, it'll just error out.
So this is something, again, we kind of have a vision of
how it could work, but I'm just not sure we should spend
the time to make it work.
Think about MMAP with copy on write and stuff like that. It really starts to hurt your brain.
How about AIO?
So AIO is this set of interfaces that supposedly go off and do the IO and then give you a callback
when it's done, right?
And so you can launch a whole bunch of asynchronous IOs at once and then
get these notifications back. They work on files. All of our IO is synchronous because,
again, there's nothing to wait for. The CPU has to put the byte in somewhere and then
there's no completion. You're done once you put the byte there. So what would AIO do? Well, what it would probably do is call back immediately. In other words,
you do an AIO write and in AIO write, we go, well, I copied the byte, I'm done and you
call the callback. But the guy who called you is still waiting for the return from AIO
write. So that doesn't happen today in normal AIO. And when we do it, you get this kind
of stack going down and down because the application
is too stupid to do anything about it.
He's like, oh, I'm done.
I'm going to call AIO right again.
And he just calls it again and again and again.
So it's not pretty.
And we could do all sorts of nasty things like saying, well, I'm going to call it later.
There's no good solution to this.
We talked about everything, honestly. But they're all so ugly that we just decided that these things should
just also be not supported by our library. So we air out on AIO for now unless somebody
tells us. And then there are a few rare syscalls. I was actually surprised to find some of these
syscalls. I'd never heard of them. You know, I went through the syscall table and I'm like, what is this? What's renameat2?
Oh, renameat2.
Okay, yeah, it's not that, not renameat.
So anyway, some of these things we were able to figure out and just make them work and
some of them are kind of too weird.
And then, you know, so part of making this work is going to have to be some sort of tool
that helps you decide whether it's appropriate for your application or not.
But it's certainly starting to sound a lot more like either you should be calling our
API so that you're responsible for what you're calling or you should be using this transparent
thing as a try before you buy mechanism. Would you actually ship an application where this
is the transparent solution? I don't know. We'll have to think about that one, but we'll see.
Multi-processing is kind of a big open for us.
So all of the libraries, all of the NVM libraries right now that are transactional, they don't
support multi-process access.
They support multi-threaded access.
So if you open up one of our PMEM pools with one of our libraries
and you fork a thousand threads, it will work and we spend a lot of time making sure it's
not just multi-thread safe, it's empty hot. We really are optimized for multi-threads.
Most of our threads can do allocations of persistent memory without grabbing a lock
because we keep little caches of per thread caches and stuff like that. But for multiprocess, there's a lot more work to it, a lot
more cache flushing and declaring things volatile and
all this kind of stuff that has to work between multiple
processes.
And we haven't been convinced that there's enough
need for it yet.
Now, I say that, and then yesterday, on Sunday, we were
doing this tutorial about programming with NVML.
And somebody in the audience came up with a decent use case for one application, right? But it's, so that's
one and I have maybe one other application that where people have told me, oh, we have
to have multiple processes access to the same pool. So, I'm still not compelled enough probably
to go and do all this work to make multi-processing work. But it's out there. And this is a limitation of all of the libraries, right?
So libpmem file suffers from it, just like libpmem-obj does.
So if you have requirements for multiprocess access
for persistent memory, drop me a line,
because I'd like to hear more about it.
I want to keep considering it.
At some point, I'll build up a big enough list.
It'll push me over, and I'll say, OK, we got to go do this.
It's not that we don't think it can be done. It's just that we think it's a lot of work,
especially in validation. Yes, sir?
Are you documenting somewhere on the site?
Yeah. So on PMEM.io, we have a blog area and we have a blog entry on how we might
make multiprocessing work, for example. So in that particular one, we have a blog about
it. But there's also an issues database from Kipa, an open. In fact, I think it is one of our open issues right
now. So we're just using GitHub issues, you know. Yeah. Anyway, like I said before, I
think the key here is to make sure we don't silently do the wrong thing, right? So if
the PMM file is going to be useful to someone, it should be sure to bail out if they do something
that doesn't support.
So that's kind of our, one of our primary design goals here.
.
Well, unfortunately, I mean, the whole point is that you're back here.
This is a binary that you didn't compile, right?
So we thought about making a tool that actually says, oh, I see you're using this syscall, right? So we thought about making a tool that actually says, oh, I see you're
using the syscall, right? In other words, like it scans through your binary. The problem
is that may have nothing to do with like the one file you're trying to put into PMEM, right?
So we're still kind of noodling on that one. Right now, it just bails out at runtime. So
in other words, the advice is get your test case ready, then put
this library in place and run your test case. And if it bails out, then sorry. I mean, we'll
tell you why we bailed out, but then you'll learn. The question is now, okay, what if
it didn't bail out during your test case, but then you actually try and use it in production,
right? So that's why I say we're trying to think that one through a little more. You might be interested in your library within two slides.
Once you start testing only, then you realize the last slide
that all these things doesn't work.
Yeah.
So, you know, it's really, it's an error-prone part of using this interception thing.
And, you know, a lot of times when I tell you, you guys haven't done the shutter yet, but you will if you think about it long enough. A lot of times when I describe the
interception thing, people are just like, what? You can't do that. And well, it turns
out, you know, there's really no documented supported way on Linux to do this kind of
interception, right? I mean, the closest thing we have is like Ptrace, right? Something that's
meant for debugging, but it's just not performant, right?
And there's also a, what is it called?
Secmod or something?
It's a set of auditing.
It's basically for auditing or things like that.
So it's not quite right for this, because those things are
done by another process.
They invoke a context switch.
You know, you do one context switch, you're done, right?
Because the whole point here was to avoid all that and just do everything with this little short little
code path that just goes directly down into your persistent memory. So we're still noodling
over this and it may be that we're able to come up with a better, more reliable interception
mechanism but only by patching the Linux kernel and adding some sort of new capability. I actually have a
bunch of ideas on how to do that, but we haven't implemented any of them yet.
Okay. So what we did do was we do some combination of the linker magic and patching. And like
I say, ld.so and libc try to protect the app from unexpected behavior and we have to overcome that a little bit. But there's
really no well-specified thing. By the way, in Windows, there is a well-specified documented
way of patching and grabbing these things. It's called... Yeah, I forget. Anyway, it's
documented on MSDN. It says this is how you intercept things
and this is how you chain them. Because you remember the old DOS days, right? That's how
you did everything. You intercepted the little ints, the int21s. So it is documented. And
so when we were considering this library, I went over to Paul's team and said, hey,
this will be easy on Windows. And they did the shutter. They're like, no, nobody uses that.
Don't use that.
You know, so I think it's just as bad even on Windows, unfortunately.
However, if we can just make a solution that we think is good enough, you know,
like frankly, I think the open, close, read, write interception is good enough
for a lot of these cases I'm talking about.
The Redis append-only file, it does nothing but open and write. It really does
nothing else. It's very simple and it works and we've done that with our library very
easily. But, you know, the thing to beat is it's all about performance, right? We wouldn't
do any of this if it's not a performance benefit. So the thing to beat is just using a real file system.
We have file systems that we can, they're P-MEM aware already, right?
EXT4, XFS, NTFS, these things already understand persistent memory.
So if you put them in this DAX mode and say, you know, now the application is just going
to run, the binary is going to run, It's transparently doing what it's doing.
That should just work.
Or we also have the ability to make persistent memory look
like a block device.
And then we can just put any file system on top of it.
But of course, you're back to having a memory footprint
when you page things in.
But this one didn't have a DRAM footprint
when we turned on DAX mode.
So if these things are fast enough,
forget the PMM file.
Right?
That's the thing to beat. Now, if these things are fast enough, forget libpm file.
That's the thing to beat.
And the code paths for things like append
are really the issue here.
They go very deeply through the file system code.
They do metadata updates.
They do multiple updates, multiple syncs
for some operations.
Whereas with libpm file, these things just
turn into loads and stores and some cache flushes now and then
in user space.
And coming soon to an Intel platform near you
are some cache flush instructions
that are more optimal than CL flush.
And so that'll make this even faster by quite a bit,
we believe.
So we did some proof ofof-concept results.
This is kind of an eye chart,
so I won't spend too much time on it.
It'll be online in the slides that we deliver.
But the point is that over here we have a file system.
Here's our first try at PMEM file.
It was a little faster than ButterFS when we tried it,
but we were really micromanaging all of the allocations
and making everything persistent, and we realized we could actually kind of do some
more clever allocations where we do bigger allocations. And here's XFS and here's the
XT4 and then here's what PMEM file can do today. So this is 64K sized appends to a file
and this is how many we can do per second with a little microbenchmark.
Okay? And this, we got it up to, you know, 18,000 where the first thing we tried at here
was below 2,000. So, we do think that there's some value here. But we're worried about the
interception part because it's error prone trade off, right?
And I have a similar set of results here for if you're not appending but you're just updating
the file.
And so it's not as dramatic but you can see we still were able to get it out on top.
So what am I really trying to do here with my seventh library in the NVML suite?
Well, you can take an unmodified application and you can just run it on top of persistent
memory using the emulated block driver and everything is cool.
You will use some DRAM buffers to copy into the non-volatile memory and you're basically
at microsecond level application latency.
If you think about like doing a kind of a one 4K block thing, we're in the microseconds,
right?
On the other hand, you could come over here and use libpm, sorry, you could use our NVML
APIs.
This means you re-architect for PMM.
And now you're doing everything in place.
You're not doing these little IOs anymore.
And now you're at the lowest app latencies you can have.
I almost put a number here, but the marketing guys
made me take it off.
But let's just say that it's order of magnitude memory
speeds, so pretty fast.
So that required you to redesign your application.
But then over here is what I've been trying to target
with today's talk.
Is there a happy medium between this and this?
This is a new library.
It emulates the file APIs so the application continues to
be unmodified.
And now you're back to the low app latency. Not as low as
this, but lower than this. Right? So it's somewhere in the middle. And as a bonus, you
inherit things like replication that libpm-mops did for you. Ooh. The replication is actually
pretty cool. So we have a lot of... Yes, sir? How much is the overhead of the replication?
The overhead of the replication is significant because of RDMA round trip times. And come
to Chet Douglas' talk tomorrow. He'll tell you exactly how much that is and what we're
going to do to make it better. See, there's a good plug for Chet's talk.
He's not here now, but he'll be here in a moment.
So we have a lot of ideas for how
to use persistent memory transparently,
and a lot of them are implemented
and they're going to work.
So when we put up this ugly code, these APIs and things
like that, don't be afraid.
There are lots of ways of using persistent memory
without even modifying your app.
But we describe one idea here, and there are lots of others coming.
And we think that these ideas, these transparent ideas, lower the barrier to adoption,
especially for, I call them, tier two applications, right?
The people who have applications, but they don't have rows of software engineers to spare to re-architect them.
The big companies are already figuring out what to do for persistent memory.
Nobody is claiming they have the one true answer that I'm aware of, of how persistent memory programming should work.
I think there are a lot of multiple competing ideas, but I think we want to encourage that.
I think we all benefit from that.
And we definitely want to get experience with a lot of these different solutions before
maybe we settle on a couple of key solutions.
I really like the idea of being able to try out persistent memory before we are kicking
your app.
So watch for libpmem file.
Sometime next year, I didn't want to put a date on here because I really do want to spend
some time figuring out how to make the interception less error-prone.
It scares me.
That's the part that scares me.
The performance benefit looked cool though, right?
So that's where I am.
So any questions?
Libpmm file?
MDML?
Yes, sir?
I'm curious what exactly goes wrong with fork.
What exactly goes wrong with fork is two different things.
One is our library isn't multiprocess aware right now.
And so when it forks, we say, oh, I'm sorry, because we really just don't do the right
things in our transactions to account for another process.
Even if we fix that, there are cases like Redis does
a very interesting thing. It wants to take a snapshot so it forks and the one process
just keeps going but it's marked all the pages copy on right. And so it just keeps going
and the other one has the frozen bunch of the pages and it goes and makes a big backup. So it's the
copy on write thing that, you know, if it's trying to do that with one of our files, it's
not going to get what it thinks. But that case would have been, you know, probably disqualified
by one of the other things on my slide. But so it's kind of those two things there, that
there's the semantics of what happens after a fork without an exec, and then there's just the multiprocess thing.
Any other questions?
All right, it's the end of the day.
Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers in
the developer community. For additional information about the Storage Developer Conference,
visit storagedeveloper.org.