Storage Developer Conference - #17: Solving the Challenges of Persistent Memory Programming
Episode Date: August 16, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 17.
Today we hear from Sarah Jelinek, Senior Software Engineer with Intel,
as she presents Solving the Challenges of Persistent Memory Programming
from the 2015 Storage Developer Conference.
I'm Sarah Jelinek from Intel.
Unfortunately, Sarah has come down with a horrible case of bronchitis,
and her doctor tells her not to get on an airplane and fly out here.
But Sarah and I are in the same group at Intel,
and so we worked on this project together.
We worked on the slides together.
So here I am.
I'm Andy Rudolph.
I'm your backup Sarah for today.
And what we're going to talk about here is what's so hard about using persistent memory?
Gosh, it sounded pretty easy based on Andy's talk yesterday. So we're going to cover these topics. I will do a little bit
of review talking about what I'm really talking about with persistent memory. What do I really
mean? And some of the problems, the context of the work that we're doing. You'll notice
that when I get to the actual detail of the work, I never actually mention the context of the work that we're doing. You'll notice that when I get to the actual detail of the work,
I never actually mention the piece of software that we're modifying to be persistent memory aware.
That's because it's something that hasn't been announced yet,
that we're working on with one of our partners.
But other than that, all the information is going to be there.
So what is persistent memory?
And I mentioned this if you saw my talk yesterday.
I have kind of my own definition of persistent memory.
It's byte addressable.
It's something where it's so fast that you probably wouldn't context switch away.
You'd probably just reasonably stall a load from the CPU in order to get the data.
So it's not NAND.
NAND is too slow.
NAND, slow stuff.
We don't even use that anymore.
Now we use this cool, look at the,
how could you not want to use something that has that really cool picture?
That's 3D cross point.
So, you know, computers don't really access things in bytes.
They access things in cache lines.
So really think of persistent memory as something that is able to speak these cache lines on Intel.
The cache line is 64 bytes, and all the transfers to and from a DIMM are these little 64-byte transfers.
The other cool thing about persistent memory is that you can DNA to it,
so you can do things that you can't really do with persistence otherwise.
You can DNA directly into your persistence.
That's a lot of what Tom's talk was about in the previous hour.
And come in the next hour, or the next talk after this one,
Chet's going to talk about how to make that work.
It's actually quite cool.
We're very careful how we phrase things because we're a bunch of engineers
and we're not allowed to announce things.
At IDF, somebody announced that
we expect the capacities of
this 3D cross-point memory
to be up to 6
terabytes on a two-socket system.
So that gives you an idea of what kinds of
capacities we're talking about and why we're so
interested in this, why we're so excited about
this stuff. We're paying all this attention to it.
So
we created a programming model for
this.
It's based on memory mapped files
and I have some
kind of Linux-y words up here, but
a couple days ago
Neil Christensen from Microsoft also
talked about how the same kind of programming model
is going to be exposed in Windows.
We have other OS's working on this as well.
And there is kind of something that hits people right away
when they look at the programming model.
You open up a file, you memory map it,
and now you have the memory right there in your process,
ready to use, do loads and stores to it.
You do some stores,
and now everybody else that has that file mapped for sharing
can see your stores,
and they can start using your data
and this shared memory
and making calculations based off of it,
making decisions based off of it.
And then the program crashes,
and those stores, it turns out,
weren't actually persistent yet.
Okay, so, my gosh, is that a bug?
How could you have done that, right?
And so a lot of people do ask this.
They think, oh, we've discovered a horrible hole in our strategy.
But, you know, it has always been this way.
Memory mapped files have been around for more than 30 years now.
And it's always been the case that when you do stores to these files, these memory mapped files,
they are visible but they're not durable until you do some sort of flushing operation.
They might be durable before this because normal cache pressure can flush things out.
But they're not known to your program to be persistent
until you flush things out.
On Linux, the call to do this is msync
or one of its variants.
So for persistent memory,
we don't use the page cache.
We allow an application to access it directly with loads and stores.
I drew a picture of that that I showed yesterday.
And so msync, where it is normally about flushing the page cache,
is actually about flushing other places in the machine now for persistent memory.
So if you just think of the data path of a typical machine,
a core does a store.
Store is called move in Intel assembly language.
And it goes through the L1 cache, the L2 cache, the L3 cache,
the memory controller,
and eventually it hits this persistent memory down here.
And so just coloring in red all the hiding places.
These are the places where the store could
be if your program crashes before
you're done flushing or if the machine
crashes before you're done flushing.
So what do we do
to get stuff out of there?
Well, we've introduced some new instructions
in Intel to help with this.
There's the processor cache part
up here and the instructions
to do this are the CL flush variety. And then there's the processor cache part up here, and the instructions to do this are the CL flush variety.
And then there's the memory subsystem down here,
and the instruction down here is called pcommit.
So CL flush, the instruction called CL flush that flushes the cache line,
it's been around for years.
But it's a serialized instruction,
so if you just wrote 4K of data and you're going to loop through and flush it all out
that little loop will go serially
every CL flush has to wait for the previous
one to finish
so we made
the CL flush serialized
in every implementation that we've
done in every processor that Intel has done
we didn't feel like we could just remove that
because we thought we'd break some existing
software. Somebody somewhere is depending on that
serialization
and so that's why we introduced
this new one, it's optimized
that same loop I talked about, if you just loop around
doing clflushopt, those will
happen concurrently and it's not until
a little fence instruction that you do
at the end of the loop, are you sure that they're all done
clwb
cache line writeback is a similar, only it doesn't invalidate the
thing that it flushed. So if you're actually going to expect to use it again and you want
that to be a cache hit, you use this. These are all public instructions. They're all documented
in the Intel software development manual on intel.com. You can just Google for these.
You'll find them. PCommit is a broadcast operation.
It doesn't take an argument.
It basically says,
if there are any stores that are headed for persistent memory,
mark them.
Mark them.
And then when you wait at the end of the PCommit,
you put an S fence there, a fence instruction.
That waits for all the marked instructions to drain.
So again, it's kind of a just,
I might end up waiting for things I don't care about.
Might end up waiting for somebody else's stores.
Or my stores might have already been flushed out.
Doesn't matter.
This is about correctness.
If a program needs to know
that its stores are definitely durable,
this is the way you do it.
You do these flushes and P commits.
Okay, so this is the standard.
Here's our architecture diagram,
and I just got done telling you
that the way you flush things here
is to use those instructions.
If you do use msync,
if you do use one of the standard ways
of making something durable,
then you'll tap into the kernel here,
and the msync code will use those instructions.
But those instructions are available from user space. You can use them up here
and save yourself some
code path.
These things that are
for enabling here
in Linux have already been delivered
into the Linux kernel.
There are links at the end of the presentation
on where you can get this stuff or you can just go to kernel.org
and grab the latest kernel and play can get this stuff or you can just go to kernel.org and grab the latest kernel
and play around with this stuff if you like.
But since
it's based on memory mapped files
pretty much all the code that I could
show you and all the examples in our library
just work on any file system. They just don't happen
to be as fast as they would be
on persistent memory because paging is happening
underneath. But it allows
you to start writing programs now.
So in order to make persistent memory programming easier,
we wanted to create a library,
a set of libraries really for the different use cases.
And we call it NVML, the NVM library.
It's completely open sourced.
It's on open sourced.
It's on GitHub today.
You can play with it.
We just had our alpha quality release,
which means all the serious bugs seem to be taken care of at the moment anyway.
And I'm going to quickly walk through what the libraries do,
and then I'm going to show you how we took a problem and selected from these libraries different things,
which library we should use to solve our problem.
So the lowest level library is just called libpmim.
Nice, very simple name.
This library is very small.
I mean, I've got to tell you, I literally wrote this library over a weekend,
and it's just a few lines of code.
It's just nice, convenient wrappers
around those instructions
that I just showed you.
So if you just want to do,
you don't want anybody
to do anything fancy for you,
you just want help
flushing this stuff,
your rights to persistent memory out,
but you don't want to write
the assembly language yourself.
And actually,
the kind of gory part
is checking to see
if the machine supports
the certain instructions that you want to use and following
back to other instructions. It's a pain in the butt. You don't want to
write that over and over and over again. So that's
what libpmem does. It looks up which
instructions are available on the platform and picks
the right one for you, the best one for your situation.
We have a few entry
points in here to do things like
a memcopy, copy a range
to persistent memory, and it's
got some heuristics in it that say, oh, you're copying a range that's so big, I'm going to
use special instructions that go around the cache so that I don't have to do the cache
flushes at the end. Things like that. Just the simple stuff. So libpmem is kind of, if
you want to roll your own, you probably still want libpmem because you don't want to roll your own, you probably still want libpm because you don't want to keep writing that code over and over again.
libpm block.
This is basically for
arrays of blocks that are all the same size
where you want to be able to update any one
block atomically.
When you write to that block, you want
either the old block or the new block
in the case of failure. You don't want part
of the old block and part of the new block.
This is,
think of it for really big
arrays like database caches,
things like that.
This is actually the exact same code
and algorithm that we use in the drivers
down in the kernel
when we want to make persistent memory look like a block
device. It's very convenient to have a block
device also have atomic blocks
associated to it. PMEM log is just one of our use cases. It's very convenient to have a block device also have atomic blocks associated to it.
PMEM log is just
one of our use cases. It's a
persistent memory resident log file.
So if you think about an application like a database
that has some sort of
undo log
that it writes.
A lot of times they call these append-only files.
I think that's what Redis calls it.
And the append-only file, it is append-only.
You literally just append stuff to this.
You never read it back.
You write, write, write, write.
The only time you read it back is if the program crashes,
and you come back and you look in the end of the log to see what operation you should be undoing.
So every time you're doing that append, you're going down into the kernel.
You're going down into the file system.
The file system maybe even has to do some metadata operations
even if it's as mundane as updating the
access time or something. And then you have to go through the block stack
so it's quite a long code path when all you're trying to do
is just append something to a little file. So moving that up
into persistent memory makes this into a much shorter thing
in the PMM log when you append, it's a memcpy
followed by a couple of these flush instructions.
It's very fast.
But it's really just also for a very specific use case
because the log file is just one of these things
that has to kind of be append only
and not replicated in this use case.
Another library we have is libpmemobj
that's a
persistent memory resident object store
which is really kind of a fancy way
of saying this is just our most generic library
object, I'm using object not in
the Java object sense
I'm using object in the storage sense
where object really just means a variable size block
so this
is the most flexible library we have.
It allows you to begin a transaction,
do all sorts of things like allocate from persistent memory,
free, and make a bunch of changes.
And all that information goes into an undo log.
And then there's a commit.
And so if you get interrupted before you do the commit,
everything reverts back to the way it was at the beginning of the transaction.
So this is our most general library.
If you're just starting out playing with MVML,
this is probably what you want.
And we have a whole bunch of tutorials and examples
on the GitHub site for this.
Libvmem.
Okay, so now this is the volatile use of persistent memory.
And I know that sounds a little ugly
to have a term that has both volatile and persistent in it at the
same time. But, you know,
in an earlier slide I said,
well, we expect to see 6 terabytes of this
stuff on a system.
And another part of the IDF announcement
was we expect it to be cheaper than DRAM.
So that means, you know,
to build a system today with 6 terabytes
in it out of DRAM, that's pretty expensive.
And so we are envisioning a time when you might say, well, I need a lot of memory,
but actually I don't need it all to perform at DRAM speeds.
So I'll put in some DRAM, save a bunch of money,
and put in some of this non-volatile memory for the rest of the terabytes that I need.
And so that's great, but now you have two types of memory in the system.
What do you get in a C program
if you call malloc? Well, you get the system
memory, whatever that is. So that's the
DRAM. So how do you ask for the
other kind of memory?
That's what libvmem is for.
Libvmem just provides another malloc
and free. That's all it is. It's just for C
programmers to say, I want a malloc from
that pool of memory instead of the normal
system memory.
It's volatile in the sense that if the system crashes, it's forgotten.
It's forgotten.
If the system crashes, or the program crashes for that matter, it's all given back to the system,
just like the stuff that you get when you call malloc.
And then the last library that we have out there is just another version of libvm.
It's called libvm-malik.
And this is one for transparently converting an application's calls to malik and free into persistent memory maliks and frees.
So this would be a case where, as an administrator, you had a whole bunch of programs running on the system,
but you only really needed a subset of them to run out of DRAM.
The rest of them, you want them to be memory-resonant.
You want them to be faster than if they were actually paged out.
So you just use this library when you run them,
and it'll just say, oh, this guy gets to allocate from the slower, bigger, cheaper pool,
and he has no idea that's even happening,
so the application is not even modified in this case.
I bet the security guy would hate to live being that way when he meets you.
Well, actually, so why do you say that?
Because people are going to be writing to something as if it's thegram,
assuming that when the power goes away, it vanishes.
There are all kinds of government secrets into that area.
So let me ask you this.
What do people do for temporary files?
Temporary files are, by the way, in use by most of the programs that you're running on your laptops today.
Many of them use temporary files.
How do you handle that same problem for a temporary file?
I've asked the security guys. I don't know.
Well, so I'll tell you a common thing.
Of course, you're right that there are more secure environments
where they are much more paranoid about it.
But for a lot of Unix-y systems,
there is a POSIX interface for creating a temporary file.
And what it does is it creates the file
and allocates as much of it as it wants.
And then it deletes the name from the file system namespace.
And so nobody else can attach to it at that point.
There have been age-old security bugs where people caught it.
But in the general case, no one can attach to it.
And when the system crashes or when the program exits,
that space is freed up and it's freed up again.
Now, it still may have the old data in it,
but that's how deleting file works.
If anybody tries to allocate from that free space,
it's zero fill on demand.
And so that's the same here.
So it's the same security as files is what I'm saying.
But if you're in an environment that needs better security than that,
absolutely, somebody needs to go back and zero that stuff out.
Yes, sir?
Pull out that DIMM
that's got DRAM on it
that file is gone.
Or is this not the case
that it was persistent?
There's a different security process.
So,
did you all hear the question? The question was, well, so
wait a minute. Since you're using a non-volatile technology
in a volatile way can somebody come
and yank the DIMM out and now they've got
your secrets and my answer to that
is that
I think will be an area for
vendors to compete with
interesting solutions so
it's not necessarily the case
there could be solutions
so for example going back to my temporary file example,
if I'm going to put an SSD in my system and I'm going to use a temporary file,
if somebody yanks that SSD out, now do they have my data?
Not if it's a self-encrypting drive and I didn't give them the unlock password for it, right?
So there are solutions.
But there are considerations.
I totally agree with you.
Okay.
So let's talk about
my current work,
my Sarah's current work.
So Sarah was
partnered up with this application
and the application
had a block cache in it.
The block cache
basically could use either memory or an SSD.
And we looked at it as a way of reducing DRAM
because, like I say, we are expecting this emerging technology
to be cheaper than DRAM
and to allow us to build bigger capacities.
Gosh, it does happen to be persistent, so maybe we can leverage
that and make the cache
warm when the program starts
up. That's a possibility, so we wanted to look into
that. We wanted to
look into also just using
it as volatile and decide which one
was better and when we should use which one.
We
knew that we had the NBM libraries
and this is maybe not quite
stated correctly. In
reality, we were working
on the libraries at the same time as we were working
on these ideas, these proof of concepts
and we were feeding, you know, requirements
back and forth to each other. So a lot of the
design of the library is based on work like
this where we figured out
things that we wanted.
So, you know, the first challenge you get is, you know,
really where to integrate this into your program.
You know, I actually get this question a lot,
and it comes in many forms, so I'll paraphrase.
The question is, are you kidding me?
I have to modify my entire program to use
persistent memory? And the answer is, well, probably not. At least in most of the cases
that we found, that's not what's going to happen. Usually what's going to happen,
you hear what? We're not replacing memory. There's still going to be system memory.
You know, we're not replacing that with persistent memory. We're not replacing storage.
There's still going to be storage. We're not replacing that with persistent memory. We're not replacing storage. There's still going to be storage. We're not replacing that.
We're introducing a new tier.
And so probably what's going to happen
if you want your program to leverage this stuff
non-transparently, you want to
actually make modifications to get
the best leverage, is that you're
going to figure out what data
structure should live in this new tier, and
you're going to make a few modules for managing
that new tier. You're not going to go through
your entire program and rewrite it.
You're mostly going to decide, ah,
I do need to comprehend
this new tier, just like what happened
when people used to have programs
that dealt with storage
and memory and then SSDs came
along and they started thinking, hmm,
I could probably do something clever by
putting some of my data on an SSD instead of a
hard drive for performance. Yes, sir?
So, is the
persistent memory address
terrain, is that actually
addressed through normal page tables, and is it
just your particular implementation
and modification of the lens kernel that is
actually separately treating those,
and alternatively,
is it actually treating the entire space? Yes, so the question was,
here I am, the only way that the program has the ability to get to persistent memory
is by opening up a file and then memory mapping it,
and then, sure enough, it's's there and it's using MMU
map, it's using page tables. Could you have
an operating system that does this completely differently?
And that's what I was talking
about yesterday with the transparency levels.
You could certainly have something where
it's used only at this level and the
application has no idea what's going on.
And it could be, you know, maybe the operating
system decides sometimes to give you a page
of this stuff. But it's persistent and usually when you want something to be, you know, maybe the operating system decides sometimes to give you a page of this stuff. But it's persistent.
And usually when you want something to be persistent, you have to associate some sort of name with it so that you can reattach to your data.
Otherwise, you know, why think of it as persistent, right?
And so that's why we came up with this idea of using file names as the names of the blobs because people were already kind of used to that.
And it already has a permission model.
So, of course, other models are possible.
So access and dirty bits are available, all the usual?
Yep. Reference, modify bits, that's all available.
Okay.
So the other challenge here is, like I say,
picking which data structure you put on persistent memory.
And we spend an awful lot of time just kind of skipping down here.
What happens if a failure occurs?
We spend an awful lot of time thinking about that,
especially for the persistent part.
I'm going to talk about that when we get in there.
Okay, I'm going to skip this slide
and just go to Sarah's picture.
This is the picture of the software
that is not yet to be named,
but will be someday soon open sourced, we think.
And the whole idea here isn't that important,
but it's obviously got some sort of fan out,
just like so many things do
in a big data world these days. And these servers here
each get some work sent to them, and they have some big giant existing block cache today,
but it's actually not that big or that giant. It's just based off of existing technology.
And we want to look at moving it from DRAM to some combination of DRAM and persistent memory.
I'm also going to skip this slide.
In our
software that we were given, there's one block
cache per server.
There's
a two-stage allocation, we think,
because at the first time that you know
that
you need to use the cache,
you can allocate the space, but you don't know the data yet,
and so then you come back and you put the data in.
So it's kind of a multi-stage cache entry part.
At least that's how the software is already designed.
And then there's a sort of a wrapper that goes into the cache
that holds both the key and the value in there.
There's a least recently used algorithm in the cache.
So this describes the cache in so many programs, right?
So many different programs have ways of caching things.
But the details aren't that important.
What's really important is that we've identified a cache
that we want to move to persistent memory.
So the first thing we did was this volatile node version
because we thought, well, that'll be easier,
it'll be faster for us to implement, and it was.
The goal was to reduce DRAM.
In other words, take this cache, which was largely using up all the DRAM,
and just move it into volatile mode.
And this is great because it is really awfully easy to use.
You don't care about flushing.
You don't care about persistence. The amount of space that you have is bounded. Now, this is actually
kind of interesting. On many Linux distros, malloc will virtually never return null. malloc
always succeeds, and it's because of this little feature in Linux called memory
overcommit. On one
distro, Malik will return
null because overcommit is turned off.
It's actually the reason why some people use
that distro, because they don't
want you thinking that you have memory that you don't.
Anybody know the distro?
No. Then you don't
win the prize.
So,
you know, Sarah doesn't give out information like that.
I don't know.
It's in SLES.
The guys from SUSE do that.
So, again, we're kind of using just a volatile node here.
We base this off of libvmem.
And really the whole point that you want to know about this slide is
we just picked a version of this cache
and we changed the places where it does malloc and free
to these calls to vmem malloc and vmem free,
which is what we call it in our library.
It was a pretty straightforward thing.
I'm lying just a little bit because the code is C++,
so actually we changed constructors and destructors to use these things.
But other than that, it was pretty straightforward.
This is a very light lift.
If you just want to take a data structure
and put it into persistent memory as volatile mode,
this is an incredibly light
lift. It's a great way to
try out the different timings of the different
tiers of memory without worrying about the
persistence.
So we had this done very quickly.
The hardest part about it, if you think
about it, remember that
diagram I showed you where you open up a file and you memory
map it. Well you don't want to go to
a programmer and say, here, I'm going to
map two terabytes into
your address space. Have fun.
You want them to have some sort of allocator
there, some alloc and free.
So that's what the library does for you.
So if you had to write your own allocator, it would take some time.
Most of us haven't had
to do that since CS101
or whatever
when you finally learn how allocators work.
So it's a pain in the butt.
And if you go out and look at all the third-party allocation programs,
memory allocators that are out there,
none of them expect to work on their own little pool of memory.
They all expect to, oh, we're a faster allocator.
We're going to take over being the allocator for your main memory.
And so we took JEMALIC, which is the memory allocator on FreeBSD
and used on a lot of other systems as well,
and we just modified it to,
instead of assuming it's the only allocator on the system,
we just modified it to have the ability
to operate on multiple independent pools of memory,
and that's exactly what this library is.
It's really just a wraparound JEMalik.
We didn't write a whole other memory allocator.
Not for volatile.
For persistence, we did.
So all the tracing things and what that
means, J.E. Malick has, they work?
They do work. We made them available.
All the tracing in J.E. Malick is available
through our library if you want the
stats.
So, this is great. Piece of cake,
right? Very light lift. So,
what did we have to watch out for?
Well, this first one actually really, really,
really annoyed me. The first
one is, if you get it mixed
up, if you get some things with malloc
and you get some things with vmem malloc,
and then you call the wrong free,
the libraries don't know what to do.
I mean, they actually just kind of do the wrong thing,
which is not crash most of the time.
They put it in the wrong tree or something,
and later on you discover it when you do crash,
and you're like, wow, what the heck happened here?
So we're fixing this, right?
Because it's just totally annoying.
You don't want to put that burden on the programmer. And the way
we're fixing this is with another library
that now is on GitHub again
that I think is probably going to replace
libvmem. We're probably going to just
sort of unify on one. It's called libmemkind.
Intel
had already done this library for
allocating from different
NUMA distances so that you could mix
your local and your remote allocations
and things like that.
LibMemkind is clever enough.
You can pass anything that it allocated to free,
and it'll figure it out.
And it's willing to take over both the system main memory
and these other pools.
We went ahead and added the ability for it
to allocate from persistent memory as well.
So LibMemkind seems to be this kind of unifying thing
and then over time as we see
these new kinds of memory emerging
over the next few years, we'll just
keep adding more kinds to LibMemkind
and it gives you a nice unified
programming model for volatile memory.
So, at least some of us are
all focusing on this. There's several groups of us
now. The high performance computing guys
and me, the persistent
memory group and several other groups inside
of Intel are all just agreeing
this is the unifying
library for volatile memory allocation.
So, yeah.
Is LibMemKind NUMA aware
still? Yes, LibMemKind is still
NUMA. For the persistent DIMT?
Yes.
For anything that has NUolocality, yes.
So, you know, LibMem, I actually added this bullet to Sarah's slide just now before I got up here.
LibMem kind wasn't quite ready for your use when she did this.
So now she's going to see this slide and say, well, why was I told not to use this? So I think it actually makes a lot of this a little simpler
because you get this unified model.
Yeah?
So when you do my lock in these new libraries,
are the page table entries any different from the process?
Or let's say if I'm looking at the process term,
can I say that, okay, this memory came from the persistent memory
and this came from the main memory?
So the question is, if I allocate some things from the normal system main memory
and some things from the persistent memory pool,
are there differences in the page tables, for example?
And there aren't differences in the page tables, for example? And there aren't differences in the page tables,
but in Linux there are these map files in slash proc that you can look at,
and it tells you what file your pages came from for memory-mapped files.
So it makes it actually quite clear.
You see the range of your address space for that process,
and you see that it came from a file on a persistent memory aware file system
so you can reverse engineer
and say oh this came from persistent memory
and the other ones show up as anonymous memory
so they don't show up as anonymous
they show up as associated
with that file system
yes sir
you talked about
involved memory versus persistent memory
you talked about cache-volatile memory versus persistent memory, you talk about cache and independent non-volatile portion of it.
Are you referring to the non-volatile DRAM on the DIMM,
or is this outside of the DIMM?
Outside of the DIMM.
I'm just talking about the normal processor cache hierarchy,
the L1, L2, L3 caches in the processor.
So let's move into persistent memory here
since I'm about, well, I'm more than halfway through my time
and I really wanted to get to this part.
So what's so hard about it?
I just got through telling you how easy it was
to just use this malloc and free interface, right?
Well, a couple things are hard.
And let me start with an example.
Remember I was telling you that these things are like memory mapped files, right?
So on Linux, the way you memory map a file is you open it, you memory map it,
and then you have this pointer that you got back from MMAP that you can just do stores to.
And here I am doing stores with a strcpy, everybody's favorite not quite secure copy command.
And I'm copying my name to this persistent memory.
And now I want to make it durable.
Well, my name is with the
Null Terminator 5 bytes.
Oh, I know, I know.
I really should have changed that,
but then the example
didn't quite work.
Yes, so I'm copying Andy's name.
Because, you know, he's such a fun guy to work with.
And with the null terminator, it's five bytes long,
and so to make it durable, I can call msync, just like I told you.
Or I just got done telling you that we have a libpmim
that essentially does the same thing but without having to trap into the kernel.
So this is what it looks like if you use libpmim.
It flushes those five bytes out.
Piece of cake, right?
So, and there's just kind of a picture in the architecture of where libpmim lives.
So you can kind of see.
But what happens if I copy something bigger?
Now my co-worker, Andy Rudolph's full name, is actually 12 bytes with the null terminator.
And so here I am.
I copy out those 12 bytes, and now I'm going to flush them all to durability.
Piece of cake, right?
This is really easy stuff.
But before this call returns, the system or the program crashes.
So what are the possible results here?
Well, one is, assuming it all started out as all nulls, that none of the data got copied.
I crashed before anything got flushed out.
Or maybe it all got copied out
there. That's perfectly possible.
Or what about this number two and number
three here? Maybe it got part of the way
flushing things out. It all seems
kind of reasonable. Or look
at number four here. It's starting to look kind of
ugly. The last
part of my co-worker Andy Rudolph's
name went out, but the first part
didn't. So you can see
with this kind of a flush idea, there's
nothing transactional, right?
Nothing is transactional, and that's
what makes it hard. If I'm trying
to recover from this, I can't
just look to say, oh, how far did the program
get, because I get cases like this.
Things can go out of order because cache pressure works that way.
On a modern-day set associative cache, things go out in any order.
You can't depend on it.
So we need transactions, and transactions are not provided by the hardware.
Now, wait a minute, you say.
I went to the Intel website and I went to the software developer's manual
and I searched for the word atomic.
And Spencer did that earlier today.
He told me it showed up like dozens of times, didn't you?
Sorry, I had to use somebody.
So, yeah, you'll find the word atomic in there all the time.
But, you know, we didn't have persistent memory for the past N years of Intel development.
So they're not talking about atomicity with respect to durability.
They're not talking about atomicity with respect to failure,
like power failure.
They're talking about visibility.
So if you see an instruction that says it atomically stores 16 bytes,
like there's a compare-exchange instruction called compare-exchange 16b,
they're talking about visibility.
No thread running concurrently to your thread will see part of that store.
Cool, right?
It has nothing to do with durability.
So even though you see these atomic stores that are much bigger in the manual,
like there's a 512-bit one in the AVX-512 instruction set,
for power failure, it's still 8 bytes.
8-byte atomicity.
If you're storing 8 aligned bytes and you lose power,
you'll get the old 8 bytes in your persistent memory or the new 8 bytes in your persistent memory.
That's cool.
That's kind of like a little mini atomic store.
Anything bigger than 8 bytes?
All bets are off.
Software has to build transactions
out of those 8 byte
transactions, those 8 byte atomic
instructions. Software has to build bigger
transactions in software.
What about the TSX instructions? There are these
new things that came out
with Haswell.
Yeah, thank you. See, you
do know this better than I do.
So,
the TSX instructions,
which stands for transactional,
the T stands for transactional, have an
instruction called X begin and X end.
So you do an X begin and you can do a bunch of changes
and you do an X end. And again, it's all
transactional with respect to other
threads. But if
you get, it's really meant to
allow you to do this kind of optimistic locking
where you don't have to grab a lock. But if
a conflict does happen,
you get this thing called an X abort and then you're
expected to take a lock and redo
it. Well, one of the things
that always causes an X abort is a cache flush.
It's just not made
for persistent memory. Maybe in the
future, but every time I go to the hardware
guys and I ask them to make this stuff work for persistent
memory, they give me a very long reply about how
hard it is. So it's not going to be
anytime soon.
I also mentioned compare exchanges.
It's used by these lockless
algorithms. Also, this is the heart of a lock
itself, but these lockless algorithms are all in vogue these days
that go around and use, you know,
they call them non-blocking data structures
where they don't grab locks,
but instead they use these compare-exchange things.
That's fine.
It will work, but it will not work as far as durability is concerned.
These things, there is no one instruction that says
compare, exchange, flush, p-commit, and then
let somebody else see it. There's no instruction to do that.
So we have to do that in software.
And so, like I said, we have a library
that is a general purpose transaction library
and it sits
on top of libpmim
and it has operations like
begin a transaction, end a transaction,
allocate and free.
It's got a memory allocator, but this is more than your garden variety, malloc and free.
Now, if you allocate something, but before you get around to using it, you crash, that memory goes back on the free list.
Otherwise, it would be not only a memory leak, it would be a persistent memory leak.
Well, that's like double bad, right?
So that same example of writing
my co-worker Andy Rudolph's name
to a field, I'm just
going to kind of skip
a lot of the details for now,
but the point is now there's a way of beginning
a transaction, which I put it into
a macro here in C, just to make the code a little
cleaner to look at.
There's a begin and an end and so now
this stir copy either completely takes place
or doesn't take place at all in the face of
failure.
Now our transactions also
provide you the opportunity for some
multi-thread safety at the same time.
What we found when we started converting programs
to use these transactions is
that
everybody does multi-threaded programming today.
The places where you protect data structures for visibility with other threads almost always lined up with the places where you protected things for power failure atomicity, right?
It was so common that that's why we made this C macro. It not only begins a transaction,
but the underscore lock version of it here
offers to grab one of your locks.
This is just a lock the programmer defined.
It's protecting this data structure
to make this code multi-thread safe.
And so what happens is,
you begin the transaction
and you're also locking out of the threads.
You do the transaction,
and then the end here commits the change
and drops the threads. You do the transaction, and then the end here commits the change and drops the lock.
Turned out that actually a lot of the multi-threaded
code got simpler
because there were places
in multi-threaded code where you grabbed a lock and you started
making your change, and you realized
something was wrong. There's an error path.
And you had to kind of put things back into a sane state
before dropping the lock again.
But we have an undo log here.
We do that for you.
Cool.
Sarah thought that was really cool.
Good job, Sarah.
Yeah.
And the kind of locks that we give you is now this, you know, this is on the Linux code,
so Pthread locks are the common kind of locks these days.
We offer these wrappers around the Pthread locks called PMEM mutexes.
A PMEM mutex is just like a normal mutex, like a Pthread mutex,
except for every time the persistent memory file is opened, all the locks are magically initialized.
It's all done
with a generation number. So imagine if you
have a data structure with potentially millions
of locks. It's a tree and it has a
lock in every node of the tree. There are millions
of locks. Some of them are held and some of them
are not in the program crashes. When you come back
up, you don't want those locks to
be in that old state. So we just
magically, if they're this type, we
magically revert them.
And like I say, it's all done by incrementing one generation versus very fast.
So in addition to that, I just got done explaining how cool all these transactions are.
We also found that outside of a transaction, it's nice to just be able to do allocates and frees and know that they're safe, they're atomic.
In other words, you might just want to say,
oh, I'm just going to allocate these things and fill them up and start using them.
And if the program crashes, I just need a way when I come back
of walking through everything that I allocated.
And so we added a type field to our allocator,
and you can allocate things and provide a little type.
That's a little tag. It's like a data type.
And then when you come back up, when you first open the pool,
you can say, I need you to walk through everything of this type.
I'm going to put it all into a hash table.
I need you to walk through everything of this type.
I'm going to put that into a different hash table, and so on.
It turned out that for a lot of the problems that we were dealing with,
this was all you needed.
We didn't actually need those transactions.
We just got a lot of simpler code here.
The other thing that we found that was handy to have outside of a transaction
are list operations.
So we just made a generic doubly linked list,
just like in the C++ standard template libraries,
and they're a D list or something like that.
It's a doubly linked list that everybody,
one person had to code it up and optimize it and get it right,
and everybody benefits from it.
Same idea here.
It's a doubly linked list, only it has some cool features to it.
You can allocate a new node and put it on the list atomically. So
if you crash, that thing is either completely allocated and on your list, or that space
reverts back to the free list. The same thing on the remove side. You can remove something
from a list and free it atomically. If you crash, it either goes back to where you found
it because you're not done yet, or it completes the operation. Or you can move from one list to another.
So the combination of the earlier allocations
that are safe and atomic
and these lists that are safe and atomic
actually allow you to solve a lot of problems
without even worrying about transactions.
And so we turned out this, you know,
we kind of added these as an afterthought,
thinking, you know, this might come in handy.
Well, for Sarah's work,
she turned out not to use transactions at all.
She ended up using this.
And so she made that block cache by just allocating those items
and putting them into a hash table that lived in DRAM
because the hash table was the critical code.
It was the thing that needed the performance of DRAM.
And then when the program shuts down, we don't worry about it.
When the program comes back up,
we just walk through all the lists of
stuff that are in the cache,
potentially terabytes of it, and quickly
rebuild the hash table. Quickly.
For terabytes, we think
it's probably
tens of seconds per terabyte at least.
But it's still
not so bad when you have a pretty big cache like that
that you're warming up.
So this has just been kind of a summary of Sarah telling you
which of these operations that we offer in the library she ended up using,
and I just gave it away so you already know.
I'm going to skip a few slides because I want to get to the end in time for questions.
And, of course, I added the slides that had the word Andy in them,
so now I didn't use some of Sarah's slides.
So, yes, it is challenging to use persistent memory.
We're trying to make it easier with libraries,
but you at least have to comprehend
which things are going to live in persistent memory
and which things aren't.
It's not something you do transparently,
at least not in this example, right?
We're trying to say to really get leverage,
you modify your application.
If you want something transparent, then you're down lower in the stack doing this kind of change to middleware
or doing this kind of change in the kernel.
It's very critical, though, to be thinking about when something is actually visible versus when it's persistent,
like I was saying.
You can use our transactions to help with that.
And, of course, it's important to consider the state of the persistent memory at any point in the crash.
So when I was telling you that Sarah decided not to use our transactions,
she allocates something onto a list,
and we guarantee a certain initial contents, all zeros.
So she has a valid field, and that valid field is initially zero.
Then she goes and fills it in, and she persists it. And then she has a valid field, and that valid field is initially zero. Then she goes and fills it in and she persists it.
And then she turns the valid field to one.
So in a way, she kind of created
her own very lightweight transaction.
And on startup, when she's rebuilding
the hash table, if she finds anything with a valid
field still set to zero, she says, oh, I guess
I never finished filling that one in, and she just frees it.
So you really have to think that
through.
Using the NVML library seemed to make it easier.
We kind of did some time estimates.
We thought if she had to write her own allocator, that's probably the big lift,
or if she had to do her own transactions for the persistent version,
that's also a big lift.
We thought it would probably have taken her 10 to 12 man months to do that,
10 to 12 man months to do that. 10 to 12 Sarah months.
Instead, she did it with the NVM library in less
than a month. It kind of gives you
an idea of what we hope to achieve with
this library. We hope to save time.
That's the summary.
I think I managed to leave myself
with a couple minutes for
questions. Any
questions?
The library is open sourced under a BSD license.
I chose the BSD license because you're free to then take the library code
or the library itself and put it in a closed source product.
You can make a T-shirt out of it, build a bridge out of it.
We won't care because it's not a money maker for us
we're not trying to make money from selling the library
we're trying to make persistent memory programming easier
so
yes sir
that's a great question
wouldn't this be easier if we added a persistent keyword or a transaction syntax to C?
And I think eventually that's what's going to happen.
I'm not sure it'll ever happen to C.
I expect to see it in higher-level languages first, anyway.
But that's a much longer thing.
So we wanted to start out making the library handle the interesting operations
and get some experience with them.
And so today, if you decided,
I'm going to have a type of an object in Java
that's a persistent object,
and you're in there and you're working on the JVM
to make it do that,
probably the way that you would make that work
is by calling this library, this C library,
in your implementation.
Right, but it would be very thorough and hard.
Yeah, to use this directly, I agree
with you. We did
make Valgrind
work with it, by the way, and Valgrind
is a Linux tool for finding leaks and things
like that. One of the things it'll tell you
is if you ever write to
persistent memory locations and you don't
flush, or if you flush multiple times,
anything that doesn't seem quite right.
So we're working on the tools.
They're out there. They're on GitHub also.
But we've got a lot of work to do there.
Right?
Absolutely.
But I see that as a longer term thing because I don't think
we know the answer yet.
But we are headed there. Definitely.
Other questions? Okay, thank you.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.