Storage Developer Conference - #116: Persistent Memory Programming Made Easy with pmemkv

Episode Date: December 10, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 116. Okay, so I'm Andy Rudolph. I work at Intel, and I'm the persistent memory software architect at Intel.
Starting point is 00:00:48 So all this stuff about how you program with persistent memory and the programming model and the changes that got made to Windows and Linux and DSX and all these things are, at some level, my fault. So here we go. I'm here to talk to you about something that is gaining a lot of traction lately. We have a lot of talks at SDC and at other conferences about different ways of using persistent memory. There's a lot of research coming out. I've noticed there's probably a dozen papers a year now that come out on persistent memory programming, which I think is great. And some of them are pretty complex and pretty involved.
Starting point is 00:01:34 So we set out a little while ago, more than a year ago now, to say what can we do to just make it as simple as possible for someone who still wants to modify their application. But what can we do to make persistent memory programming really easy? So I'm going to talk about why we built this key value store. I'll go through the design and some of the features that it has, and I'll touch on performance at the end. So there are a lot of ways of using persistent memory, right? And by now you've probably heard it if you've been to a lot of these talks. This product that Intel is shipping has a mode that you may have heard of called memory mode, which is just completely transparent to the application.
Starting point is 00:02:13 Apps aren't modified. It's completely transparent to the OS. Even the OS isn't modified. You boot the machine up and it thinks it has a huge amount of, you know, 6 terabytes on a two-socket system. That's great. It's transparent. It's a volatile mode. There's nothing persistent about this mode. And quite frankly, it's cool, but it's not a match for every use case.
Starting point is 00:02:34 It's a match if you're trying to save cost on DRAM. If you need a system with a lot of DRAM in it, and you're saying, well, how can I build that system more cheaply, rip out a bunch of the DRAM and put in the Optane stuff? That's great, but that's just one use case. And on the other hand, this stuff, the media is persistent, so
Starting point is 00:02:53 what if you want to use it as persistence? Well, the storage APIs work, and unless you've never heard of persistent memory before, you've already seen this picture, but the point of this picture is to show you that middle part here. These are the storage. These are unmodified applications right here and here. And this is an unmodified file system. So this is stuff that already has been around for years and years. And with this new driver, it's working on top of
Starting point is 00:03:19 persistent memory, just like it's storage. So that just works. So that's great. There's nothing to do, right? Well, again, this isn't really leveraging fully what persistent memory can do. To do that, we end up on this right-hand side of the diagram. Just to kind of quickly recap it for those who may not be as familiar with it, on the right-hand side of the
Starting point is 00:03:40 diagram, we now have a persistent memory aware file system. It allows the standard APIs to work and then when you memory map a file you get this arrow on the right which provides direct access or DAX. We call it DAX for direct access and that's really where the a lot of the research is happening today. That's where the most complex programming is if you want to do it. You know it's not a requirement. It's something you get to choose programming is if you want to do it. It's not a requirement. It's something you get to choose to do if you want to really leverage what byte-addressable persistence can give you.
Starting point is 00:04:20 The thing is, on that right-hand side with this DAX kind of programming come certain responsibilities. I would say with great DAX comes great responsibility. And if you came to any of the hackathons during SDC, you saw these little flowcharts I drew. One was about your responsibility for flushing things to persistence. One was about your responsibility for checking to see if there were failures for RAS kind of responsibilities. And one is about keeping your data structures consistent. And there was a lot of complexity in here. I'm not going to dive into these because we have plenty of this information in the other talks and online. But this stuff starts to get kind of hairy. And we realized this about, well, about seven years ago for me now.
Starting point is 00:05:00 And we started this project to address some of this complexity. In fact, I would say our single goal of the Persistent Memory Development Kit is to make persistent memory programming easier. So this is our slide kind of summarizing what's in PMDK. You can see there's a whole bunch of different libraries, and again, I'm not going to go into them in this talk, but there are some low-level libraries, there are some libraries that support transactions.
Starting point is 00:05:26 And then you can see on that picture on the right where the libraries sort of fit in. But the whole challenge of the libraries, this is one of the funnest projects I've ever had. One of the challenges of the libraries is to still give people that direct access, right? If you look at a lot of libraries that do things like transactions today, they're designed to fit into the storage APIs, the storage stack. And so you pass a buffer containing the data you want to write, and it writes it transactionally or something like that. But these buffer-based APIs are designed that way because storage is designed that way. It's buffer-based. You use a DRAM buffer, and then you tell the storage subsystem
Starting point is 00:06:05 to copy data onto storage or to copy data out of storage into DRAM. We have direct access here. We don't need DRAM. We can access the persistence in place. So we spent a lot of time on these libraries thinking how could we give people both transactions but also still allow them to just directly access this stuff
Starting point is 00:06:23 without making copies all over the place. And that's kind of the beauty of these libraries. But, of course, the libraries themselves, there are a lot of work to use. You have to learn how to use them. You have to train programmers how to use them and so on. Why use libraries? Why do people use any kind of library? Well, people use libraries because they want to benefit from the debugging and the validation and the tuning that somebody else has done, right? You don't
Starting point is 00:06:50 write printf over and over again for every C program you write. You use the library. And just to kind of drive this point home, we took a quick look at the versions of our libpmemobj. We have a btree benchmark that we use just to make sure that we haven't regressed every time we cut a release. You can see there's release 1.0 down there. That was about five years ago now. And then 1.1, 1.2. And just running this B-tree benchmark, which really exercises transactions, you can see what's happened. Taking the first one and just calling it 100%, you can see each time, almost each time,
Starting point is 00:07:27 we've improved the performance of the transactions. We did have a little regression there one time. And it came up. So you can see we're about 550% better than we were five years ago, and we're not done, right? Because every release, we we call it improving the performance one cache flush at a time basically every release we examine the path, the code path that these libraries
Starting point is 00:07:53 take, we look for places where there are cache flushes or fences even anything that would cause overhead in a transaction and we just are constantly picking through this. And then we have a huge validation team, and we validate it back to production quality
Starting point is 00:08:12 to make sure it's still that way. This is the stuff that you want to leverage about a library. You want to take advantage of the fact that some company, Intel in this case, has decided to spend money to validate this stuff so that you don't have to. So how can I take this benefit and this complexity of all these libraries and combine it into something that's maybe a little easier to use? Well, that's why we invented the key value store.
Starting point is 00:08:39 We have memory mode. We have storage, like I told you. Those things are for legacy software where you don't want to modify your software. We have the low level libraries in PMDK called libpm is one of them. That gives you very raw access if you want to do the flushes yourself
Starting point is 00:08:56 and deal with transactions yourself. That's what you use. We have libpmobj. This provides you all this powerful, flexible allocations, transactions, and stuff like that. Containers. We have libpmemobj. This provides you all this powerful, flexible allocations, transactions, and stuff like that, containers. We have a bunch of containers in libpmemobj. What we were kind of missing until now, until about a year ago, was just a simple put-get API,
Starting point is 00:09:18 the stuff that honestly is all the rage these days. You look at a lot of these so-called no SQL databases, the RocksDB, LevelDB, Cassandra, that kind of thing. That's essentially what their interfaces look like now, just puts and gets. And again, we do the job of tuning it and validating it to product quality. So why not just use one of the existing key value stores?
Starting point is 00:09:44 Why not just pick one? Pick RocksDB. RocksDB is very mature. It's been around forever. It's been tuned and tuned and tuned. And the answer, again, is that it's buffer-based. In RocksDB or any of these existing, if I look up a key and I get back the value, the only way I can read that value off storage, since it's not directly addressable,
Starting point is 00:10:04 is to pull it into DRAM. So it's a buffer-based API. So if the value is, say, a 4-gigabyte little clip of video or something, that's copying 4 gigabytes of data, right? I have no choice because the only way to get something off storage is to copy into DRAM. It's the only way storage works, right? It's not directly addressable. On the other hand, if you look on the right, what if I had this in persistent memory? Now I can look up the key and I get back a reference to the value.
Starting point is 00:10:35 And I can just access it right in place. I can do whatever I want with it without copying it. That's the thing that you can do with persistent memory that you can't do without persistent memory. That's what's unique about persistent memory. It's access in place persistence. We thought that was pretty cool. We thought that was worth creating a library around. When we first started this, we made a list of our goals for a persistent memory key value store.
Starting point is 00:11:11 We decided the first thing to do was just make a local key value store. We could build a distributed key value store later using it, but you usually need a local thing first. So there's no networking in this library. The second one was a big one. It's very easy to come up with a C library that does something and tell everybody, you can call C from Go, but you can call C from Java. So we just leave that as an exercise. And we didn't want that.
Starting point is 00:11:43 We really wanted the key value store to appear idiomatic for every different language that we support so that somebody didn't have to learn something that was kind of unnatural for programming in the language they wanted to use. We wanted a very simple API. It says bulletproof. Maybe the less polite way to say that is idiotproof. We want something that's just so simple you can use it. We want to be able to extend this with different engines over time. And so there's a plug-in architecture you'll see in a minute.
Starting point is 00:12:06 And really optimize for persistent memory. That was our goal. We also wanted, as you add more engines and different configuration parameters, we wanted to build up a big framework of tests. So it was very easy when you add a new engine to make sure that your engine works correctly. The result is this thing. It's on our GitHub area. It's very liberally licensed. You can take it and copy it into
Starting point is 00:12:35 closed-sourced software if you want. We don't care. We definitely have taken some outside contributions and we welcome them. Really, our goal is at this point to be kind of the gatekeepers of it to make sure it stays at a certain quality and to write engines where we think that there's a need, a missing area. Here's a picture of the architecture. You can see it's very colorful and a lot of components, but actually it's not as complex as it may seem. Down at the bottom are these different engines that we have.
Starting point is 00:13:12 There's some persistent engines that are built on PMDK. We also have some volatile engines that don't worry about persistence at all, and I'll show you the list of engines in a minute. The core library is actually written in C++ and mostly its job is to provide the common API and then we do these bindings in ways that are, like I said, are idiomatic in different languages. So a lot of the work happens on these bindings.
Starting point is 00:13:40 For example, we did an initial version of the JavaScript bindings, and we don't really know anything about JavaScript. And we said, yeah, okay, it works. But then we went to, apparently, Intel is such a big company, there's this whole group that does nothing but JavaScript, and we finally found them. And we said, can you help us? And they rewrote our Node.js implementation,
Starting point is 00:14:01 and now it's actually quite performant and quite nice. And I'll show you what that looks like on a later slide. So, you know, we're not trying to say we're the experts on all these languages, but we're trying to give people a way of making sure that we can support all these languages in an idiomatic fashion. The idea is that for each engine, there's some set of configuration parameters. And so we wanted this kind of flexible configuration mechanism i've got a little fragment of c++ code here on the right that kind of shows you how it works you you declare this little config struct here and then you start adding
Starting point is 00:14:36 the things that the the specific engine you want needs like if the engine takes a path and a size which most of them do this is where you you set those config options before using the engine takes a path and a size, which most of them do, this is where you set those config options before using the engine. Each engine has basically a man page that tells you what are its configuration options. And nothing there will surprise you, but the configuration options are based on the type, like strings or UN64. And all this configuration data is not persistent, right? It's something that happens on the type, like strings or UN64. And all this configuration data is not persistent, right? It's something that happens on the stack. So I have a little
Starting point is 00:15:09 kind of walkthrough to show you the life of an application as it uses libpm mkv. And the walkthrough starts with nothing. There's no application running yet. It's just starting up. Nothing's in DRAM.
Starting point is 00:15:25 Nothing's in persistent memory. And then the first step is this configuration step. The application creates one of these config structs. It starts filling in. Like here it's saying, oh, okay, I want the key value store to be in this persistent memory file. I've got its size declared here. Size is off screen.
Starting point is 00:15:46 And you can see it builds this little config struct in DRAM as a result of these calls. Moving on, the next thing we do is we create the actual database itself. And here you can see it's constructed this little database object here. Again, it's a volatile object for managing the engine. Open is where most of the important magic happens. It starts up the engine. In this case, we're using the CMAP engine, which is a concurrent hash map engine that we've recently released at Production Quality.
Starting point is 00:16:21 We're pretty excited about it, so I'll be talking about it a bunch during this. And you can see at this point, everything kind of kicked in on the persistent memory side. The engine started up. In this case, it created the key value store based on the options that I gave it with the size and so on I gave it. And it sort of hooks it into the volatile data structures that are used at runtime. So at that point, you're running, you're up and running, you do all your work here. That's the stuff that I'll talk about later, the puts and the gets. And then when you're ready to close things down,
Starting point is 00:16:54 they're just kind of the opposite of these operations. There's a close that shuts down the engine, and then there's the C++ delete operation that gets rid of the volatile data. So nothing too surprising here. Like I said, it's meant to be simple, but I kind of wanted to walk you through the life cycle. So let's talk about the different engines that we have so far. Here's a table of them all. I put the bottom one is in red right now because we're currently validating it right now.
Starting point is 00:17:31 It's not production yet, but it's coming pretty soon, probably within the next quarter. There's a lot of different ones here. Mostly what you care about are there are some volatile engines. These are honestly quite similar to the DRAM engines that are out there today like RocksDB. But once we got the pluggable engine architecture, we realized, oh, we can put anything there. We can put any kind of engine there, and you'll still have one unified API for it. So we wrote these mostly to test out our framework,
Starting point is 00:17:59 but they actually seem to work pretty well. Concurrent HashMap is the one that's the most exciting. It's a persistent memory engine, and it's highly multithreaded. It scales really well. It's based on... So Intel's had this library called the Threading Building Blocks Library that they've been shipping for years. And it's this library that has been written to be highly multithreaded.
Starting point is 00:18:23 It uses all these kind of lock-free, non-blocking algorithms and very complex stuff. So we went to that group, and we said, we love your concurrent HashMap. Would you work with us to make a persistent version of it? And that's exactly what they did. Actually, during that work, they ended up making requests to the PMDK team to add several new features to libpmobj, which we did. And that's how we got a lot of the performance we got. So this is kind of the most
Starting point is 00:18:53 exciting one now. And now that same team is working on a sorted version, basically an ordered concurrent hash map. With this one, when you traverse it, you get back the keys in any order. With this one, you'll get them back in sorted order. So that one is coming. But you can see there's a good number of engines here, and I would love if those of you who are developers, pull requests, welcome. We'd love to see more
Starting point is 00:19:16 engines, right? The more choices people have, the better. And each time we build an engine, we decide, is it persistent? Is it volatile? Is it ordered, is it unordered, is it concrete. There are a couple single threaded engines in there and so on. But most importantly we have this set of tests that we run on all of them. Stuff that
Starting point is 00:19:35 some of it should seem familiar if you're a developer, you know like the Valgrind kind of checks. But if it's persistent we also have our own persistent memory checking tools where we make sure they're not breaking any rules. There's a question. Are you talking about concurrent cross-process boundaries
Starting point is 00:19:55 or are you talking about multiple threaded inside one process? Yeah, that's a great question. When I'm talking about concurrent, I'm talking about multi-threaded within a single process right now. We just don't have an engine that goes across multiple processes. We could add one,
Starting point is 00:20:12 but we don't currently have one. They're all multi-threaded in a single process. You appreciate that most of the stuff that's SSD-based that you're trying to compete with actually does allow it. I don't accept your premise actually. So the reason why we haven't done something across processes is nobody's asked for it.
Starting point is 00:20:28 All of the major applications that we've been working with are multi-threaded. They're not multi-process. Only one major application that I've been engaged with is multi-process, and they did their own library, so it wasn't a request on us. When I use RocksDB, I'm actually
Starting point is 00:20:45 running multiple processes that are accessing the same common RocksDB database. And we don't know in my world. I see. So it works out really well to have multiple
Starting point is 00:21:01 processors in each of those processes. I think we'd be very open to that kind of a future request. Like I say, no one's asked us yet, but especially if you only want one writer, I think we'd be much happier to do it than if you wanted multiple writers, because that's a lot more work. It's not just that it's a lot more work,
Starting point is 00:21:22 it's that, as you probably know, the kind of lock that you create for a multithreaded app is different for a multiprocess app, and it doesn't perform as well. So that's why we've been sticking with the multithreaded stuff, since that's what people have requested so far. So please, don't hesitate to just make a request, because we'd love to hear about use cases that we're not covering. Is there a read-only mode?
Starting point is 00:21:46 We don't currently have a read-only mode, but again, that would be pretty easy to add. If you put it in a read-only mode, I don't really care whether you go ahead and make it multi-process aware or not. Now I can just go ahead and say, in the file that you've created, it's a consistent memory, but I think you can still let it file. Or in the store, you've created in the persistent memory, but I think you would still let it file or store or you might call it in the persistent memory now. That's a great question. You know, if all of the
Starting point is 00:22:13 processes are in read-only mode, it probably just works, but I'd want to validate it first. But because just for all the reasons that you're thinking, that nobody's changing anything, and so there's nothing to coordinate. I think right now we explicitly forbid the second open, right? Because, like I say, we don't validate it,
Starting point is 00:22:35 and we don't have a read-only mode, but we could do that. That doesn't sound like a hard request even, so please, if it's something interesting to you, make the request. Do you have another question? No, okay. Okay, so one of the things that's not what you just asked, but is another request that we're working on right now, is to add multiple engines within the same memory pool. So we use this term pool in PMDK, and it kind of means file, right?
Starting point is 00:23:04 But actually, PMDK can spread out a pool across multiple files. This can be useful if you have, for example, multiple interleave sets on different sockets, for example. Also, a pool can be a device DAX device, which is something where there's no file system in use. So that's why we came up with the term pool. It really stands for wherever you're going to store all your data. And so if you actually want a bunch of different key value stores to be in one pool, this is a feature that somebody did request, so we're adding that right now. And that will be showing up very soon.
Starting point is 00:23:44 I did put on the slide some detail of what it looks like, but the code's probably a little too small to tell. But this is somebody using libpmobj. So they're a more advanced user. They're writing their own transactions and things like that. But they're also mixing in a couple of key value stores into their pool. So we're trying to get multiple use cases to work together. All right, let's talk about the language bindings.
Starting point is 00:24:14 So right now what we have validated are these four languages, in addition to C and C++. I didn't even bother to put those there. But, yeah, so we have C and C++, Java, Node.js, Ruby, and Python. These are all fully validated. And what we found is by keeping the API pretty simple, we could add these without a lot of overhead. They're actually pretty performant.
Starting point is 00:24:38 I'm pretty happy with them. You know, it comes up a lot, especially with Java. The question comes up, should we just rewrite a complete native Java version of these APIs because the Java community hates non-native stuff. And honestly, the performance hasn't mandated it yet. We're getting great performance out of this. We're not against doing it, but it's something that I'd say we would do as it becomes necessary. And you can see what we've done is we just put each one of the language bindings into their own repo on GitHub. it's something that I'd say we would do as it becomes necessary.
Starting point is 00:25:07 And you can see what we've done is we just put each one of the language bindings into their own repo on GitHub. So you just pull out the stuff that you want for your language that you want to use. Nobody's requested Fortran yet. But it's only a matter of time. APL. APL. Now that would be interesting. But it's only a matter of time. Yeah. Now that would be interesting. So let's talk about our API.
Starting point is 00:25:36 So we really took a careful look at the other key value stores. RocksDB is probably the biggest influence of our very simple API. I already kind of showed you the open and close part, but the real simple part here is that there's a put and a get. There are the obvious things about, you know, removing and checking to see if something exists. And then there are some iterations for going through everything in there. And for the ordered ones, we have, you know, range-based iterators. The kind of interesting part, and I'm going to show you here more detail about this in a second, is, again, the get.
Starting point is 00:26:09 Remember, this is access in place persistence, right? So you don't have to make a copy of something. You can if you want. And so this might be readable back there. I know it's always hard when you put code on a slide. But here's just some C++ showing two versions of the get. If you use get like this, where you pass in, you know, like ampersand, a string, then you're copying.
Starting point is 00:26:37 You're getting your own version of that string. You know, we go ahead and just give you your own version. You can do whatever you want with it. But, of course, you're not changing anything that's inside the key value store. And that's okay. But, actually, we think more interesting is when you pass it a function. Here I'm passing it a function. Actually, I'm doing this with get all.
Starting point is 00:26:56 But you can do it with get or get all. When you pass it this function here, you're getting these direct access references, right? So this is not copying data anywhere. And so think about this. In a key value store, if I give you a reference to the value, you don't want that value going away, and you don't want that value changing while I'm in the middle of doing something with it. So this makes it a really simple API.
Starting point is 00:27:20 While you're in this function, between this curly brace here and this curly brace here, you're in this little function, we've got that item locked. So you didn't have to lock it. You didn't have to unlock it. It's just during your callback, you know it won't change on you, and you're accessing it in place. And as soon as your callback exits, it's free to get removed or changed by other threads. So in this way, you've kind of ended up with this program
Starting point is 00:27:47 where it's a very simple API, and you're not grabbing or releasing any locks explicitly. The library's handling that for you. So that really caused a very simplified programming. Yeah? The get-all, is that doing a match on the substrate? Get get all is entering a match on the substring? Get all is not. It's getting, that particular
Starting point is 00:28:09 API is getting everything. It's the range-based ones that do the match, and they don't actually show up until we get that range-based, that ordered engine finished. So it's part of what we're doing with the ordered engine right now. Yeah? Yes, sir. So if someone is doing, So two threads cannot read one
Starting point is 00:28:28 value at a time, is that correct? In this case, yes. In other words they can read a value at the same time. They can't write one while someone else has this reference. It's a read-write lock. So 100 threads could read it at the same time. But if a thread wants to delete that value,
Starting point is 00:28:51 it will block until these guys release their reference. Okay, but that would be using the copy that's in the DRAP. No, no, no. I'm talking about this one, direct access. That's the whole idea, is that the reason why we use a closure here, a lambda, is that that's how we know when you're done with your reference. We give you a reference, this thing num.data.
Starting point is 00:29:12 That's a reference to the data in place. We know that somebody has a reference, so we know we can't delete the data. Then as soon as this lambda returns, we know, oh, okay, now we can drop the reference count down. So in this reference, like what was it, say, lambda data is put in the SSL, you could also do that? Could I reassign the value?
Starting point is 00:29:39 Yeah, that's a great question. Could you change the value? This string view thing is C++ for const. So we won't let you modify it. You have to use our API to do that because it's transactional. And if you just start saying star pointer equals value,
Starting point is 00:29:56 if it takes you three instructions to do it and you crash in the middle, then you didn't get a transaction. And so we're trying to give it the element of least surprise. Everything, all updates should be transactional, so you do have to go through our put value to do that, or put API to do that.
Starting point is 00:30:13 So did I finish answering your question? Good, okay. All right. So I'm going to keep going. I thought it would be good just to show you a JavaScript, for those of you who are JavaScript programmers. Yes? So in case of C, you have to pass a function pointer.
Starting point is 00:30:30 Right. And that also obeys this law of function. It does. Yeah, don't have lambdas in C, but yes. And, of course, you could just use a function pointer here and declare the function at the top. But we found this kind of programming where you use the lambdas. It's particularly cool because inside this lambda, as you probably know, you get access to the other stuff that's in scope here.
Starting point is 00:30:54 And so it's really just like programming, you know, it's like doing a transaction right in the code. I love these lambdas. We've really fallen in love with them. They're a little odd for people who haven't used them before, but I think they're really quite useful. The only thing that I haven't quite gotten my head around is that inside these little square braces
Starting point is 00:31:13 there's a bunch of different operators you can put in there, right? That ampersand and colon or something. I can't even remember what they are, but yeah, those are the things that are... C++ is a little on the complex side, I'm thinking. Just over the edge.
Starting point is 00:31:33 So here's a Node.js example. And, again, this is really just to show you that it's idiomatic. Now the configuration is a JSON string because that's what people like to use in JavaScript. And then nothing here should surprise you. You can see, look, there's a very similar kind of lambda. This is what a lambda looks like in JavaScript, right? Those arrow functions. And I think most people are already familiar with this kind of stuff,
Starting point is 00:31:57 if you've ever used JavaScript. But you can see the whole idea here is it's idiomatic JavaScript. And we're actually getting pretty decent performance out of this. So talking about performance, we've really spent a lot of time trying to think this through and get the performance to be as good as we can. Probably the biggest thing that
Starting point is 00:32:22 we've done is to try and limit the number of round trips that you make between higher level languages and our implementation. It's those transitions like in Java, the JNI transitions are what kill you, right? So everything we could do to try and eliminate that, that means in some cases caching things or doing some kind of aggressive stuff inside of the language bindings to help eliminate some of those round trips. We've noticed there are some kind of annoyances, like translating bytes to UTF-8 because language is required and so on. Those are things you have to kind of learn to live with.
Starting point is 00:32:59 But inside of our core, our PMMKV core, we do a lot of movement and creating data structures in DRAM, again, for performance. This is stuff that's completely transparent to the caller, but it actually gave us quite a bit of performance. We're, of course, very aware of persistent memory. We know a lot about it since that's what we work on. So, for example, if you were in the previous talk where you saw Mike say that accessing things at 256-byte granularity is four times the bandwidth than accessing random 64-byte cache lines, we, of course, detect that and leverage it and go through a lot of effort to try and get that higher bandwidth. Not to mention the difference between reading and writing. We're very aware of that, too.
Starting point is 00:33:54 Our CMAP, our concurrent hash map performance, this is the one that we've been working on the hardest lately. Actually, it just got released into production quality about just less than three months ago so we just sort of hit that production quality landmark and now we're working on the ordered version we have found that
Starting point is 00:34:16 this thing really scales great with the number of threads and I have to just benefit, it's all in the threading building blocks library, Those guys really know what they're doing. They're a big HPC library. They're used to scaling up to hundreds of cores. Our P99
Starting point is 00:34:34 latency, the outliers, it's also quite good. Really impressive. Right before I came here, a couple days before I came here, one of the authors of a popular open source key value store that is volatile got access to one of our machines, and we gave him LibPM MKV, and he's using it inside of his key value store to make it persistent.
Starting point is 00:35:01 And so we're going to write a blog or something together. I didn't quite have permission to use his name yet so I just put the text here from the email because I thought it was so great. He said the performance numbers are incredible. Using our persistence engine with full durability, it's running at 80% the speed of DRAM.
Starting point is 00:35:17 So this is a guy who has tuned a key value store for DRAM for years. And it's been volatile. Anytime you set a value, you lose it on a power fail. Now he has his same key value store, and anytime you set a value, it's persistent. Not eventually consistent, it is synchronously persistent. And he got 80% of DRAM.
Starting point is 00:35:40 Try and do that with storage, right? I actually think this is just a huge thing here. We've tried to take some of the other key value stores like RocksDB and make them immediately persistent. RocksDB can do it. It's a config line option. Sync equals one, I think, is the option. No way is it 80% of the DRM speed.
Starting point is 00:35:59 It really drops. It's more like 2% of the DRM speed. I just thought this was such a great thing. That's when I said to the guys, let's write a blog together, and we're going to do that. That's kind of the summary of the performance for the CMAP engine.
Starting point is 00:36:16 We expect pretty good things out of the ordered concurrent map that we have coming up, which is based on a TVB implementation of skip lists. We're pretty excited about that. That have coming up, which is based on a TVB implementation of skip lists. So we're pretty excited about that. That's coming up. But again, the whole point here is that if you decide that you
Starting point is 00:36:32 want to modify your program to leverage what persistent memory can do, you don't want to use these legacy APIs, these storage APIs, you don't have to go all the way to the getting into the transactions and designing your own data structures if you don't want to. Key value store here actually solves, from what I can tell, a lot of the use cases.
Starting point is 00:36:53 That's all people really wanted was a key value store. So now there's one written for you. Any questions? Yes, sir. I'm curious to know what's the size of the map and what's the size of the RAM and what's the size of the interface that you've got on your computer? Yeah, so, you know, we, of course, regularly test, you know, the six terabyte kind of sizes, the really big sizes. To translate your question a little bit, if I had a six terabyte key value store, how much DRAM should I have in the
Starting point is 00:37:26 system? Actually, for this concurrent hash map now, the DRAM usage is quite small compared to the, you know, we really do kind of put most things in persistent memory. I don't have the exact numbers of what we require. Actually, that's kind of such a good question that we should put that into the documentation for each engine. Like, this is how much DRAM you require. But it's, you know, I'd say per terabyte, it's on the order of a few gigs at the most, just for the current engine.
Starting point is 00:37:58 But I'm going to take that as an action for me. That really needs to be on the man page. I read something that I might have misunderstood in that white paper that Intel just recently released about Baidu. Yes. It looked like Baidu used no DRAM whatsoever. I was going to ask you about that, but since you asked the question, I said, maybe I'd ask you now. Well, I will say this. We don't currently support a configuration with no DRAM.
Starting point is 00:38:29 Is it possible for someone to put together a system that we don't support, writing their own BIOS, and validate it? Sure. It is possible. But, you know, I have no knowledge of such operations, and if I did, surely I would have to kill you after telling you. I mean, it's on them, right? It sounds like it would be a really hard to move this system.
Starting point is 00:38:54 You know, quite honestly, it would work just fine. I mean, again, if you rewrote the BIOS so that you were reporting it as main memory, right now the BIOS reports that in the ENFIT table, and the OS says, oh, my main memory comes through a different table. And so the OS, you know, today, if you just didn't have any DRAM in there, the OS would say, I can't boot. So you really would have to rewrite the BIOS to do it or modify the BIOS to do it. So you'd have to be someone with those kinds of resources.
Starting point is 00:39:29 And then, you know, you'd be fetching every instruction and every data structure would be in persistent memory. It would work just fine, but it would be working at the, you know, 8 gigabytes a second per DIMM instead of 17 gigabytes per second per DIMM. But it would work. Thank you. So, you know, it's certainly possible. I didn't know that any information about that particular case had been
Starting point is 00:39:48 released I'm going to go look it up and see what they said publicly so I know what I can say publicly that's my trick rather than go and ask anybody I google it and whatever I see with google I figure it's okay for me to say that so yes I think I might
Starting point is 00:40:04 be seeing the same kind of architectural split you see in Smeo with the KB API and then computational storage. Yeah. Can you imagine what you would do with the basic API and then the engines? Yeah, I think that's a fair
Starting point is 00:40:18 point. I think what we are trying to, yes, we're very much trying to split off things that sort of stay the same between all the different engines and all the different languages, the APIs, and then give people a place where it's easy for them to innovate, like with different engines. So you could easily imagine, just to really drive your point home, I could easily imagine the FPGA accelerated engine that drops in and works here.
Starting point is 00:40:40 In fact, now that I can imagine it, I kind of want to go do it. It actually sounds kind of fun. So, yeah, I think I'm in agreement, I kind of want to go do it. It actually sounds kind of fun. So yeah, I think I'm in agreement that we're really trying to make that architectural line so that these two things can develop asynchronously. Absolutely.
Starting point is 00:40:54 One other follow-up on that. So can your engine, for example, want to do a range check? Does it actually do that without pulling all the cubes? That is a good question. Yeah. It's based, the current implementation that we're working on is based on a skip list. I don't think you have to copy the keys for that.
Starting point is 00:41:18 I think you just have to find the beginning of your range, and then you do the skip list, you know, pointer chasing until you get to the end of it. But I don't know for sure because the skip list is the one that's currently under development. And since I spend most of my time doing PowerPoint now instead of programming, I don't know the details, which is really kind of sad. So yeah, it's high-level programming.
Starting point is 00:41:41 My IDE is PowerPoint. Did you have a question? So in the data structures stored on the system, I'm guessing there are no direct pointers. Oh, this is a great question, yeah. And is there a performance indicator on that? Yeah, this is an excellent question. So in the engines that are based on PMDK, we use relative pointers instead of absolute pointers. And we do that, just in case everybody doesn't know, every time you memory map something,
Starting point is 00:42:17 a file, Linux and Windows both, they go out of their way not to give you the same address you had last time, right? They actually randomize it intentionally for security because it makes the guys in the black hats have a harder time trying to find your data structures and smash them. It's a security feature. It's called address space layout randomization, I think. There are two things we could do about that.
Starting point is 00:42:39 We could ask the system to put the mapping in the same place. You are allowed to do that. Or we could just say all of our pointers are going to be relative. So early on, five years ago now in this work, I took a version of our library and I compiled it for absolute pointers. I went in and I modified all the code, and I ran one of these benchmarks, like our B-tree benchmark. And there is a performance overhead. Every time we dereference one of those pointers, we our B-tree benchmark. And there is a performance overhead.
Starting point is 00:43:06 Every time we dereference one of those pointers, we have to add a base pointer to it. But Intel hardware adds two things really fast, right? So I found that the performance overhead was in the lower single digits, like 1%, 2%, that kind of thing, depending on your benchmark. So at that point, I just kind of decided, oh, good, I'm not going to worry about it anymore because the relative pointer solves so many other problems. So, yeah, go ahead. Do you have a follow-up? Yeah, yeah, yeah.
Starting point is 00:43:31 So in that case, for the data structures, also the data structures that are stored in DRAM, if you want to use the same relative pointers, then you could support it by putting it in shared memory from the processor. Yes. This is one of the concerns
Starting point is 00:43:50 about the multiprocess thing, right, is that that's right. You either have to have each one of them has their own DRAM thing that uses its own runtime pointers, which is probably what we do because it's easier, or we could share the DRAM between multiple processes
Starting point is 00:44:06 and really have all the address spaces laid out the same. Making all those processes have the same address space layout is possible. It's just harder. Because, again, you have to ask the system, okay, I got address 100 here. I was making another point, which is that by using relative pointers in those DRAM data structures as well, you can't just support it.
Starting point is 00:44:27 Oh, I see. Sure, we could do that. Yeah, yeah, yeah. That's not a bad idea, actually. In fact, that's really not a bad idea. We could take, you know, we spent a lot of time tuning that stuff. So we could actually take that implementation of our relative pointers and just abstract that out into something. That's not a bad idea at all.
Starting point is 00:44:43 I'll bring that one back to the team. Yes? Sorry, just to follow on to his question. Is it not possible for two processes to say, I want to map a particular memory to the same exact item? It's possible for them to request it. Now, what happens when the system does it is tricky. And so it's possible for them to request it. Now, what happens when the system does it is tricky. And so it's possible but tricky.
Starting point is 00:45:10 So, you know, shared libraries, most people don't realize that every application they're running calls them map a zillion times. It calls them map a zillion times for shared libraries mostly, but also for some files. When you ask the system to map something at a specific address, unless you're really careful, thank you, unless you're really careful, it'll just say, oh, yeah, sure, you can have it, and I, by the way, just to help you out, I unmapped whatever was there.
Starting point is 00:45:35 In other words, so if you get a conflict, it doesn't treat it as an error. It just unmaps the previous thing. So if you had some library that was in use that's sitting at that location and you map it there, now you've just introduced a horrible bug in your system. So people write this code that says, okay, well, I can go and check and I can see what's in there, and it's
Starting point is 00:45:53 racy code, right? Because they don't realize that they're calling library functions that are also performing additional M-maps and things like that. So again, I'll say it is possible, but it's tricky. And so a lot of people don't get it right. Yeah. So going back to the issue
Starting point is 00:46:11 of conventional absolute address pointers versus real address pointers. Yeah, yeah, yeah. It would be nice if PMDK offered the option of using absolute pointers because there's a lot of legacy software out there that expects it. Yeah.
Starting point is 00:46:25 And if some customer has a lot of software that would work with persistent memory, if only there was a malloc and free for persistent memory that operated on conventional pointers. So we do have one. Okay. We do have one we do have one but you know the malloc and free interfaces are not designed for persistence
Starting point is 00:46:52 so the interfaces that we have that provide those same interfaces are volatile perhaps that was the wrong word what I meant was that if you had support for using conventional pointers, that gives you compatibility with a lot of software that might be painful to use.
Starting point is 00:47:13 Yeah, maybe. But what we found is that usually for that other software, it's fine just to have an absolute pointer at runtime. It doesn't have to be persistently absolute. Of course, we convert it to an absolute pointer when you use it. If you pass one of these pointers to sprintf, the right thing happens. The sprintf gets an absolute pointer and it uses it.
Starting point is 00:47:32 You didn't have to rewrite sprintf. You might be right, but so far, it's been it's worked out pretty well to have the relative pointers in persistence but then convert them to absolute as needed. You may be right.
Starting point is 00:47:48 All right, Danny, yes, sir. I was just wondering, like, you were comparing that to the product you used in Iraq. Yeah. But that's what we're saying. But do you have a possibility for that or would you use that in a theme? Or maybe a hybrid of those? Yeah.
Starting point is 00:48:13 You know, there's no reason why you couldn't write an engine, even a RocksDB engine, right, that just plugs into our API, and then you'd have what you're saying. You'd have the same simple API, but it would work on an SSD. We haven't worked on developing anything for an SSD because there are so many of them already. We really wanted to focus on the persistent memory. But there's no reason why we couldn't do that.
Starting point is 00:48:33 Just to get uniformity of APIs. Our performance is better for immediate consistency. So RocksDB and, well, frankly, most of these key value stores, LevelDB, Cassandra, just about everyone I can think of, use this LSM algorithm, the log-structured merge tree, which I think is a brilliant algorithm. Actually, I love it. I think it's brilliant. But it's about eventual consistency.
Starting point is 00:49:04 It's about eventual consistency. Let's do everything in DRAM because I need the performance. And then eventually we stage things out to SSDs. With persistent memory, I don't have to do that. So I have an advantage with persistent memory. So it's a little bit of an apples and oranges thing, right? Because I'm really comparing the kind of persistency I get in persistent memory with something you'd never do in RocksDB because it's just not designed for that. All right. I am out of time. Thank you very much. Hope you enjoyed it.
Starting point is 00:49:35 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.