Storage Developer Conference - #66: Remote Persistent Memory - With Nothing But Net

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 66. Hi, I'm Tom Talpe. I'm an architect, storage architect with Microsoft. I've been here many times before and I've talked about similar topics. Some of these topics you'll probably say, hey, wait a minute, I've seen him talk about that before.

Starting point is 00:00:59 And that's kind of the point. I come here to talk about these things and sort of set ideas in motion. I'm not here to announce a product. I'm not here to announce the necessarily right way to go. But I have some insight into remote persistent memory, into RDMA, into upper layer storage protocols. And I hope to convince others that this idea has merit. I call my presentation Remote Persistent Memory with Nothing But Net. And that's supposed to be a basketball analogy, but it's also supposed to be, the idea is, an RDMA NIC as a persistent memory storage adapter, right? An HCA, basically. These NICs are slightly performing this type of

Starting point is 00:01:57 function on a proprietary basis, if you will. Some manufacturers have some storage functionality like fiber channel or NVMe over fabric or things like that. But I believe we can do this with raw RDMA to a persistent memory back-end store. And I'm going to map out five steps that I believe get us there. They are things that we need at the RDMA layer, which basically function as offloads for upper layers and allow the upper layer to manage the connection while RDMA NICs handle the data transfer. And I mean this in a high fidelity way, that we will have all the major attributes that we use upper layer storage protocols to provide. But we will provide them, if you will, as an offload within the RDMA NIC for the data transfer portion. I'm going to talk a little bit about each one of these and what it means. I'm going to point out that some of these steps involve protocol changes to the RDMA layer,

Starting point is 00:03:07 that first one in particular, and to a certain extent the other ones. But there are some changes, in particular the privacy and QoS, which do not. And that might be a little surprising, and that's the sort of spark that I hope to light up here. So at past storage developer conferences, I have given presentations on the first three topics. Remote flush and

Starting point is 00:03:36 SMB3 push mode kind of go hand in hand. But push mode is not limited to SMB3. My talk last year actually, or maybe it was two years ago, explored NFS, ICER, and SMB3, all of them with the same sort of transfer model. And I'm proud to say that these things are all actually working today in prototype form. David and Matthew talked about push mode the other day. The PNFS folks have been working on push mode for NFS. And, you know, it's reality. The remote flush is an RDMA extension, which is moving through the IBTA right now. And there's a very active discussion, and I'm very hopeful that we'll have prototypes of this. We have to get the protocol right before we start calling it done, right? You can't just ship this stuff

Starting point is 00:04:28 willy-nilly. You've got to really think it through. But I believe that those first two items are well underway. And the third one, Storage QoS, is a talk that I gave back in 2014 here. There's a URL for it at the back of the deck. Actually,

Starting point is 00:04:44 I should mention the deck. I don't know if it's online yet. I delivered it fairly late, shame, and it may not have made it to the website, but it should be there very soon. But anyway, storage QoS is very important, and we'll get to that toward the end. We've done a lot of further development, right? As I say, these things are prototyped, and I've done a lot of thinking. And so my goal is just to sort of bring these former concepts together with some new things that I'll talk about today. So let's start with RDMA flush. This was motivated a few years ago by persistent memory and the desire to use RDMA to move data to it. Well, from it as well, but primarily to it. Writes were the most interesting one. It's persistent memory. You

Starting point is 00:05:31 want to store things in it persistently, right? They get it out later, but storing it is the hard part. So in order to support RDMA flush, which as it's been known, I originally called it RDMA commit, we need a remote guarantee of durability. And the RDMA fabrics in support of the SNEA NVM programming twigs interface optimized flush. It's basically a way to programmatically from the application request this durability. We've talked about that. I had a presentation on that on Wednesday. RDMA write alone is not sufficient for the semantic. You can't just slap in an RDMA and do a write to it and be guaranteed that durability. You need an additional flush operation to push data to the persistent memory, possibly

Starting point is 00:06:16 to even program it into the persistent memory. The completion at the sender of an RDMA write does not mean the data was placed. This is just review. For those of you who understand RDMA, it's probably very familiar. It means only that it was sent on the wire. It doesn't necessarily mean that it was received, although some NICs give you that guarantee. It definitely does not mean that the data actually reached its final destination. It just meant that the NIC is doing its best to get it to its final destination.

Starting point is 00:06:45 That might be a sender guarantee. That might be a receiver guarantee. It might be a platform guarantee. But today's platforms actually work against that. The Intel DDIO drops data into the CPU's last level cache on purpose. It does not send it to the memory controller. And so that flush is a critical step. Processing at the receiver,

Starting point is 00:07:09 like how does the receiver know that he's received data? How do you know that fresh data has arrived? There's no signal at the receiver to the peer when a write occurs. You can send an interrupt, but that interrupt only guarantees that it was delivered to the consistency domain. And so that's called placement, and it's not the same as durability.

Starting point is 00:07:35 So we can't fix this problem at either end using the RDMA protocol or the RDMA paradigm as it sits today. And the conclusion from a couple years ago was that an extension is required. The concept is that it's a new wire operation and a new verb. The verb is the RDMA API, quote unquote. It's implementable in any fabric, any existing RDMA fabric. It takes a different approach in each one, but semantically it has the same property. The initiating NIC provides a region, a destination region, and other commit parameters. It's under control of the local API at the client slash initiator, right? The application

Starting point is 00:08:17 requests data be flushed to this region. That's the way the SNEA NVMP Twigs API works. The receiving NIC will then queue that flush operation to proceed in order. It's important that it go after the writes, right? You want to make sure the data is there before you request a flush. And so that ordering property is extremely important. And RDMA has such a facility, it's called a non-posted operation, and the examples of that are RDMA Read and RDMA Atomic. They give you certain guarantees with respect to previously written data. They guarantee that they will act on the most recent copy of the data, that the

Starting point is 00:08:55 data can't be lost or crossed or passed the execution of this following operation. And that's a receiver guarantee. There's a similar sender guarantee which has nothing to do with this. So you have to sort of unscramble the polarity of the operation to get it right. That queuing is also subject to flow control and ordering, which is really important, because these flushes have to be ordered and they take time. They're blocking operations, if you will. You have to wait for data to flow. So the ARNIC pushes pending rights to the targeted regions. It basically does that flush on the local PCI bus.

Starting point is 00:09:34 The NIC may simply, instead of doing the targeted regions, the NIC can simply say, I'm not tracking regions. I'm just going to flush everything. That's perfectly fine. Your guarantee is the regions that I requested are now durable. If you did everything, great. The RNIC performs any necessary persistent memory commit. That's a platform dependent operation. It possibly interrupts the CPU because it may not have sort of a hardware mode to do this. But in the future, it's highly

Starting point is 00:10:02 desirable to perform via the bus that the NIC is plugged into. The rNIC responds only when durability is assured. So it's a blocking operation from the client's perspective. He sends the request, some magic happens, he gets a response. That's not the way Rdmi-Write works, for instance. So this has been discussed in the InfiniBand Trade Association link working group. It's also, we published, I published a internet draft a ways back that talks about this conceptually. That discussion is kind of idle right now. I think the IBTA is pretty much leading that discussion, which is fine. There's broad agreement pretty much everywhere about the flush approach.

Starting point is 00:10:44 I mean, it's just a natural approach It matches the user API it matches the the the behavior that people imagine the platforms will have So it just is exposing the requirement that we've already agreed upon The discussion continues on a number of semantic details, though. That question of per region, per segment, you know, scatter-gather list, or per connection is a key one. How implementable is this in the NIC? And the problem is that that decision, that question mark, has implications on the API. If the API says, sync these five bytes, but the NIC says, I can't sync five bytes.

Starting point is 00:11:22 I'm going to sync everything you've asked. Well, you know, you just ignored an argument, right? You gave the guarantee of the five bytes, but we don't have good harmony, if you will, between the API and the wire operations. So that's open for discussion. It's important to converge those discussions before we finalize this design.

Starting point is 00:11:41 And there's ongoing discussion on this thing called the non-posted write that I'll talk about. But it provides a write-after-flush semantic. So I can write some data, flush it, and then write some more data. And that more data only appears when the first blocks are durable. That's really, really important. Log writers need that. And there's some other RDMA semantics that come into play in the PCI protocol. The PCI has no flush operation, right? PCI's rights are, they're called posted or fire and forget.

Starting point is 00:12:18 You ship the right and off it goes. You know, maybe it lands soon, maybe it lands later. It's guaranteed to land and it behaves with certain properties, certain ordering properties. But there's no acknowledgement to a write. There's no flush operation to push the write. On some platforms, you can do a read after write, but that only guarantees that it makes it to a visibility space, not to a durability space. And so there's a good bit of discussion that is required for the platform, either in the PCI bus or possibly with additional

Starting point is 00:12:50 sort of hints on the side. There's some PCI TLB, TLP hints that might be used. There might be some platform specific signals that can be wiggled. But the idea would be to accelerate the adoption with platform-specific support. It'll take a long time to change the PCI protocol, right? But we could do platform-specific support

Starting point is 00:13:13 more readily, more rapidly. And so it's sort of a mix of these sort of platform-specific questions that'll be looked at over time. And there are a few other RDMA semantics. The idea is that we have these blocking operations, and blocking operations introduce pipeline bubbles, and pipeline bubbles are really bad for throughput, right? They're also bad for applications, which have to block and wait for a completion and get

Starting point is 00:13:43 a signal or an interrupt or some, know some layer to layer context switching has to occur there those are really bad for performance so we would like to have the ability in the protocol at the lowest layer to be able to stream these things right to do it without a bubble to have the protocol handle the ordering and to break the the chain if that ordering can't be guaranteed if some error occurs So you know the atomically placed right after flush is really good for a log pointer update I think I have a little example on that Immediate data to signal the upper layer right I just dropped some data

Starting point is 00:14:19 The CPU had no idea it happened because I did it all on the hardware layer, right? How do I tell the upper layer? But we don't want to tell the upper layer until the data was made durable. So we need to order some signaling with the flushing as well. So immediate data or an ordered flush following send. They might be implemented in ordered operations on the wire. They might be sort of options to existing operations. It's kind of an open question. Additional processing, an integrity check,

Starting point is 00:14:50 and that's one of the things I'm going to talk about. The semantics in each case will be driven by the workload, by the applications requirements. We can't just sort of make these things up. We have to look at actual APIs or application requirements, figure out what they are, and try to marry them, if you will, try to bring them together. That's really, really important for adoption.

Starting point is 00:15:10 You can't just come out with a brand new thing and expect everybody to use it. You have to have it meet their requirements and you have to have a story for adoption that allows this thing to happen in phases over time, in my opinion. So here's a log writer example. Forever, the way log writers work, you're scribbling to a drive, and you make a log record to sort of record what you've scribbled.

Starting point is 00:15:38 And you're going to drop these log records in, and every once in a while, you're going to commit the log. You're going to say, all right, we're going to say this is a clean point in the file system. And we're going to finish writing a bunch of little log records, and then we're going to put a pointer. We're going to update a pointer. We're going to have a big circular buffer of these

Starting point is 00:15:56 things, and it's constantly going to go write, write, write, commit, pointer, commit. Write, write, write, commit, pointer, commit. And the validity of the pointer indicates the validity of those that pull a log record. It sort of advances the state of the world by dropping the pointer. And so the problem is right here. You can't write the pointer until the record is safe. And so that

Starting point is 00:16:23 little comma right there is a pipeline bubble. And that's bad. You want log writers to be absolutely latency insensitive. Latency will kill the throughput because you can't proceed until the commit is done. And so the goal is to avoid those bubbles. The possible solution, which is what we've been talking about in the IBTA, is this thing called the non-posted write. And basically, this guy would be the non-posted write, the log pointer write.

Starting point is 00:17:00 And that non-posted behavior would make him wait for the commit. If the commit failed, that second write would never occur because it's ordered. If the commit succeeded, that second write would occur only after the commit. And so that gives us exactly the semantic that a log writer needs. And so databases would use this, file systems would use this, a whole lot of very important latency-sensitive applications would immediately be able to use this. So it's really, really desirable. There may be specific size and alignment restrictions. You want that second right to be atomic. You don't want it to be fragmented and

Starting point is 00:17:34 appear in pieces, right? That would be really bad. So there may be size and alignment restrictions. There may be other types of preconditions to make sure this thing is implementable. But that's the idea. And it's being discussed, and everybody likes the idea. There are a couple of open questions. There might be some other ones we want, and we might be able to merge it with another operation, which I'll show right here. Here we show three log records being put, put, put. They are being written.

Starting point is 00:18:07 Here's a log transaction, here's a log transaction, here's a log transaction. And they are written pretty much asynchronously. And they appear in host persistent memory over time. At some point, the log writer will say, I want to flush it. And so it will launch a flush, which turns into an RDMA flush message on the wire and goes to this flush subsystem. The flush subsystem will wait for the previous writes before performing a flush and will acknowledge flush complete only after all the previous writes were placed. So far, so good. Now, How do we write the pointer?

Starting point is 00:18:47 Well the pointer is another put Possibly with a commit doesn't really matter because it may reach durability whether or not you've committed it explicitly right these platforms are constantly Sort of moving data that way But if it does a right after flush then this little stop sign will cause it to wait for the flush, right, before it performs the right. And so the idea is that you have put, put, put, flush, put, right? And this synchronization is the important one. This is what the non-posted behavior is, okay? And then finally, one open question is, can we merge the two? Can the flush and the put be done together? Could I add that little extra payload to the flush? And that's just something that we've been exploring and thinking about. It saves a wire operation.

Starting point is 00:19:40 We don't have a pipeline bubble, so that's good. I don't really care if it's two operations or one in the end, but it might be nice to consider having one. Something to think about. And I have another merge in mind that might further motivate this. All right, so graphically, that's how those steps work. I guess that thing disappears with another animation step. A second example. This is something I talked about, I don't know, a year or two ago.

Starting point is 00:20:13 SMB3 push mode. This is something we've prototyped. David and Matthew showed some numbers and some working prototypes. The thing that we haven't been able to prototype is this orange, this red bar right here that you can't really read, but it's the RDMA commit operation. We don't have that yet. So we do this operation a little differently in our prototype. But the basic steps are to open a DAX-enabled file. You do this with any file system or any file protocol on a DAX-enabled file system.

Starting point is 00:20:43 So Linux could do this, Windows could do this. It's a very sort of generic operation. But DAX is the direct access-enabled file system. There are a number of ones. There's a Windows flavor and a Linux flavor. And the idea is that that file will reside in persistent memory and can be directly addressed locally by a local application. It would be accessed directly by the server or by the RDMA NIC in the remote application. And so you'd obtain

Starting point is 00:21:11 a lease to make sure that the pages don't move. You'd request a push mode registration and while true forever you'd push or pull data, right? You'd either RDMA write or RDMA read the actual pages of the file. And from time to time, you could commit data to durability. So your writes could be made safe from the client without ever invoking the server. At the end, you'd release the registration, or maybe the server would recall it if it had to. You'd drop the lease and close the handle. And so basically, these guys up here, the black lines, are the upper layer protocol. And these lines in here, the gray lines, are the RDMA protocol. And you notice that all the data transfer happens below the hood of the upper layer,

Starting point is 00:21:57 right? So this is where I'm going with this, right? How much more can we do in that box? This is just a slightly more detailed view. This is a bit of a recycled slide. You may have seen it before, but I'm just going to mention that there are basically three ways to drop the data. There's pure upper layer via IO requests to a PMEM-enabled file system. There is a mix of upper layer and load store that moves data via the memory bus on the server, but via upper layer messages to and from it.

Starting point is 00:22:33 This can be RDMA-enabled, but it's not RDMA direct to the buffer cache. It's RDMA to the server's memory, which moves to the buffer cache. It's server-initiated RDMA. And then up here at the top is client initiated RDMA, in which an RDMA read-write goes to the buffer cache directly to PMEM. And this is kind of, you know, phase three, the ideal case that we're looking for, right? So traditional IO,

Starting point is 00:23:01 DAX load store by the server, DACs load store by the client. This is where we want to get. There's a problem up here though. There's some stuff missing, and that's where I'm going. So far so good. Only two protocol extensions are needed to do everything I just talked about. Maybe one if we merge the flush with the right after flush. That's pretty cool.

Starting point is 00:23:31 But we still need a CPU. We still need an upper layer to do that CPU for all that other storage processing. You know, let's not forget about the things that storage servers need to do. They need to guarantee the data was good, that it wasn't tampered with, that it wasn't damaged, integrity. That it's private. We're spraying this stuff on networks. People don't like seeing their data

Starting point is 00:23:54 go in the clear over networks anymore. And I don't care what network it is, there's no such thing as a secure network anymore. All networks are, at their very nature, compromisable. So privacy is critical and QoS, quality of service, fairness, management, congestion, right? You can't have 100% of the network. You've got to share the network. You can't attack your neighbors with the network. You need to be controlled, right? You need QoS. You need some protection. And it gets tricky when you push it that far down the stack. So my question is, can we do these three things with only a NIC

Starting point is 00:24:37 and without more protocol extensions than the one I just mentioned? The answer is yes, can we do it with only a NIC, and mostly yes, can we do it without other protocol extensions than the one I just mentioned? The answer is yes, can we do it with only a NIC? And mostly yes, can we do it without other protocol extensions? There is one other little thing that has to happen, but I think you might be surprised at some things that don't require protocol extensions. That's just kind of what I want to dive into here.

Starting point is 00:25:02 Just keep my timer going so I know what time it is. Remote data integrity. Assuming we have a write and a flush, and then the flush plus commit, well, it is a flush plus commit. I guess I meant write plus flush, all complete with success or failure. We've successfully delivered this data to persistency. How did the initiator know that the data is intact? It basically threw this thing to the NIC and the NIC said I put it to the other end and the other end said I made it durable.

Starting point is 00:25:37 How do you know that those bits are good? Or if there was a failure, which data was not intact? What just got lost? What got damaged? How failure, which data was not intact? You know, what just got lost? What got damaged? How much recovery do I have to do? Normally the CPU would have done that, right? We would have done this with an upper layer protocol

Starting point is 00:25:55 that would call a file system that would all give you end-to-end guarantees that the data was good. Now we've done this pretty much by DMA. So there's some possibilities. We could read it back. Well, that's extremely was good. Now we've done this pretty much by DMA. So there's some possibilities. We could read it back. Well, that's extremely undesirable. We're going to use 100% of bandwidth

Starting point is 00:26:10 in the other direction now, and we might have to read back many gigabytes of stuff. As well, you might not actually read the media. On Intel platforms, you might hit the cache. You didn't read the media. You don't know that the actual bits are good. You just know that the cache was the same as you expected. So reading back is a really bad idea.

Starting point is 00:26:30 You could signal the upper layer and say, can you please do whatever is necessary to tell me this is good? And that's all right, but it's very high overhead, and there might not be an upper layer available. You might be running this, my idea is, you run it on memory-only appliance. So you don't want to depend on this as an architecture. You want to have a way at the lower layer to be able to do this.

Starting point is 00:26:50 And the same question, you know, integrity applies to things like, you know, ordinary day-to-day operation, like an array scrub or storage management. You know, you may need to reach out and do this. It may not be the sender of the data that wants this guarantee. It might be somebody at a later date, a later time, who wants to come through and verify. So I propose another operation called RDMA Verify. We've talked about this here and there. I'm basically just going to repeat the idea here, but give it maybe a little more meat. The idea is to add integrity hashes to an operation. Integrity hashes that basically guarantee the integrity of some block of data that has been sent. Either has just been sent or was sent in the past.

Starting point is 00:27:41 It could be piggybacked on flush. It may not have to be a new operation. It could be piggybacked, and that's what I want to talk about. This is not unlike SCSI T10 diff, right, the data integrity format, where you drop a hash along with the data, and the hash accompanies the data on the end device at all times. The algorithms would be negotiated by the upper layers. We don't want to specify hash algorithms down here. They're the part of the upper layer. But the idea is that NIC would implement them, whatever they are. You would name them or negotiate them or select them. And the engine that does that could be on the platform and the storage device itself. Some devices do that. It could be in our NIC hardware and firmware.

Starting point is 00:28:23 It could be other hardware, the chipset, the memory controller, lots of lots of devices have this type of integrity capability. Or of course it could be in software, the target CPU. You might do that to get going. But the idea is that it'd be as efficient as possible. And there's a couple of options and we don't really have to decide between them right away. But one is that the source requests the hash and gets the hash back. It says, compute the hash and send it to me, and I'll decide.

Starting point is 00:28:51 Which is nice. The other is the software says, this is what I believe these regions should hash to. Tell me yes or no. And the differences are interesting. In one case, you don't have to tell the server anything. He just computes a number and sends it back to you. In the other case, you tell the server something, and the server actually knows that it's good or bad.

Starting point is 00:29:12 At the same time, you know it's good or bad. So you don't have to go back to him and say, oops, bad block, do something. It roughly maps to the SNEA NVMP twigs, optimize, flush flush and verify. Flush and verify has very weak semantics, but this would be a really strong semantic for flush and verify. And here's the basic idea.

Starting point is 00:29:35 We see our put, put, put. We see our flush, right? And we get a flush complete. Now how do we know that the data that landed here is good? And the idea would be that a verify, oh, no, here's a put. It's still in the thing. And then at the bottom, there's a verify. And the verify would follow the flush complete,

Starting point is 00:29:58 and this verify would act on this data and say verify complete. So it would simply be a second operation. But one might consider merging it with the flush. You could say flush and verify, which is after all what the SNEA NVMP API does, right? So you could consider two operations on the wire or one operation on the wire. But the verify, ideally, would be not necessarily,

Starting point is 00:30:30 no pipeline bubble after the flush complete. You would want it to synchronize with this same stop sign, ideally. I don't know. Anyway, verify is important. Otherwise, you're back to signaling the upper layer. And if you're really worried about the integrity of your data, that means you're signaling the upper layer every operation, which means latency, which is not what we want.

Starting point is 00:30:57 Yeah, quick question. Do you assume that this... Because this flash doesn't need to may mean pushing it to the media. That's a platform-specific question. But the flush will have additional latency. Yes. Yes? This is verified. It could be that it just reads from the durability domain. Possibly. It could read from the durability domain.

Starting point is 00:31:32 Once again, that's a platform-specific question. This guy on the left doesn't know how to do it. He's going to request this guy, the Nick on the right, to do it. And the Nick, in turn, will have to do some platform-dependent operation to satisfy the guarantee. The SNEA NVMP twig has spent a long time discussing that fine point. Does the verify go all the way to the median?

Starting point is 00:32:08 The answer is probably no, but we can't make a meaningful general statement that's true for all scenarios, all platforms. Frustrating. I mean, it's good. It leaves room for innovation, right? You don't want to over-specify this stuff. That would be my philosophical reply. Privacy. All right. Upper layers protect their send-receive messages today, right? Any storage upper layer worth its salt has some privacy mode. RDMA direct transfers are not protected.

Starting point is 00:32:46 There's no standard for encryption in RDMA. There's some encapsulation. They say, oh, protect it with IPSec or run it through an SSL tunnel. They're waving their hands, in my opinion. That's not an encryption standard. That's a layered solution. In any event, the desire is to protect the user data

Starting point is 00:33:08 with the user's key. In the cloud today, you really don't want the cloud provider to have the key to your data because if that provider puts your data in, you know, East, who knows a Stan, I'm sorry, I shouldn't use any sort of geographical pejorative here. It puts it in another geography. That data is vulnerable in that geography. That provider could be subpoenaed for the key to your data. It's your data, not his data. So some of us have an opinion that the user key, the user data must always, always, always be protected with a user held key

Starting point is 00:33:46 to which the cloud provider and others do not have access, right? So it's not a global key. It's not a machine key. It's not even a connection key because you might share the connection, right? It rules out the use of IPsec, TLS, DTLS, right? These things use a key per flow on the network, if you will. That's a shared key that you didn't know. You don't control it all the way down the stack. And so you can't use these traditional techniques. Alternately, why don't we just say, well, let's use the on-disk crypto.

Starting point is 00:34:24 So if the upper layer is encrypting its data, great. Then you've punted the problem up the stack. The upper layer is doing it, and everybody's happy. But very often we have some sort of on-disk crypto that protects the data when it lands on the disk. And that may be protected with a user-specific key, right? But that on-disk crypto has some really awkward properties when you try to float over the network. It's typically a block cipher, right? But that undis crypto has some really awkward properties when you try to float over the network. It's typically a block cipher, right? It's a sector-based cipher.

Starting point is 00:34:51 It'll encrypt a sector at a time. So I can't write or read a small fraction of it. I've got to read or write the whole thing to have a valid crypto behavior, right? And second, these undis cryptos rarely, if ever, provide integrity. They usually try to preserve the size of the block, right? So you can just sort of, you know, scramble the block and drop it back down. Integrity requires adding data to the block for the integrity check, right? And so you don't get integrity, meaning you got to do double computation, right? Just, you know, grabbing that block isn't enough. Decrypting the block might give you gobbledygook if it is a damaged block. So, the on-disk crypto and the traditional connection-based crypto probably don't work.

Starting point is 00:35:38 Upper layers, particularly SMB3, use authenticated stream ciphers. Right? Stream ciphers have some really nice, nice, nice properties. They provide privacy and integrity together. If you don't have the proper key, you can't get the data. But you can also be guaranteed that you know that the data was good or bad. The act of decrypting it tells you whether that was valid. You can do it on an arbitrary number of bytes.

Starting point is 00:36:06 You can protect a couple of bytes. You can protect a whole big long string of bytes. There's a limit to the amount you can encrypt safely with each cipher, but you choose your cipher. You pick the cipher that matches your workload or matches your requirements, and it'll protect exactly what's going on in the wire in both privacy and integrity. That's really cool. It shares the cipher and keying with the upper layer.

Starting point is 00:36:30 If your upper layer is using this, like SMB does, we use CCM and GCM. GCM is preferred nowadays because of the capabilities of modern hardware. But if you share that cipher and keying with the upper layer, well, the upper layer has the keys, right? And the keys came from the user which and keying with the upper layer, well, the upper layer has the keys, right? And the keys came from the user which were under control of the upper layer. So it's very desirable to do that. But how do you plumb that key into the RDMA NIC message processing?

Starting point is 00:36:55 And I propose that you can do this by enhancing RDMA memory regions. And in particular, you don't have to change the RDMA protocol if you do this carefully. I think Okay, we can talk about that a decryption coprocessor is another way to do this, but you need to protect it both on the wire and at rest and There are two separate problems. I guess it's where I'm going so

Starting point is 00:37:32 Crystal I'm just going to keep going because I'm a little shy memory region keys I Propose that we can extend the memory region verb and the NIC TPT the translation protection table to include a key Alright, basically add an argument to the memory register verb that provides a key. The keys would be held by the upper layer under user policy and passed down to the NIC with each registration. The NIC will use the key when reading or writing for RDMA each region. So there's a source key and a destination key. And this is just a little picture of it. It shows registering the source buffer with a key. That key is stored in the NIC.

Starting point is 00:38:12 A nonce, which I'll mention in a minute, is used to encrypt a string, secret stuff. It goes over the wire in encrypted form. That same key, plumbed by the upper layer over here, is used to decrypt it with the same knots. And the data can be written to the host's persistent memory. Whatever it is, it was recovered. It was protected on the wire and recovered here. And the integrity was guaranteed here. We'd use another integrity check on the PMEM if we didn't trust it.

Starting point is 00:38:42 And just in case you don't recognize it, Peg happens to be Klingon for secret stuff. I just thought that was cute. Simple cipher. Actually, I'll mention two other things. You notice that the length of the stuff on the wire is not the same as the length of the stuff that lands in memory. That cipher may have messed it up, may have swizzled it.

Starting point is 00:39:06 Sure. Sure. What is the purpose of the data? Is your purpose to secure the data on the network? Privacy on the network, which you don't own. Not on the media. Not on the media, right. An authenticated cipher is not very useful on the media because it changes the size of the data. What is the motivation to do the protocol for RDMA in the payload data and the memory key data? And what is the, for example, in the study of the. What is the reason to invent something new? I'm looking for my slide.

Starting point is 00:39:46 It protects the user data with the user key. And that user key can be rekeyed at any time. IPsec talents would be the user... For everything that flows on that association. It protects the identity and integrity of data. It is possible to use IPec for some of this protection. I grant you. I don't believe IPSec is manageable. You want to secure the network. We saw that on the network.

Starting point is 00:40:29 I'm going to ask that we keep going because I have about 12 minutes to go, and then we'll discuss it in a moment. I'm glad you have that idea. It's important to discuss this stuff. That's my goal. Cipher housekeeping. Authenticated ciphers typically employ nonces. The nonce is a really key thing

Starting point is 00:40:53 because the nonce is paired with the key to at each end to encrypt and decrypt each message. Nonces can never be reused for a different payload. These ciphers are easily broken if you send a different payload with the same nones. Easily broken, like trivially apparently. I'm not enough of a cipher mathematician. What that means is that upper layers must coordinate

Starting point is 00:41:19 the nonce usage with the RDMA layer. The upper layer is using the same cipher with the same key to transfer its upper layer messages. So the lower layer has to be sure that it doesn't collide in nonce space. The RDMA must consider this when retrying. A lot of RDMA adapters when they retry simply refetch the data from memory.

Starting point is 00:41:39 That could change the data when it retransmits. So there are some implications on the RDMA layer. The NIC may derive the nonce sequence from the RDMA connection. The nonce may not have to be sent on the wire. A lot of protocols put the nonce in front of the payload and just send the nonce. The nonce is not secure, it just changes.

Starting point is 00:41:58 That's what's important about it. And so, one possibility though is that the RDMA layer has a lot of extra data. There's a message sequence number, for instance. And we could use this for nonce management, in my view. Alternatively, we could just stick it in the data buffer and the NIC could strip it out. But the NIC would then have to receive some of the data before it began to decrypt. When the nonce is derived from the connection, all it has to do is look it up in the TPT.

Starting point is 00:42:25 The upper layer consumes nonce space too, so I just want to mention that that coordination is important. Rekeying is necessary when the nonce space is exceeded. When you run out of unique nonces, they're huge, 11 bytes, 88 bits long, for the way SMB3 uses it. Most GCM nonces range up to about 96 bits, somewhere around there.

Starting point is 00:42:45 But they can still run out because you have to share them between layers and you might have to partition it and for whatever reason you might get tight on nonces because they can never be reused. So there may be some require rekey. Key management is always the upper layer's responsibility as it should be, right? The upper layer belongs to the user's identity,

Starting point is 00:43:07 and the user's identity is the owner of that key. So it goes up to the owner of the key. And there's one more thing. Protecting the network. Upper layers today have absolutely no trouble saturating up to 100 gigabits with RDMA. Absolutely no trouble, right? You give them an RDMA adapter,

Starting point is 00:43:29 they can go into a loop, and they can burn the network, right? 100% of the network. That's really undesirable, right? It's cool for benchmarks, but it's really undesirable. So, you know, the memory, the destination device can sync writes at least that fast.

Starting point is 00:43:43 So the networks will rapidly congest without some control. Rate control is absolutely required in the RDMA NIC if we light up push mode. Absolutely required. Fortunately, we have a number of QoS approaches. Simplistically, probably the simplest way we can do this is just bandwidth limited, right? And just say, sorry, you can have 100 megabits or whatever. When your rate exceeds that, I'm just going to put my thumb on your SINs and I'm not going to let you use it. But a more sophisticated approach is very desirable in any sort of scalable situation. I will, with some pride, point to the classification and end-to-end QoS that

Starting point is 00:44:25 we talked about back in 2014. The resources slide has a link to it. I'll also talk about software-defined network techniques, in particular the generic flow table. The generic flow table is something that a lot of Microsoft technologies use and it basically classifies packets and then applies a policy to things that fall into that classification. It can be used as a firewall as well as a rate limiter. So it's very interesting. There is support for these types of QoS things in existing enterprise class NICs. So it's not like this is new to any network adapter vendor.

Starting point is 00:45:00 But bringing it into the RDMA space is definitely new. So it's a requirement here. Putting it all together, assuming that we have durability, order dynamicity, privacy, integrity, and rig control, then we support the hot path completely in the NIC, right? We have this offloaded hot path in the NIC. Is that all? Are we done? Of course, no. And I just want to say that we still need an upper layer. Something has to do connection management. Something has to do authentication, right? You know, authenticate your user to provide a key to manage nonces for privacy, right? Or just for basic access, authorization. Right, in order to do authorization,

Starting point is 00:45:45 you need an authenticated identity. Granting and revoking of these push handles, right, it's setting up the RDMA flow, basically. Assigning that QOS policy, all the other things that upper layers already do. So I view this as an offload for the persistent memory data handling. The upper layer will still be there,

Starting point is 00:46:04 you'll still go through a traditional connection, but you will delegate the handling of this hot path to the NIC if it has all these features. And you can do that faithfully because these are the key functions of the upper layer, upper storage layer. So the summary with persistent memories of storage media and the above extensions,

Starting point is 00:46:23 we enable that RDMA-only remote storage access method. We avoid the CPU and upper layer processing, and we obtain rock-bottom latencies. There are some resources, and that's it. All right, questions? Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list

Starting point is 00:46:52 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #66: Remote Persistent Memory - With Nothing But Net

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.