Storage Developer Conference - #11: Remote Access to Ultra-low-latency Storage

Episode Date: June 13, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 11. Today we hear from Tom Talpe, architect with Microsoft, as he presents Remote Access to Ultra-Low Latency Storage from the 2015 Storage Developer Conference.
Starting point is 00:00:45 Hi, I'm Tom Talpe. I'm from Microsoft, architect in the file server team. My responsibilities include SMB3 and SMB Direct, the mapping of SMB to RDMA. And I've been at SDC for many years, and usually when I'm here I talk about SMB. The last couple of years I've talked about things that are SMB related, but not actually SMB. So I'm kind of exploring the boundaries of where SMB enables new things. I'll have a couple of shameless plugs in the middle of the deck about that. But today I want to talk about ultra low latency storage, which I'll define, and how the protocols that we know and use today can adapt to these
Starting point is 00:01:34 new storage environments. And after my Bluetooth wakes up, a quick outline. I'll state the problem, okay, and the protocols today. I'm going to focus on certain protocols, not all RDMA storage protocols, but certain ones, and examine briefly the sources of latency that they currently encounter, sort of where the landscape is at. And then I'll launch into my madman hypothetical. You can agree or disagree with my wacky ideas, but I'm going to try to chart a course for the extension of RDMA storage protocols and the protocols they depend on for remote access.
Starting point is 00:02:19 So without further ado, here we go. I want to mention that this talk is related very closely to a number of other talks that are happening at SDC. We've had a new track at SDC this year for persistent memory. This is one of the talks in that track. But really there are a lot of aspects to the persistent memory problem, right? There's the API, there's the physical storage, there's the attributes of how that storage becomes durable, how you recover after an error, how you access it remotely, just, you know, how file systems will use it. There's really, there's a very broad spectrum of topics that fall under this heading of persistent memory.
Starting point is 00:03:07 It's not really about the low-level memory cells at the bottom of the stack. There are lots of different technologies down there. This is really a whole-stack discussion. And what I want to bring to this is the aspect of the remote access to this stuff. But these related talks are really closely related. You really ought to keep them in mind as I speak. The Monday's talk from Neil Christensen at Microsoft was about the file system,
Starting point is 00:03:35 the sort of the Windows adoption of this technology. He talked about a file system called DAS, D-A-S in particular, and some block layers that will go above this technology to allow rapid adoption of it. On Tuesday, Jim Pinkerton spoke to the eBOD, the so-called Ethernet bunch of disks, the Ethernet-attached JBOD. That is conceptually very similar to some topics here.
Starting point is 00:04:00 You could imagine an eBOD with persistent memory in it. Okay, just throw that out there. Andy Rudolph talked about NVML and API layers to this thing. All these APIs will flow to the wire. Doug Voigt spoke to the higher level NVM effort in the TWIG, in the SNEA TWIG, with sort of architectural challenges and new horizons for NVM to explore.
Starting point is 00:04:28 Chet is going to come after me, not immediately after. He's like two months after. But Chet's going to talk about some practical platform-specific aspects of NVM and how it can be used without some of the things that I'm going to talk to here. So we're going a little bit out of order.
Starting point is 00:04:44 I'll try not to duplicate discussion, but point to these other discussions. And then Thursday, Paul Van Baren will also follow up on some more NVM things. So just kind of think of these all together, all right? And I probably left out a couple of talks, but I'm just a little piece down in the bottom about remoting this stuff.
Starting point is 00:05:06 Okay, first, let's state the problem. The focus of this talk, my talk, my angle, my way of viewing this problem, is from the perspective of enterprise and private cloud-capable storage protocols. So these are the sort of tried-and-true protocols that we've used for years. Some of them are file, some of them are block. The point being that they're scalable, they're manageable, they're broadly deployed. These are high-level protocols that are used in the enterprise, that are used by enterprise applications and enterprise operating systems.
Starting point is 00:05:44 Now, we use them in new ways as we move to the brave new world. RDMA, which I'm quite proud to say is establishing itself very well in the industry, works with many of these protocols, in particular SMB, NFS, and iSCSI. So I'm going to talk to SMB3 with SMB Direct. It's my baby, so I'm most proud of it. NFS RDMA is technically also my baby. I wrote the protocol and some Linux implementation, but others are carrying that ball right now. And ICER as well. ICER, the iSCSI mapping to RDMA. So I'm going to just hypothesize a little bit about each one of those. There are many others,
Starting point is 00:06:33 including NVM Fabrics, but I'm not going to speak to that. One of the reasons is that I believe those are still emerging technologies, right? I want to focus, as I said, on, that's not going to show up very well, is it? Enterprise and private cloud capable storage protocols. I don't think NVM Fabrics is there yet, but it may be someday. So watch this space. I'll talk about it soon enough. New storage technologies are emerging. Okay, so this is how the problem statement begins. Advanced block devices, IO bus attached and future block or future byte addressable. They're sort of a basic class of devices, similar to what you may have worked with before, right? They are block devices, but they sit on new buses, right? NVMe sits on PCI lanes, basically, PCI Express
Starting point is 00:07:34 lanes. Solid-state devices, there's a broad array of solid-state devices. They're not all on the storage interconnect, but they have certain behaviors, persistence, and they're memory-oriented. Today they're mostly block-oriented. In the future they'll become byte-oriented. Even these NVMe devices, when they sit on a PCI bus, can be accessed as memory. And so that byte addressability or pseudo-byte addressability will become important for these things. The IO bus-attached ones are purely block oriented,
Starting point is 00:08:06 and they almost require an IO stack to get to them. There's a second class, though, storage class memory, as we've seen in a lot of discussion. I'll just call them PM. I prefer PM to SCM. But SCM does make sense when having certain application-focused dialogues. They're memory bus attached, right? They might be an NVDIM, whatever.
Starting point is 00:08:28 They're block or byte accessible. When they're block accessible, they're either native block by nature of the implementation, or they've got a block layer layered on top of them, or byte accessible. They're just plain old memory, right? They may not be actually byte accessible.
Starting point is 00:08:42 They may have a cache line behavior. Some very small block, and we saw some discussion about that yesterday. And these encompass emerging persistent memory technologies like Intel's recent 3DX point, 3D cross point. PCM phase change memory has been talked about quite a bit in this kind of context. But they come in various form factors. So they're all low latency, persistent devices that live on a memory bus. All right, they're bite addressable. Storage latencies, however, as a result, are decreasing. You've seen this slide in a bunch of different ways. I just put it up in a slightly different way, and I'm sorry about the contrast of that table. That's poor.
Starting point is 00:09:27 Write latencies of storage protocols, for instance, SMB3 today, are down in the multiple tens of microseconds when they run on RDMA. Somewhere between 30 and 50 is kind of the best latency of these protocols, which is pretty amazing by storage standards, right? If you look at the underlying technology that these protocols access, you know, hard disk drives
Starting point is 00:09:54 boom one to ten mic milliseconds, solid state drives, traditional storage attached SSDs, 100 mics to one millisecond. NVMe is pushing 10 mics. It might even be better. We've seen some NVM fabric results and some local results that are slightly better than that. But they're in the 10 to 100 mic range. And persistent memory, however, just blows it out by making it memory. So these latencies are coming down by tens of times each row, maybe 100 times at the bottom row. That puts a big challenge
Starting point is 00:10:29 on that 30 to 50 microsecond number. It's a good match to hard disk drives and SSDs. It's a stretch match to NVMe. It's about the same as NVMe, right? That's pretty good. We want to be about the same, the latency of the underlying media to the remote capability. But it is definitely not so much a match to persistent memory. So we have to
Starting point is 00:10:53 do something to these existing protocols to keep up with the latency curve that the devices are giving. Storage workloads traditionally, I just point out, are highly parallel. Storage in general is highly parallel. They issue lots of deep queues. They shoot tons of IOs down the stack. And so that mitigates latency issues. You'd fill a pipeline and you always have a new completion popping out. But that's because the traditional workloads are highly parallel. New workloads are actually not so parallel, and so the latencies are harder to make up for, if you will.
Starting point is 00:11:30 And that's where we're going to go. Those new latency sensitive workloads are primarily rights. They're small and random virtualization enterprise applications. Those are the ones that everybody dealt with in traditional storage, right? The database workloads, the virtualization hard disk drive, you know, the virtual hard disk emulation.
Starting point is 00:11:49 They're all small and random. However, these writes must be replicated and made durable in modern environments, right? It doesn't count until you've created three copies of it. And those three copies have to be physically disjoint, right? They have to be in totally different failure domains or it doesn't count. If you lose the golden copy, the one copy, the few copies you have, you've failed as a provider, as a
Starting point is 00:12:12 storage provider. And so the modern way to do that is to replicate, to spray them around. So a single write creates multiple network writes. And I'll show you a picture in just a second. Reads are also latency sensitive. Small and random are always latency sensitive. Large are more forgiving. but there's some interesting ones that may want to go remote, like recovery and rebuild, right? If you lose a copy, you've got to read all the other
Starting point is 00:12:34 copies to rebuild the missing copy, right? And so there are some interesting latency and or bandwidth questions to be seen with reads. So here's a little animation that I stole from an Azure presentation, Windows Azure. But writes with a possible erasure coding, I want to point out, greatly multiplies the network IO demand. We call it the two-hop or the multi-hop issue. You perform a write down here, and that write has to be placed in three other locations on the network, right? And you have to wait for those three locations to be safe before you can return from the write. Now, if you're a huge high-scale geographically dispersed server like Microsoft Azure, you have to also create erasure coding, right?
Starting point is 00:13:32 And so that one write prior to replying had to produce quite a few more writes. And those writes were to distinct machines within the data center. So latencies are interesting not so much at the front edge of this thing, but in the middle of this thing. Front edge as well. All such copies must be made durable before responding. That's what I just want to stress.
Starting point is 00:13:59 APIs and latency are another interesting part of the problem statement. APIs also shift the latency requirement. Traditional block and file, as I mentioned, are also often parallel. Memory mapped and persistent memory aware applications, not so much. They have a load store paradigm. You don't expect load
Starting point is 00:14:18 store to block. You don't code loads and stores to be parallel. Well, maybe you do with certain types of applications, but that type of load store expectation is the expectation. I do a read, I do a write, it doesn't block. It's quick. Low latency.
Starting point is 00:14:37 It has memory style latencies. Possibly expensive commits. The libraries all say, store, store, store, commit. And the commit is a heavier weight operation. That's where you take the penalty. That's where you expect the cost. You don't expect it on those individual stores. And a lot of people try to hide that latency with local caches, which work great for reads.
Starting point is 00:15:00 But for writes, they don't count. There's only one copy if you write to the local cache. That doesn't count. You still have to spray it all over the place. Most of these caches are write-through and run into the same latencies that you get for a traditional write. By the way, I didn't mention it, but feel free to interrupt me if I'm being confusing or if you have a question. Okay. That's roughly the problem space.
Starting point is 00:15:25 Latency, latency, latency, right. All right? It's about right latency. RDMA storage protocols today. Many layers are involved. Okay, we start with storage layers, storage protocols that carry some local API across the network to some storage server. Some examples, the ones that I'm going to be focused on today, SMB, NFS, iSCSI.
Starting point is 00:15:54 Those are three well-known remote protocols. There are RDMA layers below those, functioning as low latency, high throughput, low overhead transports. iWarp, that's one. That's the RDMA over TCP mapping. Rocky and Rocky v2, that's a InfiniBand style protocol placed on Ethernet, right? RDMA over converged Ethernet. That's what Rocky stands for. And InfiniBand, which is the sort of the custom top toto-bottom fabric used in a lot of HPC, scientific computing, financial high-performance computing data centers. So those are three typical RDMA layers all in use today, along with those storage layers above it.
Starting point is 00:16:38 And finally, the bottom layer is the IO bus itself, which has a lot to do with this. The storage layer, which could be a file system or a block layer, and if it's a block, a file system, you know, lots of different ones. Block, I'll just mention some interesting things. SCSI, random SCSI, you know, SCSI of your choice. SATA and SAS, that type of interconnect, something that's not necessarily switched or routed or shared, right?
Starting point is 00:17:02 It's just kind of a way to get to the low-level storage. But PCI Express is also an IO bus. These NVMe devices sit on it, and memory itself has become an IO bus now that these NVDIMMs are plugged into it. So those are new IO buses. I'm trying to draw the most complicated version of the picture, I guess, before I paint a solution.
Starting point is 00:17:28 SMB3 architecture, shameless plug. Okay, I work for Microsoft. I'm a co-designer of SMB and SMB Direct and blah, blah, blah. SMB3 is the principal Windows remote file sharing protocol. Almost all major Windows services run well over SMB3. They're supported. It operates over RDMA. It's a very rich, mature, highly supported protocol.
Starting point is 00:17:51 Multiple implementations exist in the industry. They're all here testing downstairs. But SMB is also, it's not just a file sharing protocol. It's also a transport protocol. It's an authenticated, secure, multi-channel, RDMA-capable session layer. It's a transport with recovery. It has that session state, right? You've logged in. You've proved who you are. You have access to other things on the server. For instance, it's not just file system operations. It's raw block operations on Windows. It's Hyper-V live migration. I can read and write
Starting point is 00:18:26 the memory of a virtual machine. And RPC, named pipes, I can perform remote procedure calls to my server over SMB. So there's a lot on the back end of SMB as well. And SMB will be a future transport for NVMe storage, persistent memory, et etc. And I'm just going to sketch how that might happen. Ooh, the white isn't going to come out very well. SMB 3 components. On the left is the client. And over here we have an application sitting on a local API, in this case Win32. It sits over the redirector, which is the SMB client.
Starting point is 00:19:02 And the redirector can access the server via multiple protocols, TCP or RDMA. The server, of course, multi-channel, multiple connections can be used. It flows to the server. The server has a number of storage providers behind it. In SMB, we call this a share. You'll open a share. It might be a disk drive. It might be a file system, a volume of a file system and things like that. The share kind of names the provider. And then the provider, in turn, uses some sort of storage back end. Typically, a file system will use a hard disk or a solid state drive or something to serve data from. But I just want to point out that there's a couple of different layers behind the server.
Starting point is 00:19:46 And so there's some interesting things going on. We can view RDMA in any of these layers. And these ones that are highlighted are sort of new components because to get to NVMe and emulate a disk, you need a little SCSI layer. To support block mode and raw mode, persistent memory, you need drivers, obviously, and maybe a block layer. And a mapped file API such as the DAS file system that Neil Christensen mentioned yesterday. That's a new paradigm for accessing files.
Starting point is 00:20:21 Okay, contributors to latency. This is probably a review for a bunch of people, but I'm going to say it anyway. The way storage protocols work today is very specifically architected. All three of the protocols that I'm going to discuss have the same basic transfer model here. They use a direct placement model, and I've simplified and sort of optimized it here. I've left a few exchanges out that some protocols will encounter, but this is sort of the best case of the protocol today. The client, to perform a read or a write, always starts by advertising some region of his memory, right? The buffer I want to write or the buffer into which
Starting point is 00:21:03 I want to read. He registers that with the RDMA NIC and he sends it to the server and he says, please do the read or the write for me. The server performs all RDMA. The server either writes the memory for a read or reads the memory for a write. All three of these protocols work that way. They do it for three very important reasons. It's more secure. The server can register its memory and does not expose it to anything, right? It's private to the server. It's performed only for the RDMAs that it requested. Second, it's more scalable. The server doesn't have to pre-allocate a bunch of stuff and pre-reserve it to a given client. And third, it turns out to be faster.
Starting point is 00:21:50 It's faster because the server can schedule the presence of the memory. We don't have the client stuffing it down the server's throat. There's no real congestion control needed. The server simply holds on to the I.O. and pulls the data when it's ready for it. And that turns out to greatly improve the performance of the server in the long run for most highly scaled workloads. SMB3 uses it for read and write. All the other protocols do basically the same thing. So the important thing to note is that those red lines are server initiated, right?
Starting point is 00:22:17 They're performed by the server. And the little lightning bolts indicate interrupts and processing required to perform the I.O. The server, in the write case, takes two of them, one for the client to request it and one for the RDMA read to be complete. And in the write case, he takes one of them. I'm sorry, in the read case, he takes one of them. In both cases, the client can play with it and probably just take one interrupt on each side.
Starting point is 00:22:46 Latencies come from those things though. They are undesirable latency contributions for the interrupts themselves. It takes time to interrupt the CPU and switch context. And then perform the work request. These are complex upper layer operations that need to be scheduled and processed
Starting point is 00:23:02 in software. So server request processing and also that actual RDMA handling, like, for instance, that RDMA read up here at the top with the double lines. CPU time is required to do it. IOS stack processing is required on the back side. And data copies, potentially. Even in the RDMA case, there may be data copies for buffer management purposes. And the question that I hope to answer is, can we reduce or remove all of the above
Starting point is 00:23:30 when we have a persistent memory device available to us? So I argue that the logical way to do this is to extend the RDMA storage protocols. And interestingly, when you head down this route, you discover that you have to extend some other protocols as well. So I'm going to map a little picture of basically three steps of protocols that may change as we address this. I believe that we can actually achieve success at each step. I think we will have a market increase at each step.
Starting point is 00:24:07 So that's good. We have a roadmap ahead of us. And I'd like to see a lot of protocols do this. I think this is a compelling architectural approach, not specific to SMB. So it starts with something that I will call push mode. Currently, as we saw, the client doesn't do any RDMA. The server performs all the RDMA. So the client actually has to say,
Starting point is 00:24:36 please server, do the RDMA, and the server then performs it. So that's cycles on the server, right? Something has to wake up the server and run them. However, with push mode, this is done by a few other protocols in the past. The reverse happens. The client requests the server pre-register some region
Starting point is 00:24:59 for its use for I.O., right? It's a named region. It probably is a file or a segment of a large persistent memory device. But it says, please register. That's a one-time operation. And the server will register it and return a handle, an RDMA handle, to the client. The client will then proceed to perform all RDMA. Around the first dotted line, he'll do a push, right, in which one or more RDMA writes occur,
Starting point is 00:25:29 and then some sort of commit, in the case of persistent memory, will occur. That's what I'm going to talk about next. In the case of a read, it's really straightforward. If the client has permission, he just reads it, right? It's an RDMA read. Neither one of these, you'll notice, interrupts the server at all. There's just a blank space to the right. Right?
Starting point is 00:25:48 So the server just kind of registered the memory and said, party on. Right? You're allowed to access this memory. I've authenticated you. I've opened this region. I've given you a handle to the region. Party on. However, as that happens, things may happen.
Starting point is 00:26:06 Maybe the client is actually writing a file. Files have metadata, right? They have change times, they have attributes, they have sizes, they have all these weird things. So every once in a while, the client needs to let the server know that something happened in this window that the server opened. So the client will periodically update the server via the master protocol, right?
Starting point is 00:26:28 And SMB, iSCSI, and NFS all do this. You basically set attributes or whatever. But the point being that from time to time, the management of that file object needs to kick in as an upper layer protocol exchange. That's actually not shown here. As well, the server occasionally has to call back to the client. Maybe the server, maybe the NIC is about to be hot-plugged,
Starting point is 00:26:52 or maybe the resource limits on the server mean I can only open a 1 gig window instead of a 4 gig window or something. So the server needs occasionally to call back to revoke those things. Closing the connection, tearing down the handle is kind of rude, but all these protocols already support these callbacks. So we're going to propose overriding or extending some of these callbacks additionally to allow the server to manage this in the presence of a client remote access. But in all cases, the client will signal the server when it's done to simply close.
Starting point is 00:27:29 So the idea is that there's one setup exchange and one teardown exchange. There are a bunch of invisible zero overhead data transfer exchanges, and there's some optional metadata exchanges in the middle. And so I can hypothesize a little bit what those look like by showing you three examples. Does a write lock then occur at the NFS level? How do you keep two clients from writing over the same? They'd better be careful. They'd better open that thing. They have direct access to that thing.
Starting point is 00:27:57 So they'd better use that upper layer protocol to exclude one another or to take locks with one another or to somehow coordinate with one another. So it's done at the next layer? It is definitely done at the next layer. one another or to take locks with one another or to somehow coordinate with one another. It is definitely done at the next layer. There's no way the RDMA layer can or should do such a thing. It's a transport. Its job is to get a bit from point A to point B, not to handle higher layer semantics. So that's what these file protocols are for, these file and or block. So they can exclude, they can do other things. But once you open that window, it's party on at that point.
Starting point is 00:28:34 There's no synchronization at that point. Yes, question? So basically here, instead of acting for every IO operation, you are basically putting them in a bunch and after a certain number of periods, you will say that, okay, I've done this for you? The question is, would we act in this protocol change, would we act each transfer or would we batch the transfers and perform them as a single batch, a single acknowledged batch?
Starting point is 00:29:01 The answer is, it depends. The client can choose to send a single request and then notify the server, or it can choose to send a large number of requests and notify the server. It all depends on the API that's driving it. That's part of the workload discussion that I want to touch on before I'm done. Would the acknowledgments be separate from the RDMA semantics? Absolutely. The ordering of the RDMA is all that manages the presence or durability of the data. The higher layer exchange as to the file state is a matter for the file protocol.
Starting point is 00:29:38 So here are the three examples. SMB3 push mode. It's hypothetical. There's really no such thing. This is just a figment of my own imagination there's a setup operation, remember that setup operation that was the first one, it's a new create
Starting point is 00:29:52 context, that's something we decorate creates with in SMB3 or it's a new FS control, we've opened the file and then we perform some control on the file, it's job is to register and advertise writeable and or readable file by handle. You've created this higher level file object
Starting point is 00:30:10 or region object and you've registered it with RDMA and advertised it back to the client. It could be directly to a region of PM. Maybe the name is literally an offset of the PM device. I want PM device 5 offset 2 gig. So whatever it is,
Starting point is 00:30:28 whatever name it would be, you'll get. An example of that latter one is the way we do Hyper-V live migration. There's basically a big long GUID that's tackled and protected. And the Hyper-V client will open the memory image of the destination,
Starting point is 00:30:44 literally by that UUID, that GUID, and write it. The setup operation will take a lease or some sort of lease-like ownership, right? It will reserve the region. It will reserve the right to call back and modify that authority that was granted. Reads and writes are that raw RDMA. Client reads and writes directly via RDMA. That isn't the SMB3 protocol at all. That's raw RDMA under the covers. It's totally invisible to the server. Commit, although, is more important.
Starting point is 00:31:17 Commit is when the client requests durability. That's the little glitch in persistent memory. You have to get to a durable point from time to time. And so there will be a new commit operation, ideally performed via RDMA, but there are similar operations in the SMB protocol already. There's a flush operation that basically writes cached data to disk. That's very similar to what we're talking about here.
Starting point is 00:31:40 It might be cached in the hardware at this point on its way via RDMA, but it needs to be committed. It needs to beached in the hardware at this point on its way via RDMA, but it needs to be committed. It needs to be placed into stable, durable storage. So one could consider overriding the SMB flush operation. However, that would require interrupting the processor. If you have frequent commits, you might want to get around that at a lower layer. So I'm trying to give a little peek into the future of how we might move stepwise toward an ideal goal. We could start with a message and proceed to a full extended RDMA exchange.
Starting point is 00:32:12 The callback, server-demanded client access is similar to the current Oplock and lease break. The server says, this handle has to change. You can't have that region anymore or I need to do something to your permissions on that region. Things like that. And finally, the finish, the client access is complete. It could be an SMB close, it could be a lease manipulation of some kind. Whatever state we decided up here at the beginning, that
Starting point is 00:32:35 new create context or FS control would basically be undone down here at the bottom. You can do the same thing with these other two protocols. It smells a little different, and I'm still being completely hypothetical, so don't view this as a protocol proposal. But NFS RDMA can do the same darned thing. It has perhaps a new NFS v4.x operation to set up that thing by opening, registering
Starting point is 00:33:06 and advertising a writable, readable file or region. It may offer a delegation or PNFS could do this. The PNFS has layouts that allow a very different storage protocol to be used under the auspices
Starting point is 00:33:22 of the NFS 4 protocol. You could define an RDMA PM-aware layout protocol, and it could simply speak RDMA, right? It wouldn't actually have any messages that weren't RDMA in it, but it was, if you will, the door to it. So PNFS is kind of a big ball of complexity on top of an already complex NFS4. It may not be the best implementation So PNFS is kind of a big ball of complexity on top of an already complex VNFS v4. It may not be the best implementation approach, but it's probably the best architectural approach
Starting point is 00:33:52 in the way NFS 4 has been implemented. Anyway, the rest of it behaves just like our previous example. Writes and reads are direct RDMA access by the client. The client occasionally requests durability via a commit. If it has an RDMA access by the client. The client occasionally requests durability via a commit. If it has an RDMA extension, it can do it. If it doesn't, it can say, dear Mr. NFS server, could you please commit the
Starting point is 00:34:13 data that I've been writing and here are the ranges that I think I've dirtied. It has a callback similar to the current delegation or if it's a PNFS approach, a layout recall that says you can't have that anymore or I'm rearranging them, I'm giving you new memory addresses or whatever it is.
Starting point is 00:34:30 That's old hat for layouts. We do that with block devices on PNFS today. And finally, the finish could be an NFS foreclose or a deleg return or a layout return depending on the choice of implementation. Chuck, question? I'm just going to toot my horn a little bit. Tomorrow I'm giving a talk
Starting point is 00:34:50 where we can continue this discussion. I have a slightly different approach to the problem, but this fits right in with some of the things I'll talk about tomorrow. If anyone's interested in following up on this,
Starting point is 00:35:07 come to my talk tomorrow morning at 10.30. Sorry, who are you? I'm Chuck. Thank you, SW. Chuck Lever from Oracle. And Chuck and I, I will say, have not talked at all about this. So if we're thinking on the same lines, well, great minds think alike. I don't know.
Starting point is 00:35:24 We'll find out tomorrow afternoon. There's a third protocol, iSCSI. So iSCSI is an interesting one. ICER is an adaptation of iSCSI to RDMA. And it basically modifies the data moving behavior of an iSCSI implementation to be compatible with RDMA. There's a data mover architecture layer, which is kind of a conceptual layer that modifies what in raw SCSI might be SCSI operations,
Starting point is 00:35:53 SCSI messages, and applies RDMA-specific rules to their processing. So this one's a little fuzzy. Some of these are iSCSI or SCSI level operations, and some of these are ISER level abstractions. But I'm going to argue that it runs almost the same way, that setup will be a new ISER operation. And I believe it's an ISER operation because it needs to register memory and return a memory handle. That's an ISER function, not an iSCSI function. Writes are a new data mover model. The data mover currently uses something called solicited data to transfer things via RDMA. Solicited means that, you know, I have the data, but it's in a buffer
Starting point is 00:36:39 somewhere and you need to come fetch it or store it for me, right? Unsolicited is when it's inline and it's just offered as part of the operation. So the architecture currently does not define these operations. There's no such thing as an unsolicited data in except for things that come, you know, inline to the message. So the data mover will need a little bit of thought and maybe the SCSI ICER architecture will need a little bit of thought. But the idea is exactly the same. Implement an RDMA write within the initiator, no target involvement whatsoever,
Starting point is 00:37:15 straight into a target buffer. The R2T processing is just not going to occur. There's nothing on the wire for R2T processing beside that operation. That's because the target sends R2Ts, and the target didn't do anything here. Read, same way. It's an unsolicited SCSI out operation. We implement the RDMA read from the initiator, from the target's memory. We don't tell the target it even happened.
Starting point is 00:37:40 Commit is a new, possibly modified iSCSI, possibly a new ISER operation. Performs a commit. It's kind of like a FUA, right? It's a flush. It's a thing where you've got a bunch of data that you need to commit onto the drive. Somebody needs to think that through. Callback is maybe a SCSI unit attention.
Starting point is 00:37:57 It smells kind of like a unit attention to me. Something happens. Somebody flipped the right protect bit. Oops. That kind of thing. It seems to me that that signaling could be overloaded here. And finish is almost undoubtedly a new ISER operation because the setup is one.
Starting point is 00:38:18 If we have those, we can actually use them by signaling the processor, but we can avoid more latencies if we start to go down the stack. The first is the RDMA protocol. The RDMA protocol has no concept of durability. It's simply not there. This wasn't on the list when RDMA was being developed. There are things called placement and completion and delivery, which are all about getting the packet to the stack, having the stack put the packet in memory,
Starting point is 00:38:54 memory, not necessarily durable, and to complete, to send some signal to the receiver. Well, durability is new, and I argue that RDMA write alone is not sufficient to provide this semantic. In RDMA speak, pardon my geek, the completion at the sender does not mean data was placed, any particular
Starting point is 00:39:18 data was placed. The sender, the only thing it means at the sender is that the buffer can be reused. The NIC has taken a copy of your buffer and promises to do its best to get it to the destination. So the sender doesn't know anything, literally. All the sender knows is that the NIC tells him he can reuse the buffer. It doesn't tell him where the data is outside that boundary.
Starting point is 00:39:41 Question on the right. It also tells the sender that the data is visible. No, no, no. Completion at the sender doesn't have anything to do with visibility. With visibility towards the network
Starting point is 00:39:56 in a target. So for example, if you did an RDV write, completion on it, you know that another host did an RD.D. Mayer right? Some completion on it, you know, another host is an R.D. Mayer. That's true for some nicks. Not all.
Starting point is 00:40:13 We can maybe discuss this one offline. I know where it will go. I will assert, perhaps without proof, that the send doesn't tell you much at all. You can't use it.
Starting point is 00:40:29 You definitely can't use it to imply durability on the remote side, which is really all we care about if we're going to use this. So we have to be very careful with this. Processing at the receiver, a completion at the receiver, means something different. The receiver begins to process it when the data is accepted. As an RDMA write comes in, he begins to push it. But the protocol doesn't make a promise for where it is on the bus.
Starting point is 00:40:58 That's implementation defined, how far it's gone. Also, I'll mention that segments can be reordered by either the wire or the bus so they arrive in funny order. You can't just because ABC gets written doesn't mean that they appear as ABC in memory at the pier. There are a lot of open windows in the
Starting point is 00:41:16 RDMA protocol to allow that to happen without buffering. They want to flow these through so if they get reordered on the wire they may get reordered on the bus during write. Only in RDMA completion that the receiver guarantees placement. However, placement does not equal durability. Placement means that it was issued to the memory bus and that it's visible to other devices on that bus.
Starting point is 00:41:39 It doesn't mean that it's actually reached the storage cell. And so we've spent a lot of time talking about this in the SNEA NVM twig, and we're going to continue to talk about this quite a bit more. Certain platform-specific guarantees can be made, but the client cannot know them. That's really important. We want the client to be able to do the same thing no matter what server he's talking to.
Starting point is 00:42:02 And so if the server plays clever tricks, and Chet's going to talk about some of those clever tricks, that's great, but the client can't know them. So what do we need to do? I think there are two obvious possibilities. The first is to extend RDMA write and to add some sort of placement acknowledge, a push bit in RDMA write.
Starting point is 00:42:22 And it seemed like a good idea until I started thinking about it. It has a big disadvantage. The advantage is it's simple, right? You set the push bit. Okay. You know, as an API, I think that's easy, right? But then you say, wait a minute. If I set the push bit, it changes the RDMA write semantic, right? It means that I have to wait for it to appear on the other end and I get an acknowledgement. Well, guess what? RDMA write doesn't actually have an acknowledgement. There's a transport-level ACK, but there's no upper layer ACK in RDMA-Write. RDMA-Write is a one-way stream.
Starting point is 00:42:53 Only operations like RDMA-Read and Atomic have a reply. So it changes the semantic, plus it flow controls that write. RDMA-Writes are not flow controlled in current RDMA designs. So it requires significant changes to the RDMA write hardware design. I believe it makes it very undesirable. Blocking the send work queue, that's really undesirable. So I think the other possibility is a new operation, an RDMA commit. Flow control that acknowledged, like an RDMA reader on Atomic,
Starting point is 00:43:23 it's actually a two-way trip on the wire, right? A commit and a commit-ack. And the disadvantage, obviously, in new operation requires a more significant protocol extension. But the advantage is it has a very simple API. It's a flush. We can specify exactly what regions to push. It
Starting point is 00:43:39 preserves the right semantic because it doesn't touch the right. And it acts only, well, I'll mention that in the next slide, when the operation is actually complete. So to drill down on the RDMA commit, which I argue is the only sensible way to go, it's a new wire operation. It's implementable in any protocol, I believe.
Starting point is 00:44:01 Initiator, I said initiating. The initiator, initiating RNIC, right, the sender, provides a region list and other commit parameters, TBD, under control of the local API. So I say I want to do optimized flush. That's this NVM operation that we talk about. Provides a list of addresses and ranges and sends it to the remote and says, I want to make this durable. So the receiving NIC gets the operation and queues it in order. It waits for all previous writes that might have touched that region and puts the commit after them.
Starting point is 00:44:33 So it behaves a lot like an RDMA read or atomic. You know that if you perform a write and then a read, you'll read the data you wrote. That's an ordering guarantee. This is similar, that if I perform a write and then a commit, that I've committed the data that I wrote. Very simple. It's subject to flow control and ordering, which is
Starting point is 00:44:51 very important, because these commits may take time. They need to block on the receiver. The RNIC pushes the pending writes. It might flush all writes. It might not track the regions. It might just say, ah, I'm just going to push everything that's dirty. That'll get really painful if we have terabyte-sized DIMMs.
Starting point is 00:45:09 Shared regions. The Yarnik performs the commit. Well, how does it do that? That's what I'll talk about in a second. And it responds only when durability is assured. I'm running low on time, and I do want to leave a little bit for questions.
Starting point is 00:45:26 There may be some local extensions as well. I'm just low on time and I do want to leave a little bit for questions. There may be some local extensions as well. I'm just going to skip over this, I think. I think there are platform specific attributes that are only the business of the platform that registered the memory. I don't think that the client has any business knowing what he's writing to.
Starting point is 00:45:41 It might be a PCI device, it might be raw memory, it might be DRAM that doesn't need a commit at all. But there are hints that can be given locally that will allow the protocols not to even care. There is a third piece, there is a third protocol that's important, and that's the PCI protocol. Most RDMA NICs are plugged into a PCI bus. Guess what? There's no commit in PCI. You can issue your rights, but they behave kind of like RDMA. They're fire and forget.
Starting point is 00:46:14 You really don't know what happens to them after they return. And that's awkward if you have to make a durability guarantee. So PCI does need an extension, I believe. If we don't have a PCI extension, however, we can interrupt the processor. The processor knows how to do this. It's undesirable to interrupt the processor, but it might be necessary in the short term.
Starting point is 00:46:38 It depends on the platform into which the ARNIC is plugged. Once again, that's the business of the platform, not of the issuing client on the other side of the wire. All he wants to know is, I commit, you return when it's done. And then it's the platform at the other end that figures out how to implement that commit. The expected goal, if we can build this whole stack with a upper layer file or block protocol, with an RDMA commit extension and with a PCI commit extension to match it, is that we'll get a single digit microsecond remote write and commit.
Starting point is 00:47:16 The same kind of thing we can get with raw RDMA today. There might be some additional overhead for the actual commit within the device, but we're basically talking about the wire transfer as the biggest chunk of that fixed budget here. Chet's slides in about an hour will give a lot of detail on this. He'll actually pick apart those individual contributions and how they can be provided and what they are, or at least estimate what they are. I'll mention that remote read is also possible, and it'll have an even lower latency than
Starting point is 00:47:48 write with commit, because it doesn't have to do the commit, it just has to read. There's no server interrupt, which we get the RDMA and PCI Express extensions in place, and there's only a single client interrupt per operation that we can even moderate and batch out. So we're talking a very efficient protocol here, all under the auspices of this fairly rich and complex set of protocols up on top. We don't have to introduce a new protocol here. We just have to make small extensions to the protocols we have. Last slide. There's a few open questions questions how do we get to the right semantic
Starting point is 00:48:25 there are discussions in multiple standards group, how do they get coordinated implementation in the hardware ecosystem, this is going to take time the NIC vendors, the platform vendors all kinds of systems will have to change to really get where this promise will
Starting point is 00:48:42 lead them. I believe that we need to drive consensus from the upper layers, the storage protocols and these APIs, these new APIs, down to the lower layers. I don't want to let the lower layers tell the upper layers what to do here. I want those upper layers to have a really good idea of what they need. What about new API semantics? Does NVML add new requirements?
Starting point is 00:49:02 We should think that through. Do those underlying file systems on the storage side of it add new requirements? I don't know. DAX the Linux and DAX the Windows version. And other semantics. Are these upper layer issues? Authentication, integrity, encryption, virtualization. These are really important deployment questions.
Starting point is 00:49:22 I'm done. Question? Can you do multicast with this? I mean, you mentioned reliability. I don't know if you make multicast. Is it possible to leverage multicast? Can you do multicast with this? There are RDMA standards for multicast.
Starting point is 00:49:41 They are typically connectionless, which is not the model that all these protocols use. And the upper layer protocols are really not one-to-many protocols. These are client-server protocols or initiator-target protocols. So while I think you could use multicast, I think it would introduce quite a few challenges
Starting point is 00:49:58 up and down the stack. Question? I think one area that you focused on is how to make the commit to the actual persistent storage. So even the traditional storage arrays that we see, they also do not guarantee that once they add a write, it's returned to the media. They have a huge cache sitting in front, which will cache the rights. We are really talking about this problem
Starting point is 00:50:29 only for RDMA, not for NMA. In particular, I don't think we are talking about it only for RDMA. So let me restate the question. Why are we having this discussion about the actual transfer? Why don't we talk about the implementation all the way down to the durable media, how that happens and what the rules for that are. Way back in here, well,
Starting point is 00:50:50 it's just going to take me a long time to switch all the way back. There are some separate back ends on the right, and they're not all memory, right? Some of them are traditional IO. Some of them are memory. They'll all behave exactly the same way. You may RDMA into memory and then the server may actually move it to the durable medium. That's why commit is so important, right? And maybe you can't bypass commit when it lives on such a back-end media. That's why the mode bit in the memory registration was there. What special processing also needs to happen? If that's the case, you may have to interrupt the server, and the server may have to perform a traditional IO stack operation. Once again, that's
Starting point is 00:51:31 not for the client to know. All the client knows is that semantically, I've issued some writes, and I've committed them, and they're guaranteed durable, right? The client will issue the same set of requests in either case. The server and the networks below it will figure out how to implement that request. So really, I'm not trying to be weaselly. That's out of scope for the proposal here. That's probably internal magic on the part of the server rather than something that the protocol specifies. All right. Thank you. I'll be available for talk. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe
Starting point is 00:52:17 at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.