Storage Developer Conference - #99: SNIA Nonvolatile Memory Programming TWG - Remote Persistent Memory

Episode Date: June 17, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 99. So I'm Tom Talpe. I'm with Microsoft. This is not a Microsoft talk. This is a SNEA talk. I'm a member of the SNEA, Non-Volatile Memory Programming, TWIG. I've been contributing to it for several years, and it's been a really rewarding experience.
Starting point is 00:00:54 I'm going to talk about where we're at with the remote flavor of the NVM programming model, which is interestingly different from the local flavor and which has a number of, you know, Andy talked about the dependencies on operating systems and platforms. Well, now we're going to add networks and network protocols to the picture. So it is, you know, increasingly complex. I'm going to talk about basically four things. The SNEA NVMP twig activities, what we're doing, what we've done. A little bit of review.
Starting point is 00:01:27 Andy covered this, but I wasn't sure everybody would be in both talks. Remote access for high availability work. That's our remote programming document, the remote access for high availability. It's a separate document. It's a white paper, not a technical specification. The technical specification may incorporate this paper now that we understand it better. It's really come a long way. RDMA requirements and extensions.
Starting point is 00:01:50 I'm going to dive down a layer and talk about the RDMA layer and what implications the NVM programming model has on the RDMA layer. That's where it will really get interesting, I think. And then wrap it up with a little bit of current and future work for the remote programming twig. So to kick it off, the mission. The mission was to accelerate the availability of software that enables persistent memory hardware, right? This is very generic boilerplate,
Starting point is 00:02:19 but the hardware included SSDs. That's the disks that Andy was talking about. All right, this worked a minute ago. Yeah, but it's red. This is annoying. It just worked. I just tested it. This is the better one that works, I guess.
Starting point is 00:02:40 So SSDs, disk drives, or emulated disk drives, and persistent memory itself, the software spanning applications in OS. Our mission was to create the NVM programming model to describe application-visible behaviors, right? It's an interface. It's way up at the top. And allow APIs to align with OSs
Starting point is 00:02:58 and describe the opportunities. The NVM programming model, the most recent version, version 1.2, was approved by SNEA just a little over a year ago. It's been out there for a year. We enable block and file, that's the disk drive and the memory mapped versions, with things like atomic capability and granularity, thin provisioning, all these things which are important to devices, storage devices. It uses memory map files to expose persistent memory to applications, etc., etc. It's a programming model, not an API, as Andy described. The programming model methods, these guys on the left are the block mode,
Starting point is 00:03:42 and we called it innovation. It doesn't feel like innovation anymore. Andy has repaired the brighter, greener pointer. Thank you. And so I'm not going to talk about these, but this is the disk drive emulation version. This is the so-called RAM disk or similar sorts of things. And the one we're really interested in are these emerging PM technologies where we insert this little PM in here. We have a PM file and a PM volume.
Starting point is 00:04:08 And really all the attention is on PM file. This thing gets brighter as you hold the button down. So anyway, block and file modes use I.O., traditional storage model. Volume and PM modes use a mapped model with load store move. And here's a great big long URL that you can go see the current version of the specification. We're currently working on this quite a bit. We've been very busy for the last year, so that's what I'm going to talk about. The remote access for HA, this is our remote version or remote addendum, if you will. It's a white paper that builds on
Starting point is 00:04:42 the 1.2 model. We actually published it back in 2016, and we've been doing quite a bit of work in the background on it. The specification 1.3, right? We released 1.2, and 1.3 is in development. We're updating that specification, and we're incorporating learning from the remote access white paper. And a couple of the specific things are asynchronous flush and ordering and error handling,
Starting point is 00:05:07 which is what I'm going to talk about. And finally, we have this brand new, just announced in the last six weeks, four weeks, something like that, a collaboration with the Open Fabrics Alliance, OfiWig. OfiWig is the Open Fabrics interface, if you will, the sort of RDMA verbs or RDMA interface corollary to what we're doing here with persistent memory. And we're going to develop a new version of the programming of
Starting point is 00:05:33 remote access for HA, but we're going to do it in conjunction with this collaboration. So we're going to get feedback from the RDMA community on their lower level APIs. And we're going to expand our use case enumeration. We have a couple of very simple use cases. We're going to expand our use case enumeration. We have a couple of very simple use cases. We're going to try to make these more thorough, complete, deep, et cetera. All right. Persistent memory and adding remote. I'm just going to do this little plus a remote thing. So here's the picture, a little different from what Andy showed, but the same basic concept. We have a user mode persistent memory aware app, and it has two ways. It has file APIs that dive into PM aware file systems, and it has a direct memory load store, right? So it basically m-maps
Starting point is 00:06:18 the file, proceeds to do loads and stores, and goes back to do syncs, all right? This is fairly generic and high level. I'm not going to go into any more detail on that. However, we're going to add remote. And so the basic idea is we add an RDMA data path through an RDMA NIC to a remote device, and we add RDMA operations that flow either from the top-level user interface, maybe hook the file APIs,
Starting point is 00:06:48 maybe a direct API. It can also come from the PM-aware file system. And I drew a dotted line, but it can also come from the PM driver. This is the device DAX model. We don't actually cover this in the SNEA NVM programming twig right now because we pretty much chose not to. But this path is available. So there are many ways to drive RDMA operations to a remote PM device. And I am going to explore this. And the twig has been exploring this.
Starting point is 00:07:32 PM remote access for HA. High availability. I should have spelled that out in the title. So this is the NVM programming Twig developed interface for remote persistent memory. It's the analog to our existing persistent memory interface. We are seeking to maximize our alignment with the local PM interface. The local PM interface is what we do today. We want to maximize the alignment. We don't want to write a new version of the remote programming interface, right? That would be a huge mistake. I'm going to write my program to do local,
Starting point is 00:08:02 and then I'm going to write another program to do remote. That big mistake. We want to tuck this under the existing interface. And we know we can. We've thought about this long and hard. But that's the goal, to make sure that we have a single interface, a single API. There might be new things that have to happen in the remote case, but we're going to tuck it under that.
Starting point is 00:08:23 We're going to take the remote environment into account, including RDMA semantics and restrictions. We're going to think long and hard about how this can be actually implemented on operating systems or in particular on RDMA layers. The OFI WIG will help us a lot with this? Check our work, if you will. And we're going to nonetheless scope our interest all the way into the RDMA layer. If the RDMA layer doesn't do something, we're going to think about it and recommend that it do it. And that's actually well established. We're going along. And analyze the error cases. A whole lot of new error cases pop up when you go remote. You not only have a processor bus between you and the device, now you've got a piece of wire between you and the device. So a lot of really interesting things happen.
Starting point is 00:09:14 I'll talk about it in a little bit. So this is the outline of our existing document. I just put it up here so that you get a feel for the way the document is put together and what it looks at. It expands out into some details, but actually some of the more important sections are not expanded, right? Down here, five and six are kind of the meat of it. But it starts with purpose and scope, and we have a taxonomy.
Starting point is 00:09:46 This recoverability definition is really important, though. And if you're going to read the document, I encourage you to think about these a little bit. Durability versus availability. Durability means that we've stored it persistently somewhere, and it'll come back after the lights go out, right? But availability is different, and we try to tease apart the meanings here, right? That the data can be mirrored for availability, and it can be durable in a separate coordinate system, right? Durability is different from availability.
Starting point is 00:10:22 And so we want to talk about it because high availability, after all, is the title of the document. Consistency, recovery, and integrity, these are more things which come into play in highly available situations, highly available scenarios, right? We want to make sure the data is consistent, that everybody sees the same version of the data, right? We want to make sure the data is consistent, that everybody sees the same version of the data, right? We want to make sure that it can be recovered, that the error case is well understood. And we want to make sure that the data has good integrity, right?
Starting point is 00:10:52 That when it's stored, we want to make sure that the right bits were stored and we don't fetch bad bits or we don't inadvertently result in bad bits. So all these things are very important. These are the HA extensions. This is where we start to get interesting. RDMA for HA is how we actually implement those extensions. We talk about security and we have some interesting things. This little guy down here at the bottom, workload generation and measurement, is where we're working with the OFI wig as well, right, as I talked about use cases. All right, map and sync.
Starting point is 00:11:30 These are generic slides that are used a lot in the NVM programming presentations, but they are sort of the fundamental, these first three major bullets are sort of the fundamental, these first three major bullets are sort of the fundamental points we make when we go over it in all these menus. But I want to point out that map is a function of local PMEM,
Starting point is 00:11:55 and sync is, the way it's expressed right now in the document, is a function of local PMEM, as Andy was talking about. You map the file, you get it in your address space, you do loads in stores. And that's kind of the paradigm that we've modeled after. But all this stuff
Starting point is 00:12:13 really starts to change when you do it remotely. Your stores do not magically become RDMA rights, nor do you want them to. If I do a store an integer, I don't want to spit a packet out the back of the machine, right? I want to control the way or batch or somehow manage the way these things flow to the wire. And the implementation of such
Starting point is 00:12:37 a thing would be impossible. You'd have to take a fault and invoke some driver and, you know, your latency story would go out the window, which is not exactly the goal of a memory device, right? You want very low latency. Flushing, which is the sync stage, applies to RDMA just the same, but it flows all the way down the RDMA stack across the network and through the IO pipeline of two machines, right? Your local machine and your remote machine. So there's a whole lot more to talk about now, right? We can't just say, well, you magically flushed your processor cache. Oh, no.
Starting point is 00:13:14 You flushed your processor cache. You performed an RMA operation, which was carried over the wire between two devices that the CPU, you know, wasn't really a part of. And then it did the same thing on the other side. Okay, now we're talking. We've got a problem or a situation on our hands that we have to discuss. And it brings in a very strong motivation for this thing we call asynchronous flush, which is part of what we're doing in 1.3. And asynchronous flush is the key to batching, is the key to sort of say, you're
Starting point is 00:13:45 ready to do some RDMA operations. I'm not flushing yet, but you're ready to do some RDMA operations. And I'll show you how we use that. Another important thing is failure atomicity. The current processor and memory systems guarantee a sort of inter-process consistency, right? That all the processes running on a platform can all see the same memory. When somebody does a store over here on this side of the motherboard, the other guy on the other side sees that same data. That's consistency or the so-called visibility of the data. But the model, the NVMP model, provides only limited atomicity, well, the platforms as well, with respect to failure.
Starting point is 00:14:26 So there are a lot of interesting things, and that's where we've really spent a lot of our time in the Twig. Well, the same kind of thing is impacted by the network, right? We stuck the network in between. So now when we perform a remote store, a remote sync, these same kind of failure atomicities come in to play, but they are multiplied by the network. Because alignment restrictions may change for, you know, failure atomicity of aligned fundamental data types, right? That suddenly changes when we're bus-attached, right? The bus might be narrower than the processor word width, for instance. And that thing then goes
Starting point is 00:15:05 over a wire, you know, and actually once it, before it goes over the wire and after it emerges, it goes over a PCI Express IO bus. And all these things have different atomicities and different behaviors, different ordering requirements and things. And so all these things come back to haunt the failure atomicity. And so We have to look at new remote scenarios and reflect network failure into this. We originally just looked at the local platform behavior. Like Andy said, you do a store that splits across two cache lines. What if the light goes out? Well, now I'm doing a store that not only splits across those cache lines,
Starting point is 00:15:43 it splits across packets and remote operations and things. What happens when maybe the lights don't even go out? Maybe the network just interrupted, right? So we have a whole lot of brand new and very interesting scenarios to explore. And consistency for recoverability. This is another interesting section of the NVM programming spec, the 1.2 spec. As Andy mentioned in response to a question about transactions, there's an application level of consistency as well. The hardware can't magically know what the application thinks is a safe recovery point.
Starting point is 00:16:20 The application has to manage that. And so these sorts of things, these transactions and consistency models of the platform, are really important to the application. How do they change in a remote scenario? And so these are all things that we need to look at and think through. And that's what we do as a TWIG. It's really interesting stuff. I'm trying to encourage people to participate here here if you haven't gotten the picture yet. Okay, some key remotable NVM programming interfaces.
Starting point is 00:16:52 Some of them you already know about. In NVMP 1.2, we have this thing called optimized flush. The optimized part of it meant that you could do it from user space. You could do it in a very highly optimized, lightweight way as an application. An optimized flush is a little harder to do from userspace. When you have an RDMA adapter, you've got to issue some RDMA requests and all, but it directly maps to RDMA. We can easily do this if we only extend RDMA a little bit. I'll talk about what that means in a second.
Starting point is 00:17:21 There's also this second version of Optimize Flush called Optimize Flush and Verify. This and Verify may sound like a really strong thing. It's actually quite weak. It's kind of a best effort verify. It's sort of, if you detected an error while you were flushing, please tell me about it before you complete the operation. And the strength of that Verify is becoming more and more important. We believe we're way too weak. The reason it was weak in the original spec is that the operating systems provide a weak semantic. MSync is void, right? You call MSync and it can tell you you had a bad file descriptor or something like that, but it doesn't tell you anything about the data. It basically starts the data flushing, right? And you have to wait
Starting point is 00:18:06 for that data in another way and collect the error in another way. MSYNC really doesn't give you a response. You don't really know how long to wait for the error, right? You don't know when the last bit of dirty data goes down there. And you want a stronger verify, and I'll talk about how we can do that, at least with RDMA transfers. And the second thing which needs to be added is async flush and async drain. I have a better slide about this in a sec, so I won't go into too much detail. But async flush initiates flushing for a region, and async drain waits for that. And there are some ordering as well that we need things that follow flush in order to implement transactions.
Starting point is 00:18:51 The other NVM programming methods are remotable, but only via upper layers. We don't expect to have the full NVMP interface jammed into the RDMA layer, right? And we don't want to create a new protocol as part of the SNEA work. That's not our goal. Our goal is to influence and set requirements,
Starting point is 00:19:13 but not to design new protocols. So we're going to leave a lot of these things to the upper layers, but we're going to put big, bright red paragraphs in our documents, say, this is not covered by the NVM programming interface. You need your operating system to put big, bright red paragraphs in our document and say, you know, this is not covered by the NVM programming interface. You need your operating system to solve this, or you need your upper layer protocols to solve this problem.
Starting point is 00:19:32 And so that's the goal of the SNEA, to increase that awareness of what's missing or what's needed on top of what's been specified. Here's the ladder diagram from the document and it and it shows a single application call called optimized flush right so we have an application that's mapped the region of memory locally and proceeds to store data in it right it does load load store store blah blah blah it dirties dirties up a bunch of data at some some point, it's going to say, I want to make that data persistent.
Starting point is 00:20:08 So it calls optimized flush. Our model in the remote access world is that at that point, we say, ah, okay, now we know what data is dirty. The application told us what data is dirty. And he told us we need to move it to the other side and commit it to a persistent medium, right? And so the library, the NVM programming interface library, begins to do a bunch of RDMA writes and starts to shovel that data over the wire. It then does this new
Starting point is 00:20:38 hypothetical flush, right? The ordering of these four operations is what's important. The flush makes sure that it pushes those writes. But the flush has to wait for these writes, this little lasso that we draw, it has to wait for these writes to actually appear on the bus before it responds. And so you can think of this as a fairly heavyweight operation in the large data model. And so this is our motivation for a new version of optimized flush. We want to batch these writes maybe a little earlier so that this flush can happen more promptly. And in particular, we want to use all this time that isn't shown here to drive the network. We want some parallelism and concurrency.
Starting point is 00:21:25 So we define this thing called asynchronous flush. It separates the flush and the drain stages. This is not currently in the document, right? This is in the 1.3 draft document that we're still writing internally. It allows early scheduling of writes without blocking. Optimized flush is a blocking operation, right? We don't like blocking operations.
Starting point is 00:21:44 They're bubbles in the pipeline. We used to call it giddy-up. We've loaded up the horse, and we want the horse to get moving, and there's more horses' worth of load to be moved or whatever, so we say, giddy-up, get out of here, and we start the RDMA writes. It's important for efficient concurrent processing. It turns out that async flush is pretty interesting locally as well. The async flush will allow you to keep your pipelines fairly shallow.
Starting point is 00:22:11 And that's always good. Shallow pipelines are good, at least if they don't hold you up. And we expect that Giddyup is useful both for the applications explicitly and particularly for middleware libraries. If the library hasn't seen a flush in a while but knows there's a bunch of stuff going on, it can do stuff in the background. It can schedule background work using this async flush. Drain is the thing you call after flush. It allows the application to ensure persistence. It's good because if
Starting point is 00:22:45 there's less data remaining to flush, because we've done this early scheduling, then we have less wait latency when it comes time to drain, right? So that's good locally and obviously good remotely. The problem with async flush is the error conditions, right? Now it's much more difficult to figure out when an error occurs which thing it was because you have all this asynchronous stuff flowing in the background. You want to know how local your error is in general. We have to get a little fuzzy when we go async. I think this is a traditional problem
Starting point is 00:23:18 seen in computer science for decades. Whenever you have asynchronous I.O., you've got to collect all these little atoms of work and put them together. And the errors, when everything's sunny and the wind is at your back and it's a beautiful day, everything's great. The problem is that not all days are like that. And you want the error case to be well understood so you know what to do. So that's what we're focusing on a lot, asynchronous flush. And under development, also being worked on in the 1.3 and the new version of the remote access
Starting point is 00:23:54 for HA, we have a number of interesting other things. And these are just a few of the ones that I cherry-picked. One, visibility versus persistence. Andy was describing this. When things become visible on the memory bus, are they persistent? Well, maybe yes, maybe no. And these are two very different things. What we've, like, our example here is you do a compare and swap on persistent memory,
Starting point is 00:24:18 and you do not necessarily yield a persistent lock, meaning that you've done the compare and swap atomically, but you've done it on some piece of memory that hasn't been made persistent yet, right? And the state of visibility is actually different from the state of persistence. And so you have to think of persistent memory in slightly different ways now, right? You have to think of it as something that needs to be committed before it is actually persistent and before you've saved the state. So these locks, like compare and swap, suddenly get new semantics or new side effects or warnings on the side. So what we're thinking right now is consumers of visibility versus consumers of persistence.
Starting point is 00:25:03 And the two are different, right? A consumer of visibility would be like a local lock or somewhere where you're putting something into the system in a multi-threaded application and you want it to be visible so you can write your multi-threaded application in the same way. But when you're making it persistent, you've actually become a different type of consumer. You're expecting a slightly different semantic. We're trying to bring out what that means, right? And it's kind of an academic exercise in a way, but we're trying to plant it in reality and thinking through what they really mean when you're implementing them on the platform.
Starting point is 00:25:43 The assurance of persistent integrity. Okay. I've made it persistent. How do I know it's good? Andy mentioned this. The memory will throw an NMI. It'll throw an NMI whenever it discovers the bad bits. I don't know when it discovers the bad bits. Maybe it discovered it when it wrote it to the memory. Maybe a scrubber came by 10 minutes later and found it. All of a sudden, boom, the NMI goes off. And the system goes, oh, my God, I got bad blocks in my PMAP. Well, what blocks were bad? And who wrote them?
Starting point is 00:26:14 And who cares? And, you know, I got to notify the right guy. And so these types of semantics are very difficult. And we do have that optimized flush with verify. Well, that's an interesting angle. So this is what I talk about, the assurance or the strength of the notification you get. So we are looking into this,
Starting point is 00:26:37 and I believe an explicit integrity semantic is one way to go. That's just Tom Talpy's opinion. Scope of flush is another interesting thing. When you flush, what gets flushed? The API says you must flush this range of bytes. Bytes 4, 5, and 6 must be made persistent. It comes back and it says, yes, they've been made persistent. Maybe 4 through 8 were made persistent. Maybe the whole page was made persistent. What's the scope of persistence?
Starting point is 00:27:07 It's never less than the guarantee, but it's often much more than the guarantee. And when you go remote, really interesting things happen because you have queue pairs and sort of ordered packets on the wire. And so we've come up with the concept of streams of stores and this thing called a store barrier or an ordering nexus, which are used to plant flags in the stream of stores and say everything prior to this time
Starting point is 00:27:32 has been made safe. This is very, very useful for implementing in remote environments. We are trying to bring that out in the document. We also have flush hints. Can I flush asynchronously or synchronously? How much do I care about this particular flush? If I'm implementing a transaction, it's usually that last flush that I care about. In other cases, you may have streams of these things and you may want to mark some of them in special ways. We want to model these in the programming interface and also in our binding, if you will, to the RDMA protocol. We're looking, once again, to understand and guide platform implementation. We feel
Starting point is 00:28:09 these are really important concepts that need to be brought out for the industry. I'm going to pause for just a sec. I'm a little over halfway. I'm going to have to speed up as I dive into RDMA. Okay, you've probably seen these before. I started talking about this stuff in, like, 2016, I think. Maybe 2015. RDMA adapter. So remote access for HA, remote PMEM for HA. We immediately knew we needed to do this over RDMA. And I really dug into the SNEA persistent memory effort when RDMA came to the table.
Starting point is 00:28:51 RDMA is my thing. I put a lot of protocols on top of RDMA, and I've done a lot of things with RDMA itself. RDMA adapters provide a connection-oriented, ordered, reliable stream. The memory registration provides a handle to enable remote access. It's described by the handle, which is basically an opaque identifier, and an offset in length
Starting point is 00:29:11 that names the bytes inside that region, right? And we have these operations called send, and in particular, RDMA, RDMA read, RDMA read and RDMA write, rather, that act on those buffers. And they're broadly implemented across the industry over multiple protocols, okay. There are dozens of vendors and three or four protocols, many more proprietary ones. RDMA protocols provide this semantic called RDMA write,
Starting point is 00:29:37 but what we need is that remote guarantee of persistence, that thing you saw called flush. It's in support of optimized flush or async flush or other things. RDMA-Write doesn't cut it. It guarantees remote visibility, but not durability, right? It provides for delivery to the visibility of applications running on the remote platform.
Starting point is 00:30:00 As we've learned from our previous segments here, that ain't it, right? It needs to be physically in the memory. And RDMA doesn't really go that far, right? RDMA just puts it at the doorstep of the memory and says, okie dokie, I'm sure you'll open the door and let me in pretty soon, right? We need a flush. We need something that pushes it in that door and past the boundary, the dotted red line in Andy's picture. So we desired an extension, and back in 2016 and more recently, we've had these things called RDMA commit,
Starting point is 00:30:38 a.k.a. RDMA flush. They both have a very similar semantic. They execute like an RDMA read. They're basically ordered, flow controlled, and acknowledged. You send a request, it waits for its turn at the head of the queue, and it's acknowledged when it completes. The initiator requests a specific byte range to be made durable. It's basically a subset of that little RDMA triplet. And the responder acknowledges when durability is complete. It looks exactly like optimized flush in the SNEA model, basically.
Starting point is 00:31:08 It just happens to be remotable, right? And there's a bunch of problems implementing these on local platforms, but we can at least write it down. And so it's a new wire operation and a new verb. There are a bunch of details here. I basically talked about them all. But there are a number of ways to actually implement them. And the simple way is to interrupt the CPU and let it do it.
Starting point is 00:31:33 The more complicated way is to change the behavior of the platform so that it happens automatically. And the way we really want to do it is to perform via some sort of explicit operation from the PCI Express bus. And that's still under discussion. But these platform-specific workarounds can certainly do that type of thing. It's just a question of how universally it's implemented. Each NIC will be responsible for it. PMEM subsystem, the PMEM library, doesn't need to understand this. It said, this is
Starting point is 00:32:05 the requirement. I'm going to call this method, and you're going to do it, right? Right. Here's some workloads. This is where it starts to, I show you exactly what's going on. So first workload, basic replication, simple replication, mirroring, right? I got a local write. I want to replicate it to the remote side, okay? That's all I want to do. I just want to copy my stuff someplace else. I want to get it into a different failure domain for basic availability, okay? So I do a write, maybe I do a whole mess of them, and then I do a flush, right? I'm not overwriting data. I'm not really ordered. The ordering dependency is basically around these flushes. Every time I flush,
Starting point is 00:32:50 I want it to be safe, right? I don't really care when the writes showed up. I don't have any control over that locally, so why should I have any control over that remotely? There's no completions at the data sink, and we want to pipeline it, right? We want this to be incredibly efficient. All we want to do is store the data. Push the data, store the data. Push the data, store the data. We're not sending it to another upper layer. We're not processing it when it arrives.
Starting point is 00:33:15 And so we draw this fairly simple picture. It looks exactly like the picture we had before. We call these put, put, put. A put is an unordered sort of a write. They turn into RDMA writes that flow. We follow it with a flush. The flush basically collects all the writes and responds. Really simple.
Starting point is 00:33:37 This is the way optimized flush and the flush part of async flush will basically work on the wire. This is basically how you'd see them on the wire. There's a pause right here while the flush takes place. So it's a blocking operation. These guys don't block. In fact, they don't even get responded to. You notice how they just sort of die out into the PMEM line. These are called posted operations.
Starting point is 00:33:55 Fire and forget. I posted my write. I went off and I did something else. Well, that's nice. Let's do something more interesting. What about a log writer? A log writer wants to write a log record, 4K, whatever, wants to commit it to persistent memory, to something durable, and then wants to write a pointer that says that log record is good. I just added it to the linked list, basically. You do not want to place that pointer until the log record is successfully durable.
Starting point is 00:34:28 That would be really bad if somebody came up and found a good pointer in a bad record. So you need some sort of transaction. You need to be sure that the log record write was good before you wrote the pointer. You need some sort of acknowledgement or rule in the protocol that implements drain, that waits for this commit, this flush. You need to wait for it. If you do that end-to-end, you have a pipeline bubble, right? I'm going to send my flush. I'm going to wait for my flush. Now I'm going to send my write. Right? That's a big dead time on the wire. You don't like pipeline bubbles.
Starting point is 00:35:10 They impact your operations per second in large ways. So you want to just say, commit write. And if the commit failed, don't do the write. Right? You get the same answer either way. You know it either worked or didn't work. You know, you successfully or unsuccessfully committed your transaction. But it's way more efficient.
Starting point is 00:35:31 So you want desire and ordering between commit and a second write. Between that curly brace and that curly brace. Basically, that little comma. You want that comma to be on the data sink, not on the data source. So we have this special thing called an atomic write. An atomic write drops 64 bits of data in an atomic fashion, but it does so only after a commit. And so it is really simple and really powerful.
Starting point is 00:36:02 It looks a lot like other RDMA atomics, in fact. So it's ordered in very similar ways, and it can be implemented very trivially on a lot of NICs. If they already do a compare and swap, they can implement this atomic right. If they do an uncache right, they can almost do it, and probably can on a lot of platforms. So we think this is a fairly lightweight thing.
Starting point is 00:36:23 It's being discussed both in the ITF and the IBTA, the IWARP and InfiniBand Rocky communities. And there's pretty good agreement on it. A couple of details to work out. Here's what it looks like on the wire. This is an animation. So we start off with put, put, put, and a flush. And the flush, I've sort of interrupted it in midstream.
Starting point is 00:36:43 It's begun the flush. Shortly later, we say right after flush. We haven't waited for this flush yet. But it stopped. It said, oh boy, there's a flush in progress. I can't continue to execute yet. A moment later, the flush completes. And we respond to the other side.
Starting point is 00:37:05 And then, boop, that little arrow pops in, and it says, you may go now. And that stop sign turns into a go. I should have had another animation, but then it would have looked funny when I flattened it for PDF. And at that point, the write occurs, and we respond to the write. Now we have a transaction. We wrote a blob of data, we committed the blob of data, and we wrote a pointer, and it all happened in the right order. Now if the flush had failed,
Starting point is 00:37:37 we never would have uncorked this thing. We just would have gone, and broken the connection. So that would meet our needs. So, looking good. But now, you notice there was no CPU on the other side checking our work, right? So how do we know the data was good? I'm a paranoid log writer, you know. I do not want to commit a record
Starting point is 00:38:03 that you didn't get every bit right. And most storage layers do this quite faithfully. They check every bit. And they actually ask the disk to check every bit. There's all manner of paranoia up and down the stack. We don't have a lot of paranoia at the RDMA layer. It's just a transport. So what we really want is an integrity check. Now, I talked about this back in 2016, and it's still not reality, but I think it's getting more and more compelling, so I'm just still ringing that bell. In order to check it, there's some really bad ones. Reading it back, right? I don't want to read the data back. Not over the wire, that's for sure. And you might
Starting point is 00:38:43 not actually read the actual data, because there's no such thing as an uncached read. You might be fetching data from somebody's cache on the platform. You could signal the upper layer and say, oh, Mr. CPU, by the way, I just wrote some data. Would you tell me it's good? And you can go off and do his stuff locally. You can call the NVM programming API and do a check error and blah, blah, blah. But that's pretty heavyweight and a lot of latency. What we really want is the lower layer to do it. Oh, there's one more, a couple more interesting things. Maybe we just brought a new volume online. We want to scrub it before we add it to the array. We want to do some storage management recovery, that kind of thing. So some, you know, we want to use this in a couple different ways. So the solution is RDMA verify.
Starting point is 00:39:24 We have a new queued operation just like the commit, but it has integrity hashes. I don't think the RDMA layer needs to negotiate the hashes. The RDMA layer needs to implement them, but I don't think they're really part of the protocol. It's kind of like an RDMA read. You're asked to read the data, but instead of giving me the data,
Starting point is 00:39:41 just give me the hash, please. And the way you compute that hash is somebody else's question. Besides, no two people want the same hash. Some people want a simple CRC. Some people want full-shot 512. I don't want to tell them which one to use. I don't think it's a good use of protocol time to negotiate that, But that's something that can be discussed. The hashing algorithm is going to be implemented in the RNIC or a library somewhere. Some component somewhere on the target platform is going to do it. The semantic, I had a few options when I first talked about this, but I've nailed down, I think, what is right. See if you agree with me. The source
Starting point is 00:40:23 computes the expected hash of the region. You know, I believe this region should have a hash result of 1, 2, 3, 4, 5. So it's previously computed it, right? It sends that hash to the target and says, please compute the hash. The target computes the hash. If it matches, everybody's happy.
Starting point is 00:40:41 We just say, your hash matches my hash. Here it is. You know, I think we're good. If it doesn't match, happy. We just say, your hash matches my hash. Here it is. You know, I think we're good. If it doesn't match, now something interesting has to happen. Maybe that's a fatal error. Or maybe it's just, please tell me which regions are bad. So we have two different semantics. One is return the computed hash value that doesn't match.
Starting point is 00:41:01 And two is, oh my God, fatal error, stop. Fence all further operations. The log writer would use this. And so basically this three-way return, successful match, unsuccessful match, fatal match, fatal mismatch. So here's what it looks like on the wire. Put, put, put, flush. I'm not going to do atomic write this time
Starting point is 00:41:26 just to get it out of the picture. Verify. I've done a flush. Now I want to be sure it's good. It stops, just like the other guy did, and waits for the flush. The flush completes. Boom, it does whatever it did.
Starting point is 00:41:40 It might have failed. If it failed, we're done. But if it succeeded, then the verify is signaled to proceed. Verify goes off to a verify engine, which computes the hash. And if it's equal, we complete. If it's not equal, we either return the value or break the connection. That animation didn't quite work. Yeah, there it is. So not equal might still say verify complete. or break the connection. That animation didn't quite work. Yeah, there it is.
Starting point is 00:42:08 So not equal might still say verify complete, but not equal can also say big red light. So that is a new proposal, not yet in any standard, but I believe very important. There's some implications for this thing at both levels. The NICs have to figure out how to do it. But the upper layer interface also needs to express the behavior. And so all of these prior RDMA extensions need consideration in the RDMA PM for HA document. First, I believe, is that it strongly strengthens the need for async flush. We need this asynchronous behavior to keep the pipeline full, right, to efficiently implement this thing. An implication of this is that we have increased
Starting point is 00:43:13 imprecision of errors. You know, the RDMA connection broke, and we know what sequence number it was executing when it broke, but we really don't know the details of what broke, right? We're not running on that platform. There's no check error call. And so that increased imprecision is a little bit of a worry, right? We need to understand more about the implications of that in the remote scenario. I touched on that earlier, but now I'm hoping you understand a little better what might go wrong here. Atomic write completion. Do we need to know if the atomic write is done? We're going to express the atomic write in the API, maybe. What's the completion semantic of it?
Starting point is 00:44:00 Is it a write, like a store operation, fire and forget? Or is it a sort of a transaction? I perform this write, I want to know that it's done. Is it, does it have a fence or a barrier behavior? That's kind of the core question. We don't know that. That's a local question. It's not a wire protocol question, but it's one we haven't really dug into yet. Asynchronous verify. Verifies can be pretty expensive. Maybe we need to run this asynchronously like we run flush. I don't know. We haven't decided. And verify fail imprecision. Verify fail, when we've actually gone off and looked for an error and it failed, you know, it's kind of imprecise in the connection break model. But it's very precise in the non-connection break model. So how do we harmonize those two behaviors from one method?
Starting point is 00:44:56 Other ongoing NP-TWIG work. The core 1.3 update is a current work in progress. We've been working on it since we shipped the last one a little over a year ago. I don't know when we're going to be done, but we're moving forward on it, and we meet very regularly to discuss it. Asynchronous flush and working out the details of it,
Starting point is 00:45:17 we're pretty far along in that one. We did a lot of this work last year. We knew it was coming, but it's maybe not done. We do have implementation learnings on all these things. There's a really interesting interaction between the C memory model. The flush command, you can't reorder instructions after the flush command. Compilers and processors
Starting point is 00:45:41 love to rearrange code for efficiency. Flush is really special. It's a barrier. And so we need to teach the compilers and perhaps the processors that it's a barrier. And that's interesting. I hadn't expected we'd become language engineers, but we might have to. And visibility versus persistence, as I described early on, right? They're two very different semantics
Starting point is 00:46:06 and two very different goals. We're going to continue the work. We're going to have greater RDMA mapping detail and maybe these extensions. We're going to have efficient remote programming models. We're going to call out the error handling, and we're going to depend on the Open Fabrics Alliance's OFIWG to help us get this stuff right.
Starting point is 00:46:23 So we're opening it up big time. Scope of flush and flush on fail fail. That was what Andy described. Those are things we don't talk about yet. But flush on fail is kind of that emergency flush. What if the emergency flush didn't work? What if somebody yanked the super cap off my machine? How do I know and what do I do?
Starting point is 00:46:42 So interesting stuff. And that is it. All right, thank you very much. I'll be around. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Starting point is 00:47:07 Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.