Storage Developer Conference - #55: Low Latency Remote Storage: A Full-stack View

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to STC Podcast Episode 55. Today we hear from Tom Talpe, architect with Microsoft, as he presents Low Latency Remote Storage, a full stack view from the 2016 Storage Developer Conference.

Starting point is 00:00:50 So, hi, I'm Tom Talpe. I'm an architect with Microsoft. You have probably seen me here before giving a similar presentation. I hope to sort of pivot this one. I've been sort of pivoting this discussion every time. I started with sort of high and lofty goals, and then I kind of grounded them in RDMA, grounded them in SMB, grounded them in all kinds of things.

Starting point is 00:01:13 So I've kind of come all the way around to what I might call a full-stack view, where I want to talk about the components and how they fit together and how we're actually doing something in this area within Microsoft. I'm not making a product announcement. I'm not really making a whole lot of specific detail here. I have to leave that to others. But I think you'll see that we're beginning to make this a reality and so I want to sort of draw the stack a little bit.

Starting point is 00:01:45 It's still a long ways off, right? I guess I should just sort of, this is a bit of a disclaimer. This is a long process and you'll see there's a bunch of components. It's like three to five years, I think, to realize a lot of this stuff. But the interesting thing is that I'm trying to propose phases in which we can get there. And there are benefits at each phase. And some of these benefits are orders of magnitude. So it's a pretty remarkable journey if we can manage to take it. All right. Outline.

Starting point is 00:02:17 I'm going to give a problem statement, which is largely just a review. I'll point to last year's presentation for more detail. Platform support, RDMA support, SMB3 support. We're going to walk the stack, right? And then at the very end, I'm going to tell you something about Windows Server 2016. Okay, the problem statement. Storage class memory. This is a Microsoft adopted or Microsoft favored term. Others have used it, some have not. But storage class memory, we intend it as a sort of a general term. Others have used it, some have not. But storage class memory, we intend it as a sort of a general term. It's a new disruptive class of storage. It's storage

Starting point is 00:02:51 first. It starts with the word storage. It is storage. It's a non-volatile medium with RAM-like performance. It has low latency. It has high throughput. It has good density, high capacity, all right? It resides on the memory bus, therefore it's byte addressable. It has byte semantics. You can read and write a single byte or maybe a cache line, but something really small. It also has block semantics when you layer a driver on top of it, and that's how Windows deploys it.

Starting point is 00:03:18 There's sort of two personalities to this type of device in Windows. It can also reside on the PCI Express bus. It usually has blocked semantics. You think of an NVMe device, but a lot of NVMe devices, for instance, Stephen Bates yesterday was describing his company's device has a memory bar as well as a PCI Express IO bar. And so it, or an IO personality, I guess I should call it. And so it behaves also in two similar ways, but predominantly blocks semantics when it's on the PCIe bus. And to storage class memory, I add remote access, right?

Starting point is 00:03:54 I'm a file system developer. I'm a remote file system developer, SMB3 and SMB direct. That's what I do. So local only storage lacks resiliency. It really doesn't count in today's world to make one copy of your data. Everybody brags about it. Everybody brags about replicating locally. It's still one copy of your data. If your system goes down, both copies are lost. So it's required to have resiliency in the modern world for storage. New features are also needed throughout the storage network and platform. So those are things we have to add to accomplish this and that's what

Starting point is 00:04:28 I'm talking about. So we're going to explore the full stack view. Quick review, in the 2000 timeframe, we had HDDs, 5 millisecond latencies. 2010, SSDs predominated. They have 100 microsecond latencies. That's a 50x improvement in 10 years. Just want to remind people of that. 2016, now it's the beginning of storage class memory. Less than 1 microsecond local latency, less than 10 microseconds remote latency because of the network round trip, right?

Starting point is 00:05:00 100x improvement on top of where we were. So it's a 5, 5000x change over 15 years. 5000x. When was the last time you saw 5000x? Storage latencies and the storage API. This is interesting because the API is driving the adoption of storage class memory. HDDs were very slow. You would always use

Starting point is 00:05:27 an async programming paradigm, right? You'd launch an I.O., you'd block, you'd wait for the I.O. to complete. SSDs were closer but you'd still, we still use the traditional block API, right? We perform a read, it interrupts us, we get the answer. Storage class memory starts to push this line back. The latencies have improved. Doug Voigt has a presentation like this. But the API shifts when latencies get down to a certain point. It's better to pull, better to wait than to block.

Starting point is 00:06:02 And so, ignoring that, the 500x reduction in latency in IOPS changed the behavior where applications suddenly program differently. Applications finally to take advantage of this device have to change their API. Now, there are some APIs that we use today and applications use today, but that's going to drive this adoption. All right? And so, eventually, we'll get down, we'll get SCM down to DRAM speeds and we'll never use async. We'll do loads and stores. There are local file systems on top of this device that are exposing it to applications today. The first and foremost is called DAX. It's the same thing, almost the same thing in both Windows DAX. It's the same thing, almost

Starting point is 00:06:45 the same thing in both Windows and Linux. It stands for direct access, direct access file system in the sense that it exposes a file or a mapped API, right? It has sort of two personalities. You can open a file, read it and write it like you normally do or you can open it, map it and load and store it and then commit it like you do. Windows and Linux implementations are very similar, and you're gonna see a lot of applications moving to a memory mapped API to take advantage of these things. NVML is an explicit non-volatile memory aware library.

Starting point is 00:07:16 Andy and others have been talking about this for years. It's real, it's open source, included in Linux, and in the future, it'll be included in Windows. Chandra, my colleague, had a presentation yesterday afternoon about this along with the Intel developer, Paul Luce. Specialized interfaces are also interested in this, databases, transactional libraries and even language extensions. So these are other ways that storage class memory will appear. But we have to take these APIs remote, right?

Starting point is 00:07:46 As I said, local copy doesn't count. You have to replicate in the modern world. One copy just doesn't count, except for temp data. That's about the only thing I care about. So we need a remote durability guarantee to do this. We need to be able to make it durable locally, which the API supports, and we need to push it to a durability guarantee remotely. And so that remote good durability has to happen all the way through the network. At the platform and the device, we need to

Starting point is 00:08:14 be sure it's durable. Across the network, we need to be sure it arrived intact and durably. And then the storage protocol itself needs to reflect that back. So I'm going to walk the stack to show you what I think the world will look like. You can feel free to interrupt me for questions. I predict we'll have a full 45-minute segment though. Consistency and durability platform. Okay, so we're going to start at the platform and I just want to say a few things about the platform. I can only wish that the platform does things, right? I don't build hardware. But I can at least tell you what I wish and what I see. Okay, I mentioned this a minute ago. We need a remote guarantee.

Starting point is 00:09:05 Wait a minute, is that in the right spot? Yeah, well, let's call the RDMA protocol the platform. Sorry, this one's a little bit out of order, but it's not out of place to talk about it. The RDMA protocol is what we see to carry data from the originating node to one of the replica nodes, all right? And the RDMA write alone is not sufficient to provide the semantic. Adan went through this with his ladder diagram. But historically, the completion means only that the data was sent, all right? In Rocky and InfiniBand, it means a little more.

Starting point is 00:09:46 It does mean that it was received in Rocky and InfiniBand. But at the verb level and IWARP, it only means that it was sent, or not even that it was sent, that it was accepted for sending. Okay, so it's a very weak guarantee, but it's what RDMA has provided. Some NICs do give a stronger guarantee,

Starting point is 00:10:04 but they never guarantee that data was stored remotely. This is where I completely agree with it. It really doesn't matter if it made it to the remote NIC. It only matters when it makes it to remote durability. Processing at the receiver, additionally, was a problem. If you could say, well, when I see the packet arrive, if I take an interrupt, I know it's there. Well, no.

Starting point is 00:10:28 Things can be reordered, and lots of funny things can happen. You need actually a completion. You need to actually take a completion in the RDMA world. And all these things add up to latencies. This is our argument for requiring to extend the RDMA protocol. Once we extend the RDMA protocol, we actually need a platform-specific method to implement the commit. And there's actually no PCI commit operation, right? PCI writes are fire and forget. They're called posted. You post the operation. It behaves just like an RDMA write. You post it and some hardware accepts it to put it there. But there's

Starting point is 00:11:04 no completion. There's no acknowledgment that the data actually arrived at the end completion on the PCI bus. So that's bad. We go through all this trouble to implement the RDMA protocol and now what does the NIC do when he's plugged into the PCI bus? So I believe that down the road there will be a PCI extension of some kind to support commit.

Starting point is 00:11:26 I'm hopeful that it's going to be a very simple extension. Something like, well, I think I mentioned it in a minute, but some very simple extension. But there are, in the meantime, workarounds. And I'm going to mention a couple things from my friend Chet Douglas. But the idea is to avoid CPU involvement and the workarounds largely require CPU involvement. So this is one of Chet's diagrams that I swiped. It's about ADR and it draws this little red box called the ADR domain. And the idea is that you can form an operation that makes it into the ADR domain and is therefore persistent. ADR is a hardware domain, right?

Starting point is 00:12:13 It requires a specific motherboard facility with a super cap or a special power supply or something. But it is a solution that's available today. There's a couple of glitches. It doesn't include the cache of the processor. It doesn't include caches that appear down here. And so the workarounds all involve either bypassing or writing through these caches. And they get a little tricky. They're expensive. There are other workarounds that basically push the data to various places and then do an operation that signals things to go where they go.

Starting point is 00:12:50 Like you notice this one on the right has no ADR domain, right? And so you can go look at Chet's presentations. But by sending messages or performing other operations, you can push the data out of these caches and into the non-volatile domain. All right? These are all very expensive, right? They are required today, so it's reasonable to consider them. But in the future, we don't want them.

Starting point is 00:13:24 And I would like to suggest that we may not actually need them if we carefully leverage our existing protocols, which actually implement many of the messages that are proposed down here. So I'm not disagreeing with the workarounds, but I'm saying the workarounds can be implemented in today's environment without really radically changing the environment while we wait for this PCI commit and this RDMA write or this RDMA commit. And so what I'm going to explore today a little bit is SMB3 push mode. And I talked about push mode last year very generically. I actually talked about push mode in three different protocols. I talked about it in SMB3, NFS, and iSCSI, iSERP. Push mode, Idan mentioned this earlier, is required, I think, to overcome the latency issue. A round trip, well, a message round trip, an actual send-receive or some sort of acknowledged operation at the upper layer

Starting point is 00:14:24 will add significant latency through both the network round trip and the interrupt completion handling at the server, right? We just can't afford it. The speed of memory, the speed of storage class memory just doesn't allow us that. We have tens of microseconds, 30 to 50 in today's implementations, sort of best case round trip latency at that layer. Whereas we can get sub 10 microsecond latency at the RDMA layer. So by push mode, what I mean is an operation which, if you will, bypasses the upper layer. And so first, I'll talk about the traditional upper layer commit, how we commit data.

Starting point is 00:15:10 And it's basically a file operation. We open the file. We obtain a handle and a lease, for instance, in SMB. And we perform a write. We can also perform a read. There's a bunch of dispositions on the write, but the server performs operations, either to a PMEM file or an ordinary file.

Starting point is 00:15:28 And then from time to time, the initiator, the client, will either perform a flush or just close the file, right, which implicitly flushes. And that flush is an explicit operation that crosses the wire and requests the server to make the file data durable, right? Which on a disk drive means, you know, foo, flush it to the disk, put it in the file system, do whatever you have to do at the file system layer to make it safe. But on a raw PMEM device, it would actually perform a local commit. It would perform one of these local durability operations in response to a flush. And that's just what the DAX file system does. There might be some recalls which are interesting. Lots of other little details that I'll mention in a minute.

Starting point is 00:16:16 I'll sort of dig into in a minute. But the advantage is it's a traditional API. It's fully functional today and it literally just works. Literally just works. The only cost is you have to take interrupts, you have to run that server CPU, so it does cost you a bit of latency. The existing latencies are all still there. Push mode with explicit commit, this is where we really want to get. And if you will, the idea is to tunnel below the file system layer, to tunnel below SMB, right? Instead of performing an SMB write, we're going to perform an RDMA write below it, right? We're going to set up the I.O. using SMB but we're going to perform the I.O.

Starting point is 00:17:07 with an RDMA write. And we can do that when we have a PMEM array over here on the side. So once again, we would use SMB3 in this case to open a DAX-enabled file, one of these direct access files. It's literally persistent memory mapped into the host's address space. So when you write it, you're just loading and storing in memory, okay? Then, instead of loading and storing from the processor, we're gonna load and store from the NIC. We're gonna perform RDMA writes to push.

Starting point is 00:17:39 We can also pull. Pull is kind of rare, happens during rebuild, happens with some weird application things. It's kind of rare. Happens during rebuild. Happens with some weird application things. It's primarily about write. And when we do write, we do a commit. We do one of these RDMA commits, which is this guy with the little diamonds on it. So we do write, write, write, commit, write, write, write, commit, write, write, write, commit. And the commits would basically serve the function that a flush does at the SMB3 layer, but only for one of these DAX-enabled files where we're actually writing the memory, right?

Starting point is 00:18:13 Otherwise we have to signal the CPU. So that's the rough schematic of how it works. There's a little bit more about leases and things. Let's drill down. So push mode. First, you have to open the file. This is called SMB2 create, the operation. Create creates a handle. It doesn't create a file. That's different. You create a handle. And if that results in creating a file, that's a disposition which is a little more involved. In any event, opening the file does a lot of things. You created a connection, you've created a session, you've authenticated yourself to

Starting point is 00:18:54 the server, you've requested to open the file so the server checks authorization based on your credentials. All that good stuff has been done by the upper layer, which is really convenient, right? That's a lot more than you get from some low level block driver, for instance. How is the block layout? Yeah, well, actually, this isn't the block layout yet. That's a different layer of SMB, which

Starting point is 00:19:20 is a bit of an extension. But it's just like an NFS v4 open or an SMB2 create. So once you send a create, in SMB2, you decorate it with things called create contexts. And create contexts are these little blobs that accompany the create to tell the server how to handle the create. And I hypothesize that there may be some sort of push mode create context.

Starting point is 00:19:46 It said, I want to open this file in push mode, right? Alternatively, there could be an iOctl later. SMB2 supports iOctl, it's called fs-controls on an open handle. So the idea is to set up push mode. And an important side effect of this is that if you're going to lock the file in memory and read it and write it, you probably need to kind of own that lock, right? And that's something we call a lease in SMB2. So I believe that it's in conjunction with requesting a lease.

Starting point is 00:20:20 And by establishing the lease in push mode, we can establish that mapped file for some range of the file, perhaps the whole file, lock it in memory, and treat it as a PMEM segment or scatter gather. So we return the create context to indicate push mode is available. There's some acknowledgement back from the server that push mode is available. All right, so the lease provides us with other things that allow us to recall and manage the mapping. All this happens at create time. This is like one operation. It's just one big decorated create. After that, the regions may be remotely written and possibly read. So that requires a RDMA registration. This is where we would clearly have some sort of FS control.

Starting point is 00:21:14 The client would say, I want to read and write this region of the file, or perhaps the whole file. So the file is in memory, ready to map. Now it needs to be registered so the RDMA layer can access it, right? And we do that today in SMB direct. We do it actually per IO in SMB direct. For each read or write, the server will, well, the client will register as memory. The difference is we're doing it here on the server side. So we sort of reverse the polarity of the operation. Anyway, I suppose that it might be at least an offset, a length, and a mode. I want to act from zero to infinity, write only, for instance.

Starting point is 00:21:54 The server would pin the mapping, register the pages, and return a region handle, maybe multiple region handles. The client could request multiple regions, but the result is that he'll be able to remotely read and write the file. Recall is important because the server may have to manage the mappings. The client will begin to perform its reads and writes directly within the file, entirely within the RDMA layers. There's no server processing. Only the NIC knows that it's happening. If there's a commit operation available, the client commits via RDMA. Otherwise, the client commits with flush, right? If there's no commit, the data is there and ready to go. The client just has to tell

Starting point is 00:22:42 the DAX file system to commit so it can send a flush operation. So this is kind of this weird hybrid, right, where you might be able to push the data with RDMA but you might have to ask the CPU to commit. And that could be a win if you're merging a lot of RDMA writes. The client may periodically update the file metadata. There's an ioctal file basic information is basically like NFS set at or it sets the attributes of the file. So for instance, if the client is writing the file

Starting point is 00:23:15 and the server CPU doesn't know it, how does the timestamp get updated, right? It's the client's responsibility to do that at this point. The flush can do it. That might be an advantage. But the client can also do it with one of these things. And there's a few things that the client can't do that he must use SMB for.

Starting point is 00:23:35 And that is adding blocks, appending to the file, for instance. Punching a hole in the file, right? You've got a mapping. That's all you've got. You can't add to the mapping without changing the mapping. You can't punch a hole in the mapping without changing the mapping. You can't punch a hole in the mapping without requesting that that be done. So a lot of these things will definitely be done with traditional SMB3 operations.

Starting point is 00:23:54 So you can see a mix. You can see the data transfers being done via RDMA. You can see the metadata operations being done with SMB. And now here's the recall. I started to talk about this at the beginning of the last one. Sorry. But the server may have to manage the sharing and this registration. And the client, I propose that we use the lease to do this.

Starting point is 00:24:18 It'd be recalled upon sharing violation if another process wants to open the file. That would be an ordinary SMB3 level sharing, right? One said, I wanna be the only writer of this file, and another one comes in. So it recalls. File doesn't have to move, but the permission has to change. Caches have to be flushed, God knows what. There might be a mapping change.

Starting point is 00:24:40 Maybe the CPUs maps are full and we have to rearrange things and there might be a mapping change. So the file system may request that the data be unmapped and remapped in another location or maybe rearranged due to bad blocks or I don't know what, some platform specific event occurred. A registration changed. For instance, let's say the server started to run low on RDMA resources and said, I gotta throw everybody out and rebalance my resources, that kind of thing.

Starting point is 00:25:09 So this recall is very useful for that kind of server callback to the client. And when recalled, the client performs the usual. If it has any dirty buffers on its side of the wire, it flushes them. It probably commits them because usually it cares about the durability. It returns the lease and probably goes back and gets another lease. Or maybe it's happy usually it cares about the durability. It returns

Starting point is 00:25:25 the lease and probably goes back and gets another lease or maybe it's happy with the new lease level. Maybe it got dropped down from read-write to read-only and maybe it's only reading. So he might be okay with that. But that's standard lease recovery. Once again, that's a sort of an ownership or a sharing metadata operation. That didn't advance. There we go. And push mode, when you're done, you're going to probably close the file. You will return the lease as part of closing the file.

Starting point is 00:25:58 So that's the SMB3 level operation. There may be the push mode registration that needs to be cleaned up. There might be an explicit FS control for that. It may just happen automatically on close. Well, it will certainly happen automatically on close. But there might not be another way. You might have to close the file. I don't know. This is a semantic that we need to think through. What's the most useful application behavior for this type of push mode? I think this depends on the way we decide to enable push mode. I think this depends on the way we decide to enable push mode. If we do it with the create context, it probably happens

Starting point is 00:26:29 at close. If we do it with an FS control, it probably happens dynamically or, you know, explicitly. I want to point out that push mode registration needs to be reset during handle recovery. SMB3 has a feature called CA, continuous availability. And continuous availability allows the server to hold on to handles when the client disconnects. And the client can come back with exactly the same state he left in. It's really important for enterprise applications

Starting point is 00:26:58 like Hyper-V, virtualized environments, right? If they get migrated, they need to connect and they need to pick up exactly where they left off with no loss, right? We want this to be completely transparent to the application. So the push mode registration, let's say it moves, the client moves and has to reconnect to the same server. That registration goes away and has to be rebuilt. So it must be reobtained after handle recovery.

Starting point is 00:27:24 You'll get a new handle. If you try it on the old handle, well, it will fail because the connection was torn down. So that's good. RDMA protection helps us. All right. So that's the quick trip through SMB3. Now, a quick trip through the protocol extension. This is going to duplicate some of what Adan was discussing. It's, it'll say it in my own words so maybe I'll have something to add but I might skip over a few things here. Doing it right. To do it right, I think we have to do it right in the RDMA layer. We need a remote guarantee of durability. We want it to be highly efficient.

Starting point is 00:28:05 And we want to standardize it across all RDMA protocols, right? The protocols themselves all have their own quirks and implementation details. That's fine. But we want the operation to be similar across all RDMA fabrics and all RDMA protocols. We have that today for storage, right? SMB3 runs equally well over IWARP, Rocky, and FinnaBand. You name it. It does so because it requires very little of the fabric.

Starting point is 00:28:33 It wants a send or receive, an RDMA read, an RDMA write, some ordering guarantees, some connection management, IP addressing, that's about it. Doesn't need the fancy stuff. This't need it, the fancy stuff. This is another element of the fancy stuff list. And it's really important that it not be proprietary or it not be specific. It can be extended, but that core behavior needs to be countable by all the storage protocols that need it.

Starting point is 00:29:05 So my concept and, you know, it's being discussed in the Internet draft and in the IBTA and by a lot of people. I've talked about this before. There's probably nothing new here to people who've seen that. Is that there's an operation that I call RDMA commit. It's also been called RDMA flush. It's also been called, I don't know what. It's a new wire operation in my view and a new verb.

Starting point is 00:29:27 It's only one operation and it's only one verb, okay? It's a very simple point enhancement. You can conceive of other natures to it, but I view it as one operation, implementable in all the protocols. I believe that the initiator provides a region list and other commit parameters under control of a local API at the client. And I'll point to the SNEA NVM technical working group,

Starting point is 00:29:54 the non-volatile memory technical working group as leading the way in that type of API. We have this thing called optimized flush. And optimized flush has a bunch of parameters that I think should map directly here. But, you know, I'm open to discussion on that. The receiving RNIC has to queue the operation in order. I think it behaves like an RDMA read or an atomic. It's subject to flow control and ordering. Very important. Both of those things are very important. Ordering because you need to know what data is becoming durable.

Starting point is 00:30:26 And then flow control because it's gonna take time. It's kind of a blocking operation on the wire, right? When PCommit was still on the list, PCommit was a pretty heavyweight operation. PCommit may not be required for some solutions now. Maybe it's just a flush, but it's still a flush. There's still an end-to-end acknowledgement within the platform. So that blocking operation is gonna require flow control.

Starting point is 00:30:50 The NIC pushes the pending writes, performs the commit, possibly interrupting the CPU. I wanna talk about that with respect to the API. And the NIC responds only when durability is assured for the region the client wanted. There's some other interesting semantics though. I think one of the key scenarios is called the log pointer update. Where you write a log record, you make it durable.

Starting point is 00:31:15 Then once it's durable, you write the log pointer and make it durable. It's like two durable commits. One is a little bit bulk, right? Might be 4K data. The other one is literally a pointer, 64-ish, little bit bulk, right, might be 4K data. The other one is literally a pointer, 64-ish bits-ish, right? And so I believe that making that two operations on the wire might be expensive and might lead databases, for instance, to think twice about using it. But I think if those two operations can be merged in some clever way, and I have

Starting point is 00:31:44 some cleverness in my internet draft, that can be a big benefit for this thing, for adoption of this thing. I also think it would just be basically useful. Second, you may need to signal the upper layer, right? You may need to tell the upper layer that something happened. We have a scenario in database server that I have in mind where the log is replicated and safely replicated but the idea of replicating it is for quick takeover, right? Should one of the database instances fail, the other one should be ready to take over. So you want

Starting point is 00:32:19 to replicate it but you want to notify the peer that there's dirty log data. And so the peer can actually stay very close behind you, not synchronously. You don't want him to actually replay that log, but you want him to know that there's dirty data to be replayed so he can keep close. And signaling him either with a little tiny gram or a whole message is a really interesting idea. So if you put these two things together, I think you have some interesting, you know, merged semantics. This one I'm really interested in, an integrity check.

Starting point is 00:32:52 If I'm writing data, how do I know it's good, right? I called commit and he said it was committed. How do I know that it's really good, right? We have all kinds of storage integrity at the upper layer. Can we do this at the RDMA layer? I don't know. I don't know exactly how. Does the NIC have to read it back?

Starting point is 00:33:08 Is there some other way to do this? But an integrity check is right up there on my list. The choice of them will be driven by the workload. We have to have this dialogue with the applications. We can't just sit here and say, I think I have a good idea. We need to motivate this by some request or some need. And finally, the expectations of this latency are that push mode will wind up, will land much less than 10 microseconds.

Starting point is 00:33:35 Idan mentioned it, right? He said there's 0.7 microseconds on the wire. There's no way push mode's completing in 0.7 microseconds, right? There's more overhead on that thing. But it's certainly 2 to 3 microseconds, right? There's more overhead on that thing. But it's certainly 2 to 3 microseconds, okay? Which is an order of magnitude, maybe better than an order of magnitude, better than we get for a write

Starting point is 00:33:52 with a traditional storage RDMA transport today. Like I said, most of them are in the 30 to 50 microsecond, okay? So that's huge, right? An order of magnitude is always something to go after with everything you got. Remote read is also possible. So he was asking why would you need this? There aren't a lot of scenarios that need it. That's why I sort of put it as a footnote, right? But it is interesting, certainly for rebuild, right? You lost a copy, you got to read that copy over the network, just suck it in, right? You lost a copy, you got to read that copy over the network, just suck it in, right?

Starting point is 00:34:28 The reason we get this is that there's no server interrupt, there's a single client interrupt. We can even optimize it when we have multi-channel and flow control and things like that. We know how to do that. That's local. That's local magic, right? We can all innovate in ways that are important. All right. Push mode considerations. I've got a list of interesting things. Then I got a quick update on Windows Server 2016. I called these fun facts when I talked about this a couple of months ago at Samba XP. I've decided to name each fun fact a little more meaningfully. The first is buffering. This is true certainly of Windows and I believe of Linux. If you open a file in buffered mode on a server

Starting point is 00:35:17 with a persistent memory device, buffered mode has a new meaning. Buffered mode means that you've actually mapped the device directly when you open a file in buffered mode. It used to be that the buffer was this bounce buffer in between you and the device, right? You would write to the buffer and the server could acknowledge that quickly and then lazily it could write it out unless you flushed it.

Starting point is 00:35:42 Buffering means the opposite with PMEM. Buffering means you actually own the device right now, at least if everything's working well. There might still be a buffer in there, but there's no reason to buffer just to drop it back in memory, right? So the idea is to put it directly in the device. So that's really interesting. But it enables both load and store and RDMA read-write when you have this, right? And so it's pretty interesting that you sort of want to invert the meaning of buffered and unbuffered. It used to be unbuffered kind of went faster because it went straight to the device.

Starting point is 00:36:14 Well, buffered went faster, but you didn't get the guarantee. So it's kind of the opposite now. You want buffered. So the server can actually hide this from you. And that's what I'm going to talk about in a minute. We the server can actually hide this from you, right? And that's what I'm going to talk about in a minute. We experimented with Windows on this. Recalls. NFSv4 has this, SMB has this. Push mode management requires new server plumbing, right? The up call from the file system for sharing violations remains the same. But,

Starting point is 00:36:43 you know, relocation, file extension, all these things, these are new up calls that don't come from disk drives, right? And so the server has to be prepared for some interesting behaviors. And RDMA resource constraints are another reason for recall. Now you're explicitly giving RDMA resources on the server to the client. None of the protocols today, including NVMe over Fabric, exposes the server buffer. So when you go to push mode, you're going to have to manage those resources. So this is really important. You're going to have to change the plumbing of your server. So an implementer will have to move toward this

Starting point is 00:37:19 over time. I don't know what all these recalls will be just yet. But the idea is, I'm in trouble. You need to let me fix this thing and then we'll get started again. Congestion, this is the thing I'm most, most worried about. The big thing about storage protocols in write, write operations in storage protocols today is that they natively, naturally flow control themselves, right? You send the write to the server and it's just a request to push data, right? The server says, okay, I'll take that request and it pulls the data, right? And

Starting point is 00:37:53 what that means is that the server schedules that pull, in particular when it has the memory to receive it and when it has the I.O., you know, the I.O. has reached the top of its queue basically, right? And so it flow controls the network very nicely, as well as flow controls the queues on client and server very nicely. When we go to push mode, we're going to congest all over again. We're just jamming data down the server's throat, right? The software doesn't have to deal with it. It's out of the loop.

Starting point is 00:38:20 But the RNIC, the network, the interface, the DRAM bus, all this stuff is going to have to deal with it. I'm really worried about that. RDMA itself does not provide congestion control, certainly not for RDMA rights, right? They are unconstrained. You can send as many of them as you want. You can fill the wire from one note. But congestion control is provided by the upper storage layers. And there are credits in SMB3. There's also something called storage QoS. My colleague Matt Krjanowicz gave a presentation yesterday about this and

Starting point is 00:38:53 their use in Hyper-V. So I'm going to point out that there are existing client side behaviors, such as this QoS, that can mitigate this problem. And we're going to need to think those through, because I think they're going to be really, really, really important. And Arnix can help. More thinking is needed. We're going to have to solve that question.

Starting point is 00:39:11 Tom, is that congestion at the network or at the end node? Both. Absolutely both. Primarily the end node, right? But the switches, the exit port from that fan-in congestion. I mean, we see this every day in any sort of scaled network deployment. And integrity, I mentioned this before, but data in flight protection, well the network provides that. Data at rest, that is not provided by the network.

Starting point is 00:39:41 How do we do that? SMB3 has a means, but we're not using SMB3 to transfer the data in push mode. So I think we're going to need to think about remote integrity as time goes on here. I don't think people will be willing to push their data quickly only to discover that it didn't quite make it the way it was sent, right? That's usually not a good way for a storage provider to make money. To wrap up, I have a few slides on where we're at today in the Windows implementation. Yeah?

Starting point is 00:40:14 Sorry. Yeah? You gave the example of the log pointer update. Yeah. And this was a motivation that you said that you want basically the two operations Yeah. Yeah. I want them to be ordered with respect to one another, and ideally I would like to issue them without two round trips. That's a subtle difference but...

Starting point is 00:40:58 No, no. There would actually be two flushes. Well, you could pipeline them, that's one approach. You could have an optional payload in the message, a small 64-bit payload, that's what I explored in the internet draft. Because I think 64 bits is enough for the scenarios I have in mind. Well, could you do it as one operation where the log update effectively carries a commit list which forces commits for all the previous log writes? Tons of ways to think about this. That's a protocol implementation question. But the idea is you have one probably bulk piece that's placed and made durable followed

Starting point is 00:41:36 by another which is placed and made durable only after the successful placement and durability of the first piece. That's the rule, right? I can't update the pointer until the log is safe. And I can't write the pointer until the log is underway. But specifically you mentioned that you don't want it to be two operations, you want it to be one.

Starting point is 00:41:57 Ideally I wanna keep the latency minimal. If I have to do two operations, so be it. But I'm gonna point out that maybe it's easy to, as long as the semantics are well defined, to merge them. Right? A lot of upper layers will have things like compound or fused or whatever operations where you can send more than one operation. The important thing is that the two operations are ordered with respect to one another. If one fails, the other one cannot proceed. That's what's critical. Yeah. I know. Yeah. Oh, I'm sorry. Repeat the question.

Starting point is 00:42:45 And I'm saying that smashing operations together will have a cost and we'll have to measure that cost, right? Maybe it will be too expensive to do it in the NIC. Maybe the protocol will change in some ungainly way if we try this. And so I agree with that, right? We need to think it through, right? If the cost outweighs the benefit, forget it. Benefit outweighs the cost? Oh, now we're talking. Okay, Windows Server 2016. Tech Preview 5 has been available since this summer and general availability is imminent. I'm not here to announce anything.

Starting point is 00:43:27 Oh, I have more time than I thought. Good. Just a couple of minutes. It's imminent. It's very soon. I'll give you a hint. We have Microsoft Ignite in Atlanta next week. But we'll see. The Tech Preview 5 and Server 2016 GA both support DAX.

Starting point is 00:43:51 DAX is really great. And Remote SMB works over DAX, I just want to mention. It's a file system. It's exportable. It works great. And so we get all the expected performance advantages, right? DAX is great because there's no media, it's just memory, right? And it supports all the reads and loads and stores and commits that we expect from a memory-based file system. It's basically

Starting point is 00:44:17 a RAM disk, right? It has a block semantic and we put a file system on the block semantic and we open files and we read them and write them. And if we map them cleverly, we avoid data copies, which is really cool. However, Windows Server 2016 Tech Preview 5 do not implement full push mode. Full push mode, as you can see, has protocol implications. Generally speaking, when we develop Windows Server, we spend a year in design, a year in development, and a year in test. I can't slide a new feature in in that last year. It's very difficult.

Starting point is 00:44:50 If it wasn't committed way up front, it ain't going in. And a new protocol change, forget about it, okay? So just damping expectations perhaps a little. It's not gonna have push mode. I hope it will in the future. And we have a new release vehicle that might let me do it, but not yet, okay? However, so there's no reliance on extending the protocols that we discussed.

Starting point is 00:45:12 We did, however, consider the direct mapping mode, and that was that picture where I showed the DAX file system in that little dotted line that went from the PMEM straight to the files buffer cache copy, right? And that direct mapping is really interesting because the Windows SMB server basically can operate on that mapping, right, in a defined way, in a file system managed way, and can read and

Starting point is 00:45:37 write those pages directly out of the PMEM, right? And so we thought to ourselves, why don't we just RDMA straight out of the buffer cache, which would effectively RDMA straight out of the PMEM device. And we learned something very surprising. This is kind of small and I'm sorry, but basically these green boxes of IOPS and latency are basically within 1% of one another when we tried the two models. We tried the traditional model where we DMA'd to the buffer cache and then copied the buffer

Starting point is 00:46:11 cache to the PMEM or when we mapped the PMEM into the buffer cache and DMA'd directly there. On the right is that second model, RDMA direct in quote buffered mode, air quote buffered mode. And on the left, RDMA traditional unbuffered mode where we do the data copy. We got the same darned result. This was a 40 gigabit network, so basically the bandwidth was almost completely occupied. But the latency didn't budge, right?

Starting point is 00:46:37 It took the same length of time, processing was the same length of time on the server. So that was pretty surprising. We spent a lot of time and we do understand that the reasons go kind of deeply and DAX delivers. This performance was considered quite good. This is actually not the most interesting platform. We have a much more interesting platform that we're going to get some really neat results from soon. This was done back in May, June timeframe, I think. There's no advantage. And so, we basically

Starting point is 00:47:11 decided not to do it. We basically stepped back and we said, okay, we're not actually disappointed here, right? It was a good idea, but it isn't ready in the code. We basically discovered that memory manager and cache manager, these are MM and CC and Windows, are not quite ready to do it at this IOP rate, in other words, to map and unmap pages from a file at this kind of rate. They're used to doing persistent mappings or long-lived mappings, something that lasts for seconds, minutes, whatever, not a few microseconds, right? And so they need some work or some sort of persistent FS control

Starting point is 00:47:54 that just makes one mapping and hangs on to it for a long period of time. By not changing anything, stability and performance are maintained, right? But we can improve this in a future update. And so I'm going to beg you to watch this space. There's a lot of horsepower under the hood here that we haven't lit up yet. So if you happen to have Windows Server 2016 or Technical Preview 5, I just want to mention that if you stick an MVDM in it, and if you stick an RDMA NIC in it, and you can choose any type you like,

Starting point is 00:48:31 we support Chelsea 1 Mellanox primarily. There's a couple of others available from vendors who've built it and certified it for Windows. Configure a DAX file system, which is basically format your thing slash DAX. Here's a little link to help you learn the new format option. Create an SMB3 share and you're off to the races. So it's literally out of the box ready to go. You just need two pieces of hardware, a DIM and a NIC. And finally, external RDMA efforts.

Starting point is 00:49:11 We mentioned this. The requirements and the protocol need to be done in a very broad way through standards bodies, right? We need everybody to know what to do and to be able to build interoperable and useful implementations. So this moves to various standards organizations. I didn't actually mention the PCI Express folks here, but the PCI SIG is an important one as well. IBTA Link Working Group, specifying InfiniBand and

Starting point is 00:49:39 Rocky protocols for this. The ITF Storm Working Group, which unfortunately, I was the co-chair of it, has recently closed. We basically completed all our work. And this train didn't arrive at the station early enough to save it. So it unfortunately closed. The Working Group is actually still there. It's just inactive, closed. So the mailing list is still there. And I submitted this internet draft to it. This draft needs to be updated. My co-author helpfully left the company, so I have to kind of pick up a flag over there. But there's a lot of what I'm discussing in here in a little more detail. It's also being discussed in the SNEA NVM TWIG, the Non-Volatile Memory Programming Working

Starting point is 00:50:22 Group. And Open Fabrics has shown some interest in surfacing APIs for this. And here are some links to resources when you see the presentation online. These two are really cool. This was Neil Christensen back in January of 2016, eight or nine months ago. Neil Christensen gave a talk and just after Jeff Moyer from Red Hat, Neil Christensen is my colleague at just after Jeff Moyer from Red Hat, Neil Christensen is my colleague at Microsoft, Jeff Moyer has a similar role at Red Hat. They could have given each other's presentations. They could have swapped decks. It was the funniest thing.

Starting point is 00:50:54 They talked about the same architectural problems, the same architectural solutions. It was really astonishing. Everybody was really surprised by that. But check those two out. They went back to back and it was quite an experience. All right, questions if any time remains. Chat. last year even about push versus pull and the resources that are tied up with one versus the other and the buffers. Right. The timeline associated with that. Have you explored that at all?

Starting point is 00:51:32 Have I explored the timeline of buffer ownership or buffer management across the push versus pull mode? Not a lot. It is a lot. It's kind of an app, it is a concern. It's an application-dependent concern though because it's sort of like the working set of the application, right? How many of those buffers need to be hung around, right? And the, how many recalls have to occur for instance, right, to manage limited space. So I believe there's actually several layers of issues. There might be limited map space in the processor of the target.

Starting point is 00:52:11 There might be limited memory handles and resources within the NIC of the target. There may be other memory pressures from for instance the cache manager in the target. All these things are going to impact how many mappings can be held and how long they can be hung out. So I don't think until we really start to see those workload profiles that we can answer that question. For next year. It's a new paradigm, right? Yes.

Starting point is 00:52:38 Yeah. Yeah? Have you considered in the design the impact of the server having to recover from or prevent the interaction with a buggy or malicious client? Because it seems like there were several aspects of this that were required. Why did it interact? Have we thought through server protection of its data and its actual data paths with respect to malicious clients. Some, some. First thing, because if we use SMB3, we have an authenticated attack. That automatically limits the scope of the vulnerability, okay?

Starting point is 00:53:18 We have a special class of threats called authenticated threats, right? So that's number one. We have the RDM layer protecting us to a certain extent. The server can manage things. If the client attempts to guess addresses, the RDM layer will close this connection. That will be detected by both sides. Third, because it's an authenticated access to this mapping, the client has access to all the data that he's able to write. He can't write data that he's not allowed to write. He can only write data that he was allowed to write. So he's literally only stepping on his own data. He can't damage somebody else's data. He was given, you know, he had the authority to damage the data. He went ahead and did it. So primarily, the threat

Starting point is 00:53:59 is that he can hog up the bus, you know, the physical bus and things like that. And we can mitigate that with local, you know, NIC-based QoS, things like that, we believe. That's where we have to think it through, right? That's why I said this is part of that congestion question, I think, in many ways. Well, there's no explicit congestion in the RDMA layer. The RDMA layers depend on transports that have congestion control. IORP uses TCP, Rocky uses the InfiniBand, you know, transport. They all have a certain level of congestion control.

Starting point is 00:54:41 But that's transport level congestion, not necessarily these RDMA-right level congestions. So it solves a slightly different set of problems. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference,

Starting point is 00:55:14 visit storage-developer.org.

Storage Developer Conference - #55: Low Latency Remote Storage: A Full-stack View

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.