Storage Developer Conference - #29: Low Latency Remote Storage: A Full-stack View

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 29. Today we hear from Tom Talpe, architect at Microsoft, as he presents Low Latency Remote Storage, a Full Stack View, from the 2016 Storage Developer Conference.

Starting point is 00:00:46 So, hi, I'm Tom Talpe. I'm an architect with Microsoft. You have probably seen me here before giving a similar presentation. I hope to sort of pivot this one. I've been sort of pivoting this discussion every time. I started with sort of high and lofty goals and then I kind of grounded them in RDMA, grounded them in SMB, grounded them in all kinds of things. So, I've kind of come all the way around to what I might call a full stack view where I want to talk about the components and how they fit together and how we're actually doing something in this area within Microsoft. I'm not making a product announcement. I'm

Starting point is 00:01:26 not really making a whole lot of specific detail here. I have to leave that to others. But I think you'll see that we're beginning to make this a reality. And so I want to sort of draw the stack a little bit. It's still a long ways off, right? I guess I should just sort of, this is a bit of a disclaimer. This is a long process and you'll see there's a bunch of components. It's like three to five years, I think, to realize a lot of this stuff. But the interesting thing is that I'm trying to propose phases in which we can get there. And there are benefits at each phase. And some of these benefits are orders of magnitude. So it's a pretty remarkable journey if we can manage to take it.

Starting point is 00:02:10 All right. Outline. I'm going to give a problem statement which is largely just a review. I'll point to last year's presentation for more detail. Platform support, RDMA support, SMB3 support. We're going to walk the stack, right? And then at the very end, I'm gonna tell you something about Windows Server 2016. Okay, the problem statement.

Starting point is 00:02:32 Storage class memory, this is a Microsoft adopted or Microsoft favored term. Others have used it, some have not. But storage class memory, we intend it as a sort of a general term. It's a new disruptive class of storage. It's storage first. It starts with the word storage. It is storage. It's a non-volatile medium with RAM-like performance. It has low latency. It has high throughput. It has good density, high capacity, all right? It resides on the memory bus. Therefore, it's byte addressable.

Starting point is 00:03:02 It has byte semantics. You can read and write a single byte or maybe a cache line, but something really small. It also has block semantics when you layer a driver on top of it and that's how Windows deploys it. There's sort of two personalities to this type of device in Windows. It can also reside on the PCI Express bus. It usually has block semantics. You think of an NVMe device, but a lot of NVMe devices, for instance, Stephen Bates yesterday was describing his company's device has a memory bar as well as a PCI Express IO bar. And so it, or an IO personality, I guess I should call it. And so it behaves also in two similar ways, but predominantly block semantics when it's on the PCIe bus. And to storage class memory, I add remote access, right? I'm a file system developer. I'm a remote

Starting point is 00:03:53 file system developer, SMB3 and SMB direct. That's what I do. So local only storage lacks resiliency. It really doesn't count in today's world to make one copy of your data. Everybody brags about it. Everybody brags about replicating locally. It's still one copy of your data. If your system goes down, both copies are lost. So it's required to have resiliency in the modern world for storage. New features are also needed throughout the storage network and platform. So those are things we have to add to accomplish this and that's what I'm talking about. So, we're going to explore the full stack view. Quick review, in the 2000 timeframe, we had HDDs, 5 millisecond latencies. 2010, SSDs predominated.

Starting point is 00:04:38 They have 100 microsecond latencies. That's a 50x improvement in 10 years. Just want to remind people of that. 2016, now it's the beginning of storage class memory. Less than one microsecond local latency, less than 10 microseconds remote latency because of the network round trip, right? 100x improvement on top of where we were. So it's a 5,000x change over 15 years.

Starting point is 00:05:03 5,000x. When was the last time you saw 5000x? Storage latencies and the storage API. This is interesting because the API is driving the adoption of storage class memory. HDDs were very slow. You would always use an async programming paradigm, right? You'd launch an I.O., you'd block, you'd wait for the I.O. to complete. SSDs were closer but you'd still, we still use the traditional block API, right? We perform a read, it interrupts us, we get the answer.

Starting point is 00:05:38 Storage class memory starts to push this line back. The latencies have improved. Doug Voigt has a presentation like this. But the API shifts when latencies get down to a certain point. It's better to pull, better to wait than to block. And so, you know, ignoring that, the 500X reduction in latency in IOPS changed the behavior where applications suddenly program differently. Applications finally to take advantage of this device have to change their API. Now, there are some APIs that we use today and applications use today, but that's going to drive this adoption, all right? And so, eventually eventually we'll get down we'll get SCM down to DRAM speeds and we'll never use async We'll do loads and stores There are local file systems on top of this device that are exposing it to applications today

Starting point is 00:06:36 The first and foremost is called DAX. It's the same thing almost the same thing in both Windows and Linux. It stands for direct access, direct access file system in the sense that it exposes a file or a mapped API, right? It has sort of two personalities. You can open a file, read it, and write it like you normally do, or you can open it, map it, and load and store it, and then commit it like you do. Windows and Linux implementations are very similar, and you're going to see a lot of applications moving to a memory mapped API to take advantage of these things. NVML is an explicit non-volatile memory aware library. Andy and others have been talking about this for years. It's real, it's open source, included in Linux and in the future it will be included in Windows.

Starting point is 00:07:19 Chandra, my colleague, had a presentation yesterday afternoon about this along with the Intel developer, Paul Luce. Specialized interfaces are also interested in this, databases, transactional libraries and even language extensions. So these are other ways that storage class memory will appear. But we have to take these APIs remote, right? As I said, local copy doesn't count. You have to replicate in modern, in the modern I said, local copy doesn't count. You have to replicate in the modern world. One copy just doesn't count except for temp data. That's about the only thing I care about. So, we need a remote durability guarantee to do this. We need to

Starting point is 00:07:56 be able to make it durable locally, which the API supports. We need to push it to a durability guarantee remotely. And so that remote good durability has to happen all the way through the network. At the platform and the device, we need to be sure it's durable. Across the network, we need to be sure it arrived intact and durably. And then the storage protocol itself

Starting point is 00:08:17 needs to reflect that back. So I'm going to walk the stack to show you what I think the world will look like. You can feel free to interrupt me for questions. I predict we'll have a full 45-minute segment, though. Consistency and durability platform. Okay, so we're going to start at the platform, and I just want to say a few things about the platform.

Starting point is 00:08:49 I can only wish that the platform does things, right? I don't build hardware. But I can at least tell you what I wish and what I see. Okay, I mentioned this a minute ago. We need a remote guarantee. Is that in the right spot? Yeah, well, let's call the RDMA protocol the platform. Sorry, this one's a little bit out of order, but it's not out of place to talk about it. The RDMA protocol is what we see to carry data from the originating node to one of the

Starting point is 00:09:23 replica nodes, all right? And the RDMA write alone is not sufficient to provide this semantic. Adan went through this with his ladder diagram. But historically, the completion means only that the data was sent, all right? In Rocky and InfiniBand, it means a little more. It does mean that it was received in Rocky and InfiniBand, it means a little more. It does mean that it was received in Rocky and InfiniBand. But at the verb level and IWARP, it only means that it was sent or not even that it was sent, that it was accepted for sending. Okay. So it's a very weak guarantee but it's what RDMA has provided. Some NICs do give a stronger guarantee but they never

Starting point is 00:10:01 guarantee the data was stored remotely. This is where I completely agree with Adam, right? It really doesn't matter if it made it to the remote NIC. It only matters when it makes it to remote durability. Processing at the receiver, additionally, was a problem. If you could say, well, when I see the packet arrive, if I take an interrupt, I know it's there. Well, no. Things can be reordered and lots of funny things can happen. You need actually a completion. You need to actually take a completion in the RDMA world. And all these things add up to latencies. This is our argument for requiring to extend the RDMA protocol.

Starting point is 00:10:40 Once we extend the RDMA protocol, we actually need a platform-specific method to implement the commit. And there's actually no PCI commit operation, right? PCI writes or fire and forget. They're called posted. You post the operation. It behaves just like an RDA may write. You post it and some hardware accepts it to put it there, but there's no completion. There's no acknowledgement that the data actually arrived at the end completion on the PCI bus. So that's bad. We go through all this trouble to implement the RDMA protocol,

Starting point is 00:11:11 and now what does the NIC do when he's plugged into the PCI bus? So I believe that down the road, there will be a PCI extension of some kind to support commit. I'm hopeful that it's going to be a very simple extension. Something like, well, I think I mentioned it in a minute, but some very simple extension. But there are, in the meantime, workarounds. And I'm going to mention a couple things from my friend Chet Douglas.

Starting point is 00:11:38 But the idea is to avoid CPU involvement. And the workarounds largely require CPU involvement. So this is one of Chet's diagrams that I swiped. It's about ADR and it draws this little red box called the ADR domain. And the idea is that you can form an operation that makes it into the ADR domain and is therefore persistent. ADR is a hardware domain, right? It requires a specific motherboard facility with a super cap or a special power supply or something. But it is a solution that's available today. There's a couple of glitches. It doesn't include the cache of the processor. It doesn't include

Starting point is 00:12:23 caches that appear down here. And so the workarounds all involve either bypassing or writing through these caches. And they get a little tricky. They're expensive. There are other workarounds that basically push the data to various places and then do an operation that signals things to go where they go. Like you notice this one on the right has no ADR domain, right? And so you can go look at Chet's presentations. But by sending messages or performing other operations, you can push the data out of these caches and into the non-volatile domain.

Starting point is 00:13:01 These are all very expensive, right? They are required today today so it's reasonable to consider them. But in the future, we don't want them. And I would like to suggest that we may not actually need them if we carefully leverage our existing protocols which actually implement many of the messages that are proposed down here. So I'm not disagreeing with the workarounds, but I'm saying the workarounds can be implemented in today's environment without really radically changing the environment while we wait for this PCI commit and this RDMA right or this RDMA commit. And so And so, what I'm going to explore today a little bit is SMB3 push mode.

Starting point is 00:13:49 And I talked about push mode last year very generically. I actually talked about push mode in three different protocols. I talked about it in SMB3, NFS, and iSCSI, iSCIR. Push mode, Idan mentioned this earlier, is required, I think, to overcome the latency issue. A round trip, well, a message round trip, an actual send-receive or some sort of acknowledged operation at the upper layer will add significant latency through both the network round trip and the interrupt completion handling

Starting point is 00:14:25 at the server, right? We just can't afford it. The speed of memory, the speed of storage class memory just doesn't allow us that. We have tens of microseconds, 30 to 50 in today's implementations, sort of best case round trip latency at that layer, whereas we can get sub 10 microsecond latency at the RDMA layer. So by push mode, what I mean is an operation which, if you will, bypasses the upper layer. And so first, I'll talk about the traditional upper layer commit, how we commit data. And it's basically a file operation.

Starting point is 00:15:07 We open the file, we obtain a handle and a lease, for instance, in SMB. And we perform a write, we can also perform a read. There's a bunch of dispositions on the write, but the server performs operations, either to a PMM file or an ordinary file. And then from time to time, the initiator, the client, will either perform a flush or just close the file, right, which implicitly flushes. And that flush is an explicit operation that crosses the wire and requests the server to make the file data durable, right, which on a disk drive means, you know, flush it to the disk,

Starting point is 00:15:46 put it in the file system, do whatever you have to do at the file system layer to make it safe. But on a raw PMEM device, it would actually perform a local commit. It would perform one of these local durability operations in response to a flush. And that's just what the DAX file system does. There might be some recalls which are interesting. Lots of other little details that I'll mention in a minute. I'll sort of dig into in a minute. But the advantage is it's a traditional API. It's fully functional today and it literally just works. Literally just works.

Starting point is 00:16:21 The only cost is you have to take interrupts. You have to run that server CPU. So it does cost you a bit of latency. The existing latencies are all still there. Push mode with explicit commit. This is where we really wanna get. And if you will, the idea is to tunnel below the file system layer, to tunnel below SMB.

Starting point is 00:16:51 Instead of performing an SMB write, we're going to perform an RDMA write below it. We're going to set up the I.O. using SMB, but we're going to perform the I.O. with an RDMA write. And we can do that when we have a PMEM array over here on the side. So once again, we would use SMB3 in this case to open a DAX-enabled file, one of these direct access files. It's literally persistent memory mapped into the host's address space. So when you write it, you're just loading and storing in memory.

Starting point is 00:17:27 Then, instead of loading and storing from the processor, we're going to load and store from the NIC. We're going to perform RDMA writes to push. We can also pull. Pull is kind of rare. Happens during rebuild. Happens with some weird application things. It's primarily about write.

Starting point is 00:17:44 And when we do write, we do a commit. We do one of these RDMA commits, which is this guy with the little diamonds on it. So we do write, write, write, commit, write, write, write, commit, write, write, write, commit. And the commits would basically serve the function that a flush does at the SMB3 layer,

Starting point is 00:18:03 but only for one of these DAX-enabled files where we're actually writing the memory, right? Otherwise, we have to signal the CPU. So that's the rough schematic of how it works. There's a little bit more about leases and things. Let's drill down. So push mode. First, you have to open the file. This is called SMB2 create the operation. Create creates a handle. It doesn't create a file. That's different. You create a handle. And if that results in creating a file, that's a disposition which is a little more involved. In any event, opening the file does a lot of things. You've created a connection.

Starting point is 00:18:47 You've created a session. You've authenticated yourself to the server. You've requested to open the file so the server checks authorization based on your credentials. All that good stuff has been done by the upper layer, which is really convenient, right? That's a lot more than you get

Starting point is 00:19:03 from some low-level block driver, for instance. How is the block layout? Yeah, well, actually, this isn't the block layout yet. That's a different layer of SMB, which is a bit of an extension. But it's just like an NFS v4 open or an SMB2 create, all right? So once you send a create, in SMB2 you decorate it with things called create contexts. And create contexts are these little blobs that accompany

Starting point is 00:19:33 the create to tell the server how to handle the create. And I hypothesize that there may be some sort of push mode create context that said, I want to open this file in push mode create context. It said, I want to open this file in push mode, right? Alternatively, there could be an iOctl later. SMB2 supports iOctl, it's called FS controls on an open handle. So the idea is to set up push mode. And an important side effect of this is that if you're going to lock the file in memory and read it and write it, you probably need to kind of own that lock, right?

Starting point is 00:20:09 And that's something we call a lease in SMB2. So I believe that it's in conjunction with requesting a lease. And by establishing the lease in push mode, we can establish that mapped file for some range of the file, perhaps the whole file, lock it in memory and treat it as a PMEM segment or scatter gather. So we return the create context to indicate push mode is available. There's some acknowledgement back from the server that push mode is available. All right.

Starting point is 00:20:37 So the lease provides us with other things that allow us to recall and manage the mapping. All this happens at create time. This is like one operation. It's just one big decorated create. After that, the regions may be remotely written and possibly read, okay? So that requires a RDMA registration, all right? This is where we would clearly have some sort of FS control. The client would say, I want to read and write this region of the file or perhaps the whole file, right? So the file is in memory ready

Starting point is 00:21:16 to map. Now it needs to be registered so the RDMA layer can access it, right? And we do that today in SMB Direct. We do it actually per I.O. in SMB direct. For each read or write, the server will, the client will register his memory. The difference is we're doing it here on the server side. So we sort of reverse the polarity of the operation. Anyway, I suppose that it might be at least an offset, a length, and a mode. I want to act from zero to infinity write only, for instance. The server would pin the mapping, register the pages, and return a region handle, maybe multiple region handles.

Starting point is 00:21:56 The client could request multiple regions, but the result is that he'll be able to remotely read and write the file. Recall is important because the server may have to manage the mappings. The client will begin to perform its reads and writes directly within the file, entirely within the RDMA layers. There's no server processing, only the NIC knows that it's happening. If there's a commit operation available, the client commits via RDMA. Otherwise, the client commits with flush, right? If there's no commit, the data's there and ready to go.

Starting point is 00:22:36 The client just has to tell the DAX file system to commit so it can send a flush operation. So this is kind of this weird hybrid, right, where you might be able to push the data with RDMA but you might have to ask the CPU to commit. And that could be a win if you're merging a lot of RDMA writes. The client may periodically update the file metadata. There's an ioctal file basic information is basically like NFS set at or it sets the attributes of the file. So for instance, if the client is writing the file and the server CPU doesn't know it, how does the timestamp get updated, right? It's the client's responsibility to do that at this point.

Starting point is 00:23:18 The flush can do it. That might be an advantage. But the client can also do it with one of these things. And there's a few things that the client can't do that he must use SMB4. And that is adding blocks, appending to the file, for instance, punching a hole in the file, right? You've got a mapping. That's all you've got. You can't add to the mapping without changing the mapping.

Starting point is 00:23:44 You can't punch a hole in the mapping without, you know, requesting that that be done. So a lot of these things will definitely be done with traditional SMB3 operations. So you can see a mix. You can see the data transfers being done via RDMA and you can see the metadata operations being done with SMB. And now here's the recall. I started to talk about this at the beginning of the last one, sorry. But the server may have to manage the sharing and this registration. And the client, we use the, I propose that we use the lease to do this. It would be recalled upon sharing violation if another process wants to open the file. That would be an ordinary SMB3 level sharing, right? One said, I want to the only writer of this file, and another one comes in.

Starting point is 00:24:25 So it recalls. File doesn't have to move, but the permission has to change. Caches have to be flushed, God knows what. There might be a mapping change. Maybe the CPUs maps are full and we have to rearrange things, and there might be a mapping change. So the file system may request that the data be unmapped and remapped in another location or maybe rearranged due to bad blocks or

Starting point is 00:24:50 I don't know what, some platform specific event occurred. A registration changed. For instance, let's say the server started to run low on RDMA resources and said, I gotta throw everybody out and rebalance my resources, that kind of thing. So this recall is very useful for that kind of server callback to the client. And when recalled, the client performs the usual. If it has any dirty buffers on its side of the wire, it flushes them. It probably commits them because usually it cares about the durability. It returns the lease and probably goes back and gets another lease or maybe it's happy with the new lease level.

Starting point is 00:25:25 Maybe it got dropped down from read-write to read-only and maybe it's only reading. So he might be okay with that. But that's standard lease recovery. Once again, that's a sort of an ownership or a sharing metadata operation. That didn't advance. There we go. And push mode, when you're done, you're going to probably close the file. You will return the lease as part of closing the file.

Starting point is 00:25:54 So that's the SMB3 level operation. There may be the push mode registration that needs to be cleaned up. There might be an explicit FS control for that. It may just happen automatically on close. Well, it'll certainly happen automatically on close. But there might not be another way. You might have to close the file. I don't know.

Starting point is 00:26:12 This is a semantic that we need to think through. What's the most useful application behavior for this type of push mode? I think this depends on the way we decide to enable push mode. If we do it with a create context, it probably happens at close. If we do it with an FS control, it probably happens dynamically or explicitly. I want to point out that push mode registration needs to be reset during handle recovery. SMB3 has a feature called CA, continuous availability. And continuous

Starting point is 00:26:42 availability allows the server to hold on to handles when the client disconnects. And continuous availability allows the server to hold on to handles when the client disconnects. And the client can come back with exactly the same state he left in. It's really important for enterprise applications like Hyper-V, virtualized environments, right? If they get migrated, they need to connect and they need to pick up exactly where they left off with no loss, right? We want this to be completely transparent to the application. So the push mode registration, let's say it moves, you

Starting point is 00:27:09 know, the client moves and has to reconnect to the same server. That registration goes away and has to be rebuilt. So it must be reobtained after handle recovery. You'll get a new handle. If you try it on the old handle, well, it will fail because the connection was torn down. So that's good. RDMA protection helps us. All right. So that's the quick trip through SMB3. Now, a quick trip through the protocol extension. This is going to duplicate some of what Adan was discussing. It's, you know, I'll say it in my own words so maybe I'll have something to add, but I might skip over a few things here. Doing it right. To do it right, I think we have to do it right in the RDMA layer.

Starting point is 00:27:55 We need a remote guarantee of durability. We want it to be highly efficient, and we want to standardize it across all RDMA protocols, right? The protocols themselves all have their own quirks and implementation details. That's fine. But we want the operation to be similar across all RDMA fabrics and all RDMA protocols. We have that today for storage, right? SMB3 runs equally well over IWARP, Rocky, and FinnaBand. You name it. It does so because it requires very little of the fabric. It wants a send or receive, an RDMA read, an RDMA write, some ordering guarantees, some connection management, IP addressing. That's about it.

Starting point is 00:28:38 It doesn't need the fancy stuff. This is another element of the fancy stuff list. And it's really important that it not be proprietary or it not be specific. It can be extended but that core behavior needs to be countable by all the storage protocols that need it. So my concept and, you know, it's being discussed in the internet draft and in the IBTA and by a lot of people. I've talked about this before.

Starting point is 00:29:09 There's probably nothing new here to people who've seen that. Is that there's an operation that I call RDMA commit. It's also been called RDMA flush. It's also been called, I don't know what. It's a new wire operation in my view and a new verb. It's only one operation and it's only one verb, okay? It's a very simple point, you know, enhancement. You can conceive of other natures to it but I view it as one operation, implementable in all the protocols. I believe

Starting point is 00:29:38 that the initiator provides a region list and other commit parameters under control of a local API at the client. And I'll point to the SNEA NVM technical working group, the non-volatile memory technical working group, as leading the way in that type of API. We have this thing called optimized flush. And optimized flush has a bunch of parameters that I think should map directly here.

Starting point is 00:30:03 But, you know, I'm open to discussion on that. The receiving RNIC has to queue the operation in order. I think it behaves like an RDMA read or an atomic. It's subject to flow control and ordering. Very important. Both of those things are very important. Ordering because you need to know what data is becoming durable.

Starting point is 00:30:21 And then flow control because it's going to take time. It's kind of a blocking operation on the wire, right? When pcommit was still on the list, pcommit was a pretty heavyweight operation. pcommit may not be required for some solutions now. Maybe it's just a flush, but it's still a flush. There's still an end-to-end acknowledgement within the platform. So that blocking operation is going to require flow control. The NIC pushes the pending rights, performs the commit,

Starting point is 00:30:49 possibly interrupting the CPU. I wanna talk about that with respect to the API. And the NIC responds only when durability is ensured for the region the client wanted. There's some other interesting semantics though. I think one of the key scenarios is called the log pointer update, where you write a log record, you make it durable, then once it's durable, you write the log pointer and make it durable.

Starting point is 00:31:14 It's like two durable commits. One is a little bit bulk, right? Might be 4K data. The other one is literally a pointer, 64-ish, bits-ish, right? And so I believe that making that two operations on the wire might be expensive and might lead databases, for instance, to think twice about using it. But I think if those two operations can be merged in some clever way, and I have some cleverness in my Internet draft, that can be a big benefit for this thing, for the adoption of this thing. I also think it would just be basically useful.

Starting point is 00:31:48 Second, you may need to signal the upper layer, right? You may need to tell the upper layer that something happened. We have a scenario in database server that I have in mind where the log is replicated and safely replicated. But the idea of replicating it is for quick takeover, right? Should one of the database instances fail, the other one should be ready to take over. So you want to replicate it, but you want to notify the peer

Starting point is 00:32:17 that there's dirty log data. And so the peer can actually stay very close behind you. Not synchronously, you don't want him to actually replay that log, but you want him to know that there's dirty data to be replayed so he can keep close. And signaling him either with a little tiny gram or a whole message is a really interesting idea. So if you put these two things together I think you have some interesting, you know, merged semantics. This one I'm really interested in, an integrity check.

Starting point is 00:32:48 If I'm writing data, how do I know it's good, right? I called commit and he said it was committed. How do I know that it's really good, right? We have all kinds of storage integrity at the upper layer. Can we do this at the RDMA layer? I don't know. I don't know exactly how. Does the NIC have to read it back? Is there some other way to do this? But an integrity check is right up there on my list. The choice of them will be driven by the workload. We have to have this dialogue with the applications. We can't just sit here and say, I think I have a good idea. We need to motivate this by some request or some need. And finally, the expectations of this latency are that push mode will land much less than 10 microseconds.

Starting point is 00:33:31 Idan mentioned it, right? He said there's 0.7 microseconds on the wire. There's no way push mode's completing in 0.7 microseconds. There's more overhead on that thing. But it's certainly 2 to 3 microseconds, which is an order of magnitude, maybe better than an order of magnitude, better than we get for a write with a traditional storage RDMA transport today. Like I said, most of them are in the 30 to 50 microsecond, okay?

Starting point is 00:33:56 So that's huge, right? An order of magnitude is always something to go after with everything you've got. Remote read is also possible. So I was asking why would you need this? There aren't a lot of scenarios that need it. That's why I sort of put it as a footnote, right? But it is interesting, certainly for rebuild, right? You lost a copy, you got to read that copy over the network, just suck it in, right?

Starting point is 00:34:24 The reason we get this is that there's no server interrupt. There's a single client interrupt. We can even optimize it when we have multi-channel and flow control and things like that. We know how to do that. That's local. That's local magic, right? We can all innovate in ways that are important.

Starting point is 00:34:41 All right. Push mode considerations. I've got a list of interesting things. Then I got a quick update on Windows Server 2016. I called these fun facts when I talked about this a couple of months ago at Samba XP. I've decided to name each fun fact a little more meaningfully. The first is buffering. This is true certainly of Windows and I believe of Linux. If you open a file in buffered mode on a server with a persistent memory device,

Starting point is 00:35:15 buffered mode has a new meaning. Buffered mode means that you've actually mapped the device directly when you open a file in buffered mode. It used to be that the buffer was this bounce buffer in between you and the device, right? You would write to the buffer, and the server could acknowledge that quickly, and then lazily it could write it out unless you flushed it.

Starting point is 00:35:37 Buffering means the opposite with PMEM. Buffering means you actually own the device right now, at least if everything's working well. There might still be a buffer in there, but there's no reason to buffer just to drop it back in memory, right? So the idea is to put it directly in the device. So that's really interesting, but it enables both load and store and RDMA read-write when you have this, right?

Starting point is 00:35:59 And so it's pretty interesting that you sort of want to invert the meaning of buffered and unbuffered. It used to be unbuffered kind of went faster because it went straight to the device. Well buffered went faster but you didn't get the guarantee. So it's kind of the opposite now. You want buffered. So the server can actually hide this from you, right?

Starting point is 00:36:20 And that's what I'm going to talk about in a minute. We experimented with Windows on this. Recalls. NFS v4 has this. SMB has this. Push mode management requires new server plumbing. The up call from the file system for sharing violations remains the same. But relocation, file extension, all these things,

Starting point is 00:36:42 these are new up calls that don't come from disk drives, right? And so the server has to be prepared for some interesting behaviors. And RDMA resource constraints are another reason for recall. Now you're explicitly giving RDMA resources on the server to the client. None of the protocols today, including NVMe over Fabric, exposes the server buffer. So when you go to push mode, you're going to have to manage those resources. So this is really important. You're going to have to change the plumbing of your server. So an implementer will have to move toward this over time. I don't know what all these recalls will be just yet. But the idea is

Starting point is 00:37:19 I'm in trouble. You need to let me fix this thing and then we'll get started again. Congestion. This is the thing I'm most, most worried about. The big thing about storage protocols in write, write operations in storage protocols today is that they natively, naturally flow control themselves, right? You send the write to the server and it's just a request to push data, right? The server says, okay, I'll take that request and it pulls the data, right? And what that means is that the server schedules that pull, in particular when it has the memory to receive it and when it has the I.O., you know, the I.O. has reached the top of its queue basically, right? And so it flow controls the network very nicely as well as flow controls

Starting point is 00:38:03 the queues on client and server very nicely. When we go to push mode, we're going to congest all over again. We're just jamming data down the server's throat, right? The software doesn't have to deal with it. It's out of the loop. But the RNIC, the network, the interface, the DRAM bus, all this stuff is going to have to deal with it. I'm really worried about that.

Starting point is 00:38:23 RDMA itself does not provide congestion control, certainly not for RDMA rights, right? They are unconstrained. You can send as many of them as you want. You can fill the wire from one note. But congestion control is provided by the upper storage layers. And there are credits in SMB3. There's also something called Storage QoS.

Starting point is 00:38:44 My colleague Matt Kujanowicz gave a presentation yesterday about this and their use in Hyper-V. So I'm going to point out that there are existing client-side behaviors, such as this QoS, that can mitigate this problem. And we're going to need to think those through because I think they're going to be really, really, really important. And ARNICs can help. More thinking is needed.

Starting point is 00:39:04 We're going to have to solve that question. How is that congestion at the network or at the end node? Both. Absolutely both. Primarily the end node, right? But the switches, the exit port from that fan-in congestion. I mean, we see this every day in any sort of scaled network deployment. And integrity, I mentioned this before, but, you know, data in-flight protection, well, the network provides that. Data at rest, that is not provided by the network. How do we do that?

Starting point is 00:39:37 SMB3 has a means, but we're not using SMB3 to transfer the data in push mode. So I think we're going to need to think about remote integrity as time goes on here. I don't think people will be willing to push their data quickly only to discover that it didn't quite make it the way it was sent. That's usually not a good way for a storage provider to make money. All right. And to wrap up, I have a few slides on where we're at today in the Windows implementation. Yeah, sorry, yeah. Yeah. I want them to be ordered with respect to one another, and ideally I would like to issue

Starting point is 00:40:35 them without two round trips. That's a subtle difference, but... No. No. There would actually be the flushes? No. No. There would actually be two flushes. Well, you could do it that way. You could pipeline them. Well, you could pipeline them. That's one approach. You could have an optional payload

Starting point is 00:40:59 in the message, a small 64-bit payload. That's what I explored in the Internet draft. Because I think 64 bits is enough for the scenarios I have in mind. Well, could you do it as one operation where the log update effectively carries a commit list which forces commits for all the previous log writes? Tons of ways to think about this. That's a protocol implementation question.

Starting point is 00:41:23 But the idea is you have one probably bulk piece that's placed and made durable followed by another which is placed and made durable only after the successful placement and durability of the first piece. That's the rule, right? I can't update the pointer until the log is safe and I can't write the pointer until the log is safe and I can't write the pointer until the log is underway. But you mentioned that you don't want it to be two operations, you want it to be one. Ideally I want to keep the latency minimal.

Starting point is 00:41:55 If I have to do two operations, so be it. But I'm going to point out that maybe it's easy to, as long as the semantics are well defined, to merge them. Right, a lot of upper layers will have things like compound or fused or whatever operations where you can send more than one operation. The important thing is that the two operations are ordered with respect to one another. If one fails, the other one cannot proceed. That's what's critical. Yeah. Oh, I'm sorry.

Starting point is 00:42:45 Repeat the question. And I'm saying that smashing operations together will have a cost and we'll have to measure that cost, right? Maybe it will be too expensive to do it in the NIC. Maybe the protocol will change in some ungainly way if we try this. And so I agree with that, right? We need to think it through, right? If the cost outweighs the benefit, forget it. Benefit outweighs the cost, oh, now we're

Starting point is 00:43:11 talking. Okay, Windows Server 2016. Tech Preview 5 has been available since this summer and general availability is imminent. I'm not here to announce anything. Oh, I have more time than I thought. Good. Just a couple of minutes. It's imminent. It's very soon. I'll give you a hint. We have Microsoft Ignite in Atlanta next week.

Starting point is 00:43:34 But we'll see. The TechPreview 5 and Server 2016 GA both support DAX. DAX is really great. And Remote SMB works over DAX, I just want to mention. It's a file system. It's exportable. It works great. And so we get all the expected performance advantages, right?

Starting point is 00:44:00 DAX is great because there's no media. It's just memory, right? And it supports all the reads and loads and stores and commits that we expect from a memory based file system. It's basically a RAM disk, right? It has a block semantic and we put a file system on the block semantic and we open files and we read them and write them. And if we map them cleverly, we avoid data copies, which is really cool. However, Windows Server 2016 Tech Preview 5, do not implement full push mode. Full push mode, as you can see, has protocol implications. Generally speaking, when we develop Windows

Starting point is 00:44:37 Server, we spend a year in design, a year in development, and a year in test. I can't, like, slide a new feature in that last year. It's very difficult. If it wasn't committed way up front, it ain't going in. And a new protocol change, forget about it, okay? So just damping expectations perhaps a little, it's not going to have push mode. I hope it will in the future when we have a new release vehicle that might let me do it but not yet, okay? However, so there's no reliance on extending the protocols that we discussed. We did, however, consider the direct mapping mode.

Starting point is 00:45:10 And that was that picture where I showed the DAX file system in that little dotted line that went from the PMEM straight to the file's buffer cache copy. And that direct mapping is really interesting, because the Windows SMB server basically can operate on that mapping, right, in a defined way, in a file system managed way and can read and write those pages directly out of the PMEM, right? And so we thought to ourselves, why don't we just RDMA straight out of the buffer cache,

Starting point is 00:45:41 right, which would effectively RDMA straight out of the PMEM device. And we learned something very surprising. This is kind of small, and I'm sorry. But basically, these green boxes of IOPS and latency are basically within 1% of one another when we tried the two models. We tried the traditional model where we DMA'd to the buffer cache and then copied the buffer cache to the two models. We tried the traditional model where we DMA'd to the

Starting point is 00:46:05 buffer cache and then copied the buffer cache to the PMEM or when we mapped the PMEM into the buffer cache and DMA'd directly there. On the right is that second model, RDMA direct in quote buffered mode, air quote buffered mode. And on the left, RDMA traditional unbuffered mode where we do the data copy. We got the same darned result. This was a 40 gigabit network so basically the bandwidth was almost completely, you know, occupied. But the latency didn't budge, right? It took the same length of time, processing was the same length of time on the server. So that was pretty surprising. We spent a lot of time and we do understand that the reasons go kind of deeply.

Starting point is 00:46:46 And DAX delivers. This performance was considered quite good. This is actually not the most interesting platform. We have a much more interesting platform that we're going to get some really neat results from soon. This was done back in May, June time frame, I think. There's no advantage. And so we basically decided not to do it. We basically stepped back and we said, okay, we're not actually disappointed here, right? It was a good idea, but it isn't ready in the code.

Starting point is 00:47:20 We basically discovered that memory manager and cache manager, these are MM and CC and Windows, are not quite ready to do it at this IOP rate, in other words, to map and unmap pages from a file at this kind of rate. They're used to doing persistent mappings or long-lived mappings, something that lasts for seconds, minutes, whatever, not a few microseconds, right? And so they need some work or some sort of persistent FS control that just makes one mapping and hangs on to it for a long period of time.

Starting point is 00:47:54 By not changing anything, stability and performance are maintained, right? But we can improve this in a future update. And so I'm going to beg you to watch this space. There's a lot of horsepower under the hood here that we haven't lit up yet. So if you happen to have Windows Server 2016 or Technical Preview 5, I just want to mention that if you stick an MVDM in it and if you stick an RDMA NIC in it, and you can choose any type you like.

Starting point is 00:48:27 We support Chelsea on Melanox primarily. There's a couple of others available from vendors who've built it and certified it for Windows. Configure a DAX file system, which is basically format your thing slash DAX. Here's a little link to help you learn the new format option. Create an SMB3 share and you're off to the races. So it's literally out of the box ready to go. You just need two pieces of hardware, a DIM and a NIC. And finally, external RDMA efforts. We mentioned this.

Starting point is 00:49:08 The requirements and the protocol need to be done in a very broad way through standards bodies, right? We need everybody to know what to do and to be able to build interoperable and useful implementations. So this moves to various standards organizations. I didn't actually mention the PCI Express folks here, but the PCI SIG is an important one as well. IBTA LINC working group, specifying InfiniBand and Rocky

Starting point is 00:49:35 protocols for this. The ITF Storm working group, which unfortunately, I was the co-chair of it, has recently closed. We basically completed all our work. And this train didn't arrive at the station early enough to save it. So it unfortunately closed. The working group is actually still there. It's just inactive, closed. So the mailing list is still there. And I submitted this internet draft to it. This draft needs to be updated. My co-author

Starting point is 00:50:02 helpfully left the company. So I have to kind of pick up a flag over there. But there's a lot of what I'm discussing in here in a little more detail. It's also being discussed in the SNEA NVM TWIG, the Non-Volatile Memory Programming Working Group. And Open Fabrics has shown some interest in servicing APIs for this. And here are some links to resources when you see the presentation online. These two are really cool. This was Neil Christensen back in January of 2016, eight or nine months ago. Neil Christensen gave a talk and just after Jeff Moyer from Red Hat, Neil Christensen is my colleague at Microsoft, Jeff Moyer has

Starting point is 00:50:43 a similar role at Red Hat. They could have given each other's presentations. They could have swapped decks. It was the funniest thing. They talked about the same architectural problems, the same architectural solutions. It was really astonishing. Everybody was really surprised by that.

Starting point is 00:50:58 But check those two out. They went back to back and it was quite an experience. All right, questions if any time remains. Chat. . Right. . one versus the other and buffers. Right. Timeline associated with that is, have you explored that at all? Have I explored the timeline of buffer ownership or

Starting point is 00:51:31 buffer management across the push versus pull mode? Not a lot. It is a concern. It's an application dependent concern though. Because it's sort of like the working set of the application, right? How many of those buffers need to be hung around, right? And how many recalls have to occur, for instance, right, to manage limited space. So I believe there's actually several layers of issues.

Starting point is 00:52:02 There might be limited map space in the processor of the target. There might be limited memory handles and resources within the NIC of the target. There may be other memory pressures from, for instance, the cache manager in the target. All these things are gonna impact how many mappings can be held and how long they can be hung out. So I don't think until we really start to see those workload profiles that we can answer that question.

Starting point is 00:52:31 It's a new paradigm, right? Yeah. Have you considered in the design the impact of the server having to recover from or prevent the interaction with a buggy or malicious client? Because it seems like there were several aspects of this that were there. Why did it interact with them? Have we thought through server protection of its data and its actual, you know, data paths with respect to malicious clients? Some, some. First thing, because if we use SMB3,

Starting point is 00:53:07 we have an authenticated attack. That automatically limits the scope of the vulnerability. We have a special class of threats called authenticated threats. So that's number one. We have the RDMA layer protecting us to a certain extent. The server can manage things. If the client attempts to guess addresses, the RDMA layer will close its connection. That will be detected by both

Starting point is 00:53:28 sides. Third, because it's an authenticated access to this mapping, the client has access to all the data that he's able to write. He can't write data that he's not allowed to write. He can only write data that he was allowed to write. So he's literally only stepping on his own data. He can't damage somebody else's data. He was given, you know, he had the authority to damage the data. He went ahead and did it. So primarily, the threat is that he can hog up the bus, you know, the physical bus and things like that. And we can mitigate that with local, you know, NIC-based QoS, things like that, we believe. That's where we have to think it through, right?

Starting point is 00:54:07 That's why I said this is part of that congestion question, I think, in many ways. You're done. So, about the congestion, you said that there is no supposed congestion mechanism in RDMA. Could you elaborate on that? Well, there's no explicit congestion in the RDMA layer. The RDMA layers depend on transports that have congestion control. IORP uses TCP.

Starting point is 00:54:31 Rocky uses the InfiniBand transport. They all have a certain level of congestion control. But that's transport-level congestion, not necessarily these RDMA-right-level congestions. So it solves a slightly different set of problems. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in

Starting point is 00:55:04 the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #29: Low Latency Remote Storage: A Full-stack View

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.