Storage Developer Conference - #55: Low Latency Remote Storage: A Full-stack View
Episode Date: August 17, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to STC Podcast Episode 55.
Today we hear from Tom Talpe, architect with Microsoft,
as he presents Low Latency Remote Storage,
a full stack view from the 2016 Storage Developer Conference.
So, hi, I'm Tom Talpe.
I'm an architect with Microsoft.
You have probably seen me here before giving a similar presentation.
I hope to sort of pivot this one.
I've been sort of pivoting this discussion every time.
I started with sort of high and lofty goals,
and then I kind of grounded them in RDMA,
grounded them in SMB, grounded them in all kinds of things.
So I've kind of come all the way around
to what I might call a full-stack view,
where I want to talk about the components
and how they fit together
and how we're actually doing
something in this area within Microsoft. I'm not making a product announcement. I'm not
really making a whole lot of specific detail here. I have to leave that to others. But
I think you'll see that we're beginning to make this a reality and so I want to sort of draw the stack a little bit.
It's still a long ways off, right? I guess I should just sort of, this is a bit of a
disclaimer. This is a long process and you'll see there's a bunch of components. It's like
three to five years, I think, to realize a lot of this stuff. But the interesting thing
is that I'm trying to propose phases in which we can get there. And there are benefits at each phase.
And some of these benefits are orders of magnitude.
So it's a pretty remarkable journey if we can manage to take it.
All right.
Outline.
I'm going to give a problem statement, which is largely just a review.
I'll point to last year's presentation for more detail.
Platform support, RDMA support,
SMB3 support. We're going to walk the stack, right? And then at the very end, I'm going
to tell you something about Windows Server 2016. Okay, the problem statement. Storage
class memory. This is a Microsoft adopted or Microsoft favored term. Others have used
it, some have not. But storage class memory, we intend it as a sort of a general term. Others have used it, some have not. But storage class memory, we
intend it as a sort of a general term. It's a new disruptive class of storage. It's storage
first. It starts with the word storage. It is storage. It's a non-volatile medium with
RAM-like performance. It has low latency. It has high throughput. It has good density,
high capacity, all right? It resides on the memory bus, therefore it's byte addressable.
It has byte semantics.
You can read and write a single byte or maybe a cache line, but
something really small.
It also has block semantics when you layer a driver on top of it, and
that's how Windows deploys it.
There's sort of two personalities to this type of device in Windows.
It can also reside on the PCI Express bus. It usually
has blocked semantics. You think of an NVMe device, but a lot of NVMe devices, for instance,
Stephen Bates yesterday was describing his company's device has a memory bar as well
as a PCI Express IO bar. And so it, or an IO personality, I guess I should call it.
And so it behaves also in two similar ways, but predominantly blocks semantics when it's
on the PCIe bus.
And to storage class memory, I add remote access, right?
I'm a file system developer.
I'm a remote file system developer, SMB3 and SMB direct.
That's what I do.
So local only storage lacks resiliency. It really doesn't
count in today's world to make one copy of your data. Everybody brags about it. Everybody
brags about replicating locally. It's still one copy of your data. If your system goes
down, both copies are lost. So it's required to have resiliency in the modern world for
storage. New features are also needed throughout the storage network and platform. So those are things we have to add to accomplish this and that's what
I'm talking about. So we're going to explore the full stack view. Quick review, in the
2000 timeframe, we had HDDs, 5 millisecond latencies. 2010, SSDs predominated. They have
100 microsecond latencies. That's a 50x improvement in 10 years.
Just want to remind people of that.
2016, now it's the beginning of storage class memory.
Less than 1 microsecond local latency,
less than 10 microseconds remote latency
because of the network round trip, right?
100x improvement on top of where we were.
So it's a 5, 5000x change over 15 years.
5000x.
When was the last time you saw 5000x?
Storage latencies and the storage API.
This is interesting because the API is driving the adoption of storage class memory.
HDDs were very slow.
You would always use
an async programming paradigm, right? You'd launch an I.O., you'd block, you'd wait for
the I.O. to complete. SSDs were closer but you'd still, we still use the traditional
block API, right? We perform a read, it interrupts us, we get the answer. Storage class memory starts to push this line back.
The latencies have improved.
Doug Voigt has a presentation like this.
But the API shifts when latencies get down
to a certain point.
It's better to pull, better to wait than to block.
And so, ignoring that, the 500x reduction in latency in IOPS
changed the behavior where applications suddenly program differently. Applications finally
to take advantage of this device have to change their API. Now, there are some APIs that we
use today and applications use today, but that's going to drive this adoption. All right?
And so, eventually, we'll get down, we'll get SCM down to DRAM speeds and we'll never
use async. We'll do loads and stores. There are local file systems on top of this device
that are exposing it to applications today. The first and foremost is called DAX. It's
the same thing, almost the same thing in both Windows DAX. It's the same thing, almost
the same thing in both Windows and Linux. It stands for direct access, direct access
file system in the sense that it exposes a file or a mapped API, right? It has sort of
two personalities. You can open a file, read it and write it like you normally do or you
can open it, map it and load and store it and then commit it like you do. Windows and
Linux implementations are very similar, and
you're gonna see a lot of applications moving to a memory mapped API to
take advantage of these things.
NVML is an explicit non-volatile memory aware library.
Andy and others have been talking about this for years.
It's real, it's open source, included in Linux, and
in the future, it'll be included in Windows.
Chandra, my colleague,
had a presentation yesterday afternoon about this along with the Intel developer, Paul
Luce. Specialized interfaces are also interested in this, databases, transactional libraries
and even language extensions. So these are other ways that storage class memory will
appear. But we have to take these APIs remote, right?
As I said, local copy doesn't count.
You have to replicate in the modern world.
One copy just doesn't count, except for temp data.
That's about the only thing I care about.
So we need a remote durability guarantee to do this.
We need to be able to make it durable locally, which the API
supports, and we need to push it to a durability guarantee remotely. And so that remote good durability
has to happen all the way through the network. At the platform and the device, we need to
be sure it's durable. Across the network, we need to be sure it arrived intact and durably.
And then the storage protocol itself needs to reflect that back. So I'm going to walk the stack to show you what I think the world will look like.
You can feel free to interrupt me for questions.
I predict we'll have a full 45-minute segment though.
Consistency and durability platform. Okay, so we're going to start at the platform and
I just want to say a few things about the platform. I can only wish that the platform
does things, right? I don't build hardware. But I can at least tell you what I wish and
what I see. Okay, I mentioned this a minute ago. We need a remote guarantee.
Wait a minute, is that in the right spot?
Yeah, well, let's call the RDMA protocol the platform.
Sorry, this one's a little bit out of order,
but it's not out of place to talk about it.
The RDMA protocol is what we see to carry data from the originating
node to one of the replica nodes, all right? And the RDMA write alone is not sufficient
to provide the semantic. Adan went through this with his ladder diagram. But historically,
the completion means only that the data was sent, all right? In Rocky and InfiniBand, it means a little more.
It does mean that it was received in Rocky and InfiniBand.
But at the verb level and IWARP,
it only means that it was sent,
or not even that it was sent,
that it was accepted for sending.
Okay, so it's a very weak guarantee,
but it's what RDMA has provided.
Some NICs do give a stronger guarantee,
but they never guarantee that data was stored remotely.
This is where I completely agree with it.
It really doesn't matter if it made it to the remote NIC.
It only matters when it makes it to remote durability.
Processing at the receiver, additionally, was a problem.
If you could say, well, when I see the packet arrive, if I take an interrupt, I know it's
there.
Well, no.
Things can be reordered, and lots of funny things can happen.
You need actually a completion.
You need to actually take a completion in the RDMA world.
And all these things add up to latencies.
This is our argument for requiring to extend the RDMA protocol.
Once we extend the RDMA protocol, we actually need a platform-specific method to implement the commit. And there's actually no PCI commit operation, right? PCI
writes are fire and forget. They're called posted. You post the operation. It behaves
just like an RDMA write. You post it and some hardware accepts it to put it there. But there's
no completion. There's no acknowledgment that the data actually arrived at
the end completion on the PCI bus.
So that's bad.
We go through all this trouble to implement the RDMA protocol
and now what does the NIC do when he's plugged into the PCI
bus?
So I believe that down the road there will be a PCI extension of
some kind to support commit.
I'm hopeful that it's going to be a very simple extension.
Something like, well, I think I mentioned it in a minute, but some very simple extension.
But there are, in the meantime, workarounds.
And I'm going to mention a couple things from my friend Chet Douglas.
But the idea is to avoid CPU involvement and
the workarounds largely require CPU involvement. So this is one of Chet's diagrams that I swiped.
It's about ADR and it draws this little red box called the ADR domain. And the idea is that you can form an operation
that makes it into the ADR domain and is therefore persistent. ADR is a hardware domain, right?
It requires a specific motherboard facility with a super cap or a special power supply
or something. But it is a solution that's available today. There's a couple of glitches.
It doesn't include the cache of the processor.
It doesn't include caches that appear down here.
And so the workarounds all involve either bypassing or writing through these caches.
And they get a little tricky.
They're expensive.
There are other workarounds that basically push the data to various places and then do an operation that signals things to go where they go.
Like you notice this one on the right has no ADR domain, right?
And so you can go look at Chet's presentations.
But by sending messages or performing other operations,
you can push the data out of these caches and into the non-volatile domain.
All right?
These are all very expensive, right?
They are required today, so it's reasonable to consider them.
But in the future, we don't want them.
And I would like to suggest that we may not actually need them if we carefully leverage our existing protocols,
which actually implement many of the messages that are proposed down here. So I'm not disagreeing with the workarounds,
but I'm saying the workarounds can be implemented in today's environment without really radically
changing the environment while we wait for this PCI commit and this RDMA write or this RDMA commit. And so what I'm going to explore today a little
bit is SMB3 push mode. And I talked about push mode last year very generically. I actually
talked about push mode in three different protocols. I talked about it in SMB3, NFS, and iSCSI, iSERP. Push mode, Idan mentioned this earlier, is
required, I think, to overcome the latency issue. A round trip, well, a message round
trip, an actual send-receive or some sort of acknowledged operation at the upper layer
will add significant latency through
both the network round trip and the interrupt completion handling at the server, right?
We just can't afford it.
The speed of memory, the speed of storage class memory just doesn't allow us that.
We have tens of microseconds, 30 to 50 in today's implementations, sort of best case round trip
latency at that layer. Whereas we can get sub 10 microsecond latency at the RDMA layer.
So by push mode, what I mean is an operation which, if you will, bypasses the upper layer. And so first, I'll talk about the traditional upper layer
commit, how we commit data.
And it's basically a file operation.
We open the file.
We obtain a handle and a lease, for instance, in SMB.
And we perform a write.
We can also perform a read.
There's a bunch of dispositions on the write,
but the server performs operations, either
to a PMEM file or an ordinary file.
And then from time to time, the initiator, the client, will either perform a flush or
just close the file, right, which implicitly flushes.
And that flush is an explicit operation that crosses the wire and requests the server to make the file data durable, right?
Which on a disk drive means, you know, foo, flush it to the disk, put it in the file system, do
whatever you have to do at the file system layer to make it safe. But on a raw PMEM device, it would
actually perform a local commit. It would perform one of these local durability operations
in response to a flush. And that's just what the DAX file system does. There might be some
recalls which are interesting. Lots of other little details that I'll mention in a minute.
I'll sort of dig into in a minute. But the advantage is it's a traditional API. It's
fully functional today and it literally just works. Literally just works. The only
cost is you have to take interrupts, you have to run that server CPU, so it does cost you
a bit of latency. The existing latencies are all still there. Push mode with explicit commit, this is where we really want to get.
And if you will, the idea is to tunnel below the file system layer, to tunnel below SMB,
right?
Instead of performing an SMB write, we're going to perform an RDMA write below it, right?
We're going to set up the I.O. using SMB but we're going to perform the I.O.
with an RDMA write. And we can do that when we have a PMEM array over here on the side.
So once again, we would use SMB3 in this case to open a DAX-enabled file, one of these direct
access files. It's literally persistent memory mapped into the host's address space.
So when you write it,
you're just loading and storing in memory, okay?
Then, instead of loading and storing from the processor,
we're gonna load and store from the NIC.
We're gonna perform RDMA writes to push.
We can also pull.
Pull is kind of rare, happens during rebuild,
happens with some weird application things. It's kind of rare. Happens during rebuild. Happens with some weird application
things. It's primarily about write. And when we do write, we do a commit. We do one of
these RDMA commits, which is this guy with the little diamonds on it. So we do write,
write, write, commit, write, write, write, commit, write, write, write, commit. And the
commits would basically serve the function that a flush does at the SMB3 layer,
but only for one of these DAX-enabled files where we're actually writing the memory, right?
Otherwise we have to signal the CPU.
So that's the rough schematic of how it works.
There's a little bit more about leases and things. Let's drill down. So push mode. First, you have to open
the file. This is called SMB2 create, the operation. Create creates a handle. It doesn't
create a file. That's different. You create a handle. And if that results in creating
a file, that's a disposition which is a little more involved.
In any event, opening the file does a lot of things.
You created a connection, you've created a session, you've authenticated yourself to
the server, you've requested to open the file so the server checks authorization based on
your credentials.
All that good stuff has been done by the upper layer, which is really convenient, right?
That's a lot more than you get from some low level block
driver, for instance.
How is the block layout?
Yeah, well, actually, this isn't the block layout yet.
That's a different layer of SMB, which
is a bit of an extension.
But it's just like an NFS v4 open or an SMB2 create.
So once you send a create, in SMB2,
you decorate it with things called create contexts.
And create contexts are these little blobs
that accompany the create to tell the server
how to handle the create.
And I hypothesize that there may be some sort of push mode create context.
It said, I want to open this file in push mode, right?
Alternatively, there could be an iOctl later.
SMB2 supports iOctl, it's called fs-controls on an open handle.
So the idea is to set up push mode.
And an important side effect of this is that if you're going to lock the file in memory and read it and write it,
you probably need to kind of own that lock, right?
And that's something we call a lease in SMB2.
So I believe that it's in conjunction with requesting a lease.
And by establishing the lease in push mode, we can establish that mapped file for some range of the file, perhaps the whole file, lock it in memory, and treat it as a PMEM segment or scatter gather.
So we return the create context to indicate push mode is available.
There's some acknowledgement back from the server that push mode is available.
All right, so the lease provides us with other things that allow us to recall
and manage the mapping. All this happens at create time. This is like one operation. It's
just one big decorated create. After that, the regions may be remotely written and possibly read.
So that requires a RDMA registration.
This is where we would clearly have some sort of FS control.
The client would say, I want to read and write this region of the file, or perhaps the whole file.
So the file is in memory, ready to map.
Now it needs to be registered so the RDMA layer can access
it, right? And we do that today in SMB direct. We do it actually per IO in SMB direct. For
each read or write, the server will, well, the client will register as memory. The difference
is we're doing it here on the server side. So we sort of reverse the polarity of the
operation. Anyway, I suppose that it might be at least an offset, a length, and a mode.
I want to act from zero to infinity, write only, for instance.
The server would pin the mapping, register the pages, and return a region handle,
maybe multiple region handles.
The client could request multiple regions,
but the result is that he'll be able to remotely read and write the file. Recall is important because the server may have to
manage the mappings. The client will begin to perform its reads and writes directly within
the file, entirely within the RDMA layers. There's no server processing. Only the NIC knows that it's happening. If there's a commit
operation available, the client commits via RDMA. Otherwise, the client commits with flush,
right? If there's no commit, the data is there and ready to go. The client just has to tell
the DAX file system to commit so it can send a flush operation. So this is kind of this weird hybrid, right, where you might be able to
push the data with RDMA but you might have to ask the CPU to commit. And that could be
a win if you're merging a lot of RDMA writes. The client may periodically update the file
metadata.
There's an ioctal file basic information is basically like NFS set at
or it sets the attributes of the file.
So for instance,
if the client is writing the file
and the server CPU doesn't know it,
how does the timestamp get updated, right?
It's the client's responsibility
to do that at this point.
The flush can do it.
That might be an advantage.
But the client can also do it with one of these things.
And there's a few things that the client can't do that he must use SMB for.
And that is adding blocks, appending to the file, for instance.
Punching a hole in the file, right?
You've got a mapping.
That's all you've got.
You can't add to the mapping without changing the mapping.
You can't punch a hole in the mapping without changing the mapping.
You can't punch a hole in the mapping without requesting that that be done.
So a lot of these things will definitely be done with traditional SMB3 operations.
So you can see a mix.
You can see the data transfers being done via RDMA.
You can see the metadata operations being done with SMB.
And now here's the recall. I started to talk about this at the beginning of the last one.
Sorry.
But the server may have to manage the sharing and
this registration.
And the client, I propose that we use the lease to do this.
It'd be recalled upon sharing violation if another process
wants to open the file.
That would be an ordinary SMB3 level sharing, right?
One said, I wanna be the only writer of this file, and another one comes in.
So it recalls.
File doesn't have to move, but the permission has to change.
Caches have to be flushed, God knows what.
There might be a mapping change.
Maybe the CPUs maps are full and we have to rearrange things and there might be a mapping change.
So the file system may request that the data be unmapped and
remapped in another location or maybe rearranged due to bad blocks or
I don't know what, some platform specific event occurred.
A registration changed.
For instance, let's say the server started to run low on RDMA resources and
said, I gotta throw everybody out and
rebalance my resources, that kind of thing.
So this recall is very useful for
that kind of server callback to the client.
And when recalled, the client performs the usual.
If it has any dirty buffers on its side of the wire,
it flushes them.
It probably commits them because usually it cares about
the durability.
It returns the lease and probably goes back and gets another lease. Or maybe it's happy usually it cares about the durability. It returns
the lease and probably goes back and gets another lease or maybe it's happy with the
new lease level. Maybe it got dropped down from read-write to read-only and maybe it's
only reading. So he might be okay with that. But that's standard lease recovery. Once again,
that's a sort of an ownership or a sharing metadata operation.
That didn't advance. There we go.
And push mode, when you're done, you're
going to probably close the file.
You will return the lease as part of closing the file.
So that's the SMB3 level operation.
There may be the push mode registration that needs to be cleaned up. There might be
an explicit FS control for that. It may just happen automatically on close. Well, it will
certainly happen automatically on close. But there might not be another way. You might
have to close the file. I don't know. This is a semantic that we need to think through.
What's the most useful application behavior for this type of push mode? I think this depends
on the way we decide to enable push mode. I think this depends on the
way we decide to enable push mode. If we do it with the create context, it probably happens
at close. If we do it with an FS control, it probably happens dynamically or, you know,
explicitly. I want to point out that push mode registration needs to be reset during
handle recovery. SMB3 has a feature called CA, continuous availability.
And continuous availability allows the server
to hold on to handles when the client disconnects.
And the client can come back
with exactly the same state he left in.
It's really important for enterprise applications
like Hyper-V, virtualized environments, right?
If they get migrated, they need to connect
and they need to pick up exactly where they left off with no loss, right?
We want this to be completely transparent to the application.
So the push mode registration, let's say it moves,
the client moves and has to reconnect to the same server.
That registration goes away and has to be rebuilt.
So it must be reobtained after handle recovery.
You'll get a new handle.
If you try it on the old handle, well, it will fail because the connection was torn
down. So that's good. RDMA protection helps us.
All right. So that's the quick trip through SMB3. Now, a quick trip through the protocol
extension. This is going to duplicate some of what Adan was discussing. It's, it'll
say it in my own words so maybe I'll have something to add but I might skip over a few
things here. Doing it right. To do it right, I think we have to do it right in the RDMA
layer. We need a remote guarantee of durability. We want it to be highly efficient.
And we want to standardize it across all RDMA protocols, right?
The protocols themselves all have their own quirks and implementation details.
That's fine.
But we want the operation to be similar across all RDMA fabrics and all RDMA protocols.
We have that today for storage, right?
SMB3 runs equally well over IWARP, Rocky, and FinnaBand.
You name it.
It does so because it requires very little of the fabric.
It wants a send or receive, an RDMA read, an RDMA write,
some ordering guarantees, some connection management,
IP addressing, that's about it.
Doesn't need the fancy stuff. This't need it, the fancy stuff.
This is another element of the fancy stuff list.
And it's really important that it not be proprietary or it not be specific.
It can be extended, but that core behavior needs to be
countable by all the storage protocols that need it.
So my concept and, you know, it's being discussed in the Internet draft and in the IBTA and
by a lot of people.
I've talked about this before.
There's probably nothing new here to people who've seen that.
Is that there's an operation that I call RDMA commit.
It's also been called RDMA flush.
It's also been called, I don't know what.
It's a new wire operation in my view and a new verb.
It's only one operation and it's only one verb, okay?
It's a very simple point enhancement.
You can conceive of other natures to it, but
I view it as one operation, implementable in all the protocols.
I believe that the initiator provides a region list
and other commit parameters
under control of a local API at the client.
And I'll point to the SNEA NVM technical working group,
the non-volatile memory technical working group
as leading the way in that type of API.
We have this thing called optimized flush.
And optimized flush has a bunch of parameters
that I think should map directly here. But, you know, I'm open to discussion on that.
The receiving RNIC has to queue the operation in order. I think it behaves like an RDMA
read or an atomic. It's subject to flow control and ordering. Very important. Both of those
things are very important. Ordering because you need to know what data is becoming durable.
And then flow control because it's gonna take time.
It's kind of a blocking operation on the wire, right?
When PCommit was still on the list,
PCommit was a pretty heavyweight operation.
PCommit may not be required for some solutions now.
Maybe it's just a flush, but it's still a flush.
There's still an end-to-end acknowledgement within the platform.
So that blocking operation is gonna require flow control.
The NIC pushes the pending writes, performs the commit,
possibly interrupting the CPU.
I wanna talk about that with respect to the API.
And the NIC responds only when durability is assured for
the region the client wanted.
There's some other interesting semantics though.
I think one of the key scenarios is called the log pointer update.
Where you write a log record, you make it durable.
Then once it's durable, you write the log pointer and make it durable.
It's like two durable commits.
One is a little bit bulk, right?
Might be 4K data.
The other one is literally a pointer, 64-ish, little bit bulk, right, might be 4K data. The other one is
literally a pointer, 64-ish bits-ish, right? And so I believe that making that two operations
on the wire might be expensive and might lead databases, for instance, to think twice about
using it. But I think if those two operations can be merged in some clever way, and I have
some cleverness in my internet draft, that can be a big benefit for this thing, for adoption
of this thing.
I also think it would just be basically useful.
Second, you may need to signal the upper layer, right?
You may need to tell the upper layer that something happened.
We have a scenario in database server that I have in mind where the log is replicated and safely
replicated but the idea of replicating it is for quick takeover, right? Should one of
the database instances fail, the other one should be ready to take over. So you want
to replicate it but you want to notify the peer that there's dirty log data. And so the
peer can actually stay very
close behind you, not synchronously. You don't want him to actually replay that log, but
you want him to know that there's dirty data to be replayed so he can keep close. And signaling
him either with a little tiny gram or a whole message is a really interesting idea. So if
you put these two things together, I think you have some interesting, you know,
merged semantics.
This one I'm really interested in, an integrity check.
If I'm writing data, how do I know it's good, right?
I called commit and he said it was committed.
How do I know that it's really good, right?
We have all kinds of storage integrity at the upper layer.
Can we do this at the RDMA layer?
I don't know.
I don't know exactly how.
Does the NIC have to read it back?
Is there some other way to do this?
But an integrity check is right up there on my list.
The choice of them will be driven by the workload.
We have to have this dialogue with the applications.
We can't just sit here and say, I think I have a good idea.
We need to motivate this by some request or some need.
And finally, the expectations of this latency are that push mode will wind up, will land
much less than 10 microseconds.
Idan mentioned it, right?
He said there's 0.7 microseconds on the wire.
There's no way push mode's completing in 0.7 microseconds, right?
There's more overhead on that thing.
But it's certainly 2 to 3 microseconds, right? There's more overhead on that thing. But it's certainly 2 to 3 microseconds, okay?
Which is an order of magnitude,
maybe better than an order of magnitude,
better than we get for a write
with a traditional storage RDMA transport today.
Like I said, most of them are in the 30 to 50 microsecond, okay?
So that's huge, right?
An order of magnitude is always something to go after
with everything you got. Remote read is also possible. So he was asking why would you need
this? There aren't a lot of scenarios that need it. That's why I sort of put it as a
footnote, right? But it is interesting, certainly for rebuild, right? You lost a copy, you got
to read that copy over the network, just suck it in, right? You lost a copy, you got to read that copy over the network, just suck it in, right?
The reason we get this is that there's no server interrupt, there's a single client
interrupt. We can even optimize it when we have multi-channel and flow control and things
like that. We know how to do that. That's local. That's local magic, right? We can all innovate in ways that are important. All right. Push mode considerations. I've
got a list of interesting things. Then I got a quick update on Windows Server 2016. I called
these fun facts when I talked about this a couple of months ago at Samba XP. I've decided to name each fun fact a little more meaningfully.
The first is buffering.
This is true certainly of Windows and I believe of Linux.
If you open a file in buffered mode on a server
with a persistent memory device, buffered mode
has a new meaning.
Buffered mode means that you've actually
mapped the device directly when you open a file in
buffered mode.
It used to be that the buffer was this bounce buffer in between you and the device, right?
You would write to the buffer and the server could acknowledge that quickly and then lazily
it could write it out unless you flushed it.
Buffering means the opposite with PMEM.
Buffering means you actually own the device right now, at least if everything's working well. There
might still be a buffer in there, but there's no reason to buffer just to drop it back in
memory, right? So the idea is to put it directly in the device. So that's really interesting.
But it enables both load and store and RDMA read-write when you have this, right? And
so it's pretty interesting that you sort of want to invert the meaning of buffered
and unbuffered.
It used to be unbuffered kind of went faster because it went straight to the device.
Well, buffered went faster, but you didn't get the guarantee.
So it's kind of the opposite now.
You want buffered.
So the server can actually hide this from you.
And that's what I'm going to talk about in a minute. We the server can actually hide this from you, right? And that's what
I'm going to talk about in a minute. We experimented with Windows on this.
Recalls. NFSv4 has this, SMB has this. Push mode management requires new server plumbing,
right? The up call from the file system for sharing violations remains the same. But,
you know, relocation, file
extension, all these things, these are new up calls that don't come from disk drives,
right? And so the server has to be prepared for some interesting behaviors. And RDMA resource
constraints are another reason for recall. Now you're explicitly giving RDMA resources
on the server to the client. None of the protocols today, including
NVMe over Fabric, exposes the server buffer. So when you go to push mode, you're going
to have to manage those resources. So this is really important. You're going to have
to change the plumbing of your server. So an implementer will have to move toward this
over time. I don't know what all these recalls will be just yet. But the idea is, I'm in
trouble. You need to let me fix this thing and
then we'll get started again.
Congestion, this is the thing I'm most, most worried about.
The big thing about storage protocols in write,
write operations in storage protocols today is that they natively,
naturally flow control themselves, right? You send the write to the server and it's just a request to push data,
right? The server says, okay, I'll take that request and it pulls the data, right? And
what that means is that the server schedules that pull, in particular when it has the memory
to receive it and when it has the I.O., you know, the I.O. has reached the top of its
queue basically, right? And so it flow controls the network very nicely,
as well as flow controls the queues on client and server very nicely.
When we go to push mode, we're going to congest all over again.
We're just jamming data down the server's throat, right?
The software doesn't have to deal with it.
It's out of the loop.
But the RNIC, the network, the interface, the DRAM bus,
all this stuff is
going to have to deal with it. I'm really worried about that. RDMA itself does not provide
congestion control, certainly not for RDMA rights, right? They are unconstrained. You
can send as many of them as you want. You can fill the wire from one note. But congestion
control is provided by the upper storage layers. And there are credits in SMB3.
There's also something called storage QoS.
My colleague Matt Krjanowicz gave a presentation yesterday about this and
their use in Hyper-V.
So I'm going to point out that there are existing client side behaviors,
such as this QoS, that can mitigate this problem.
And we're going to need to think those through,
because I think they're going to be really, really, really important.
And Arnix can help.
More thinking is needed.
We're going to have to solve that question.
Tom, is that congestion at the network or at the end node?
Both. Absolutely both.
Primarily the end node, right?
But the switches, the exit port from that fan-in congestion.
I mean, we see this every day in any sort of scaled network deployment.
And integrity, I mentioned this before, but data in flight protection,
well the network provides that.
Data at rest, that is not provided by the network.
How do we do that?
SMB3 has a means, but we're not using SMB3 to transfer
the data in push mode. So I think we're going to need to think about remote integrity as
time goes on here. I don't think people will be willing to push their data quickly only
to discover that it didn't quite make it the way it was sent, right? That's usually not
a good way for a storage provider to make money.
To wrap up, I have a few slides on where we're at today in the Windows implementation.
Yeah?
Sorry.
Yeah?
You gave the example of the log pointer update.
Yeah.
And this was a motivation that you said that you want basically the two operations Yeah. Yeah.
I want them to be ordered with respect to one another,
and ideally I would like to issue them without two round trips.
That's a subtle difference but...
No, no. There would actually be two flushes. Well, you could pipeline them, that's one
approach. You could have an optional payload in the message, a small 64-bit payload, that's what I explored in the internet draft. Because I think 64
bits is enough for the scenarios I have in mind.
Well, could you do it as one operation where the log update effectively carries a commit
list which forces commits for all the previous log writes?
Tons of ways to think about this.
That's a protocol implementation question.
But the idea is you have one probably bulk piece that's placed and made durable followed
by another which is placed and made durable only after the successful placement and durability
of the first piece.
That's the rule, right?
I can't update the pointer until the log is safe.
And I can't write the pointer until the log is underway.
But specifically you mentioned
that you don't want it to be two operations,
you want it to be one.
Ideally I wanna keep the latency minimal.
If I have to do two operations, so be it.
But I'm gonna point out that maybe it's easy to,
as long as the semantics are well defined, to merge them. Right? A lot of upper layers
will have things like compound or fused or whatever operations where you can send more
than one operation. The important thing is that the two operations are ordered with respect
to one another. If one fails, the other one cannot proceed. That's what's critical.
Yeah. I know. Yeah. Oh, I'm sorry. Repeat the question.
And I'm saying that smashing operations together will have a cost and we'll have to measure
that cost, right?
Maybe it will be too expensive to do it in the NIC.
Maybe the protocol will change in some ungainly way if we try this. And so I agree with that, right? We need to think it through, right?
If the cost outweighs the benefit, forget it. Benefit outweighs the cost? Oh, now we're
talking. Okay, Windows Server 2016. Tech Preview 5 has been available since this summer and
general availability is imminent.
I'm not here to announce anything.
Oh, I have more time than I thought.
Good.
Just a couple of minutes.
It's imminent.
It's very soon.
I'll give you a hint.
We have Microsoft Ignite in Atlanta next week.
But we'll see. The Tech Preview 5 and Server 2016 GA both support DAX.
DAX is really great.
And Remote SMB works over DAX, I just want to mention.
It's a file system.
It's exportable.
It works great.
And so we get all the expected performance advantages, right? DAX is great
because there's no media, it's just memory, right? And it supports all the reads and loads
and stores and commits that we expect from a memory-based file system. It's basically
a RAM disk, right? It has a block semantic and we put a file system on the block semantic
and we open files and we read them and write them.
And if we map them cleverly, we avoid data copies, which is really cool.
However, Windows Server 2016 Tech Preview 5 do not implement full push mode.
Full push mode, as you can see, has protocol implications.
Generally speaking, when we develop Windows Server, we spend a year in design, a year in development, and a year in test.
I can't slide a new feature in in that last year.
It's very difficult.
If it wasn't committed way up front, it ain't going in.
And a new protocol change, forget about it, okay?
So just damping expectations perhaps a little.
It's not gonna have push mode.
I hope it will in the future.
And we have a new release vehicle that might let me do it, but not yet, okay?
However, so there's no reliance on extending the protocols
that we discussed.
We did, however, consider the direct mapping mode,
and that was that picture where I showed
the DAX file system in that little dotted line
that went from the PMEM straight to the files
buffer cache copy, right?
And that direct mapping is really interesting because the Windows SMB
server basically can operate on that mapping, right, in a defined way,
in a file system managed way, and can read and
write those pages directly out of the PMEM, right?
And so we thought to ourselves,
why don't we just RDMA straight out of the buffer cache,
which would effectively RDMA straight out of the PMEM device.
And we learned something very surprising.
This is kind of small and I'm sorry, but basically these green boxes of IOPS and latency are
basically within 1% of one another when we tried the two models.
We tried the traditional model where we DMA'd to the buffer cache and then copied the buffer
cache to the PMEM or when we mapped the PMEM into the buffer cache and DMA'd directly there.
On the right is that second model, RDMA direct in quote buffered mode, air quote buffered
mode. And on the left, RDMA traditional unbuffered mode
where we do the data copy.
We got the same darned result.
This was a 40 gigabit network,
so basically the bandwidth was almost completely occupied.
But the latency didn't budge, right?
It took the same length of time,
processing was the same length of time on the server.
So that was pretty surprising.
We spent a lot of time
and we do understand that the reasons go kind of deeply and DAX delivers. This performance
was considered quite good. This is actually not the most interesting platform. We have
a much more interesting platform that we're going to get some really neat results from
soon. This was done back in May, June timeframe, I think. There's no advantage. And so, we basically
decided not to do it. We basically stepped back and we said, okay, we're not actually
disappointed here, right? It was a good idea, but it isn't ready in the code. We basically discovered that memory manager and cache manager, these are MM and CC and Windows,
are not quite ready to do it at this IOP rate, in other words,
to map and unmap pages from a file at this kind of rate.
They're used to doing persistent mappings or long-lived mappings,
something that lasts for seconds, minutes, whatever, not a few microseconds, right?
And so they need some work
or some sort of persistent FS control
that just makes one mapping
and hangs on to it for a long period of time.
By not changing anything,
stability and performance are maintained, right?
But we can improve this in a future update. And so I'm going to beg you to watch this space. There's a
lot of horsepower under the hood here that we haven't lit up yet.
So if you happen to have Windows Server 2016 or Technical Preview 5, I just want to mention that if you stick an MVDM in it,
and if you stick an RDMA NIC in it, and you can choose any type you like,
we support Chelsea 1 Mellanox primarily.
There's a couple of others available from vendors who've built it and certified it for Windows.
Configure a DAX file system, which is basically format your thing slash DAX.
Here's a little link to help you learn the new format option.
Create an SMB3 share and you're off to the races.
So it's literally out of the box ready to go.
You just need two pieces of hardware, a DIM and a NIC.
And finally, external RDMA efforts.
We mentioned this.
The requirements and the protocol need to be done in a very broad way through standards
bodies, right?
We need everybody to know what to do and to be able to build interoperable and useful implementations.
So this moves to various standards organizations.
I didn't actually mention the PCI Express folks here, but
the PCI SIG is an important one as well.
IBTA Link Working Group, specifying InfiniBand and
Rocky protocols for this.
The ITF Storm Working Group, which unfortunately, I was the co-chair of it, has recently closed. We basically completed all our work. And this train didn't arrive
at the station early enough to save it. So it unfortunately closed. The Working Group
is actually still there. It's just inactive, closed. So the mailing list is still there.
And I submitted this internet draft to it. This draft needs to be updated.
My co-author helpfully left the company, so I have to kind of pick up a flag over there.
But there's a lot of what I'm discussing in here in a little more detail.
It's also being discussed in the SNEA NVM TWIG, the Non-Volatile Memory Programming Working
Group.
And Open Fabrics has shown some interest in
surfacing APIs for this. And here are some links to resources when you see the presentation
online. These two are really cool. This was Neil Christensen back in January of 2016,
eight or nine months ago. Neil Christensen gave a talk and just after Jeff Moyer from
Red Hat, Neil Christensen is my colleague at just after Jeff Moyer from Red Hat, Neil Christensen
is my colleague at Microsoft, Jeff Moyer has a similar role at Red Hat. They could have
given each other's presentations. They could have swapped decks. It was the funniest thing.
They talked about the same architectural problems, the same architectural solutions. It was really
astonishing. Everybody was really surprised by that. But check those two out. They went
back to back and it was quite an experience. All right, questions if any time remains.
Chat. last year even about push versus pull and the resources that are tied up with one versus
the other and the buffers.
Right.
The timeline associated with that.
Have you explored that at all?
Have I explored the timeline of buffer ownership or buffer management across the push versus
pull mode?
Not a lot.
It is a lot. It's kind of an app, it is a concern. It's an application-dependent
concern though because it's sort of like the working set of the application, right? How
many of those buffers need to be hung around, right? And the, how many recalls have to occur
for instance, right, to manage limited space. So I believe there's actually
several layers of issues. There might be limited map space in the processor of the target.
There might be limited memory handles and resources within the NIC of the target. There
may be other memory pressures from for instance the cache manager in the target. All these
things are going to impact how many mappings can be held
and how long they can be hung out. So I don't think until we really start to see those workload
profiles that we can answer that question.
For next year.
It's a new paradigm, right?
Yes.
Yeah. Yeah? Have you considered in the design the impact of the server having to recover from
or prevent the interaction with a buggy or malicious client?
Because it seems like there were several aspects of this that were required.
Why did it interact?
Have we thought through server protection of its data and its actual data paths with respect to malicious clients.
Some, some.
First thing, because if we use SMB3, we have an authenticated attack.
That automatically limits the scope of the vulnerability, okay?
We have a special class of threats called authenticated threats, right?
So that's number one.
We have the RDM layer protecting us to a certain extent. The server can manage things. If the client attempts to guess addresses,
the RDM layer will close this connection. That will be detected by both sides. Third,
because it's an authenticated access to this mapping, the client has access to all the
data that he's able to write. He can't write data that he's not allowed to write. He can only write data that he was allowed to write. So he's literally
only stepping on his own data. He can't damage somebody else's data. He was given, you know,
he had the authority to damage the data. He went ahead and did it. So primarily, the threat
is that he can hog up the bus, you know, the physical bus and things like that. And we
can mitigate that with local, you know, NIC-based QoS, things like that, we believe. That's where
we have to think it through, right? That's why I said this is part of that congestion
question, I think, in many ways.
Well, there's no explicit congestion in the RDMA layer.
The RDMA layers depend on transports that have congestion control.
IORP uses TCP, Rocky uses the InfiniBand, you know, transport.
They all have a certain level of congestion control.
But that's transport level congestion, not necessarily these RDMA-right
level congestions. So it solves a slightly different set of problems.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storage-developer.org.