Storage Developer Conference - #11: Remote Access to Ultra-low-latency Storage
Episode Date: June 13, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 11.
Today we hear from Tom Talpe, architect with Microsoft,
as he presents Remote Access to Ultra-Low Latency Storage
from the 2015 Storage Developer Conference.
Hi, I'm Tom Talpe.
I'm from Microsoft, architect in the file server team.
My responsibilities include SMB3 and SMB Direct, the mapping of SMB to RDMA.
And I've been at SDC for many years, and usually when I'm here I talk about SMB.
The last couple of years I've talked about things that are SMB related, but not actually SMB.
So I'm kind of exploring the boundaries of where SMB enables new things.
I'll have a couple of shameless plugs in the middle of the deck about that. But today I want to talk about ultra low latency
storage, which I'll define, and how the protocols that we know and use today can adapt to these
new storage environments. And after my Bluetooth wakes up, a quick outline. I'll state the
problem, okay, and the protocols today. I'm going to focus
on certain protocols, not all RDMA storage protocols, but certain ones, and examine briefly
the sources of latency that they currently encounter, sort of where the landscape is at.
And then I'll launch into my madman hypothetical.
You can agree or disagree with my wacky ideas,
but I'm going to try to chart a course for the extension of RDMA storage protocols
and the protocols they depend on for remote access.
So without further ado, here we go.
I want to mention that this talk is related very closely to a number of other talks that are happening at SDC.
We've had a new track at SDC this year for persistent memory.
This is one of the talks in that track.
But really there are a lot of aspects to the persistent memory problem, right? There's the
API, there's the physical storage, there's the attributes of how that storage becomes durable,
how you recover after an error, how you access it remotely, just, you know, how file systems will
use it. There's really, there's a very broad spectrum of topics that fall under this heading of persistent memory.
It's not really about the low-level memory cells at the bottom of the stack.
There are lots of different technologies down there.
This is really a whole-stack discussion.
And what I want to bring to this is the aspect of the remote access to this stuff.
But these related talks are really closely related.
You really ought to keep them in mind as I speak.
The Monday's talk from Neil Christensen at Microsoft
was about the file system,
the sort of the Windows adoption of this technology.
He talked about a file system called DAS, D-A-S in particular,
and some block layers that will go above this technology
to allow rapid adoption of it.
On Tuesday, Jim Pinkerton spoke to the eBOD,
the so-called Ethernet bunch of disks,
the Ethernet-attached JBOD.
That is conceptually very similar to some topics here.
You could imagine an eBOD with persistent memory in it.
Okay, just throw that out there.
Andy Rudolph talked about NVML and API layers to this thing.
All these APIs will flow to the wire.
Doug Voigt spoke to the higher level NVM effort in the TWIG,
in the SNEA TWIG,
with sort of architectural challenges
and new horizons for NVM to explore.
Chet is going to come after me,
not immediately after.
He's like two months after.
But Chet's going to talk about
some practical platform-specific aspects of NVM
and how it can be used
without some of the things that I'm going to talk to here.
So we're going a little bit out of order.
I'll try not to duplicate discussion,
but point to these other discussions.
And then Thursday, Paul Van Baren will also follow up
on some more NVM things.
So just kind of think of these all together, all right?
And I probably left out a couple of talks,
but I'm just a little piece down in the bottom
about remoting this stuff.
Okay, first, let's state the problem.
The focus of this talk, my talk, my angle, my way of viewing this problem,
is from the perspective of enterprise and private cloud-capable storage protocols.
So these are the sort of tried-and-true protocols that we've used for years.
Some of them are file, some of them are block.
The point being that they're scalable, they're manageable, they're broadly deployed.
These are high-level protocols that are used in the enterprise,
that are used by enterprise applications and enterprise operating systems.
Now, we use them in new ways as we move to the brave new world.
RDMA, which I'm quite proud to say is establishing itself very well in the industry,
works with many of these protocols, in particular SMB, NFS, and iSCSI.
So I'm going to talk to SMB3 with SMB Direct.
It's my baby, so I'm most proud of it.
NFS RDMA is technically also my baby. I wrote the protocol and some Linux implementation,
but others are carrying that ball right now. And ICER as well. ICER, the iSCSI mapping to RDMA.
So I'm going to just hypothesize a little bit about each one of those. There are many others,
including NVM Fabrics, but I'm not going to speak to that. One of the reasons is that I believe those are still emerging technologies, right? I want to focus, as I said, on, that's not going to show up
very well, is it? Enterprise and private cloud capable storage protocols. I don't think
NVM Fabrics is there yet, but it may be someday. So watch this space. I'll talk about it soon
enough. New storage technologies are emerging.
Okay, so this is how the problem statement begins.
Advanced block devices, IO bus attached and future block or future byte addressable.
They're sort of a basic class of devices, similar to what you may have worked with before, right? They are
block devices, but they sit on new buses, right? NVMe sits on PCI lanes, basically, PCI Express
lanes. Solid-state devices, there's a broad array of solid-state devices. They're not all on the
storage interconnect, but they have certain behaviors, persistence, and they're memory-oriented.
Today they're mostly block-oriented.
In the future they'll become byte-oriented.
Even these NVMe devices, when they sit on a PCI bus, can be accessed as memory.
And so that byte addressability or pseudo-byte addressability
will become important for these things.
The IO bus-attached ones are purely block oriented,
and they almost require an IO stack to get to them.
There's a second class, though, storage class memory,
as we've seen in a lot of discussion.
I'll just call them PM.
I prefer PM to SCM.
But SCM does make sense when having certain application-focused dialogues.
They're memory bus attached, right?
They might be an NVDIM, whatever.
They're block or byte accessible.
When they're block accessible,
they're either native block
by nature of the implementation,
or they've got a block layer layered on top of them,
or byte accessible.
They're just plain old memory, right?
They may not be actually byte accessible.
They may have a cache line behavior.
Some very small block, and we saw some discussion about that yesterday. And these encompass
emerging persistent memory technologies like Intel's recent 3DX point, 3D cross point.
PCM phase change memory has been talked about quite a bit in this kind of context. But they
come in various form factors. So they're all low latency, persistent devices that live on a memory bus. All right, they're bite addressable.
Storage latencies, however, as a result, are decreasing. You've seen this slide in a bunch
of different ways. I just put it up in a slightly different way, and I'm sorry about the contrast of that table.
That's poor.
Write latencies of storage protocols,
for instance, SMB3 today,
are down in the multiple tens of microseconds
when they run on RDMA.
Somewhere between 30 and 50
is kind of the best latency of these protocols,
which is pretty amazing by storage standards, right?
If you look at the underlying technology that these protocols access, you know, hard disk drives
boom one to ten mic milliseconds, solid state drives, traditional storage attached SSDs,
100 mics to one millisecond. NVMe is pushing 10 mics. It might even be better.
We've seen some NVM fabric results and some local results that are slightly better than that.
But they're in the 10 to 100 mic range.
And persistent memory, however, just blows it out by making it memory.
So these latencies are coming down by tens of times each row,
maybe 100 times at the bottom row.
That puts a big challenge
on that 30 to 50 microsecond number.
It's a good match to hard disk drives and SSDs.
It's a stretch match to NVMe.
It's about the same as NVMe, right?
That's pretty good.
We want to be about the same,
the latency of the underlying media to the
remote capability. But it is definitely not so much a match to persistent memory. So we have to
do something to these existing protocols to keep up with the latency curve that the devices are
giving. Storage workloads traditionally, I just point out, are highly parallel. Storage in general
is highly parallel. They issue lots of deep queues. They shoot tons of IOs down the stack.
And so that mitigates latency issues. You'd fill a pipeline and you always have a new
completion popping out. But that's because the traditional workloads are highly parallel.
New workloads are actually not so parallel, and so
the latencies are harder to
make up for, if you will.
And that's where we're going to go.
Those new latency
sensitive workloads are primarily rights.
They're small
and random virtualization enterprise applications.
Those are the ones that everybody dealt with
in traditional storage, right? The database
workloads, the virtualization hard disk drive, you know, the virtual hard disk emulation.
They're all small and random.
However, these writes must be replicated and made durable in modern environments, right?
It doesn't count until you've created three copies of it.
And those three copies have to be physically disjoint, right?
They have to be in totally different failure domains or it doesn't count.
If you lose the golden copy,
the one copy, the few copies you have,
you've failed as a provider, as a
storage provider. And so the modern
way to do that is to replicate, to spray them
around. So a single write creates multiple
network writes. And I'll show you
a picture in just a second. Reads are
also latency sensitive. Small and random
are always latency sensitive. Large are more forgiving. but there's some interesting ones that may want to go
remote, like recovery and rebuild, right? If you lose a copy, you've got to read all the other
copies to rebuild the missing copy, right? And so there are some interesting latency and or bandwidth questions to be seen with reads.
So here's a little animation that I stole from an Azure presentation, Windows Azure.
But writes with a possible erasure coding, I want to point out, greatly multiplies the network IO demand.
We call it the two-hop or the multi-hop issue. You perform a write down here, and that write has to be placed in three other locations
on the network, right? And you have to wait for those three locations to be safe before you can
return from the write. Now, if you're a huge high-scale geographically dispersed server
like Microsoft Azure,
you have to also create erasure coding, right?
And so that one write prior to replying
had to produce quite a few more writes.
And those writes were to distinct machines within the data center.
So latencies are interesting not so much at the front edge of this thing,
but in the middle of this thing.
Front edge as well.
All such copies must be made durable before responding.
That's what I just want to stress.
APIs and latency are another interesting part of the problem statement.
APIs also shift the latency requirement. Traditional
block and file, as I mentioned, are also
often parallel.
Memory mapped and persistent
memory aware applications,
not so much. They have a load store
paradigm. You don't expect load
store to block.
You don't code
loads and stores to be parallel.
Well, maybe you do with certain types of applications,
but that type of load store expectation is the expectation.
I do a read, I do a write, it doesn't block.
It's quick.
Low latency.
It has memory style latencies.
Possibly expensive commits.
The libraries all say, store, store, store, commit.
And the commit is a heavier weight operation.
That's where you take the penalty.
That's where you expect the cost.
You don't expect it on those individual stores.
And a lot of people try to hide that latency with local caches, which work great for reads.
But for writes, they don't count.
There's only one copy if you write to the local cache.
That doesn't count. You still have to spray it all over the place.
Most of these caches are write-through and run into the same latencies that you get for a traditional write.
By the way, I didn't mention it, but
feel free to interrupt me if I'm being confusing or if you have a question.
Okay.
That's roughly the problem space.
Latency, latency, latency, right.
All right?
It's about right latency.
RDMA storage protocols today.
Many layers are involved.
Okay, we start with storage layers,
storage protocols that carry some local API across the network to some storage server.
Some examples, the ones that I'm going to be focused on today, SMB, NFS, iSCSI.
Those are three well-known remote protocols.
There are RDMA layers below those, functioning as low latency, high throughput, low overhead transports. iWarp,
that's one. That's the RDMA over TCP mapping. Rocky and Rocky v2, that's a InfiniBand style
protocol placed on Ethernet, right? RDMA over converged Ethernet. That's what Rocky stands
for. And InfiniBand, which is the sort of the custom top toto-bottom fabric used in a lot of HPC, scientific computing,
financial high-performance computing data centers.
So those are three typical RDMA layers all in use today,
along with those storage layers above it.
And finally, the bottom layer is the IO bus itself,
which has a lot to do with this.
The storage layer, which could be a file system or a block layer,
and if it's a block, a file system, you know, lots of different ones.
Block, I'll just mention some interesting things.
SCSI, random SCSI, you know, SCSI of your choice.
SATA and SAS, that type of interconnect,
something that's not necessarily switched or routed or shared, right?
It's just kind of a way to get to the low-level storage.
But PCI Express is also an IO bus.
These NVMe devices sit on it,
and memory itself has become an IO bus
now that these NVDIMMs are plugged into it.
So those are new IO buses.
I'm trying to draw the most complicated version of the picture, I guess,
before I paint a solution.
SMB3 architecture, shameless plug.
Okay, I work for Microsoft.
I'm a co-designer of SMB and SMB Direct and blah, blah, blah.
SMB3 is the principal Windows remote file sharing protocol.
Almost all major Windows services run well over SMB3.
They're supported.
It operates over RDMA.
It's a very rich, mature, highly supported protocol.
Multiple implementations exist in the industry.
They're all here testing downstairs.
But SMB is also, it's not just a file sharing protocol.
It's also a transport protocol.
It's an authenticated, secure, multi-channel, RDMA-capable
session layer. It's a transport with recovery. It has that session state, right? You've logged in.
You've proved who you are. You have access to other things on the server. For instance, it's not just
file system operations. It's raw block operations on Windows. It's Hyper-V live migration. I can read and write
the memory of a virtual machine. And RPC, named pipes, I can perform remote procedure calls to
my server over SMB. So there's a lot on the back end of SMB as well. And SMB will be a future
transport for NVMe storage, persistent memory, et etc. And I'm just going to sketch how that might happen.
Ooh, the white isn't going to come out very well.
SMB 3 components.
On the left is the client.
And over here we have an application sitting on a local API, in this case Win32.
It sits over the redirector, which is the SMB client.
And the redirector can access the server via multiple
protocols, TCP or RDMA. The server, of course, multi-channel, multiple connections can be used.
It flows to the server. The server has a number of storage providers behind it. In SMB, we call
this a share. You'll open a share. It might be a disk drive. It might be a file system, a volume of a file system and things like that.
The share kind of names the provider.
And then the provider, in turn, uses some sort of storage back end.
Typically, a file system will use a hard disk or a solid state drive or something to serve data from.
But I just want to point out that there's a couple of different layers behind the server.
And so there's some interesting things going on.
We can view RDMA in any of these layers.
And these ones that are highlighted are sort of new components
because to get to NVMe and emulate a disk, you need a little SCSI layer.
To support block mode and raw mode, persistent memory,
you need drivers, obviously, and maybe a block layer.
And a mapped file API such as the DAS file system that Neil Christensen mentioned yesterday.
That's a new paradigm for accessing files.
Okay, contributors to latency.
This is probably a review for a bunch of people, but I'm going to say it anyway.
The way storage protocols work today is very specifically architected.
All three of the protocols that I'm going to discuss have the same basic transfer model here.
They use a direct placement model, and I've simplified and sort of optimized
it here. I've left a few exchanges out that some protocols will encounter, but this is sort of the
best case of the protocol today. The client, to perform a read or a write, always starts by
advertising some region of his memory, right? The buffer I want to write or the buffer into which
I want to read. He registers that with the RDMA NIC and he sends it to the server and he says,
please do the read or the write for me. The server performs all RDMA. The server either
writes the memory for a read or reads the memory for a write. All three of these protocols work
that way. They do it for three very important reasons. It's more secure. The server can register
its memory and does not expose it to anything, right? It's private to the server. It's performed
only for the RDMAs that it requested. Second, it's more scalable. The server doesn't have to
pre-allocate a bunch of stuff and pre-reserve it to a given client.
And third, it turns out to be faster.
It's faster because the server can schedule the presence of the memory.
We don't have the client stuffing it down the server's throat.
There's no real congestion control needed.
The server simply holds on to the I.O. and pulls the data when it's ready for it. And that turns out to greatly improve the performance of the server in the long run
for most highly scaled workloads.
SMB3 uses it for read and write.
All the other protocols do basically the same thing.
So the important thing to note is that those red lines are server initiated, right?
They're performed by the server.
And the little lightning bolts indicate interrupts and processing required to perform the I.O.
The server, in the write case, takes two of them,
one for the client to request it and one for the RDMA read to be complete.
And in the write case, he takes one of them.
I'm sorry, in the read case, he takes one of them.
In both cases, the client can play with it
and probably just take one interrupt on each side.
Latencies come from those things
though. They are undesirable latency
contributions for the interrupts themselves.
It takes time to interrupt the CPU
and switch context. And then perform
the work request. These are
complex upper layer operations
that need to be scheduled and processed
in software. So server
request processing and also that actual RDMA handling,
like, for instance, that RDMA read up here at the top with the double lines.
CPU time is required to do it.
IOS stack processing is required on the back side.
And data copies, potentially.
Even in the RDMA case, there may be data copies for buffer management purposes.
And the question that I hope to answer is, can we reduce or remove all of the above
when we have a persistent memory device available to us?
So I argue that the logical way to do this is to extend the RDMA storage protocols.
And interestingly, when you head down this route,
you discover that you have to extend some other protocols as well.
So I'm going to map a little picture of basically three steps
of protocols that may change as we address this.
I believe that we can actually achieve success at each step.
I think we will have a market increase at each step.
So that's good.
We have a roadmap ahead of us.
And I'd like to see a lot of protocols do this.
I think this is a compelling architectural approach, not specific to SMB.
So it starts with something that I will call push mode.
Currently, as we saw, the client doesn't do any RDMA.
The server performs all the RDMA.
So the client actually has to say,
please server, do the RDMA,
and the server then performs it.
So that's cycles on the server, right?
Something has to wake up the server and run them.
However, with push mode,
this is done by a few other protocols in the past.
The reverse happens.
The client requests the server pre-register some region
for its use for I.O., right?
It's a named region.
It probably is a file or a segment of a large persistent memory device.
But it says, please register.
That's a one-time operation.
And the server will register it and return a handle, an RDMA handle, to the client.
The client will then proceed to perform all RDMA.
Around the first dotted line, he'll do a push, right, in which one or more RDMA writes occur,
and then some sort of commit, in the case of persistent memory, will occur.
That's what I'm going to talk about next.
In the case of a read, it's really straightforward.
If the client has permission, he just reads it, right?
It's an RDMA read.
Neither one of these, you'll notice, interrupts the server at all.
There's just a blank space to the right.
Right?
So the server just kind of registered the memory and said, party on.
Right?
You're allowed to access this memory.
I've authenticated you.
I've opened this region.
I've given you a handle to the region.
Party on.
However, as that happens, things may happen.
Maybe the client is actually writing a file.
Files have metadata, right?
They have change times, they have attributes,
they have sizes, they have all these weird things.
So every once in a while, the client needs to let the server know
that something happened in this window that the server opened.
So the client will periodically update the server
via the master protocol, right?
And SMB, iSCSI, and NFS all do this.
You basically set attributes or whatever.
But the point being that from time to time,
the management of that file object
needs to kick in as an upper layer protocol exchange.
That's actually not shown here.
As well, the server occasionally has to call back to the client.
Maybe the server, maybe the NIC is about to be hot-plugged,
or maybe the resource limits on the server mean I can only open a 1 gig window
instead of a 4 gig window or something.
So the server needs occasionally to call back to revoke those things.
Closing the connection, tearing down
the handle is kind of rude, but all these protocols already support these callbacks. So we're going to
propose overriding or extending some of these callbacks additionally to allow the server to
manage this in the presence of a client remote access. But in all cases, the client will signal
the server when it's done to simply close.
So the idea is that there's one setup exchange and one teardown exchange.
There are a bunch of invisible zero overhead data transfer exchanges,
and there's some optional metadata exchanges in the middle.
And so I can hypothesize a little bit what those look like by showing you three examples.
Does a write lock then occur at the NFS level? How do you keep two clients from writing over the same?
They'd better be careful.
They'd better open that thing.
They have direct access to that thing.
So they'd better use that upper layer protocol
to exclude one another or to take locks with one another
or to somehow coordinate with one another.
So it's done at the next layer? It is definitely done at the next layer. one another or to take locks with one another or to somehow coordinate with one another.
It is definitely done at the next layer. There's no way the RDMA layer can or should do such a thing. It's a transport. Its job is to get a bit from point A to point B, not to handle
higher layer semantics. So that's what these file protocols are for, these file and or block.
So they can exclude, they can do other things.
But once you open that window, it's party on at that point.
There's no synchronization at that point.
Yes, question? So basically here, instead of acting for every IO operation,
you are basically putting them in a bunch
and after a certain number of periods,
you will say that, okay, I've done this for you?
The question is, would we act in this protocol change,
would we act each transfer or would we batch the transfers
and perform them as a single batch, a single acknowledged batch?
The answer is, it depends.
The client can choose to send a single
request and then notify the server, or it can choose to send a large number of requests and
notify the server. It all depends on the API that's driving it. That's part of the workload
discussion that I want to touch on before I'm done. Would the acknowledgments be separate from the RDMA semantics?
Absolutely.
The ordering of the RDMA is all that manages the presence or durability of the data.
The higher layer exchange as to the file state is a matter for the file protocol.
So here are the three examples.
SMB3 push mode.
It's hypothetical.
There's really no such thing.
This is just a figment of my own imagination
there's a setup
operation, remember that setup
operation that was the first one, it's a new create
context, that's something we decorate
creates with in SMB3
or it's a new FS control, we've opened
the file and then we perform some control
on the file, it's job
is to register and advertise
writeable and or readable file by handle.
You've created this higher level file object
or region object
and you've registered it with RDMA
and advertised it back to the client.
It could be directly to a region of PM.
Maybe the name is literally an offset
of the PM device.
I want PM device 5 offset 2 gig.
So whatever it is,
whatever name it would be,
you'll get.
An example of that latter one
is the way we do Hyper-V live migration.
There's basically a big long GUID
that's tackled and protected.
And the Hyper-V client will open
the memory image of the destination,
literally by that UUID, that GUID,
and write it. The setup operation will take a lease or some sort of lease-like ownership,
right? It will reserve the region. It will reserve the right to call back and modify that
authority that was granted. Reads and writes are that raw RDMA. Client reads and writes directly via RDMA.
That isn't the SMB3 protocol at all.
That's raw RDMA under the covers.
It's totally invisible to the server.
Commit, although, is more important.
Commit is when the client requests durability.
That's the little glitch in persistent memory.
You have to get to a durable point from time to time.
And so there will be a new commit operation,
ideally performed via RDMA,
but there are similar operations in the SMB protocol already.
There's a flush operation that basically writes cached data to disk.
That's very similar to what we're talking about here.
It might be cached in the hardware at this point on its way via RDMA,
but it needs to be committed. It needs to beached in the hardware at this point on its way via RDMA, but it needs to
be committed. It needs to be placed into stable, durable storage. So one could consider overriding
the SMB flush operation. However, that would require interrupting the processor. If you have
frequent commits, you might want to get around that at a lower layer. So I'm trying to give a
little peek into the future of how we might move stepwise toward an ideal goal.
We could start with a message
and proceed to a full extended RDMA exchange.
The callback, server-demanded client access
is similar to the current Oplock and lease break.
The server says, this handle has to change.
You can't have that region anymore
or I need to do something to your permissions on that region.
Things like that.
And finally, the finish, the client access is complete. It could be an SMB close, it could be
a lease manipulation of some kind. Whatever state we decided up here at the beginning, that
new create context or FS control would basically be undone down here at the bottom.
You can do the same thing with these other two protocols.
It smells a little different,
and I'm still being completely hypothetical,
so don't view this as a protocol proposal.
But NFS RDMA can do the same darned thing.
It has perhaps a new NFS v4.x operation
to set up that thing by opening, registering
and advertising a writable, readable
file or region. It may
offer a delegation or
PNFS
could do this.
The PNFS has layouts that allow
a very different storage
protocol to be used under the auspices
of the NFS 4 protocol.
You could define an RDMA PM-aware
layout protocol, and it could simply speak RDMA, right? It wouldn't actually have any messages
that weren't RDMA in it, but it was, if you will, the door to it. So PNFS is kind of a big ball of
complexity on top of an already complex NFS4. It may not be the best implementation So PNFS is kind of a big ball of complexity
on top of an already complex VNFS v4.
It may not be the best implementation approach,
but it's probably the best architectural approach
in the way NFS 4 has been implemented.
Anyway, the rest of it behaves just like our previous example.
Writes and reads are direct RDMA access by the client.
The client occasionally requests durability via a commit. If it has an RDMA access by the client. The client occasionally requests
durability via a commit. If it has an
RDMA extension, it can do it. If it doesn't,
it can say, dear Mr. NFS
server, could you please commit the
data that I've been writing and here are
the ranges that I think I've
dirtied. It has a callback
similar to the current delegation
or if it's a PNFS approach, a layout
recall that says you can't have that anymore
or I'm rearranging them, I'm giving you new memory addresses
or whatever it is.
That's old hat for layouts.
We do that with block devices on PNFS today.
And finally, the finish could be an NFS foreclose
or a deleg return or a layout return
depending on the choice of implementation.
Chuck, question?
I'm just going to toot my horn a little bit.
Tomorrow I'm giving a talk
where we can
continue this discussion.
I have a slightly
different approach to the problem,
but this fits right in
with some of the things I'll talk
about tomorrow.
If anyone's interested in following up on this,
come to my talk tomorrow morning at 10.30.
Sorry, who are you?
I'm Chuck.
Thank you, SW.
Chuck Lever from Oracle.
And Chuck and I, I will say, have not talked at all about this.
So if we're thinking on the same lines, well, great minds think alike.
I don't know.
We'll find out tomorrow afternoon.
There's a third protocol, iSCSI.
So iSCSI is an interesting one.
ICER is an adaptation of iSCSI to RDMA.
And it basically modifies the data moving behavior of an iSCSI implementation to be compatible with RDMA.
There's a data mover architecture layer,
which is kind of a conceptual layer
that modifies what in raw SCSI might be SCSI operations,
SCSI messages,
and applies RDMA-specific rules to their processing.
So this one's a little fuzzy.
Some of these are iSCSI or SCSI level operations,
and some of these are ISER level abstractions. But I'm going to argue that it runs almost the
same way, that setup will be a new ISER operation. And I believe it's an ISER operation because it
needs to register memory and return a memory handle. That's an ISER function, not an iSCSI function.
Writes are a new data mover model. The data mover currently uses something called solicited data to transfer things via RDMA. Solicited means that, you know, I have the data, but it's in a buffer
somewhere and you need to come fetch it or store it for me, right? Unsolicited is when it's inline
and it's just offered as part of the operation. So the architecture currently does not define
these operations. There's no such thing as an unsolicited data in except for things that come,
you know, inline to the message. So the data mover will need a little bit of thought and maybe the
SCSI ICER architecture will need a little bit of thought.
But the idea is exactly the same.
Implement an RDMA write within the initiator,
no target involvement whatsoever,
straight into a target buffer.
The R2T processing is just not going to occur.
There's nothing on the wire for R2T processing beside that operation.
That's because the target sends R2Ts, and the target didn't do anything here.
Read, same way.
It's an unsolicited SCSI out operation.
We implement the RDMA read from the initiator, from the target's memory.
We don't tell the target it even happened.
Commit is a new, possibly modified iSCSI, possibly a new ISER operation.
Performs a commit.
It's kind of like a FUA, right?
It's a flush.
It's a thing where you've got a bunch of data
that you need to commit onto the drive.
Somebody needs to think that through.
Callback is maybe a SCSI unit attention.
It smells kind of like a unit attention to me.
Something happens.
Somebody flipped the right protect bit.
Oops.
That kind of thing.
It seems to me that that signaling could be overloaded here.
And finish is almost undoubtedly a new ISER operation
because the setup is one.
If we have those,
we can actually use them by signaling the processor,
but we can avoid more latencies if we start to go down the stack.
The first is the RDMA protocol.
The RDMA protocol has no concept of durability.
It's simply not there.
This wasn't on the list when RDMA was being developed. There are things called placement and completion and delivery,
which are all about getting the packet to the stack, having the stack put the packet in memory,
memory, not necessarily durable, and to complete, to send some signal to the receiver.
Well, durability is new, and I argue that RDMA
write alone is not sufficient to provide
this semantic.
In RDMA speak,
pardon my geek,
the completion at the sender does not mean
data was placed, any particular
data was placed.
The sender, the only thing it means
at the sender is that the
buffer can be reused.
The NIC has taken a copy of your buffer and promises to do its best to get it to the destination.
So the sender doesn't know anything, literally.
All the sender knows is that the NIC tells him he can reuse the buffer.
It doesn't tell him where the data is outside that boundary.
Question on the right.
It also tells the
sender that the data is visible.
No, no, no.
Completion at the sender
doesn't have anything to do with visibility.
With visibility
towards the network
in a target. So for example,
if you did an RDV
write,
completion on it, you know that
another host did an RD.D. Mayer right? Some completion on it, you know, another host
is an R.D. Mayer.
That's true for some nicks.
Not all.
We can maybe discuss this one
offline.
I know where
it will go.
I will assert,
perhaps without proof,
that the send doesn't tell you much at all.
You can't use it.
You definitely can't use it to imply durability on the remote side,
which is really all we care about if we're going to use this.
So we have to be very careful with this.
Processing at the receiver, a completion at the receiver,
means something different.
The receiver begins to process it when the data is accepted.
As an RDMA write comes in, he begins to push it.
But the protocol doesn't make a promise for where it is on the bus.
That's implementation defined, how far it's gone.
Also, I'll mention that segments can be reordered
by either the wire or the bus so they
arrive in funny order. You can't
just because ABC gets written
doesn't mean that they appear as ABC
in memory at the pier. There are a lot
of open windows in the
RDMA protocol to allow that to happen without
buffering. They want to flow
these through so if they get reordered on the wire
they may get reordered on the bus during write.
Only in RDMA completion that the receiver guarantees placement.
However, placement does not equal durability.
Placement means that it was issued to the memory bus
and that it's visible to other devices on that bus.
It doesn't mean that it's actually reached the storage cell.
And so we've spent a lot of time talking about this in the SNEA NVM twig,
and we're going to continue to talk about this quite a bit more.
Certain platform-specific guarantees can be made,
but the client cannot know them.
That's really important.
We want the client to be able to do the same thing
no matter what server he's talking to.
And so if the server plays clever tricks,
and Chet's going to talk about some of those clever tricks,
that's great, but the client can't know them.
So what do we need to do?
I think there are two obvious possibilities.
The first is to extend RDMA write
and to add some sort of placement acknowledge,
a push bit in RDMA write.
And it seemed like a good idea until I
started thinking about it. It has a big disadvantage. The advantage is it's simple,
right? You set the push bit. Okay. You know, as an API, I think that's easy, right? But then you say,
wait a minute. If I set the push bit, it changes the RDMA write semantic, right? It means that I
have to wait for it to appear on the other end and I get an acknowledgement. Well, guess what?
RDMA write doesn't actually have an acknowledgement.
There's a transport-level ACK, but there's no upper layer ACK in RDMA-Write.
RDMA-Write is a one-way stream.
Only operations like RDMA-Read and Atomic have a reply.
So it changes the semantic, plus it flow controls that write.
RDMA-Writes are not flow controlled in current RDMA designs.
So it requires significant changes to the RDMA write hardware design.
I believe it makes it very undesirable.
Blocking the send work queue, that's really undesirable.
So I think the other possibility is a new operation, an RDMA commit.
Flow control that acknowledged, like an RDMA reader on Atomic,
it's actually a two-way trip on the wire, right? A commit
and a commit-ack. And the
disadvantage, obviously, in new operation
requires a more significant protocol
extension. But the advantage is it
has a very simple API. It's a flush.
We can specify exactly what regions
to push. It
preserves the right semantic because it doesn't touch
the right. And it
acts only, well, I'll mention that in the next slide,
when the operation is actually complete.
So to drill down on the RDMA commit,
which I argue is the only sensible way to go,
it's a new wire operation.
It's implementable in any protocol, I believe.
Initiator, I said initiating.
The initiator, initiating RNIC, right, the sender, provides a region list and other commit parameters, TBD, under control of the local API.
So I say I want to do optimized flush.
That's this NVM operation that we talk about.
Provides a list of addresses and ranges and sends it to the remote and says, I want to make this durable.
So the receiving NIC gets the operation and queues it in order.
It waits for all previous writes that might have touched that region
and puts the commit after them.
So it behaves a lot like an RDMA read or atomic.
You know that if you perform a write and then a read,
you'll read the data you wrote.
That's an ordering guarantee.
This is similar, that if I perform a write
and then a commit, that I've committed the data that I
wrote. Very simple.
It's subject to flow control and ordering, which is
very important, because these commits may take time.
They need to block on the receiver.
The RNIC
pushes the pending writes. It might flush
all writes. It might not track
the regions. It might just say, ah, I'm just going to push
everything that's dirty. That'll get really painful
if we have terabyte-sized DIMMs.
Shared
regions.
The Yarnik performs the commit. Well, how
does it do that? That's what I'll talk about
in a second. And it responds
only when durability is
assured. I'm running
low on time, and I do want to leave a little bit for questions.
There may be some local extensions as well. I'm just low on time and I do want to leave a little bit for questions. There may be some local extensions
as well. I'm just going to skip over this, I think.
I think there are platform specific
attributes that are only
the business of the platform
that registered the memory.
I don't think that the client has any business
knowing what he's writing to.
It might be a PCI device,
it might be raw memory, it might be
DRAM that doesn't need a commit at all. But there are hints that can be given locally that will
allow the protocols not to even care. There is a third piece, there is a third protocol that's
important, and that's the PCI protocol. Most RDMA NICs are plugged into a PCI bus.
Guess what? There's no commit in PCI.
You can issue your rights, but they behave kind of like RDMA.
They're fire and forget.
You really don't know what happens to them after they return.
And that's awkward if you have to make a durability guarantee.
So PCI does need an extension, I believe.
If we don't have a PCI extension, however,
we can interrupt the processor.
The processor knows how to do this.
It's undesirable to interrupt the processor,
but it might be necessary in the short term.
It depends on the platform into which the ARNIC is plugged.
Once again, that's the business of the platform,
not of the issuing client on the other side of the wire.
All he wants to know is, I commit, you return when it's done.
And then it's the platform at the other end
that figures out how to implement that commit.
The expected goal, if we can build this whole stack
with a upper layer file or block protocol, with an RDMA commit extension and with a PCI commit extension to match it, is that we'll get a single digit microsecond remote write and commit.
The same kind of thing we can get with raw RDMA today.
There might be some additional overhead for the actual commit within the device, but we're basically talking about the wire transfer
as the biggest chunk of that fixed budget here.
Chet's slides in about an hour will give a lot of detail on this.
He'll actually pick apart those individual contributions
and how they can be provided and what they are,
or at least estimate what they are.
I'll mention that remote read is also possible, and it'll have an even lower latency than
write with commit, because it doesn't have to do the commit, it just has to read.
There's no server interrupt, which we get the RDMA and PCI Express extensions in place,
and there's only a single client interrupt per operation that we can even moderate and
batch out.
So we're talking a very efficient protocol here, all
under the auspices of this fairly rich and complex set of protocols up on top. We don't
have to introduce a new protocol here. We just have to make small extensions to the
protocols we have. Last slide. There's a few open questions questions how do we get to the right semantic
there are discussions in multiple
standards group, how do they get coordinated
implementation in the
hardware ecosystem, this is going to take time
the NIC vendors, the platform vendors
all kinds of systems will have to change
to really get where this
promise will
lead them. I believe that we need
to drive consensus from the upper layers,
the storage protocols and these APIs, these new APIs,
down to the lower layers.
I don't want to let the lower layers tell the upper layers what to do here.
I want those upper layers to have a really good idea of what they need.
What about new API semantics?
Does NVML add new requirements?
We should think that through.
Do those underlying file systems on the storage side of it add new requirements?
I don't know.
DAX the Linux and DAX the Windows version.
And other semantics.
Are these upper layer issues?
Authentication, integrity, encryption, virtualization.
These are really important deployment questions.
I'm done.
Question?
Can you do multicast with this?
I mean, you mentioned reliability.
I don't know if you make multicast.
Is it possible to leverage multicast?
Can you do multicast with this?
There are RDMA standards for multicast.
They are typically connectionless,
which is not the model that all these protocols use.
And the upper layer protocols
are really not one-to-many protocols.
These are client-server protocols
or initiator-target protocols.
So while I think you could use multicast,
I think it would introduce quite a few challenges
up and down the stack.
Question?
I think one area that you focused on is how to make the commit to the actual persistent storage.
So even the traditional storage arrays that we see,
they also do not guarantee that once they add a write, it's returned to the media.
They have a huge cache sitting in front, which will
cache the rights.
We are really talking about this problem
only for RDMA, not
for NMA.
In particular, I don't think we are talking
about it only for RDMA. So let me restate the question.
Why are we having this
discussion about the actual transfer?
Why don't we talk about the implementation all the way down
to the durable media, how that happens and what the rules for that are. Way back in here, well,
it's just going to take me a long time to switch all the way back. There are some separate back
ends on the right, and they're not all memory, right? Some of them are traditional IO. Some of
them are memory. They'll all behave exactly the same way. You may RDMA into memory
and then the server may actually move it to the durable medium. That's why commit is so important,
right? And maybe you can't bypass commit when it lives on such a back-end media. That's why the
mode bit in the memory registration was there. What special processing also needs to happen?
If that's the case, you may have to interrupt the
server, and the server may have to perform a traditional IO stack operation. Once again, that's
not for the client to know. All the client knows is that semantically, I've issued some writes, and
I've committed them, and they're guaranteed durable, right? The client will issue the same set of
requests in either case. The server and the networks below it will
figure out how to implement that request. So really, I'm not trying to be weaselly. That's
out of scope for the proposal here. That's probably internal magic on the part of the server
rather than something that the protocol specifies. All right. Thank you. I'll be available for talk.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers in
the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.