Storage Developer Conference - #99: SNIA Nonvolatile Memory Programming TWG - Remote Persistent Memory
Episode Date: June 17, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org slash
podcasts. You are listening to STC Podcast, episode 99. So I'm Tom Talpe. I'm with Microsoft.
This is not a Microsoft talk. This is a SNEA talk.
I'm a member of the SNEA, Non-Volatile Memory Programming, TWIG.
I've been contributing to it for several years, and it's been a really rewarding experience.
I'm going to talk about where we're at with the remote flavor of the NVM programming model,
which is interestingly different from the local flavor
and which has a number of,
you know, Andy talked about the dependencies on operating systems and platforms. Well,
now we're going to add networks and network protocols to the picture. So it is, you know,
increasingly complex. I'm going to talk about basically four things. The SNEA NVMP twig
activities, what we're doing, what we've done.
A little bit of review.
Andy covered this, but I wasn't sure everybody would be in both talks.
Remote access for high availability work.
That's our remote programming document, the remote access for high availability.
It's a separate document.
It's a white paper, not a technical specification.
The technical specification may incorporate this paper now that we understand it better.
It's really come a long way.
RDMA requirements and extensions.
I'm going to dive down a layer and talk about the RDMA layer
and what implications the NVM programming model has on the RDMA layer.
That's where it will really get interesting, I think.
And then wrap it up with a little bit of current and future work for the remote programming twig.
So to kick it off, the mission.
The mission was to accelerate the availability of software
that enables persistent memory hardware, right?
This is very generic boilerplate,
but the hardware included SSDs.
That's the disks that Andy was talking about.
All right, this worked a minute ago.
Yeah, but it's red.
This is annoying.
It just worked.
I just tested it.
This is the better one that works, I guess.
So SSDs, disk drives, or emulated disk drives,
and persistent memory itself,
the software spanning applications in OS.
Our mission was to create the NVM programming model
to describe application-visible behaviors, right?
It's an interface.
It's way up at the top.
And allow APIs to align with OSs
and describe the opportunities.
The NVM programming model, the most recent version,
version 1.2, was approved by SNEA just a little over a year ago.
It's been out there for a year.
We enable block and file, that's the disk drive and the memory mapped versions,
with things like atomic capability and granularity, thin provisioning, all these things which are important to devices, storage devices. It uses memory map files to expose persistent memory to applications, etc., etc.
It's a programming model, not an API, as Andy described.
The programming model methods, these guys on the left are the block mode,
and we called it innovation.
It doesn't feel like innovation anymore.
Andy has repaired the brighter, greener pointer.
Thank you.
And so I'm not going to talk about these, but this is the disk drive emulation version.
This is the so-called RAM disk or similar sorts of things.
And the one we're really interested in are these emerging PM technologies where we insert this little PM in here.
We have a PM file and a PM volume.
And really all the attention is on PM file.
This thing gets brighter as you hold the button down.
So anyway, block and file modes use I.O., traditional storage model.
Volume and PM modes use a mapped model with load store move.
And here's a great big long URL that you can go see the
current version of the specification. We're currently working on this quite a bit. We've
been very busy for the last year, so that's what I'm going to talk about. The remote access for HA,
this is our remote version or remote addendum, if you will. It's a white paper that builds on
the 1.2 model. We actually published it back in 2016,
and we've been doing quite a bit of work in the background on it.
The specification 1.3, right?
We released 1.2, and 1.3 is in development.
We're updating that specification,
and we're incorporating learning from the remote access white paper.
And a couple of the specific things are asynchronous flush
and ordering and error handling,
which is what I'm going to talk about.
And finally, we have this brand new,
just announced in the last six weeks, four weeks,
something like that,
a collaboration with the Open Fabrics Alliance, OfiWig.
OfiWig is the Open Fabrics interface, if you will,
the sort of RDMA verbs or RDMA interface corollary to what we're
doing here with persistent memory. And we're going to develop a new version of the programming of
remote access for HA, but we're going to do it in conjunction with this collaboration. So we're
going to get feedback from the RDMA community on their lower level APIs. And we're going to expand
our use case enumeration. We have a couple of very simple use cases. We're going to expand our use case enumeration. We have
a couple of very simple use cases. We're going to try to make these more thorough, complete, deep,
et cetera. All right. Persistent memory and adding remote. I'm just going to do this little
plus a remote thing. So here's the picture, a little different from what Andy showed, but the same basic concept.
We have a user mode persistent memory aware app, and it has two ways. It has file APIs that dive
into PM aware file systems, and it has a direct memory load store, right? So it basically m-maps
the file, proceeds to do loads and stores, and goes back to do syncs, all right? This is fairly
generic and high level.
I'm not going to go into any more detail on that.
However, we're going to add remote.
And so the basic idea is we add an RDMA data path
through an RDMA NIC to a remote device,
and we add RDMA operations that flow
either from the top-level user interface, maybe hook the file APIs,
maybe a direct API. It can also come from the PM-aware file system. And I drew a dotted line,
but it can also come from the PM driver. This is the device DAX model. We don't actually cover this
in the SNEA NVM programming twig right now
because we pretty much chose not to.
But this path is available.
So there are many ways to drive RDMA operations to a remote PM device.
And I am going to explore this.
And the twig has been exploring this.
PM remote access for HA.
High availability.
I should have spelled that out in the title.
So this is the NVM programming Twig developed interface for remote persistent memory.
It's the analog to our existing persistent memory interface.
We are seeking to maximize our alignment with the local PM interface. The local PM interface is what
we do today. We want to maximize the alignment. We don't want to write a new version of the remote
programming interface, right? That would be a huge mistake. I'm going to write my program to do local,
and then I'm going to write another program to do remote.
That big mistake.
We want to tuck this under the existing interface.
And we know we can.
We've thought about this long and hard.
But that's the goal, to make sure that we have a single interface, a single API.
There might be new things that have to happen in the remote case,
but we're going to tuck it under that.
We're going to take the remote environment into account, including RDMA semantics and restrictions. We're going
to think long and hard about how this can be actually implemented on operating systems
or in particular on RDMA layers. The OFI WIG will help us a lot with this? Check our work, if you will. And we're going to nonetheless scope our interest
all the way into the RDMA layer. If the RDMA layer doesn't do something, we're going to think about
it and recommend that it do it. And that's actually well established. We're going along.
And analyze the error cases. A whole lot of new error cases pop up when you go remote. You not only have a processor bus between you and the device,
now you've got a piece of wire between you and the device.
So a lot of really interesting things happen.
I'll talk about it in a little bit.
So this is the outline of our existing document.
I just put it up here so that you get a feel for the way the document is put together
and what it looks at.
It expands out into some details,
but actually some of the more important sections are not expanded, right?
Down here, five and six are kind of the meat of it.
But it starts with purpose and scope, and we have a taxonomy.
This recoverability definition is really important, though. And if you're going to read the
document, I encourage you to think about these a little bit. Durability versus availability.
Durability means that we've stored it persistently somewhere, and it'll come back after the lights go out, right?
But availability is different,
and we try to tease apart the meanings here, right?
That the data can be mirrored for availability,
and it can be durable in a separate coordinate system, right?
Durability is different from availability.
And so we want to talk about it
because high availability, after all,
is the title of the document. Consistency, recovery, and integrity, these are more things which come
into play in highly available situations, highly available scenarios, right? We want to make sure
the data is consistent, that everybody sees the same version of the data, right? We want to make sure the data is consistent, that everybody sees the same version of the data, right?
We want to make sure that it can be recovered,
that the error case is well understood.
And we want to make sure that the data has good integrity, right?
That when it's stored, we want to make sure that the right bits were stored
and we don't fetch bad bits or we don't inadvertently result in bad bits.
So all these things are very important.
These are the HA extensions. This is
where we start to get interesting. RDMA for HA is how we actually implement those extensions.
We talk about security and we have some interesting things. This little guy down here at the bottom,
workload generation and measurement, is where we're working with the OFI wig as well, right, as I talked about use cases.
All right, map and sync.
These are generic slides that are used a lot
in the NVM programming presentations,
but they are sort of the fundamental,
these first three major bullets
are sort of the fundamental, these first three major bullets are sort of the fundamental points we make
when we go over it in all these menus.
But I want to point out that map
is a function of local PMEM,
and sync is, the way it's expressed right now
in the document, is a function of local PMEM,
as Andy was talking about.
You map the file, you get it in your address space,
you do loads in stores.
And that's kind of the paradigm
that we've modeled after.
But all this stuff
really starts to change
when you do it remotely.
Your stores do not
magically become RDMA rights,
nor do you want them to.
If I do a store an integer,
I don't want to spit a packet out the back of the machine, right? I want to control the way
or batch or somehow manage the way these things flow to the wire. And the implementation of such
a thing would be impossible. You'd have to take a fault and invoke some driver and, you know,
your latency story would go out the window, which is not exactly the goal
of a memory device, right? You want very low latency. Flushing, which is the sync stage,
applies to RDMA just the same, but it flows all the way down the RDMA stack across the network
and through the IO pipeline of two machines, right? Your local machine and your remote machine.
So there's a whole lot more to talk about now, right?
We can't just say, well, you magically flushed your processor cache.
Oh, no.
You flushed your processor cache.
You performed an RMA operation, which was carried over the wire between two devices
that the CPU, you know, wasn't really a part of.
And then it did the same thing on the other side.
Okay, now we're talking. We've got a problem or a situation on our hands that we have to discuss.
And it brings in a very strong motivation for this thing we call asynchronous flush,
which is part of what we're doing in 1.3. And asynchronous flush is the key to batching,
is the key to sort of say, you're
ready to do some RDMA operations. I'm not flushing yet, but you're ready to do some
RDMA operations. And I'll show you how we use that.
Another important thing is failure atomicity. The current processor and memory systems guarantee
a sort of inter-process consistency, right? That all the processes running on a platform can all see the
same memory. When somebody does a store over here on this side of the motherboard, the other guy on
the other side sees that same data. That's consistency or the so-called visibility of the
data. But the model, the NVMP model, provides only limited atomicity, well, the platforms as well,
with respect to failure.
So there are a lot of interesting things, and that's where we've really spent a lot of our time in the Twig.
Well, the same kind of thing is impacted by the network, right?
We stuck the network in between.
So now when we perform a remote store, a remote sync, these same kind of failure atomicities come in to play, but they are
multiplied by the network. Because alignment restrictions may change for, you know, failure
atomicity of aligned fundamental data types, right? That suddenly changes when we're bus-attached,
right? The bus might be narrower than the processor word width, for instance. And that
thing then goes
over a wire, you know, and actually once it, before it goes over the wire and after it emerges,
it goes over a PCI Express IO bus. And all these things have different atomicities and different
behaviors, different ordering requirements and things. And so all these things come back to
haunt the failure atomicity. And so We have to look at new remote scenarios and reflect network failure into this.
We originally just looked at the local platform behavior.
Like Andy said, you do a store that splits across two cache lines.
What if the light goes out?
Well, now I'm doing a store that not only splits across those cache lines,
it splits across packets and remote operations and things.
What happens when maybe the lights don't even go out?
Maybe the network just interrupted, right?
So we have a whole lot of brand new and very interesting scenarios to explore.
And consistency for recoverability.
This is another interesting section of the NVM programming spec, the 1.2 spec.
As Andy mentioned in response to a question about transactions, there's an application level of consistency as well.
The hardware can't magically know what the application thinks is a safe recovery point.
The application has to manage that. And so these sorts of things, these transactions and consistency models of the platform,
are really important to the application.
How do they change in a remote scenario?
And so these are all things that we need to look at and think through.
And that's what we do as a TWIG.
It's really interesting stuff.
I'm trying to encourage people to participate here here if you haven't gotten the picture yet.
Okay, some key remotable NVM programming interfaces.
Some of them you already know about.
In NVMP 1.2, we have this thing called optimized flush.
The optimized part of it meant that you could do it from user space.
You could do it in a very highly optimized,
lightweight way as an application.
An optimized flush is a little harder to do from userspace. When you have an RDMA adapter,
you've got to issue some RDMA requests and all, but it directly maps to RDMA. We can easily do
this if we only extend RDMA a little bit. I'll talk about what that means in a second.
There's also this second version of Optimize Flush called Optimize Flush and Verify. This and Verify may sound like a really strong thing. It's actually
quite weak. It's kind of a best effort verify. It's sort of, if you detected an error while you
were flushing, please tell me about it before you complete the operation. And the strength of that
Verify is becoming more and more important. We believe
we're way too weak. The reason it was weak in the original spec is that the operating systems
provide a weak semantic. MSync is void, right? You call MSync and it can tell you you had a bad
file descriptor or something like that, but it doesn't tell you anything about the data.
It basically starts the data flushing, right? And you have to wait
for that data in another way and collect the error in another way. MSYNC really doesn't give you a
response. You don't really know how long to wait for the error, right? You don't know when the last
bit of dirty data goes down there. And you want a stronger verify, and I'll talk about how we can do
that, at least with RDMA transfers.
And the second thing which needs to be added is async flush and async drain. I have a better slide about this in a sec, so I won't go into too much detail. But async flush initiates
flushing for a region, and async drain waits for that. And there are some ordering as well
that we need things that follow flush
in order to implement transactions.
The other NVM programming methods are remotable,
but only via upper layers.
We don't expect to have the full NVMP interface
jammed into the RDMA layer, right?
And we don't want to create a new protocol
as part of the SNEA work.
That's not our goal.
Our goal is to influence and set requirements,
but not to design new protocols.
So we're going to leave a lot of these things
to the upper layers,
but we're going to put big, bright red paragraphs
in our documents, say,
this is not covered by the NVM programming interface. You need your operating system to put big, bright red paragraphs in our document and say, you know, this is not covered by the NVM programming interface.
You need your operating system to solve this,
or you need your upper layer protocols to solve this problem.
And so that's the goal of the SNEA,
to increase that awareness of what's missing or what's needed
on top of what's been specified.
Here's the ladder diagram from the document and it and it shows a single application call called
optimized flush right so we have an application that's mapped the region of memory locally
and proceeds to store data in it right it does load load store store blah blah blah it dirties
dirties up a bunch of data at some some point, it's going to say,
I want to make that data persistent.
So it calls optimized flush.
Our model in the remote access world
is that at that point, we say,
ah, okay, now we know what data is dirty.
The application told us what data is dirty.
And he told us we need to move it to the other side
and commit it to a persistent medium, right? And so the library, the NVM programming interface library, begins to
do a bunch of RDMA writes and starts to shovel that data over the wire. It then does this new
hypothetical flush, right? The ordering of these four operations is what's important. The flush makes sure that it pushes those writes.
But the flush has to wait for these writes, this little lasso that we draw,
it has to wait for these writes to actually appear on the bus before it responds.
And so you can think of this as a fairly heavyweight operation in the large data model.
And so this is our motivation for a new version of optimized flush.
We want to batch these writes maybe a little earlier so that this flush can happen more promptly.
And in particular, we want to use all this time that isn't shown here to drive the network.
We want some parallelism and concurrency.
So we define this thing called asynchronous flush.
It separates the flush and the drain stages.
This is not currently in the document, right?
This is in the 1.3 draft document
that we're still writing internally.
It allows early scheduling of writes without blocking.
Optimized flush is a blocking operation, right?
We don't like blocking operations.
They're bubbles in the pipeline.
We used to call it giddy-up.
We've loaded up the horse, and we want the horse to get moving,
and there's more horses' worth of load to be moved or whatever,
so we say, giddy-up, get out of here, and we start the RDMA writes.
It's important for efficient concurrent processing.
It turns out that async flush is pretty interesting locally as well.
The async flush will allow you to keep your pipelines fairly shallow.
And that's always good.
Shallow pipelines are good, at least if they don't hold you up.
And we expect that Giddyup is useful both for the applications explicitly
and particularly for middleware
libraries. If the library hasn't seen a flush in a while but knows there's a bunch of stuff
going on, it can do stuff in the background. It can schedule background work using this
async flush. Drain is the thing you call after flush. It allows the application to ensure
persistence. It's good because if
there's less data remaining to flush, because we've done this early scheduling, then we have
less wait latency when it comes time to drain, right? So that's good locally and obviously good
remotely. The problem with async flush is the error conditions, right? Now it's much more difficult
to figure out when an error occurs which thing it was because you have all this asynchronous stuff
flowing in the background.
You want to know how local your error is in general.
We have to get a little fuzzy when we go async.
I think this is a traditional problem
seen in computer science for decades.
Whenever you have asynchronous I.O.,
you've got to collect all these little
atoms of work and put them together. And the errors, when everything's sunny and the wind
is at your back and it's a beautiful day, everything's great. The problem is that not
all days are like that. And you want the error case to be well understood so you know what
to do. So that's what we're focusing on a lot, asynchronous flush.
And under development, also being worked on in the 1.3 and the new version of the remote access
for HA, we have a number of interesting other things. And these are just a few of the ones that
I cherry-picked. One, visibility versus persistence. Andy was describing this.
When things become visible on the memory bus,
are they persistent?
Well, maybe yes, maybe no.
And these are two very different things.
What we've, like, our example here is
you do a compare and swap on persistent memory,
and you do not necessarily yield a persistent lock,
meaning that you've done the compare and swap atomically,
but you've done it on some piece of memory that hasn't been made persistent yet, right? And the state of visibility
is actually different from the state of persistence. And so you have to think of persistent
memory in slightly different ways now, right? You have to think of it as something that needs to be
committed before it is actually persistent and before you've saved the state.
So these locks, like compare and swap, suddenly get new semantics or new side effects or warnings on the side.
So what we're thinking right now is consumers of visibility versus consumers of persistence.
And the two are different, right? A consumer
of visibility would be like a local lock or somewhere where you're putting something into
the system in a multi-threaded application and you want it to be visible so you can write
your multi-threaded application in the same way. But when you're making it persistent,
you've actually become a different type of consumer. You're expecting a slightly different semantic.
We're trying to bring out what that means, right?
And it's kind of an academic exercise in a way, but we're trying to plant it in reality
and thinking through what they really mean when you're implementing them on the platform.
The assurance of persistent integrity. Okay. I've made it persistent.
How do I know it's good? Andy mentioned this. The memory will throw an NMI. It'll throw an NMI
whenever it discovers the bad bits. I don't know when it discovers the bad bits. Maybe it discovered
it when it wrote it to the memory. Maybe a scrubber came by 10 minutes later and found it.
All of a sudden, boom, the NMI goes off.
And the system goes, oh, my God, I got bad blocks in my PMAP.
Well, what blocks were bad?
And who wrote them?
And who cares?
And, you know, I got to notify the right guy.
And so these types of semantics are very difficult.
And we do have that optimized flush with verify.
Well, that's an interesting angle.
So this is what I talk about,
the assurance or the strength of the notification you get.
So we are looking into this,
and I believe an explicit integrity semantic
is one way to go.
That's just Tom Talpy's opinion.
Scope of flush is another interesting thing. When you flush, what gets flushed? The API says you
must flush this range of bytes. Bytes 4, 5, and 6 must be made persistent. It comes back and it
says, yes, they've been made persistent. Maybe 4 through 8 were made persistent. Maybe the whole
page was made persistent.
What's the scope of persistence?
It's never less than the guarantee,
but it's often much more than the guarantee.
And when you go remote, really interesting things happen
because you have queue pairs
and sort of ordered packets on the wire.
And so we've come up with the concept of streams of stores
and this thing called a store barrier or an ordering nexus,
which are used to plant flags in the stream of stores and say everything prior to this time
has been made safe. This is very, very useful for implementing in remote environments.
We are trying to bring that out in the document. We also have flush hints. Can I flush asynchronously
or synchronously? How
much do I care about this particular flush? If I'm implementing a transaction, it's usually
that last flush that I care about. In other cases, you may have streams of these things
and you may want to mark some of them in special ways.
We want to model these in the programming interface and also in our binding, if you
will, to the RDMA protocol. We're looking, once again, to understand and guide platform implementation. We feel
these are really important concepts that need to be brought out for the industry.
I'm going to pause for just a sec. I'm a little over halfway. I'm going to have to speed up
as I dive into RDMA.
Okay, you've probably seen these before.
I started talking about this stuff in, like, 2016, I think.
Maybe 2015.
RDMA adapter. So remote access for HA, remote PMEM for HA.
We immediately knew we needed to do this over RDMA. And I really dug into the SNEA persistent memory effort when RDMA came to the table.
RDMA is my thing.
I put a lot of protocols on top of RDMA, and I've done a lot of things with RDMA itself.
RDMA adapters provide a connection-oriented, ordered, reliable stream.
The memory registration provides a handle
to enable remote access.
It's described by the handle,
which is basically an opaque identifier,
and an offset in length
that names the bytes inside that region, right?
And we have these operations called send,
and in particular, RDMA, RDMA read,
RDMA read and RDMA write, rather,
that act on those buffers.
And they're broadly implemented across the industry
over multiple protocols, okay. There are dozens of vendors and three or four protocols,
many more proprietary ones. RDMA protocols provide this semantic called RDMA write,
but what we need is that remote guarantee of persistence, that thing you saw called flush.
It's in support of optimized flush
or async flush or other things.
RDMA-Write doesn't cut it.
It guarantees remote visibility,
but not durability, right?
It provides for delivery to the visibility
of applications running on the remote platform.
As we've learned from our previous segments here, that ain't it, right? It needs
to be physically in the memory. And RDMA doesn't really go that far, right? RDMA just puts it at
the doorstep of the memory and says, okie dokie, I'm sure you'll open the door and let me in pretty
soon, right? We need a flush. We need something that pushes it in that door and past the boundary,
the dotted red line in Andy's picture.
So we desired an extension,
and back in 2016 and more recently,
we've had these things called RDMA commit,
a.k.a. RDMA flush.
They both have a very similar semantic.
They execute like an RDMA read.
They're basically
ordered, flow controlled, and acknowledged. You send a request, it waits for its turn at the head
of the queue, and it's acknowledged when it completes. The initiator requests a specific
byte range to be made durable. It's basically a subset of that little RDMA triplet. And the
responder acknowledges when durability is complete. It looks exactly like optimized flush in the SNEA model, basically.
It just happens to be remotable, right?
And there's a bunch of problems implementing these on local platforms,
but we can at least write it down.
And so it's a new wire operation and a new verb.
There are a bunch of details here.
I basically talked about them all.
But there are a number of ways to actually implement them.
And the simple way is to interrupt the CPU and let it do it.
The more complicated way is to change the behavior of the platform so that it happens automatically.
And the way we really want to do it is to perform via some sort of explicit operation from the PCI Express bus.
And that's still under discussion.
But these platform-specific workarounds can certainly do that type of thing.
It's just a question of how universally it's implemented.
Each NIC will be responsible for it.
PMEM subsystem, the PMEM library, doesn't need to understand this.
It said, this is
the requirement. I'm going to call this method, and you're going to do it, right? Right.
Here's some workloads. This is where it starts to, I show you exactly what's going on. So
first workload, basic replication, simple replication,
mirroring, right? I got a local write. I want to replicate it to the remote side, okay? That's all
I want to do. I just want to copy my stuff someplace else. I want to get it into a different
failure domain for basic availability, okay? So I do a write, maybe I do a whole mess of them,
and then I do a flush, right? I'm not overwriting data. I'm not really
ordered. The ordering dependency is basically around these flushes. Every time I flush,
I want it to be safe, right? I don't really care when the writes showed up. I don't have
any control over that locally, so why should I have any control over that remotely? There's
no completions at the data sink, and we want to pipeline it, right? We want this to be
incredibly efficient. All we want to do is store the data.
Push the data, store the data.
Push the data, store the data.
We're not sending it to another upper layer.
We're not processing it when it arrives.
And so we draw this fairly simple picture.
It looks exactly like the picture we had before.
We call these put, put, put.
A put is an unordered sort of a write.
They turn into RDMA writes that flow.
We follow it with a flush.
The flush basically collects all the writes and responds.
Really simple.
This is the way optimized flush and the flush part of async flush will basically work on the wire.
This is basically how you'd see them on the wire.
There's a pause right here while the flush takes place.
So it's a blocking operation.
These guys don't block.
In fact, they don't even get responded to.
You notice how they just sort of die out into the PMEM line.
These are called posted operations.
Fire and forget.
I posted my write.
I went off and I did something else.
Well, that's nice.
Let's do something more interesting. What about a log writer?
A log writer wants to write a log record, 4K, whatever, wants to commit it to persistent memory,
to something durable, and then wants to write a pointer that says that log record is good.
I just added it to the linked list, basically. You do not want to place that pointer until the log record is successfully durable.
That would be really bad if somebody came up and found a good pointer in a bad record.
So you need some sort of transaction.
You need to be sure that the log record write was good before you wrote the pointer. You need some sort of acknowledgement or rule
in the protocol that implements drain, that waits for this commit, this flush. You need to wait for
it. If you do that end-to-end, you have a pipeline bubble, right? I'm going to send my flush. I'm
going to wait for my flush. Now I'm going to send my write.
Right? That's a big dead time on the wire.
You don't like pipeline bubbles.
They impact your operations per second in large ways.
So you want to just say, commit write.
And if the commit failed, don't do the write.
Right?
You get the same answer either way.
You know it either worked or didn't work.
You know, you successfully or unsuccessfully committed your transaction.
But it's way more efficient.
So you want desire and ordering between commit and a second write.
Between that curly brace and that curly brace.
Basically, that little comma.
You want that comma to be on the data sink, not on the data source.
So we have this special thing called an atomic write.
An atomic write drops 64 bits of data in an atomic fashion,
but it does so only after a commit.
And so it is really simple and really powerful.
It looks a lot like other RDMA atomics, in fact.
So it's ordered in very similar ways,
and it can be implemented very trivially on a lot of NICs.
If they already do a compare and swap,
they can implement this atomic right.
If they do an uncache right, they can almost do it,
and probably can on a lot of platforms.
So we think this is a fairly lightweight thing.
It's being discussed both in the ITF and the IBTA,
the IWARP and InfiniBand Rocky communities.
And there's pretty good agreement on it.
A couple of details to work out.
Here's what it looks like on the wire.
This is an animation.
So we start off with put, put, put, and a flush.
And the flush, I've sort of interrupted it in midstream.
It's begun the flush.
Shortly later, we say right after flush.
We haven't waited for this flush yet.
But it stopped.
It said, oh boy, there's a flush in progress.
I can't continue to execute yet.
A moment later, the flush completes.
And we respond to the other side.
And then, boop, that little arrow pops in, and it says, you may go now.
And that stop sign turns into a go.
I should have had another animation, but then it would have looked funny when I flattened it for PDF.
And at that point, the write occurs, and we respond to the write.
Now we have a transaction.
We wrote a blob of data, we committed the blob of data,
and we wrote a pointer, and it all happened in the right order.
Now if the flush had failed,
we never would have uncorked this thing.
We just would have gone, and broken the connection.
So that would meet our needs.
So, looking good.
But now, you notice there was no CPU on the other side checking our work, right?
So how do we know the data was good?
I'm a paranoid log writer, you know.
I do not want to commit a record
that you didn't get every bit right. And most
storage layers do this quite faithfully. They check every bit. And they actually ask the
disk to check every bit. There's all manner of paranoia up and down the stack. We don't
have a lot of paranoia at the RDMA layer. It's just a transport.
So what we really want is an integrity check. Now, I talked about this back in 2016,
and it's still not reality, but I think it's getting more and more compelling, so I'm just
still ringing that bell. In order to check it, there's some really bad ones. Reading it back,
right? I don't want to read the data back. Not over the wire, that's for sure. And you might
not actually read the actual data, because there's no such thing as an uncached read. You might be fetching data from somebody's cache
on the platform. You could signal the upper layer and say, oh, Mr. CPU, by the way, I just wrote
some data. Would you tell me it's good? And you can go off and do his stuff locally. You can call
the NVM programming API and do a check error and blah, blah, blah. But that's pretty heavyweight
and a lot of latency. What we really want is the lower layer to do it. Oh, there's one more, a couple more interesting
things. Maybe we just brought a new volume online. We want to scrub it before we add
it to the array. We want to do some storage management recovery, that kind of thing. So
some, you know, we want to use this in a couple different ways. So the solution is RDMA verify.
We have a new queued operation just like the commit,
but it has integrity hashes.
I don't think the RDMA layer needs to negotiate the hashes.
The RDMA layer needs to implement them,
but I don't think they're really part of the protocol.
It's kind of like an RDMA read.
You're asked to read the data,
but instead of giving me the data,
just give me the hash, please.
And the way you compute
that hash is somebody else's question. Besides, no two people want the same hash. Some people
want a simple CRC. Some people want full-shot 512. I don't want to tell them which one to use.
I don't think it's a good use of protocol time to negotiate that, But that's something that can be discussed. The hashing
algorithm is going to be implemented in the RNIC or a library somewhere. Some component somewhere
on the target platform is going to do it. The semantic, I had a few options when I first talked
about this, but I've nailed down, I think, what is right. See if you agree with me. The source
computes the expected hash of the region.
You know, I believe this region should have a hash result
of 1, 2, 3, 4, 5.
So it's previously computed it, right?
It sends that hash to the target and says,
please compute the hash.
The target computes the hash.
If it matches, everybody's happy.
We just say, your hash matches my hash.
Here it is. You know, I think we're good. If it doesn't match, happy. We just say, your hash matches my hash. Here it is.
You know, I think we're good.
If it doesn't match, now something interesting has to happen.
Maybe that's a fatal error.
Or maybe it's just, please tell me which regions are bad.
So we have two different semantics.
One is return the computed hash value that doesn't match.
And two is, oh my God, fatal error, stop.
Fence all further operations.
The log writer would use this.
And so basically this three-way return,
successful match, unsuccessful match, fatal match, fatal mismatch.
So here's what it looks like on the wire.
Put, put, put, flush.
I'm not going to do atomic write this time
just to get it out of the picture.
Verify.
I've done a flush.
Now I want to be sure it's good.
It stops, just like the other guy did,
and waits for the flush.
The flush completes.
Boom, it does whatever it did.
It might have failed.
If it failed, we're done.
But if it succeeded, then the verify is signaled to proceed.
Verify goes off to a verify engine, which computes the hash.
And if it's equal, we complete.
If it's not equal, we either return the value or break the connection.
That animation didn't quite work. Yeah, there it is.
So not equal might still say verify complete. or break the connection. That animation didn't quite work. Yeah, there it is.
So not equal might still say verify complete,
but not equal can also say big red light.
So that is a new proposal, not yet in any standard, but I believe very important.
There's some implications for this thing at both levels.
The NICs have to figure out how to do it. But the upper layer interface also needs to express the behavior. And so all of these prior RDMA
extensions need consideration in the RDMA PM for HA document. First, I believe, is that
it strongly strengthens the need for async flush. We need this asynchronous behavior to keep the pipeline full, right,
to efficiently implement this thing. An implication of this is that we have increased
imprecision of errors. You know, the RDMA connection broke, and we know what sequence
number it was executing when it broke, but we really don't know the details of what broke, right? We're not running on that platform. There's no check error call. And so that increased imprecision is a little bit of a worry,
right? We need to understand more about the implications of that in the remote scenario.
I touched on that earlier, but now I'm hoping you understand a little better what might go wrong here.
Atomic write completion.
Do we need to know if the atomic write is done?
We're going to express the atomic write in the API, maybe.
What's the completion semantic of it?
Is it a write, like a store operation, fire and forget?
Or is it a sort of a transaction?
I perform this write, I want to know that it's done.
Is it, does it have a fence or a barrier behavior? That's kind of the core question.
We don't know that. That's a local question. It's not a wire protocol question, but it's one we haven't really dug into yet. Asynchronous verify. Verifies can be pretty expensive. Maybe we need to run this asynchronously like we run flush. I don't know. We haven't decided. And verify fail imprecision. Verify fail,
when we've actually gone off and looked for an error and it failed, you know, it's kind
of imprecise in the connection break model. But it's very precise in the non-connection break model.
So how do we harmonize those two behaviors from one method?
Other ongoing NP-TWIG work.
The core 1.3 update is a current work in progress.
We've been working on it since we shipped the last one
a little over a year ago.
I don't know when we're going to be done,
but we're moving forward on it,
and we meet very regularly to discuss it.
Asynchronous flush and working out the details of it,
we're pretty far along in that one.
We did a lot of this work last year.
We knew it was coming, but it's maybe not done.
We do have implementation learnings on all these things.
There's a really interesting interaction
between the C memory model.
The flush command, you can't reorder instructions
after the flush command. Compilers and processors
love to rearrange code for efficiency.
Flush is really special.
It's a barrier.
And so we need to teach the compilers and perhaps the processors that it's a barrier.
And that's interesting.
I hadn't expected we'd become language engineers, but we might have to.
And visibility versus persistence, as I described early on, right?
They're two very different semantics
and two very different goals.
We're going to continue the work.
We're going to have greater RDMA mapping detail
and maybe these extensions.
We're going to have efficient remote programming models.
We're going to call out the error handling,
and we're going to depend on the Open Fabrics Alliance's
OFIWG to help us get this stuff right.
So we're opening it up big time.
Scope of flush and flush on fail fail.
That was what Andy described.
Those are things we don't talk about yet.
But flush on fail is kind of that emergency flush.
What if the emergency flush didn't work?
What if somebody yanked the super cap off my machine?
How do I know and what do I do?
So interesting stuff.
And that is it.
All right, thank you very much.
I'll be around.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.