Storage Developer Conference - #56: Samba and NFS Integration
Episode Date: September 8, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 56.
Today we hear from Stephen French, Principal Systems Engineer Protocols with Primary Data Corporation,
as he presents Samba and NFS Integration from the 2016 Storage
Developers Conference.
Okay, well I want to thank you all for coming.
I think this is always one of the more interesting talks because you get a mix of NFS people
and SMB people and I think that's always a useful thing because we have many of the same
problems.
But if you add clustering, we have more, we go from problems to psychoses or problems to very difficult,
I don't know the right word for it, other than it is enough to keep us busy for a hundred years.
Just a reminder, I work for Primary Data.
We have some wonderful NFS people.
We have some wonderful people like Richard Sharp here in the audience.
But I'm not particularly talking about our product. Obviously we share many of these same challenges, but so do many of you guys.
I think Red Hat and others deal with many of the same problems every day.
I maintain the kernel driver, sips.ko, for SMB3 and SIPs enablement in the kernel.
I'm also a member of the Samba team. and I've been doing this for a long time, but it does seem
like some of these same problems come back year after year in slightly more interesting
ways.
I'm going to talk about NFS 4.2 and SMB3, how to integrate them multiple ways, one on
top of the other, then both together exporting, and on top of a cluster file system.
Some of this you all will be very familiar with with and hopefully some of this will be useful information.
So why do we care about this?
So let's see if...
Yeah, so let's try this and see if it works better. This one's off, actually.
Yeah.
Let's see if there is a way.
Yeah. You'd think it has an on-off button, wouldn't you?
No.
Maybe not.
Okay.
That's weird.
You would think.
You know, we deal with networking problems, but they're different.
Okay, sounds great.
Okay, so why do we care about this?
And I think one of the reasons is that the performance and stability differs a lot.
Now, did any of you sit through Ira's, remember, yesterday?
He talked about your particular, right, Ceph. Ceph, right?
Now many of you know Gluster, many of you know PNFS.
Windows has its own clustering model.
I think all of us have seen at some point GPFS or Luster, they all share different problems,
headaches with performance, with stability, with compatibility.
But we all know that our Windows clients and our Mac clients
are going to do pretty well over SMB3.
We all know that there are workloads that are,
I mean, I think, if I remember correctly,
Ronnie still every day deals with NFS, right?
There are workloads that we have to support these two protocols.
Now, it's been a long time, but pretty much everyone else died.
And I think you guys know that, right?
You remember this?
You know, 1984 was not just George Orwell.
1984 was the birth of two protocols that ate all the rest.
And I was thinking, why do we have dinosaurs on our shirt again?
But if you guys were downstairs at the Plugfest,
I think we now have, we still deal with these.
Now, the nice thing is that 30 years of improvements have created some pretty impressive things.
And even in the NFS world, we're seeing new things being developed.
We're seeing new implementations of copy range coming in.
We're seeing new layouts proposed.
The feature sets overlap, but they create kind of unique problems.
But they also create kind of strengths and weaknesses for particular workloads.
And I think it's obvious when you think about the RDMA discussions, right?
We, Tom gave a talk on RDMA, right?
SMB3 RDMA has done a very nice job.
Both these protocols are very well tested and they're more common than all of the others
combined.
Now, some of this is review I want to go through fairly quickly.
Early versions of SNB, sorry, early versions of NFS 4 had some security features.
They were layered in an interesting way that made it possible to do some nice security things.
They had a uniform namespace. They were stateful. The original NFS v3, except for the byte range
locks, was stateless, which was kind of odd. Their compound operations supported, but with NFS 4.1, they
added parallel NFS trunking, they added this concept of a layout, and there's a good overview
of this in some of the SNEAR presentations, and I think over the years people like Alex and Tom
Haynes can have given good overviews of that. But we also added, not long ago NFS 4.2. To give a 30-second reminder of what
NFS 4.2 included, it added sparse file support so we can better do FLK, space reservation,
labeled NFS so you could do SELinux, IOadvise, server-side copy, copy file, copy range, a
clone file, clone range. Application data holes.
Now, when you think NFS, most people actually think of NFS v3.
Stateless, simple reads and writes,
very different traffic patterns.
But NFS 4.2 actually does add some useful features,
and I don't want to underestimate its usefulness.
You can go out to the IETF site and look at the spec.
Now, what's the status on some of these things?
This is just review to sort of understand what's going on with NFS because many of us are in the SMB3 world.
There are major layout types.
NetApp guys love files, right?
We have object and block.
Each of these are developed by different vendors.
The kernel server has added some support for PNFS. Layout stats and a new layout, flex files, were added in the last two years.
Flex files have been in since kernel version 4.0. Linux actually had sparse files for over a year and a half in NFS.
Space reservations labeled NFS.
And the copy-off load is fairly recent.
The addition of copy-off load is quite recent.
Last I checked, IOadvise and the application data holes were not in.
I didn't look today, but I don't think they're in.
Okay, so what are these layout types?
It's kind of weird.
It's almost like every vendor added their own,
but we have file, now flex files.
So you can see Tom Haynes' presentation.
Flex files is sort of file V2,
if you want to look at it that way, layout type.
Object and block.
And there are others that have been proposed.
So what is flex files?
It has a lot of improvements on top of the file layout. It lets you take Red Hat servers that are NFS or Isilon servers, all these different types of servers, whether V3, V4, V4.1, and you can spray your I.O. to them.
You can have the client do the mirroring, have the client get the layout from the metadata server, the NFS client gets it and is able to write it to different places.
So it allows you in a sense to use NFS as sort of NFS v3 or v4 as your sort of standard access protocol over the wire
or just to do the reads and writes of data and you have a metadata server that understands NFS 4.2
and understands how to give out a layout for this data.
And the clustered file system,
existing clustered file systems don't map real well to this,
but it does allow you to create clustered file systems sort of out of PNFS.
So here's a picture of it,
stealing a slide from Alex McDonald here.
And, you know, you can do fencing, and I think it's an interesting view here, right? picture of it, stealing a slide from Alex McDonald here.
You can do fencing. I think it's an interesting view here. Your client goes to
the NFS metadata server, gets a layout,
and then is able to write one or more copies of that data
to boring servers that really know nothing about
PNFS.
So that's a quick review of the NFS stuff.
Now, why do we care about both?
Because they have unique and interesting features.
Well, what are some of those unique and interesting?
What's different?
Obviously, NFS is more POSIX compatible.
We're trying to fix that with the talk.
As a matter of fact, immediately following,
we have a session where we'll get a chance to discuss your requirements in more detail
on that.
Things like advisory range locking don't map to SMB directly.
You have to map through mandatory locks.
Unlink behavior, good example, works better over NFS.
There are Unix extensions that Jeremy and I and others did for SIFS, but not for SMB3. There's no equivalent of PNFS
and the ability to query a layout
and get layout stats in SMB3.
It could be added,
but we don't have such an equivalent.
Also, NFS, for good or bad,
tends to be layered in a way that
in some cases makes it easier,
in some cases makes it much harder.
That layering on top of SunRPC means that it's harder for some features,
some security features, but in other ways,
it's somebody else's problem, most of the security things like Kerberos.
Which is a nice thing, right?
Samba guys lose a lot of sleep over Kerberos, like every day.
Label NFS, we could do that over SMB,
but right now the
attributes that flow over
SMB are all the user attributes,
not security or
trusted or any of the other
categories used by
SELinux here. SMB 3.1
though includes things like a global namespace.
There have been proposals for a global namespace
in NFS.
They have never been accepted.
Claims-based ACLs, obviously the clustering features,
witness protocol, the RDMA is much better, I think, in SMB3.
And like I said, I think Tom gave a talk on that earlier.
And, of course, there's many management protocols that are layered on top of SMB in useful ways
and that match very nicely when
you're trying to manage a Windows server. You sort of get a whole set of features listing
servers, managing servers, getting group information, user information that is almost always considered
present at the same time. Branch cache, shadow copy, sort of SCSI over SMB, MSR, SPD, these
don't have equivalents in NFS.
And I think the multi-channel is really neat.
No one wants to set up the headache of bonding multiple adapters together in NFS,
but man, it's easy with SMB3.
Adding adapters just works.
Okay, so what's the best way to get these to integrate together?
We've got to have Windows clients, Mac clients.
I think Jeremy, you mentioned at Samba XP, we now even have Google shipping SMB client.
SMB client even on your phone, right?
SMB3, all these protocols.
What's the best way to get these?
Chromebook.
Chromebook.
Okay, so your Chromebook laptops, right?
We have a Linux-like OS
that's shipping user space libraries to access this.
And I think some of our marketing guys
were using SMB apps on their phone to access data.
It's not just the Xbox and weird appliances,
but weird stuff uses SMB and NFS
like routers and NATs.
Should we do NFS over SMB?
Should we do SMB over NFS?
You have choices, right?
We could have PNFS on the bottom
and just Samba sitting on top.
We could have NFS sitting on top of an SMB client.
These things are all possible in theory.
It's funny, if you Google some of this stuff, you end up with Hadoop discussions where I
really wish those Hadoop guys actually came to these conferences because they would learn
how to do this much easier.
But they have the same issues like, how do I get Samba on Hadoop?
The more likely solution we're going to get to is a dual gateway over something like PNFS.
Now if you're in IRO's world, you're running on top of a different cluster, you're running
on top of Ceph, and there are lots of people who would be running on top of Gluster.
The cluster file system underneath varies and it has some of the same problems, but the most likely
solution is we have Samba or something like
it running in user
space. I know we've had talks in this conference
about kernel space servers.
There have been a couple proposed for
kernel space servers, but here let's talk about Samba.
Then of course you have NFS servers
like the kernel server or Ganesha
to serve your v3 clients.
If you're doing PNFS, you can go directly to the back end.
So what are the problems you have to solve?
I think all of us have dealt with at least one of these problems.
You have to deal with creation time.
You have to deal with these crazy DOS attributes.
DOS attributes you don't think matter.
Actually, they do.
So we also have to have ways of dealing with security. I hate to
state the obvious but I remember vaguely some news stories about North Korea.
You guys remember those? I vaguely remember like they broke into SMB3 server or no
sorry SIF servers, right? They broke into SIF servers and did something. I mean
these are crazy stories, right?, sorry, CIF servers. They broke into CIF servers and did something. I mean, these are crazy stories.
Security matters.
People actually care about ACLs.
They actually care about this.
And we can't just say, well, it doesn't matter.
The lowest common denominator
is fine. Directory
leases, metadata performance matters.
Leases, when you're opening
a file and it's not heavily contested,
you should be able to cache it.
Quotas, auditing, we should be able to allow our administrators some flexibility, no matter
what protocol they're using, to configure easy to use quotas.
Right mouse button on their Mac, right mouse button in Windows in Explorer, click on something.
We have to deal with the fact that open has lots of things that happen, it's not atomic. We have to deal with the differences in byte range locking, and we have to deal with the fact that open has lots of things that happen. It's not atomic. We have to deal with the differences in byte range locking and we
have to deal with the problem that NFS can spray data across lots of different
servers and it's kind of invisible to us when we're running in this kind of
environment. The data may be spread across 10 or 20 or more servers. In
addition we have to be able to deal with UIDs. Ronnie may be UID 1000 because he was the first one added on that server.
Ira may be UID 1000 on server 2, and UID 1003 on something else, and 1000 on a different one.
So how are we going to map these things?
Obviously we can use WinBind for these sorts of things, but it is painful.
We deal
with these problems all the time. We also have to take into account that there were
significant security improvements added in Windows. Share encryption is very useful.
There's a reason that share encryption is required to access some of these very remote
servers, some of these cloud-based SMB servers. Secure negotiate is better. What do we introduce
in terms of security problems when we're running in this environment?
And we talked about the how do we get consistent UID mapping.
You know, we have three separate ways of naming you.
You could be david at microsoft.com.
You could be UID 1000.
Or you could be some long SID in the Windows world.
We have to get these all right.
Okay, so what if we have KNFSD or Ganesha exporting the same data? We need a good cluster file system.
We need something like CTDB to handle starting and stopping services to help
with the the non-POSIX state that the cluster file system can't handle. Okay, so
what about CTDB and NFS? Can they run together? Does CTDB do anything with NFS? Yes.
I think you guys probably
who were in
one of the talks yesterday
probably got a chance to see
the config file. Here's a CTDB config file.
Notice it manages NFS at the bottom. It can turn
that on. When it manages NFS, what does that mean?
Not as much as I'd
like, unfortunately. But what it does mean is it can
start and stop NFS automatically in the cluster.
So when cluster nodes go down or go up,
you can manage IPs, you can move IPs.
There are about 15
distinct CTDB NFS helper and event
scripts and about 40 test
files for testing
CTDB NFS related events,
starting and stopping, address takeover,
grace period.
There's additional links here you can get on the presentation to look up more information if you care about using CTDB in the NFS world. Of course what we would
like is things like deny modes. What we would like is for
things like advisory locks to be able to be reflected in something like CTDB
state or at least a call out so Samba and NFS could keep them closer.
Now, let's go back to something
very interesting. The SIFS client.
Well, forget this NFS
stuff. Just run over a SIFS mount.
You can actually do this,
sort of. At least I did for a while.
It does need some work, though.
If you want to follow up on this,
the file systems documentation,
NFS exporting talks about it.
If you tried it, as you saw today, it gets an error.
Well, the reason it gets an error is because I have CIFS NFT export turned off and so the
export ops are not exposed, but there are tiny versions of those.
You really, really, really, really should not NFS export to the network share.
There are many surprises.
Yeah, yeah, I mean, NFS v3 is weird,
but the things you have to,
if you look at what export ops have to do,
they have to deal with the mapping of the NFS file ID,
right, they have to deal with the get parent,
and there's like four or five functions
you have to be able to export well.
Although it's technically possible to do, there's a reason that this is turned off by
default.
The reason is explained very, very well in this.
But if you can, if a client can see the underlying CIFS share as well,
surprises me, it corruptifies.
That is what we, not the main, we have.
I mean, what's nice about this NFS over CIFS
is that it is kind of an interesting experiment because a lot
of the cluster file systems have worse POSIX semantics than CIFS or looser semantics than
CIFS. It's interesting if you want to play some time, do a grep of the kernel and see
which ones export-export operations. It's a little scary. Here's the BTRFS example of
the... Okay, so let's forget that for a minute and let's go back to thinking about how we would
export KNFSD and Samba over a cluster file system, what's necessary.
NFS v3 and v4 can go to KNFSD or Ganesha and Samba can take care of the rest.
Now oversimplified, the problems that Volker and Jeremy deal with every day,
Michael deals with every day.
If you can get it from a POSIX API,
like the file size or mtime,
you go to the POSIX API to get it.
You may have a wrapper for that POSIX call,
but basically Samba has no problem with this
over something weird, a weird cluster file system.
If it's file system specific,
you can have a VFS module return.
You can have something like Ceph
that runs all in user space.
So you don't even have to go down the kernel at all.
So you can get a file system specific VFS, NFS, something, and you can return it.
Now if that fails, if you don't have such a thing installed or if you don't find it
there, you can get out of an XADDR.
So Samba heavily relies on things like EXT4 on writing to XADributes for DOS attributes, among other things. But creation time,
for example. Go ahead, Jeremy.
If this were really a good idea, why
would somebody written a standard EFS module is purely an NFS translator?
Oh, actually, somebody did.
Is that you?
No, it wasn't me. Yeah, so his comment was,
well, why wouldn't you write a VFS module?
You could...
Yep.
Yep.
Okay, so repeating the question and the answer,
if this was a good idea to do NFL,
why wouldn't you just, like with Ceph,
why wouldn't you just have a user space NFS?
We have PI NFS in user space.
We have your lib NFS, right?
Ronnie has lib NFS, very nice library.
Why wouldn't we just put a PNFS-capable user space module?
And the answer was, well, somebody did.
But the kernel actually gets a little bit more love. But the kernel actually gets a little bit more love.
The kernel NFS gets a little bit more love.
And to be honest, there is performance reasons why you do want to go through the kernel.
I think that despite all of our headaches about the kernel,
the kernel does some things reasonably well.
Now, it's an interesting question,
but I don't think libNFS has PNFS support right now, right?
There are other reasons for it as well.
But you could, in theory, write a user space NFS module that plugged in here.
But the general thing is if you can't find it in a file system specific module for Gluster
or Ceph or NFS, we're going to try XHatters.
But what if you don't have XHatters in your file system?
This makes life difficult.
You have to guess or emulate in other ways sometimes, or you have to emulate XHatters.
Well, Samba can emulate XHatters.
What about getting these with better system calls?
Those of you who were at the File System Summit or Vault
probably heard the discussions on XStat,
now called StatX, to make life confusing.
And I have a link here to the LWN article on it
that has the patches.
Looked like it had general agreement.
The last comment was basically, you know,
interchange between Dave Howells and Christoph
that said, hey, you hey, shouldn't we get
glibc to
agree on this? Yes.
I think. But he could never
get a response from them. On the other hand, there wasn't really any
disagreement about the patches.
This is a little bit frustrating because
they patched the file systems to get birth time
back and to get simple attributes back.
Some people would argue
that it should be simpler or more complicated, but basically
it's annoying
because we're so close. Go ahead.
I believe that the set
is not getting the set
we're trying
because Jeff Layton published a
version of the set
that actually
stands inside
the set.
You can see see this in
GitHub. You can see this set of catches
that are required for the
Yeah.
So the point
was that
stat X or X stat
is not just for
EXT4 and NFS
and CIFS, but it's also for
some of these cluster file systems, and they
see activity on the mailing list and in the Git repos that show that some of these cluster
file systems are implementing StatX.
It's been frustrating, of course, because for three or four years we've seen everybody
agree on StatX, and somebody had a feature and it just stops for a while.
This would be so helpful for us in our world because now NFS and SMB can both
get the birth time of a file without relying on extended attributes. And that actually kind of
makes a difference because right now if you can cut down the stuff that's emulated metadata that
you have to write on every file, it really does help performance. And needless to say the birth
time should be on every file.
What about this creation time? There was another set of patches that Anne Marie
Merritt proposed.
They're very small, actually, to turn on all
of these fields in the NFS client and add a simple I-optical. Here's a patch series
from back in May, late May. And this was interesting because it's the classic example. This would help everybody,
these small little things to get an IOC to get this attributes out. And Christoph's comment was
use XStat instead. Well I agree. We have two different ways to get this. If we don't get it, then every single file has to have stuff in xstat or tdb. Now, there's a whole set of attributes,
right? Here's the whole list, unless you guys added one yesterday.
Things like sparse probably get other ways. Compressed, I'd like. Encrypted? I'd like. But realistically
we have most of them mapped already through the xstat that matter the most.
Now what about ACLs? We talk about ACLs in a painful, disparaging way because we lose
sleep over it, but the ACL model between NFS 4.1, not 4.0, but the NFS 4.1 ACL model and
SMB is close enough. There are
problems and it is different. I mean, username and domain is not the same as SIDs. When we're
talking about rich ACL, we're talking about UIDs usually, not username and domain usually.
Rich ACL does make it easier. The last update on the rich ACL patch drama was about a month ago. I have a link to it here. Go ahead.
I remember working with you at IBM about eight years ago, there was some really smart developer
in India who put out a really good set of patches.
Andreas took those over.
Pretty much every year it came up in the Linux file system summit.
I think at least three or four times it came this close.
So the question is, how do we get it?
And suggestions are obviously welcome.
Red Hat and SUSE probably have some leverage here.
One of the things that would help a lot is having more file systems implemented.
One thing I was very happy about, so we can get very cynical about certain political, philosophical things in the kernel.
Deny aces are evil, for example.
Deny aces are evil, therefore we're not going to do something that's needed for Mac and
Windows and NFS compatibility.
Without being cynical, we should recognize Andreas got a lot of stuff in.
I can tell you from experience he got stuff in because I'm sitting down at the plug fest trying to
get this, some code written and my code was gone because he removed some dead code and
clean up of the XAD or ACL stuff.
So his patches did, his clean up patches went in, unfortunately it caused me about half
hour of extra work trying to find where this code was.
But that's good, right?
He cleaned up some dead code, he was making stuff better across the file system
so his patch series is smaller.
So less reasons to deny it
but we still have this philosophical issue
of deny bit reordering
and just for completion
I put the Microsoft blog entry
saying why they think
deny entries belong in that order.
Go ahead.
Have people stopped laying their body down on train tracks to stop this yet?
No.
Okay.
By the way, Jeremy is aware of more of this, I think.
I'm sure you'd love to talk to more people about this, right?
I don't know.
If you want, we can make the road better. I don't know. And to be fair,
there are kernels out there
and there are products out there
that ship RichACL.
And yes, we, you know,
there is testing on RichACL.
So this is, you know,
it's getting better.
But it would be nicer
if it weren't just everybody
downloading a patch set
from Andreas's Git tree.
And by the way,
there's a VFS Richacromato for Samba.
Ronnie, go ahead.
So basically, eight years later,
we're still at the overnight and body kind of negotiation.
I really only...
Yes, yes, but there are very, very few bodies in the way.
There are very few bodies in the way?
Body, yes, body, yes.
Okay, no comment.
In any case, from my perspective,
I have to be...
The best thing I can do as a developer
is to get all of the rich ACL prep done
in my file system that I have some control over.
The best thing Samba can do is update VFS rich ACL
to make sure it works well with Samba 4.5 and master.
The best thing the NFS developers can do
is make sure VFS rich ACL continues to integrate well
in those patches.
And they're actually fairly well tested.
There were some changes that Tron and others merged in
to make Rejackle
a little cleaner.
As individual developers, we can
get all the other stuff out of the way.
And, of course,
we can rely on our products on just
patching in Rejackles.
On the other hand, I really would like to
get some agreement about
why the heck somebody thought it was a good idea to have
allow, allow, deny, allow, deny
as the ordering of the mode bits.
I don't know about you, but I sort of think that mode bits should be allow, allow, allow.
Go ahead, Jeremy.
It's the only way to have exact posits.
So Jeremy's...
It's the only way.
And the reason for that is,
on the posits, you can have an owner
with less rights than a group or vice versa.
So you can have...
Because this is specific order,
you have to allow them...
The denier that you allow them, the denier that you allow them.
Or do we come on like perfect? So Jeremy's comment was that Yeah. The deny that they allow them, the deny that they allow them. All the way to moment perfect.
Yeah.
So Jeremy's comment was that for POSIX,
the only semantic way to do it is because the owner can have
less permission than the group is to have deny bits that are,
in a sense, out of order, that are not all at the beginning.
You would think, though, that you could put that are not all at the beginning. You would think, though,
that you could put all the denies at the beginning.
I've heard that claimed,
and there may be a good reason for that.
On the other hand, what it means in practice
is that every file that's had a mode change
by some evil NFS client,
every file that had a mode change by some utility running on the, every file that had a mode change by some
utility running on the server is going to pop up with some warning in Windows Explorer
when you edit it.
If we use cackles or we use SMB cackles, it's much more polite.
We don't display warnings at SMB cackles.
It is sort of annoying.
It's not a problem for files that are created with SMB, but if we use this evil CHMOD
tool, you'll
end up with
warnings.
It's
not a big deal, but it may confuse some users.
The goal, I think, in a lot of this is not
to confuse users.
Ideas are welcome on that one.
We talked about XStat
integration. What about alternative data streams?
I think Jeremy's favorite topic in the world is spreading viruses through alternate data streams, right?
So we can emulate these.
We could do what Macs do. We could put them in XStaters.
But NFS doesn't have XStaters.
So maybe the best thing is just not to support them on NFS.
But there are a few apps that do require alternate data streams.
What about witness protocol integration and clustering?
Well, that's kind of a topic that's in progress.
If you listen to Michael's presentation and Volker's presentation,
you've got a feeling for a little bit of the progress
that's been made in witness.
But I think that in a mixed world,
CTDB and witness have to play better together.
And in addition to this,
there are some things we could do that are kind of
cool with PNFS,
allowing Witness protocol events
when the metadata
servers go down or whatever are moved.
Okay, just DFS, global
namespace. I think it's largely independent
of any of this. DFS should be
okay in this kind of world.
Okay, so what about the Samba activity?
What are some things that would help?
Merging rich ACL. That would make our life
a little bit easier. Merging the
XStat, or StatX actually.
Updating VFS rich ACL.
It's a little bit out of date.
I think we've sent some patches.
Droz Adamson may have sent some patches
to Jeremy on that.
There's a couple little things where we have to do some more testing on that.
But it's actually not that bad.
And then, you know, XADDR-TDB, I think we've talked on the mailing list
about various lock ordering issues with XADDR-TDB that show up.
That may not be an issue in 4.5, though.
So what about clone and copy range? Right now I think David
Disseldorf did a really good job with clone and copy
but it was BTRFS specific, right? Now
you have NFS, you have other file systems, XFS and others
with patches for this kind of stuff, right? So we're going to need to figure out a way to extend
the copy offload.
You know, they're all kind of using the same
iOctal, so maybe
it's not going to be an issue, but
this was one that I was thinking about
in terms of performance features
that enabling it across a broader set of
file systems. And that may already be done in 4.5, but
I didn't notice it.
Okay, and then the XStat enablement would allow us to
significantly reduce the amount of traffic
to these little tiny database files and stored metadata that we can't put in the file system.
Alternatively, that NFSI octal that Anne-Marie proposed
should be very non-controversial. Those patches were very small, actually.
I think they just got kind of forgotten about
back in late May.
In my view, there's no harm in having two ways
of getting at the same data.
If you had an XADDR or you had an IO,
well, that's fine.
So what are the key features that we think about
in SMB3 and performance?
Obviously, Tom thinks about RDMA every day.
The copy offload, I think, is really cool.
The compounding operations, large file I. IO, file leases, directory leases,
and then various Linux specific protocol optimizations that you could do with NFS
and also some of the F allocate features.
Now are any of these affected in the gateway environment where you're running NFS and SMB?
And the answer is yes, obviously.
And what are the big ones?
Well, I think the big one is leasing. But just at a very high level, what are the things that you see
in terms of generally in this environment? You see more traffic, obviously, if you're going in
a gateway environment, because you're seeing at least twice as much traffic, actually more,
because you're seeing traffic to your Samba server and then traffic out the back to your cluster.
If you're writing directly to your cluster, you're obviously seeing less than half as much traffic, actually more, because you're saying traffic to your Samba server and then traffic out the back to your cluster. If you're writing directly to your cluster, you're obviously saying less than half as
much traffic.
But most of that would go away for less contended files if we had lease support.
What's the problem with lease support?
Big problem with lease support is NFS only supports it on the wire.
It doesn't expose the API on the client.
Now what we did in CIFS for this was
if you asked for a lease and you already
had a lease, we'd give you a lease.
And then of course if we broke a lease, if
Oplock break came in, we'd break the lease.
NFS could do the same thing and we've discussed
that kind of patch before.
But I think they wanted something
a little different. So that'll
be an interesting thing to argue about because
realistically
90% of the time you have uncontested files. You already have a lease. Some app like Samba
asks for a lease, give it to it and then break the lease if needed. Also, copy offload should
be relatively non-controversial to deal with.
Now, here's the big thing.
I don't know of any way to deal with directory delegations.
Directory delegations, the Microsoft guys in one of the presentations I remember saved about one-third of the performance.
The metadata operations were so much faster.
So if you had things that had large directories,
you had things that involved things that SOMA doesn't do particularly well.
SOMA does not like like million file directories.
Directory release would help a lot,
but we don't really have an equivalent on that in NFS.
Well, NFS will cache.
It'll cache for seconds,
but it really doesn't have a concept of directory release,
and that makes it hard for metadata caching.
And that's something that we...
It does metadata caching.
No, I mean some directory delegation.
Or not one.
Or not one.
I don't see it.
It has it, but it doesn't implement it, right?
Yes, but it's limited.
Right, right, right.
Yes, thank you.
So a minor correction to what I was saying.
The NFS 4.1 spec has directory delegations.
The NFS client and server don't implement directory delegations.
If we implement directory delegations,
it helps in our world more than it might in other workloads
because Samba doesn't handle large directories and metadata queries well.
It doesn't scale as well.
Now, there are other things that are kind of interesting to think about in this world.
Notify. I think all of you who've used Samba notice that when you launch Samba, you don't
just get SMBD, you get SMBD notify D. You get a process that's sitting there looking
for notify events. That's kind of cool in a cluster. It'd be interesting to think if
that could help in the NFS world as well, how we might optimize that. In terms of CTDB traffic, apparently NotifyD has some effect
on the CTDB traffic. Figuring out in a clustered world where we're exporting
NFS v3 and Samba, there are optimizations maybe.
It's certainly worth looking at. I think one of the things that could be done
that would make a lot of sense is just simply look at wire traffic more.
Drill down one level on the wire traffic when you do a typical operation launching VI let's say
or launching some application of renif s v3 looking at over PNFS that's not our
problem right but the Samba guys now let's bring up word look at it open a
file and then see what NFS traffic is sent on the far end.
So seeing the PNFS traffic that comes out underneath Samba from the NFS client when you bring up Word and open a document.
This is useful stuff.
I've done some of this.
But certainly when we have support for leases and when we have some of the additional features that we talked about,
like the iOctal to query these additional attributes
that the spec supports and the client can support,
but we don't have an API to get to right now
until we have XStat or Anne Marie's API.
When those are in place, we can start talking
about drilling down one level
and just optimizing the traffic a little bit better.
Because right now, this is a workload we can't ignore.
We can't ignore the fact that SMB and NFS are
run together. And it's important that we better optimize this and of course also put pressure on
the kernel developers to fix the stupid APIs. I think all of us here have particular kernel
features we need and guys like Jeff Layton have been very good about driving some of these
gradually over time.
If you think about the kernel VFS API, five or six of those APIs came because of Samba or network file systems.
We do have some influence despite our cynicism about it being eight years sometimes.
Nine years.
Testing.
Here's one of the things I really like about this.
XFS test runs over SIFS mounts.
XFS test runs over NFS mounts.
Ronnie's plug-ins, libNFS, right,
runs underneath multiple...
So you can run infrastructure like yours
over multiple protocols.
So some of the scalable infrastructure for testing load
over NFS and SMB is actually possible,
even if you had an all Linux world.
But on top of this, you have a nice Microsoft functional test suite.
You have PyNFS.
You have SMB Torture.
The test suites are actually reasonably mature.
What we don't have is a couple of those little pieces I mentioned earlier
that would make life easier. Obviously, it works works. Obviously we can do lots of wonderful things, but this could be
much better. And I think a lot of these same things we discussed will also help Ceph. We had
that discussion earlier in the talk where the activity Jeff was doing for StatX in Ceph will
also be helping other file systems.
So many of these are shared problems,
and I think very interesting problems.
And I think we need to think about, like, take Ronnie's example,
how to keep extending this infrastructure for better scalability.
I think we had a talk yesterday on NAS testing.
Some of this will apply to this kind of environment particularly well,
because right now most of our test cases are focused on one protocol and contended data.
It's a little tougher to test. Now some of the guys at EMC remember Pike.
They gave a presentation I think three years ago on Pike, and they I think used that infrastructure on both NFS and SMB,
if I remember their presentation correctly.
But there hasn't been a lot of infrastructure out there that's open source that does protocol-specific operations in one protocol
and then does a protocol-specific operation in another protocol
to try to contend or break.
What they tend to do is things like, I open a file here, I open a file here,
using the standard POSIX API,
which is harder sometimes to get at some of these difficult problems with compatibility.
So we could have some improvement there. And I think that
the good news is that it tests out pretty well. The bad news is the cross
protocol stuff could be expanded. Go ahead, Ira.
Wouldn't you have to define what success and failure was in order to test?
Yes. So you'd have to define what success and failure was? Yes. In order to test? Yes.
So you'd have to define what success and failure was for test was his point.
And I think that's true.
Because I'll give you a great example.
Advisory locks.
Do you want an advisory lock to fail if you have a mandatory lock?
Do you want a mandatory lock to fail if you have an advisory lock?
If your window of time fails,
the positive point has an advisory lockout.
Jeremy says yes, and I agree with him.
I'm sure there are people who disagree,
but I agree with Jeremy on that, that that should fail.
I tend to disagree.
So,
now, if you run with Ceph,
and no, never mind.
But seriously, there are file systems with
different expectations about
loose coherence versus strong coherence. There are file systems with different expectations about loose coherence versus strong coherence.
There's file systems with different expectations about what should happen when you try to delete
an open file.
And one thing that I think was mentioned maybe in Jeremy's talk was we're going to a world
where cloud storage matters.
In the cloud world,
you have to have really loose semantics sometimes
to get anything to work.
There are problems when we're in that kind of world, right?
Because we can't,
the coherence is much looser
in that world.
We tend to write a whole bunch of stuff at once.
So, you know, M time consistency,
maybe not a, you know, I don't know.
These are interesting problems.
But yes, we have to define what success means
in some of these cases.
And we don't tend in IETF RFCs or Microsoft documents
to talk about, this is what the spec says.
This is what we kind of recommend
you do if you're thinking of testing this
because maybe we should.
And this has come up before.
What do we think best practice would be
if you tried this kind of operation?
We mentioned in the Microsoft specs,
we mentioned Windows 10 client returns 128 here.
But we don't tend to say best practice would be you should try to do this.
Sometimes, but a lot of times
we don't. And in the case of
contended locks cross protocol,
that was one that we've come up with in the past.
And certainly at Connectathon we've had
these arguments before.
Okay, we're reaching the end of the talk
and soon we're going to be following with
the panel discussion about POSIX extensions.
Do you guys know how much time before that one starts?
About ten minutes?
Yeah, so we've got a couple minutes for questions if you guys wanted to talk about it.
I think some of you guys may have even more experience with this world.
Go ahead.
So one of the problems we have seen when dealing with cross-protocol issues Go ahead. Go ahead. Go ahead. Right.
Right.
Right. That's not really a problem because we're looking at a grace period which is a response
to a similar function
like this.
And we,
they have to honor
each other's grace period
to do it right.
So that's something
that I don't see as possible.
Yeah, so that grace period
discussion is really interesting
for a couple reasons.
Imagine that you're trying
to do lock recovery
like you're discussing.
CTDB can trigger
this, it has a little hook for Ganesha for grace recovery, right? CTDB can trigger this.
It has a little hook for Ganesha for grace recovery, right?
But that doesn't help you if you're going directly
with PNFS to the back-end file system.
And also, we don't support...
We support persistent handles, sort of,
but this could relate to persistent handles.
Grace lock recovery on a file
which has a persistent handle opened by Windows
might act differently than one that doesn't.
And I think Volker was about to respond
on maybe some more detail on what he thinks.
I mean, there are...
This is what I also see,
that channels, variations, and so on,
we need to talk to do those things. Yep. So we need... Yeah. some kind of logging and so on. Because what we need is really, first we need to define the semantics of the channels.
NFS has their own set of channels, whatever they call them.
But for example, NFS doesn't have file-sharing needs.
We need to first sit down with the NFS guy and define what it actually means.
What should the error code be for what that about is going on,
then we need this in user space.
And then we can go to the kernel guys
and say we need an IPR.
So remember the good old days
when we had weird return codes
that were added 20 years ago
like jukebox that made no sense?
Jukebox, yeah.
Yeah, I mean, exactly.
Add eVulker,
and that would allow you to communicate some of this information.
One of the things that's so strange is that the conversations between the Ganesha team and KNFST and Samba...
There is no communication.
Yeah.
There's no conversations.
Yeah, now to be fair...
There's no public conversations.
There are conversations.
I know they exist because I'm had them in the background with some
of the HHS authors at this point. There are a few people thinking about it, but the problem
comes in when there's a third access method nobody's discussing at this table, and that
access method is called local files, or kernel map, or views.
And once you bring that guy in the picture,
all this user-led integration stuff,
especially if you're talking about, in one case,
set where you're talking about kernel client,
it doesn't play.
For various file systems, you can or can't
try this in user space, or they may not
want to integrate with CPDB if you're talking about use map you know if you're talking about that point
integrating gradient yeah but that shouldn't prevent the ganesha people from talking to
someone no it doesn't absolutely that should be happening there are some people having those
discussions but the thing is whenever the people have it in terms of a product they're working on
and the work they're doing they always end up with this third thing coming in on the
side and hosing up everything.
Well, it's not just the different processes running on one machine.
If you've got a cluster environment, you also have to deal with four.
Well, in fact, I gave a talk on this seven or eight years ago here at SDC about doing cross-node coordination
of race periods over the multiple nodes and cluster.
And like you said, the hardest thing
is to coordinate with stuff that's just running locally
on the actual file system
and we came to the conclusion that really wasn't
the solve of the problem.
Who's me?
Oh, I work at on the.
That was me.
Sorry.
You had a viewer bulletin. Yeah, sorry. I couldn't really tell where the bulletin was coming from...
.
.
.
.
. put some code into the file system driver to assist with that. But even then, the part said, well, if you're accessing the same files
and taking the same walks through a remote client
and through a local program,
it's still possible to screw everything up.
I mean, the sad thing about this discussion, right,
is that getting these changes
across 20 different file systems isn't possible.
Getting them across things that have Hadoop,
right, have no idea anything of a discussion.
They barely know what Samba or PNFS is sometimes, right?
Some of these developers are in a very different world.
They're programming in Java, not in C or Python.
So one of the things that's fascinating about this is,
can we drive tiny changes?
Because we know that big things like Rich Ackle
don't go in. 20 line or 50 line changes to the kernel do go in so adding 50 line kinds of
changes is there something small that can be done to delay contended IO from
local access while grace is going on or something silly like this or prioritize
differently or you would think that that small changes to the kernel might be helpful coming back up, but we did it with a private API because we wanted to ship what we were doing
in a reasonable amount of time,
so we didn't want to take the time
to try and get a native API
in the kernel to change the behavior
of the process device.
Yeah, and then, you know,
when you think about lease behavior,
blocking leases, it was doable,
and it was driven by Samba, right?
It was driven by Samba needs.
But I think you're right that it could take years.
We still don't want to forget these thoughts because we're going to come back next year and have the same problem.
Yeah, and that's the thing I know we're guilty of over and over again is that, oh, well, we want this in the kernel, but we can't get it right now.
We'll do something else, and then we forget about it.
Yep.
We never go back and try and get it into the kernel later.
Well, this is why we need to solve the protocol
problem in the real space.
Yes.
And that's at least my view of things.
Yeah.
I thought, yes, we are lacking the old product test.
We are lacking whatever.
This is HD or whatever sum of HDB.
But if we can solve the NFS and HTTP problem,
I think we are pretty far off the cliff.
Because we know what the respective
docking market and error code server might be.
Right, and you would at least know
what kind of operation it needs.
Yeah.
What's your concern? And I mean, the other alternative is So, by the way, Volcker's comment, let's echo that.
So, Volcker said, we have lots of ways to solve this.
One is we could deal with the user space code,
get the user space code working,
and then drive that to other broader.
Another way is just to make SMB3 workable for all workloads.
Because, I mean, we really like this.
But that's a great segue to Jeremy getting on stage to talk to you about protocol extensions,
and not just the protocol extensions for POSIX, but hopefully for things that are pragmatic
in making all of our workloads better.
Okay, so back to Jeremy.
Yep, so we got five minutes for coffee breaks or something. Thank you, Steve.
Good job.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org. Here you can
ask questions and discuss this topic further with your peers in the developer community.
For additional information about the storage developer conference, visit storagedeveloper.org.