Storage Developer Conference - #39: SMB3.1.1 and Beyond in the Linux Kernel
Episode Date: April 5, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 39.
Today we hear from Stephen French, Principal Systems Engineer, Protocols,
Samba Team Primary Data, as he presents SMB 3.1.1 and beyond in the Linux kernel,
providing optimal file access to Windows, Mac, Samba, and other file servers from the 2016
Storage Developer Conference. I'm Steve French. I've given talks at Cineo before, so many of you
know me, and obviously we have a lot of people on Samba Team and Microsoft here. I'm Steve French. I've given talks at Cineo before so many of you know me and obviously we have a lot of people
on Samba team and Microsoft here.
I'll be talking about the current status
of the Linux client and also
some of the things we're working on. This
week we have a plug fest going on downstairs
and next week at Microsoft we'll
be continuing to work on these.
So it tends to be super intense time for all
of the Samba developers and all of the kernel developers
because we're rapidly trying to remember the things
we were last working on three months ago
at the last one of the plug tests, or two months ago.
So it's a very interesting time.
But what I want to do is get you guys up to date a little bit
on where we are with the kernel client.
Obviously, the kernel client is largely independent of Samba,
but it does have pieces
in common in user space
and it has many of the same
challenges
Volker has done some great work
with SMB client especially with
its performance and there's been some recent
work on
actually very exciting work on
Samba server we can talk about
okay I work for primary data Actually, very exciting work on the SAML server we can talk about.
Okay.
I work for Primary Data.
And at Primary Data, we have some very interesting NFS work going on with FlexFiles, parallelizing I.O., the follow-on to the file layout.
But I'm not going to be talking about that.
I'll be talking more about SMB stuff today. We had a talk this morning about some of the stuff our company's doing.
Okay.
I'm going to be talking about
some of the file system activity generally,
then the status of some key features,
a little bit of discussion about performance and bugs,
and then
Jeremy's going to be giving a talk later
about
POSIX extensions for
SMB3.
Jeremy and I and others had done extensions for POSIX for CIFS in the past but we're long overdue probably three years overdue
for getting this into SMB3 partly because SMB3 does 80% of what we wanted
and partly because we've been busy doing other SMB3 features. Okay so I work
on the Linux kernel. Linux kernel is an amazing project.
I think all of you are aware
of just how many people,
I mean, from Christoph Helwig outside
to some really amazing developers
working on the Linux kernel.
The scale of it is staggering.
But to give you some ideas of it,
12 months ago,
we had this crazy Linux version 4.2.
Like clockwork,
every 10 or 11 weeks, we have our new release. months ago we had this crazy Linux version 4.2, like clockwork, every
10 or 11 weeks we have our new release.
I'm not a real
fan of the name here, but that's what they call it.
Each release has its own
name.
So we are almost at 4.8.
Now,
we had a file system summit. Some of you guys were there.
File system summit has some
impressive developers.
These are not all the people working on Linux file systems, of course,
but it's a good chunk of them.
But it gives you an idea of the kind of people we have to deal with.
So in the Samba world, we have one set of people.
We see it's Samba XP.
In the file system world, Linux people, you have another set.
You have other sets you deal with at the Storage Developer Conference.
There is some overlap.
But a lot of the work,
the amazing work you see at Linux
is done by this group.
Now, what are the things
that I hear about in the background
that are driving some of this activity?
I think this conference
is the best time
to actually get a feel for that.
But some of the things you're saying
are the better support for NVMe.
I think Christoph gave a...
Actually, we had a couple talks
about this today, I think. Also, new cheaper adapters RDMA these low latency
adapters and much higher network bandwidth are enabling us to do with
NAS things that wouldn't be possible before. At the file system summit we had
a lot of discussions about how we're going to get rich ACLs. There's some
violent disagreement about rich ACLs in the kernel. SMB has always supported a richer ACL model.
NFS 4.1 and NFS 4.0 copying, in some extent, SMB had a similar ACL model,
but there's strong resistance against the idea of having deny aces.
So our more primitive POSIX ACLs in the Linux kernel,
we keep trying year after year to push this.
It looks so close.
So close.
We also have violent agreement on XStat.
Everybody seems to be behind this idea of XStat,
as long as there's no bike shedding, as they put it.
As long as we don't keep adding.
But the reason this matters to us is that this would, for the first time,
allow all Linux to not only get,
but set birth time.
It would allow us to get at key information that today there's no POSIX API for.
Volker, Jeremy, all the Samba team guys daily
have to be setting stuff into attributes
and dealing with non-atomic calls
to query and set information
that really should be part of the metadata that comes back.
Copy-off load, this is fascinating.
To be able to get that Blu-ray disc copied in half a second without the client doing
any work.
We have extensions in Windows 2016 that I'll talk about later, but there's been a lot of
push here.
I think you've seen in NFS, you've seen in XFS, implementations of copy-offload.
We had it in BTRFS.
David did some great work on enabling the APIs for this.
This is so important to cut the network bandwidth
as you move files within a server driven by client activity.
And, of course, virtualization, Hyper-V,
as well as Linux virtualization and VMware
have driven a lot of the activity,
not just in NFS, but also in other protocols.
Being able to improve our support for sparse files and F-allocate.
And, of course, a lot of workload shifting.
What was the last talk on?
You know, Swift.
A lot of cloud workloads.
We have new ways of new access patterns.
These general things are driving a lot
of the work we have to do. Now,
what happens in Linux? Linux
file system activity is huge.
Many of you guys, you know, 4,000
change sets in the
Linux file system area, that's huge.
These are also very heavily
checked, very heavily reviewed,
especially in some of the core areas like
EXT4 and
XFS, for example.
Now, the Linux kernel activity
is down, though. The file system activity is down about 10%, and probably because
of the gradual maturing of file systems. Still, this is a hugely active
area of development. Over 5% of the kernel
changes, obviously most kernel changes are driven by drivers, new hardware devices, weird embedded devices, drive kernel activity, but over 5% of it is still file system code. The CIFS code represents
about 42,000 lines of that,
not counting the user space stuff and the Samba pieces
we pull in.
What's the most active thing?
I think many of you guys would not think that BTRFS
is the most active part of the kernel today.
People don't think about BTRFS
sometimes like that. They think of EXT4
perhaps. Notice the EXT4
activity continues to decrease actually.
XFS is very active.
BTRFS is very active.
The NFS client activity increased somewhat.
The NFS server activity decreased,
partly because a year ago there was a lot of scalability work on the NFS server.
Now Samba though is 1,800 change sets in the same time period.
It's much more active.
Well obviously because Samba has
a lot more components than just
file system related stuff.
But it's interesting that Samba is as big as the
top four of those bind in activity.
Okay, by release.
In 4.2, the SMB 3.1.1
support. Can we do the full 3.1.1
support? No. Can you authenticate as a guest? Sure.
Do we support
the more primitive form of the
SMB 311 secure and negotiate
contacts? Yes. We added
support for duplicate extents, the ref link copy
offload that was added in
2016 on REFS.
That's there in 4.2.
In 4.3, we
added the KRB 5 support for SMB3.
We had it in CIFS, but not in SMB3.
Obviously very important for more secure authentication.
A lot of people were curious in their apps
about how to query information
about whether the mount underneath them is actually,
does it support, you know, is it on an SSD?
Different hardware characteristics.
And, you know, we had PROC FS kind of pseudo files to get this,
but it was easier to get it from an octal.
So I added an octal to allow you to query detailed information
on the share properties, the device properties,
and the volume properties underneath you.
4.4, we realized.
A lot of times, you're on a server,
and you want to copy a file from your client from share one to share two on the same server.
Well, allow copy off load across shares.
We also added a sort of primitive form of resilient and persistent handle support.
You could mount with the option.
We'd request the durable V2 contacts.
We'd request the persistent handle if you mounted with persistent handle or if your
share said continuous availability, but we didn't do all of the guarantees
properly so I mean it works but it's not perfect and talking with David and
others we're very close to getting a reconnect to be much more closer to what
you expect and we'll talk about that later so 4.5 you know there's a lot of tunable stuff. O direct with cache equals loose. So
if you want to write directly to your file but with loose cache and semantics, you can
now do it. People in Linux are often running different network topologies where their network
may be very slow or very fast. The echo interval, we're pinging the server to see if it's awake
to decide if we need to reconnect to it.
It was made tunable.
And we began this encryption support.
Now, encryption support is not finished,
but share encryption is probably the biggest
feature that we get
asked about right now.
Encryption support is so helpful
when you're trying to mount to an Azure share
or when you're trying to mount to something at Microsoft.
When you're going across the Internet, this per share encryption is very important.
But I think it's useful for lots of use cases, not just that.
4.7 was dominated by bad lock fixes.
It was a big set of fixes.
Thank you, Samba team guys.
There were some side effects on our guest mount options, even in the kernel for this, and some NetApp-related fixes. Thank you Samba team guys. There were some side effects on our guest mount options even in the kernel for this and some NetApp related fixes. Red Hat and SUSE did
some interesting fixes for 4.8. Some of the Red Hat contributed stuff was this prefix
path stuff and we also had an interesting problem that you know it's fascinating you
can still have bugs like this that happen even years and years afterwards. We had a problem with MakeJer.
And, you know, fell through a hole in a test.
We didn't notice it.
And there's some stuff that's in progress that's really kind of neat.
Okay, right now we're reviewing fixes for PrefixPath.
So, for example, if you get the slashes,
depending on how you... When you're mounting to a share,
you might not have access to the root directory in the share.
So you might be mounting one or two levels lower than that.
But we have to traverse the whole thing.
Well, we have to get the slashes right, depending on whether your server supports POSIX or not.
Well, that fix is almost, we're working on that as we, you know, an hour ago.
We added some fixes recently
for improved POSIX compatibility.
Just crazy things like trailing spaces,
trailing periods, that sort of thing.
That just went in fairly recently.
One of the things that we added last night,
or I added last night,
was, you know, we've had this creation time for a long time.
XStat's not in.
How do we return creation time?
How do we return the file attributes?
Today, the only file attributes you can get back,
other than the boring ones,
are, it's a file or a directory,
is it's compressed.
So I think David added the compressed file support
a while ago.
You can get that flag, but none of
the other flags, like
it's indexed.
It's sparse.
There's no way to get those back. So we added
an XADDR for that so you
can have simple user space tools to display that.
These aren't POSIX things.
Obviously POSIX doesn't know anything about anything except
the EXT2I
octals allow you to get compressed and encrypted and that's pretty much it.
The rest of them, things like it's indexed, things like this you have to, or archive system
hidden those DOS attributes we have to get with a sudo x editor.
And the name is up for negotiation if you guys don't like sifs.dos attributes.
NTFS tends to put these up in user space,
so it's hard to look at the code because the user space tools are all a little more hidden,
but NTFS has a similar kind of approach to returning some of the NTFS metadata.
But the general problem we're trying to solve here is if your server stores it, we probably want to back it up.
If your server stores it, we probably want to see it sometimes from tools.
So this SMB3 metadata that you can't get out from POSIX matters, and we have to have some way of displaying it.
We also need to improve reconnect support.
Networks are terrible.
They go down.
How many times does your network go down today?
And being able to provide best effort
reconnection is nice, but there are guarantees.
One of the things we're looking at
literally right now with David and others
was
today, recently we fixed
two, three months, four months ago
we fixed it so
if the server goes down
we try to reconnect
proactively to that share.
But we also need to reopen all of the persistent handles immediately.
So we need to improve this HA support, and we're very close on that.
Encrypted share support we talked about.
Most of the core pieces are in there.
It's over half done.
We need to finish this last bit.
And I'm hoping, you know, with these two test events back-to-back,
this will give us a chance
to do that. ACLs
we have for CIFs. So finishing
the XADDR
that allows you to view ACLs
with SMB3 would be useful.
I also want to have a way of
viewing auditing
information, quota information,
and claims-based ACLs
through pseudo-XADDRs, at least for backup if nothing else. And that's something, you know, if claims-based ACLs through pseudo-X headers, at least for backup, if nothing
else.
That's something, if you guys have ideas on the
naming for that. Bug status.
We've got about 50 bugs opened
in Samba Bugzilla.
The 50 bugs that are active in Samba
Bugzilla, there's a few that look serious.
Most are not. Most it's cleanup work we need
to do to get it down. There's a smaller number
opened in the kernel bugzilla.
Some of these are stale
and they need
some love to close them off.
Now,
what's the high level view?
SMB support is great for
large file I.O., but it's not fast
for metadata operations. If you're going to do
a directory lookup with 10,000 files, it's going to be
pretty slow because we're going to do a directory lookup with 10,000 files, it's going to be pretty slow because we're going to do
too many open query closes,
open query closes,
getting metadata over and over again
for individual files.
We need to support directory leases.
That would help a lot.
And we also need to add,
and this is something we talked about
at the File System Summit,
we have this witness protocol prototype in
Samba. The client part of it is all we need. We need a witness protocol client that can
I octal down into the SIFS kernel, just wait on a witness protocol event so we know when
to failover. If we want to move a share from here to here because you're doing some management activity
or some load balancing,
that doesn't require the whole witness protocol implementation
that's kind of stalled a little bit in Samba.
I think they may talk about it a little more tomorrow.
But the client piece, I'm hoping we can get linked in
because only the client piece,
by opting down into our kernel code,
is needed to notify the kernel
as share movements requested.
POSIX emulation we emulate.
We don't have POSIX extensions.
We're going to talk about that Wednesday with Jeremy.
Jeremy's listened to that about POSIX emulation.
Today, though, we start POSIX protocol extensions.
Right now, we're doing emulation.
We'll talk about that a little bit at the end, though.
And dialects. We support SIFs, of course, but I really want people to be using SMB 2.1 or later.
There's no reason not to be using SMB 2.1 or later, except for the POSIX things we talked about, and we'll get to that later on.
So there's a set of capability bits that are negotiated. Which capability bits do we know about and we support? DFS, leasing, large MTU.
Sort of persistent handles.
Yes, we negotiate them.
Yes, we reconnect them.
But we don't reconnect them fast.
So we don't reconnect them immediately
when the server comes back up.
And we're fixing that right now.
And of course, the server support
is an interesting story for persistent handles.
This is very important to provide guarantees on data integrity.
CAP encryption we talked about earlier, it's in progress, not complete.
Directory leasing we really need to do.
That metadata performance is bad for the SIFS client compared,
sorry, SIFS.co compared with the Windows client.
That 30% performance boost you get from being able to cache metadata information
longer on a directory for which you have a lease, it's a big deal.
And multichannel, it's started, but it's not finished.
It is a priority.
One of the things that is holding us up is getting the Mellanox guys, getting the guys from these adapter vendors to help us with,
how do you simulate this on your little VM a little bit easier?
Multichannel and RDMA, being able to get the kernel,
you know, looking at the NFS code and how that works
and figuring out how to simulate some of these multi-adapter cases,
especially with RDMA, would help.
Because I think there is significant value in that.
Copy offload, I'm very excited about that.
It's a huge performance win.
You can see here an example of, you know, here we have a 30 meg file.
Now, the normal copy versus the, with the server-side copy, without the server-side copy.
You know, half a second down to one, 18 milliseconds, right?
Big performance win.
So this performance gain is really neat.
So seeing the server-side copy of 3 meg file,
seeing a server-side copy of 30 meg file,
huge performance win.
Notice that you're using dash dash ref link. The default behavior of ref link now calls the
iOctl for duplicate extents and I know it gets a little confusing because there's at least four ways to do copy offload with SMB.
Duplicate extents is the newer one for Windows 2016 RAFS.
Here's a wire trace of it, and you can get an example.
You can get a feel for what actually goes on.
I'm going to copy a 30-meg file, and you'd normally see at least 30 writes of one megabyte, right?
Do we see any writes?
You see an open, get info, an iOptal, a set info, an iOptal, and close.
This is really fast.
It's a heck of a lot better than sending 30 or 300 or 3,000 writes.
So what happens when the copy file fails on the writes. This is interesting.
So what happens if copy file fails on the server?
It's a great question.
Now, I haven't looked at this in the last...
These slides are actually from...
This particular slide is actually from a couple months ago.
CP, the CP command,
its error handling I didn't think was that good for the reflink case
if you used the same iOctl
yourself, you can do cleanup
but my impression was with the cp
command, what you'd see is a 0 byte file
left around, if I remember correctly
and you're welcome to try it
just try some cp dash dash reflink, I don't think it's unique to
CIFS, but if the reflink iOctl failed, I think you would leave a debris around
a file, an empty file, a 0 byte file.
I thought it fell back to just regularly write IO, so you can
provide reflink with always and reflink with...
Yeah, it was an option.
Yeah, yes, yes.
I thought the default was left to file.
That's a good point.
So he thought there was options on the cp-ref link.
There probably are.
And also, to be fair, behind this system here,
this is Ubuntu that I was testing around with today.
This is not the latest Ubuntu.
It doesn't have the latest cp command.
So I have to be a little bit careful
because we in Samba and we in the kernel
tend to write to the latest version of the code,
but we don't tend to download the CP command
that's the latest version of CP.
So this is the problem we have sometimes.
We're really good about bringing the kernel in.
We're really good about bringing Samba in.
We're really good about running the latest Windows 2016,
we're not really good about updating
our two-year-old Ubuntu
to bring in the latest CP command by reinstalling.
I've been getting bugged every other day
about, you know, do you want to upgrade?
So,
but I think this is an important thing for you guys to try
and I think that it would be very useful
to check on that. What is the
error handling that goes on and What is the error handling that goes on
and what is the error handling?
Does it match our expectations for error handling?
This is somewhat outside the scope of an I-Octal, right?
The I-Octal either works or fails.
So these are tool questions, really.
In case of I-Octal fails with some CP error,
so the client should fall back to the regular standard coordinate?
The tool may or may not.
And like David, I think that it's important to realize
that whether you want to retry or fall back
is part of the CP semantics, I believe.
The CP command itself has those options.
Let's say for ARQL, it's completely fake.
You should allow an option in your tool
to do the boring way, the slow way.
But realize that could be a thousand times slower.
So there may be times if it fails,
you don't want to take up all that bandwidth.
So I understand that there should be options,
and I do understand that users of this tool
could be very narrow applications that aren't using CP.
They could be writing their own calls to the cyocles.
And remember, this is not particularly CIF-specific,
or SMB3-specific,
because ButterFS and other file systems support this.
Now you'll also see NFS.
I think Christoph also did it for XFS, right?
Do you remember which file systems support? XFS, NFS,
ButterFS,
and CIFS.
And he thought OCFS2 as well.
There may be others, but it's kind of neat.
Okay, so let's look at it versus copy chunk.
Now, copy chunk is what we're used to.
Copy chunk is more common, especially at NTFS. The problem is
we don't have a lot of tools for this. We really need to write some more CIFS Samba-like
tools for this, but I think you wrote Cloner, right? So Cloner allows you to do copy chunk.
Here we're doing a 500 meg file. And you can see how long it takes to do the copy, this 500 meg file.
Not bad.
And we tried it a few times.
And notice the performance can kind of vary dramatically
because here we actually had to do some write-through.
Well, the second time was slower because, you know,
you had an existing file you're writing over.
So it slowed down a lot.
But look at ref link comparatively.
Now, what are we talking about there? So it slowed down a lot. But look at ref link comparatively.
Now what are we talking about there?
6,000 times faster, is that right?
I mean, it's pretty cool.
I mean, there's a big difference between the performance we got for copy chunk and the performance we got for ref link.
It's kind of neat.
Now remember, this is RAFS,
so your times for NTFS,
those copy chunk times,
were kind of interesting to look at.
We don't have the luxury of duplicate sense for this,
but it's kind of interesting
because you don't get as much variation in NTFS
as we saw with RAFS
when the file existed or didn't exist.
So RAFS, you got a big penalty if the file
already existed because you're having to clean up
what was there.
I find this very interesting. What's the best way to copy a file?
We have lots of choices.
Unfortunately, we only implement two of the four choices
in CIFS, but
it's a fun question.
What's the best way to copy
lots of data across the network?
These aren't bad ways to do it, but there are other options.
Now, Samba doesn't support all these options,
but across different servers we have these four different options to do this.
Okay, so what about HA?
We have the new mount options.
We can mount with resilient handles or mount with persistent handles.
We probably don't care about resilient.
If this mount option is specified,
we're always going to try it,
but I don't know what to do if it fails.
I don't think we want to give up on a file
if you refuse to give me a resilient handle.
The server, for some reason, like Azure,
wants to give you a persistent handle.
Great.
If for some reason it said Azure, wants to give you a persistent handle, great. If for some reason it said no,
I'll keep going,
but I'm not going to,
it's not a persistent handle, but I'm going to try.
I'm going to try on opening to get a persistent handle
and send that context.
I do need to add the channel sequence number.
This isn't quite as important without multi-channel,
but it is something.
Also, there's a couple failover things that are easy.
I know guys have played around in Sambo with exotic ways of doing DFS as load balancing.
DFS is kind of neat.
But one thing that will be relatively easy to do that I think we underestimate and should have finished a while ago
was if a server gives us multiple EFS referrals for the same path,
and one of the servers goes down, well, why not reconnect to the other?
And also this witness protocol we talked about earlier.
If I want to move from your server to your server,
because you're taking your server offline,
I need to be able to be notified about that event.
The client will have no problem reconnecting,
but we need notification.
And writing the witness protocol parsing,
the RPC parsing for that isn't worth it.
It's already there.
We have a prototype for it in Samba.
And so one of the things we talked about
at the file system summit was just
separating out that client daemon,
getting that checked into the Samba tree,
so part of the Samba client tools
can just iOctal down into the kernel
and wait on those events.
Okay, Steve.
Yes.
Question on the previous one.
Those mount options,
so those are opt-in by default?
You will now request that resilient or persistent?
We will always request persistent.
So the use...
The mount option is my question.
Yeah, so request persistent. So the use... The mount option is my question. Yeah, so use persistent.
The flag we keep on the share,
so every share we mount,
we will first mark it as use persistent
if this mount option were set.
We'll try.
And second, we'll set it
if the server set continuous availability.
So if the server set continuous availability, like Azure the server set continuous availability like Azure, right?
Right.
It's a server cap.
It'll tell you.
Right.
But I understand.
So what I'm saying is if the user said you must,
and by the way, if the server says it doesn't support it.
You'll request persistent handles unconditionally.
No, because if there's no cap
there's no point in requesting it.
But if it's not continuous availability
and the server
says it would support
so if you tell me that it's possible
for you to support persistent handles
but you don't mark the share of continuous
availability
then Alright. persistent handles, but you don't mark the share continuous availability, then...
Alright, well it sounds like a really weird option. It seems like it's going to be very difficult to tell the administrator how to use it.
Well, the short answer is you don't have to, because if you mark the share as continuous availability, we're implicitly setting this.
So if it's continuous availability, we're doing it.
Well, it just seems like it's a lot of
non-options.
Realistically, what happens is
your server either wants it for the share or it doesn't.
So your server's going to say...
...offers it for the share.
Whether the handle's persistent or not
is the client's decision.
It's usually driven by the type of application.
Persistent handles are really useful
for virtual disks, for instance.
But I agree with you.
If you don't use the handle, you can reconnect and recover
without damaging the disk. Right. I absolutely agree
with you. So his point,
I think, is persistent
handles are very valuable for
specific applications.
So, for argument's sake, let's say that
you have an application, virtualization application, that wants persistent handles and the rest of your
traffic doesn't care. You have two choices. You can either force the whole
share to get it by setting continuous availability and then we'll try, the
client will always try, or you mount it twice. You're trying to get the efficiency of non-persistent handles for apps that don't
care.
But you force it on if you manage.
Yeah, so the lesson we want to learn here, I think, is the lesson of resilient handles.
Having the app requested on open, nobody would do it.
So rather than have the app ask for it on open, If the admin wants apps to get the higher availability,
they'll mount it twice, once with continuous availability,
once without, to different directories.
Right?
OK.
So that's an option.
Yeah?
What does the protocol say?
That is, if the CA is not set, can you
deliver persistent handles from a shared for which CA continuous availability is not
set?
I think so.
I mean, this is a good question.
This is a good question for Tom.
The protocol says what happens when it's operating
is designed.
I think requesting a persistent handle from a server that
didn't tell you it supported it, it's not illegal.
No, no, the other way around.
Simply ignore the context.
I think it's the other way around. So the question...
If CA is not set, but the server returns a persistent handle to you, is that what you're asking?
Well, it can't because the create context, you wouldn't look for the create context.
The server can't, he would have to insert a create context in the reply.
It hasn't there in the request.
So basically, the interesting question is here.
From a protocol perspective,
when should the client request a persistent handle
and when shouldn't it?
When he cares.
Yeah, and since we don't have a per-app way in POSIX
for an app to say, give me persistent handles,
what we do is, okay, if your app wants it,
use this mount. If your app doesn't want it, use this mount.
Or, you have your server tell you, you must use it.
Go ahead.
The thing is, if the share is CA, you are always using persistent handles.
We'll always ask for them.
So if the share of CA...
You will always get them,
but...
Well, no,
you're going to open
another persistent handle
on a CA share.
It would be weird,
but...
Yeah, I mean,
we...
A down-level client
will do that,
for instance,
one that didn't support
persistent handles.
Right.
But the point is that
on a continuous availability
share,
if we've negotiated
a high enough dialect,
we will try on the Linux CIPS client
because the server told us
it's continuous available.
We'll try to get a persistent handle.
If we don't get it
and the server wouldn't give it to us,
okay, we tried.
The thing is that
this application doesn't need it.
There's no way to disable it.
And you said you would mount it twice.
Would you get persistent on both?
That's actually an interesting point.
So what happens if the client
absolutely didn't want persistent
handles, but the server had
continuous availability?
So Azure, for example,
you return persistent handles, right?
But only if you give us a normal B2 correct on a text.
It's that persistent flag.
We would give you persistence if you didn't ask for it.
But you always say that you support persistent,
and you always say on continuous availability, this share.
So we had been using the continuous availability flag
as a way of saying, this share has stuff that matters.
Trust me, I'm the server.
This share matters. There's something important here.
Now the capability
I support...
It's not the share.
It's what the application wants.
The files on the share
are no different
whether they're persistent handled or not.
It's the recovery scenario that changes.
So it's also the application,
the thing that wants to open it.
Yes, but your Windows
client will open
persistent
if it says continuous availability.
That's the behavior of the Windows client.
That's one example.
And so will we in Linux.
And the Windows client will open if it's one example. And so will we in Linux. Okay.
Right?
And the Windows client will open.
If it says can use availability,
the open persistent,
the Windows client will,
so will our client.
Okay.
But the point is,
I think it's an interesting point that you bring up.
If the application really didn't want
that penalty,
performance penalty,
maybe we should mount no persistent
and I don't care what the server supports I mean
as an alternative today just now with with a dialect they didn't support it but it's just It can be available at the capability that comes at negotiate time, or the server knows that you're connecting to it. It can be available at share or global share.
So that's how it's going to be there.
And then when you come into the share,
that's when you can find out if the share is
capable of persistent.
So at that point, he has to choose a default behavior,
just like says.
If an application is running against a CA share,
it will default to persistent.
If it's against a non-CA share, it'll default to non-persistent.
Now, the only case that's special, in my mind,
is the case where you have an app that wants to run against
a CA share, but doesn't want the overhead
because he's gonna do a lot of metadata operations
and he doesn't want to recharge the neighbors.
So I think the idea of a no persistent amount
makes a lot more sense.
The idea of having a persistent amount
where you're going to attempt persistent handles
against a share that's explicitly told you
it doesn't do CA, which it sounds like
maybe you might wanna do the options,
that sounds a little bit weird to me
because the server's saying the share
doesn't have the capability,
and I'm not sure what it means to try to use it.
Like, I don't think the server capability was intended,
at least maybe it's not popular,
to override the share capability.
But yeah, it's interesting.
I mean, basically this...
Both the server hint, this share matters,
and the client hint, I don't care.
I don't care. I don't care.
I want better performance.
They're important. And I think that
the no option versus the yes
option, now realistically
it probably doesn't make
as much of a performance hit as we think,
because once again, from the Linux
perspective, a lot of our problem is
metadata performance and things like this.
It's not really
reconnect delays perspective, a lot of our problem is metadata performance and things like this. It's not really
reconnect delays
and I'm not
as worried about that. But I think that this is something
we should revisit and maybe something we can talk about next
week at the Microsoft event or the test event this week.
Okay.
So
F-allocate.
This is actually kind of interesting. There's been a lot of
changes a year or two years ago, different F-allocate options. is actually kind of interesting. There's been a lot of changes a year to two years ago,
different F-allocate options.
We support the basic stuff,
punch hole, zero range, keep size,
and we discussed ways
a little over a year ago
on REFS at least.
Now that we support block ref counting,
we could simulate
a collapse range and insert range.
These are kind of interesting options.
In Linux, you have the ability
to take a file,
remove a chunk out of the middle of the file, and push the
two pieces together.
It's kind of a neat
thing. Then insert range,
same idea. You've got a file.
You want to insert something in the middle of the file.
Keep the first half
and the second half, but move them and stick something in the middle.
With
block ref counting stuff, it actually
isn't that risky
to do something like that by just sort of iterating
through block ref counting, block copies
to do that.
Because you can use
the hint
that the file system supports block ref counting
to know whether it's safe to do this.
Okay, ACLs.
Go ahead. Does Seq support SeqData, SeqHole?
SeqData, SeqHole.
So I think that would have to be mapped to
the query allocated expense.
Yeah, that would be interesting. So the question is, do we support
SeqData, SeqHole? I haven't looked at this in over a year.
But that's, yeah, this is interesting.
But the query allocated ranges
thing you're talking about, right?
So we'd have to
query allocated ranges.
So the seek whole thing,
let's look at that.
That's useful to...
So seek whole
probably would require
that we add support for querying
the allocated ranges in a file.
Okay, so ACLs.
We have SIF support for ACLs.
They're really important.
There are cases where the mode bits can be emulated for this.
I was kind of intrigued that Apple, I think, if I remember correctly,
they query the default permissions back so they can return a simulated mode on the file
by figuring out whether the particular user they've got has access
by using the create context.
But ACLs are a nice way to simulate mode, but they're also useful in other ways.
Being able to return the ACLs and set the ACLs is useful for backup,
it's useful for other things,
and especially if rich ACL support makes it
into the kernel. The reason the SMB3 support
isn't there isn't an architectural
problem, it's just
the CIFS code's been around
a long time and the SMB3 code was somewhat
different in its implementation
we need to finish up.
Security features, we talked about secure negotiate
it's partially implemented
and the share encryption.
Earlier I had mentioned some very recent work.
Here was the prototype of it that we were testing yesterday.
We're returning the creation time on various files.
So here's a set of files.
You're going to see this one has the archive bit set.
This has archive index.
This has archive read only. But these files were created slightly different times you'll
notice that they're the the timestamps differ while they you know they're all
created the same day the same hour but wrapping these in tools so it's actually
human readable would be interesting but this is just the raw blob.
And then, you know, what about metadata?
Here you have the Windows view of a file.
Hey, it's content indexed, archive bit set, read only on this one, system hidden.
And here's the flag set here.
You can see the flag, each of the flags set using the user DOS attribute.
Once again, this could be used for backup. It could be used for, you know, if we allow a set.
I don't know if you guys have opinions
on whether we should allow a set of this,
but it would allow you to do...
There are certain things like sparse we set other ways,
and there are certain things like compressed we set other ways.
But other than sparse and compressed,
you know, there is value to these flags to know them.
We need to build some Python tools
or little Samba-like client tools around those.
Okay, we talked about XStat.
Generally, at the file system summit,
I got the impression that XStat integration
was generally agreed on,
as long as we didn't bike shed
and keep adding new features to it.
But returning at least the birth time and some of the attributes
in a more standardized format is important.
We also, even if we don't want to be encouraging people to use alternative data streams,
alternative data streams matter.
You can open alternative data streams today.
If you open file colon stream one, file colon stream one, you open it,
just a file name, right? The problem is that you don't know that file has stream one and stream two and stream one. File colon stream one, you open it, just a file name, right? The problem is that you don't know that file has
stream one and stream two and stream three. So how do you list that file has
stream one, stream two, and stream three? If you knew that it had those streams, you could open them
and you could query information on those, but you can't because you don't know a way to list the streams.
And, you know, once again, an X adder to list the streams
was what we were working on literally
as I was walking up to this talk. We talked about the clustering and witness protocol
integration. We talked about DFS reconnect. Now, performance. There are some really cool
things that can be done for performance. There are are some of the... There are probably more that...
Well, I'm sure half you guys have other ideas
on things that help with performance.
But from my view,
these are some of the more interesting
performance features for SMB to talk about.
Compounding.
Do we do compounding?
No.
Does the Mac do compounding?
Absolutely.
You know, Mac does way too much compounding sometimes, right?
It takes advantage of that feature really well. So, you know, does way too much compounding sometimes right so you know I have this
most operations
in SMB3
go through one routine called
open query close
well that could be a pretty obvious compounding
candidate couldn't it
you know right now open query close calls open query and close
but we just compound that it was done intentionally
but we didn't finish the last bit.
That would probably help 10% or 20% at least on metadata performance.
Large file I.O., we do pretty well there.
Our performance can be better than NFS in some cases.
It's really neat.
Performance scales really well for large file I.O.
File leases, yep.
Support leases, can we upgrade them?
No.
Lease upgrades might be nice to reacquire leases from time to time after we lose them.
We don't do that.
Directory leases. Huge performance win.
Shouldn't be too bad. We don't support it.
Copy offload. Yes, we support the two probably most important mechanisms.
But it would be nice to support the T10 style as well.
Multichannel. We do the basics. We query.10 style as well. Multichannel.
We do the basics.
We query.
We know the server has multichannel.
We know information about these adapters,
but we don't take advantage of multichannel.
And this is unfortunate
because Samba has recently added support for multichannel.
And, of course, RDMA,
one of the challenges of RDMA
is getting some good sample code in the kernel
and some sample drivers that we can emulate RDMA when we're running around in presentations without RDMA
in our VMs.
And then Linux-specific protocol optimizations.
I think we should be very aware that every operating system has particular quirks about
IO, and being able to optimize that matters.
And I think that we've spent a lot of time
at these conferences listening to Hyper-V.
Hyper-V has particular I.O. requirements.
Probably Azure and other things do too.
We have to be very aware of, in Linux,
what we can do to reduce number of frames sent on the wire.
And hopefully as we go beyond the things
that Jeremy is talking about with
the Unix extensions, we can do that. Okay, when we talk about Linux extensions or Unix
extensions or POSIX extensions, we have to remember that Linux is not POSIX. Linux does
a lot of things that aren't POSIX. It has lots of extra system calls, things that are
Linux specific. But, they matter.
So what do we do today? The best effort, compatibility.
We can handle all the reserved characters, mapping them
just like the Mac does. We can support these Minimal French
SimLinks. We can emulate SimLinks multiple
ways. We recognize
the multiple ways. We can get most of the information
we need. What can't we do? We can't do advisory locks.
We can't do case sensitivity and opening paths.
And without some sort of SIFT-SACL thing, What can't we do? We can't do advisory locks. We can't do case sensitivity and opening paths. And without some sort of SIFT-ACL thing, we can't really emulate the mode bits very well.
Apple's approach to this is reasonable and might be worth looking at for returning some of the
mode bits. But the Unix extensions allow you to do this a little bit cleaner. This is actually
good enough to run the vast majority of apps and test cases, though.
So we could query maximal access request,
create context, as we talked about,
to get some of the mode bits.
The case-sensitive volume, unfortunately, isn't...
The servers lie about this.
The servers say they're case-sensitive volume, and they lie.
So there's not a whole lot we can do about that.
To use it as a clue,
hey, the server says it's case sensitive, let's not
worry about case sensitive
mapping. Well,
unfortunately we can't.
The NFS
SimLink code in Microsoft
NTFS allows you to create a reparse
point that their NFS server uses
and we could use these same things.
They're nice that only the clients follow them.
They don't have any server security problems
as a server-followed junction might.
We recognize it, but we could clean that up a little bit.
Right now, the Minstrel French ones
have a sort of magic file size and signature in them
that allow you to recognize them as a symlink,
and Apple does that, and our Linux client does that.
Query FS info, the physical bytes per sector, we can map that to an obscure set FS field.
But it doesn't address byte range locking, it doesn't address the case sensitive path
names and of course we have this problem with streams.
There are some things like you download a file with Internet Explorer, it's going to
add a stream name for its zone.
If I have file colon stream,
how do I tell that from a valid POSIX path
that has a colon in the valid POSIX path?
So one of the problems we have is
you're either mapping with POSIX emulation
where colon is mapped to something else,
or not.
And if you're not, then that would be a stream name.
If you are, it's an illegal path to Windows.
It's a legal path to POSIX.
So how do we deal with that colon conflict between POSIX,
where it's a legal character,
and Windows, where it's a separator between that and the stream name?
Okay, second one.
Apple has this create context, AAPL.
And you can see some of the things here.
And it would improve Mac interoperability. Here you see an example on the wire of what the AAPL. And you can see some of the things here. And it would improve Mac interoperability.
Here you see an example on the wire
of what the AAPL context looks like.
And nothing too magic about it.
It's good enough for a lot of their needs.
We could certainly do it.
We could make it a mount option.
Or we can go finish up the very relatively small changes that Jeremy and I have been talking about for these POSIX extensions.
We'll have a breakout section on it as well as Jeremy's talk.
Performance.
We really need to do some of this compounding, get the direct releases in place.
That should help a lot.
There are cases where we're going to be faster than NFS.
And there's cases where NFS, of course, is going to have fewer operations to get the same metadata back because their reader
and their query info map a little
bit better to POSIX
but the bottom line is we want SMB
to be good enough
for a lot of different workloads
right now it's good enough for a reasonable set of
workloads
but to broaden that we need this
POSIX support and we also
need to improve the automated testing.
XFS test is wonderful,
but there are a subset of tests that fail.
And some of those we know why.
Some of them need POSIX permission mode bits.
Okay, we can deal with that.
The F-alloc missing features
that Dave and I were discussing,
sure, there's a few of those.
There are not that many.
XFS test tests lots of file system specific features.
Things that other file systems don't support,
things like F allocate special features.
We're going to fail at least one test, test 131,
because we don't have advisory locking.
Okay, I can live with that.
And there's a few that relate to network file systems generally.
Getting timestamp coherence between M time and A time and some of these consistent in a multi-node client server network,
in some cases, is impossible. You can do it in a local file system, but in a multi-node client server network, in some cases it's impossible.
You can do it in a local file system, but in a network file system,
it's not always possible.
And there are a couple of cases like this that on NFS or SMU aren't going to work.
But generally, XFS test works reasonably well on Linux,
and especially with the scratch mounts specified.
And I think that as we clean up some of these last few things
we've talked about here,
getting to the point where 90% of the tests
make sense on CIFS and pass
is really going to change a lot
because it means that when you do a fix
or when you're testing your server,
you don't have to think as hard.
When 60% pass,
there's too much thinking involved.
Is this a bug in my server? Is this a bug in the CIFF client? What's going on?
But it's important to get a little bit farther along in the compatibility for POSIX
if for nothing else when NetApp or when any other server vendor wants to do a quick test of their NFS client and CIFF client,
you want to use the same tools.
Being able to use the same tools against different protocol versions, SMB3, SMB3.1, 1, NFS3, NFS4,
using the same set of XFS tools, XFS test tools, is helpful.
And I realize XFS test sounds misleading
because it has nothing to do with XFS anymore,
but it's the name for the file system test suite.
And its history obviously leads it to that name
because it came from the XFS development team initially.
But it's now become a kind of a catch-all for all the test tools.
But at these events, specifically at the Plugfest downstairs this week
and then next week at Microsoft,
we have the opportunity to make some progress here.
And this is kind of exciting.
Some of the SUSE developers and Citrix developers and others
have been here working through some of these problems.
I think SMB3 on Linux has a very bright future.
I'm very excited about some of the improvements that were made in this quality of service.
In our SPD talks earlier, I'm kind of excited about getting additional security features in
and just getting the performance a little bit better each year.
I think a lot of times on Linux we focus too much on local file systems and on maybe iSCSI
and NFS.
There is a role for SMB3 quite broad, and not just on Macs, not just on Windows.
I think there's some workloads where it's really exciting on Linux as well.
But I also want to encourage you guys to send patches,
because we definitely need more help here.
And this is a fantastic, very, very interesting challenge.
OK, we have time probably for a few questions.
Anybody have questions?
Yes, go ahead.
You mentioned earlier some of the Linux APIs like rich apple and xstab and I was
just kind of generically curious if you had much involvement in that or insight into how
those things are going more than just that they aren't in yet because I know some of
those, I remember the rich apple patches coming across mailing lists probably six or seven years ago, maybe?
Well, I remember them because on my team at IBM years ago, we were working on it.
This was like 2007 or 2008.
So the Rich Apple patches have gotten gradual improvements. was Rich Ackle, some of the NFS work and XADDR work
that was done was actually in preparation for Rich Ackle and cleanup.
So some of the cleanup patches to make Rich Ackle go in cleaner have gone in.
I noticed this yesterday in annoyance
because remember I was talking about the creation time and the DOS attributes?
I was like, where did my code go?
And I realized it went in because they took out
some dead code.
They took out some dead code because in preparation
for Rich Ackle, they cleaned up some of the XADDR code.
So there has been a little bit of the patch
set go in, but realistically, I think
Jeremy Allison's had some offline discussions.
You could find Jeremy and ask him
the progress he made.
There's been strong objection to the
idea of having deny aces in Linux
because it does complicate the admin model.
The patch set is just a typical Linux example.
When you get a patch set that's this much complexity,
it takes 10 times more than something
with this much complexity.
On the other hand,
there are product shipping with rich ACL support in it.
The tree is available,
and we have the Samba modules to take advantage of it and you know NFS patches to take advantage of it go ahead
yeah that's actually a good question is it more people than Christophe that
object to rich Apple probably but he's but he's the one that everybody hears about.
I haven't run into anybody other than Christoph recently that objects to it.
He's here.
Yeah.
He's here.
I haven't asked him.
Talk to him.
Yeah.
And to be fair, rich ACL is more complicated.
XStat, on the other hand, shouldn't be a big objection.
And I think that that got more general agreement.
I think we're much closer on XStat.
On the other hand, the problem that was mentioned over and over again at the kernel,
because we had this breakout session, we were talking about this,
and Michael and I, no bike shedding.
The kernel has this really horrible habit of trying to make it 2% better
by adding 5% complexity, and then 100 of those changes,
it just gets a little bit, and then it just falls apart.
The whole XStack could return many things if we leave it at what it is right now it's not much objection
but why it's not in because you know look we talked about it back in what April or May and
Raleigh is a good question and it's something I haven't actually followed the mailing list
discussions on to see where the last you might be able to go out on Linux FS develop and see what the last reaction and exit was because that was much less
controversial than rich
And I feel a little bit guilty on the rich ackle side
Because on rich ackle side I could make the life easier
By doing what the NFS guys did and kind of like making us sort of set of patches for sips that kind of look like
The NFS ones so when they turn on rich patches for SIFs that kind of look like the NFS ones
so when they turn on RichAckle,
it works for one more file system.
Because one of the things that we have to be aware of
is that if it provides more and more value
to more and more file systems, that's good.
But not every file system is enabled for RichAckle.
And SIF supports it on the wire,
so why don't we just add a little bit of glue?
Why don't you just store it as an
extended attribute?
Well, the
And only those
files and systems need it.
They're essentially Samba.
I mean, you do it in Google Space.
So Samba stores
other metadata as well in extended attributes.
And it's fine. There's no harm in storing it.
The problem you get is that
multi-protocol access, it's a little bit more confusing
and it is a little bit more expensive
to store it in
to have Samba stored in user space.
And it's not atomic.
Yes, and this is a big deal actually.
If you think about
it's actually not atomic either way.
The rich ACL interface wouldn't be atomic either.
Because one of the problems we have today...
Well, the access checks are atomic, yes.
But what I was...
Something painful is I create a file.
In SME3, I'm presented with create context.
And when we have to go out to various X adders to process, create context, you could
have an open succeed but a create context relating to ACLs that we couldn't set.
.
Yeah, and that's the way of...
So we could...
Yes. So you stick it in a temp file and then do the rename game. Yes? The other thing that I know we've been looking at, what I've been working on, is actually
exporting a local file system over NFS before and with the current Linux NFS server without
the rich Apple patches, it's completely faking the Apple support, so it's not really properly compliant.
Yeah, I think that what I think is important about your point about the faking ACL support,
when you have NFS ACLs on top of, you know, there are other platforms obviously that support NFS ACLs a little more natively.
When we, you know, rich ACL integration with NFS is, the patches are fine. But without that, there's a certain amount of complexity
that is hard to understand when you're mapping twice.
You're mapping on the client end and on the server end.
So yes, you lose information.
But more importantly, an administrator trying to be sane
about denying access or allowing access,
it's very hard to get
right when you're mapping twice. And, you know, I think in this room, eventually all of us could
figure that out, but we get it wrong a lot. And the chance for accidentally making data available,
you know, that's very painful. And this complexity, rich ACL may be more complex,
but it's not more complex than mapping twice.
See other questions?
Okay, well, hopefully you guys can help out with patches and testing.
But once again, this test event downstairs is a great opportunity
for us to talk more about optimizations and how to do this.
Let's get some more progress on this stuff.
Thank you guys.
Yeah.
So this is the
old Samba logo, right? Yeah, we need
a better Samba logo.
No, we have one.
We have a better Samba logo, but it's the wrong one
in the test. Yeah, I know. I put the wrong
one because this is stolen from a year-old presentation.
Mea culpa. My fault.
Okay. Thank you, guys.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.