Storage Developer Conference - #125: Opening up Linux to the wider world
Episode Date: May 5, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual
Storage Developer Conference.
The link to the slides is available in the show notes
at snea.org slash podcasts.
You are listening to SDC Podcast
Episode 125.
Welcome to another year of the POSIX extensions.
So this is an exciting time.
We get feedback and find out stupid things we did
and find out things that can be done better.
We had a chance to try these out at the...
But first of all, this is Steve French of the Microsoft Azure file-serving team,
for those of you who don't know us.
And I'm Jeremy Allison from Google.
And there we go.
That's what I was waiting for.
And Google don't even know or care that I'm here.
And I want to keep it that way.
So everything I say here is not the opinions of Googling,
blah, blah, blah.
And for Steve, I think Microsoft know you're here.
Microsoft knows I'm here, but this isn't Azure code.
This is the kernel code.
There's lots of wonderful things we'll hear about Microsoft tomorrow
because those guys are going to say everything, right?
Okay.
Okay, let's go.
So let's talk about a couple things.
I always like starting with the slide,
and I hope you guys can forgive us for repeating it,
but every once in a while somebody brings up a four-letter word, POSIX.
And some of you, old enough,
have even sat on POSIX committees,
heard screaming matches between different companies
about POSIX standards,
heard, like Jeremy, stories you don't want to know
about byte range locking.
So one of the things I wanted to go,
let's go to the next slide.
You know, it's like Windows.
It doesn't matter what the spec says.
It matters what the app does.
And unfortunately, apps on Linux are written for Linux.
Now, you see this little tiny thing called the Ganussi library and those things.
It calls into POSIX.
Let's go to the next slide, and I'll show you how bad it is.
So there are about 100 POSIX. Let's go to the next slide and I'll show you how bad it is. So there are about 100 POSIX
API calls. A few minutes ago, I did a git grep syscall underscore define. Just in the file
system directory, not any other syscall, just file systems, is 222. There are 100 defined in POSIX. Linux has 222 in the file system alone.
The man page only pulls up 400 syscalls.
So you can kind of see our problem here, right,
is that we think in terms of this narrow thing,
but it's actually a bigger problem, right?
Linux is evolving.
And the thing to remember, of course,
is that every single one of those syscalls
is there for a reason,
and applications that use it expect to keep working across kernel upgrades, across distributed file systems,
you know, and of course, if you run an application locally and then you run it against a remote
file system like SME3 and it fails, it's always the remote file system's fault.
Yep.
By definition.
Yeah, and, you know, as you guys probably noticed, there are some well- the remote file system's fault. Yeah. By definition. Yeah, and as you guys probably noticed,
there are some well-known Linux file system developers here.
Some of them are two rooms over.
Notice they're not here.
They're not here.
So it's our fault, even if they're not here, right?
So we have to adapt to them.
And this year alone, there have been new syscalls added. In the news
two weeks ago was a follow-on discussion about OpenApp 2, one of Jeremy's favorite topics.
So these continue to be discussed. It's not ending here.
So I want to give an example of fall allocate, because some of you have actually dealt with sparse files
or allocated this API call.
I actually had somebody last week say,
well, I don't use F trunk to set the file size.
I use F allocate to set the file size.
Oh, well, we got to, I mean, we can't avoid these things.
This is a real Linux app developer
who just expects it to work. So
there are seven flags in F allocate alone. Rename has three flags. Imagine the combination
of these things. Protocol stuff isn't easy in our case because we're not dealing with
POSIX. This is beyond POSIX. And these are just two examples, syscalls. Okay? So, let's get apps to work.
Case sensitivity,
you've got to have it if you want to build Linux.
The first test people ask me about is,
can you build the kernel on a SIFS mount?
They want it, whether it's on-premises,
in the cloud, or sitting in Starbucks.
Right?
Sitting in your hotel, running your presentation.
So improve the common situations where they access Linux.
And, by the way, too bad Ned Pyle isn't here, right?
Yeah.
Deprecate CIFs.
This is SMB 3.1.1 only, the most secure only.
We don't want the old stuff.
Because security matters more than everything.
I don't care if you're Azure or Google or whatever else. We can't afford to run less secure stuff.
Okay. So I want to talk this slide rather than him because we have to brag about him
and Volker and Metz and Ralph. I hope you guys got a chance to see the presentation
this morning about all the progress this last few years in Samba.
Why do we care about Samba?
Some of you remember 1992.
Some of you remember Tridge.
There's a lot of reasons we care about Samba.
It's three and a half million lines of code of stuff that you don't want to rewrite or replace.
It's great stuff. Or at least you want to be able to split it into pieces
and use the pieces you need,
which is what we're trying to get to.
It's like a toolbox that lets you work on the fanciest car
with these really fun tools.
You've got all the tools there.
Okay?
Okay.
Okay.
Anybody recognize any of these people?
So Samba.io Lab last week,
this is a great development community to work with.
I think you'll recognize some of them here this week.
But this is a great group.
And as you don't really realize, this is actually up at the Microsoft campus.
Just to show you how Microsoft is very much, and this is where I get to talk about, Steve, how Microsoft is very much a change community and is basically just another member of the open source community developing Linux, developing software, open source free software on multiple different platforms.
So over the last four years, we've gotten to get together to test four times over the last year as a group.
So this has been kind of good.
We've gotten a chance to get some feedback, try some things.
We're able to get some other implementers.
You know, Linux makes a lot of progress.
Linus enjoys crazy names.
A year ago, we had the Merciless Moray.
A week ago, we have the Bobtail Squid.
So he enjoys these crazy names.
Linux continues to evolve.
It's a lot of fun to track.
If you want to experiment with the things
we're talking about in the talk,
you can experiment today
at the test event in Israel,
Tel Aviv at STC back then.
There's a vendor who implemented the server side.
You just have to try.
Here's the patch if you want to backport it.
But anybody with a 5.1 kernel or later, just works. The server side, you just have to try. Here's the patch if you want to backport it.
But anybody with a 5.1 kernel or later just works.
So what terrifies me about Steve is that he keeps pushing this stuff out to the public into general kernels that people are actually building and using with way before the server side is considered ready to ship.
So we have experimental trees for the server side of things,
but Steve is busily pushing the client code out there to everybody.
There is some precedent in NFS and others for this.
But when you mount, it's not by default.
When you mount, it prints an ugly message in your log
saying this is experimental.
And I didn't say it because Jeremy told me so.
But it's true.
But it's true.
Okay, so let's go.
Okay, so one of the things that Ronnie and Aurelian are probably trying to hide here,
but one of the things that our distro partners and others have pushed
is this stability and regression automation.
And it's a lot of fun, right?
These XFSS, I think we run more than NFS does now, or at least that's what they claim. There's been a lot of good fixes, but this build bot that
Ronnie Aureliano Paolo and some of those guys put together has been fantastic because it allows us
now to start testing against the POSIX tree. So as we get back to work on some of this stuff,
it makes it much safer for us. Both client and server changes because we have the automation for it.
One of the test targets is Azure.
One's Windows.
We have the generic regression target.
If you have your favorite server you want to add, that's great.
But up in Azure, we can spin up these various VMs, including with his tree,
so we can do automated against his tree.
Okay, so what could you try today?
This is experimental. It is not enabled by default unless you try today? This is experimental.
It is not enabled by default unless you type the word POSIX.
You must mount, you don't have to actually specify
verse equals 311 because verse 311 is the default now,
but for older distros, you would have to type that in.
You need the mount option POSIX, that four-letter word.
You need to specify that.
There are very limited protocol features in it.
Now, on the server code,
we'll give you some instructions about that,
but here is a pointer to his tree,
which may change,
but here is his experimental SOMA tree
that has the prototype server code.
And as I said, there are some vendors
who have tested this,
and we saw this at the Tel Aviv event.
Now, that tree hasn't been getting
very much love of late, mostly
because of the VFS changes that are going
on elsewhere. But what I'm
hoping is that once all the VFS
turmoil is
finished, hopefully by the end
of the year, then we can start
taking the changes for that and moving them
on top of the modified
Samba and basically
get it into an experimental version of the modified Samba and basically get it into an experimental
version of mainline Samba. So when you pull Samba, you'll get this code by default, just
not turned on, you know, it will start testing it, but just not turned on in a standard production
build of Samba.
Good, cool.
Okay. So why isn't this shipped already?
Well, the problem is we thought we were getting close,
and then all of a sudden the Windows subsystem for Linux
essentially changed the goalposts a little bit.
And this is kind of important
because it turns out that
the way we're implementing in the POSIX extensions,
the way we're implementing the file system objects
that Windows clients don't want to deal with,
that POSIX clients have to deal with,
and these are things like SimLinks, FIFOs,
Unix domain sockets, character and block devices,
et cetera. The way we're doing those is in repass points. We're exposing them as Windows-style
repass points. And it turns out that everyone has a different idea of what those should look like.
Now, our goal is to basically be as close to what Windows does as possible. But now the Windows
subsystem for Linux has defined a new method of exporting these
POSIX object types, and they've chosen different
repass point tags for
exposing these. So we've basically got to sit down and resolve
right now what we're doing is we for exposing these. So we've basically got to sit down and resolve.
Right now, what we're doing is we were using the initial NFS repass point tag,
which is what Windows used to store repass point data.
Now it looks like we may need to change this,
and this will mean an on-the-wire protocol change.
I mean, actually, that's not true.
It means an interpretation of existing fields that we have within our protocol change. I mean, actually that's not true. It means an interpretation of existing
fields that we have within our protocol changes. So our protocol changes actually already have
the repass point tag returned as part of our new POSIX info, but the actual meaning of
what gets put in that tag now may need to change.
We'll see the good example two slides from now.
We can show you exactly why this makes even more sense.
And then the other thing is, basically,
the original Samba VFS was built around
the old open group path name-based operations,
and the world is moving to handle-based operations.
And so we really have to build this
on top of the at series of calls,
like open at, make dir at, rmdir at, you name it.
And if you follow Samba development,
you'll be finding a large number of VFS changes.
Basically, we decided that now was the best time we had.
We just shipped 4.11, so we've got six months.
Let's get all these VFS changes in and finished
before we start building POSIX extensions on top.
Otherwise, we'll end up with trying to retrofit
the VFS changes and the POSIX extensions at the same time.
So unfortunately, this is the,
why isn't this damn thing shipped already,
is essentially that is the thing
that has pushed us back a little bit.
So one of the things that...
Actually, I think that's...
Yes, there we go.
That was the same slide you duplicated by accident.
Yeah, so one of the things that's important,
that we want to not add a big performance penalty
to get POSIX information.
And that's one of the things the WSL guys mentioned.
They're like.
Yeah, they have exactly the same problem.
Yeah, because if you do a query directory,
you don't want to have to keep querying
every single file in there, right?
So yes you will if they ask for stat information,
but you'd like to be able to find the file type
for these special files.
So you can see an example on the next slide.
So instead of having one tag where you have to query if the tag ever shows up,
so tag, you know, an NFS tag,
and you have to go do a second query on it,
having the tag returned,
and we verified that Windows does this already.
So, and the nice thing about this
is it'll work with Windows.
It doesn't require an extension to the protocol for this.
It requires that when Samba exposes a local FIFO,
that it exposes them the way Windows exposes it.
The beauty of the way Windows exposes it now
is that it doesn't require an extra round trip for me
when you do a query to figure out,
oh, that's a care device.
No, that's a FIFO.
No, that's a symlink.
So it saves round trips.
And for directions with large numbers of files or objects in it,
those round trips are a killer performance,
for performance reasons.
You're having to do an extra round trip to the server
for every single type that you get back
where you know it's a weird object,
but you don't know what it is.
Now, since the last SDC,
we talked about this at SOMXP,
since the last, a year ago,
there was one other change that we made,
and we'll talk about that in a minute.
So if you tried to use it today,
you know, just mounting from your laptop
or whatever you had in current,
you know, 5.1 or later,
easy. Verse equals 311 is already there. You don't even have to type that in. You're just mounting from your laptop or whatever you had in current 5.1 or later. Easy.
Verse equals 311 is already there.
You don't even have to type that in.
POSIX.
We prefer, because we don't want Jeremy to lose any more sleep,
but use client sim links, MF sim links like Apple,
Apple-style sim links,
because these are only evaluated in the client.
They make life a little bit easier,
but obviously he can expose server ones too.
And on the server side,
if you wanted to experiment with Samba,
here's the mangled names know, create mask.
But one thing that is a little bit unusual
just for anybody experimenting with this tree
is the bottom line.
That's just a bug.
Yeah.
It's a little bit odd, so I just mention that.
That's a very odd thing.
It took us a little while to figure out why that happened,
and it's a long story.
But if you're experimenting with Jeremy's tree,
those are the four things we'd recommend.
Now, remember we talked about the reparse tags.
So in practice, what it changes,
so here's Jeremy's code.
So this is, you know, during the last talk,
during Namj's talk, I did this wire chart capture.
So this is, you know, not fake.
Notice the find response, and notice the tag, okay?
So we're enumerating these files, right, whatever,
and it's not completely parsed all the code on the client,
but notice this tag, that tag changes.
So the tag comes back with a strange name,
this is just the way Windows defined the protocol,
but the tag comes back in EA size.
Anyway, not a big deal.
Just this is the only thing that changes.
It's a very small change.
Okay, go ahead.
Yeah, and you can see the query directory example.
Yeah, so right now, YShop doesn't decode these,
but this is basically the POSIX query directory
returning the new info level
that we need for POSIX information.
Yeah, notice the info level?
Yeah.
Okay.
It's the same information that would be returned
in the POSIX create context.
Yeah, info level 100.
Yeah.
Yeah.
Yeah.
Okay, next.
Yep, cool.
Okay, so this is an older slide.
You guys may have seen some of this before.
But, you know,
all those weird path names that you guys love to do.
You know, building the kernel is fun
because you end up with these path conflicts
that wouldn't work with Windows, right?
But there they are.
I mean, they work.
This is, you know, this code's working.
All the weird path names you want to try
with question marks or exclamation points.
Next one.
And, you know, here's what it looks like.
You know, if you want to make sure you mount it with POSIX,
you can see POSIX and POSIX paths.
You can see those in what we display in the mount options.
And you should be able to see some of it
in the debug info as well.
Yep, you can see the POSIX is enabled in the build.
Nothing fancy here,
but just kind of showing you this isn't a trick.
You know, you want to try case sensitivity.
Upper and upper are two separate files.
Okay?
Case sensitivity works.
You can create directories with different mode bits.
Yay.
You know?
Here's positive context work.
This is Jeremy's code to from my code, right?
Keep going.
Rename.
This is always fun.
It's a little hard to read this, but what you have here is a, you know,
renaming open files, that kind of thing.
That would fail if Windows with POSIX,
it works to his code.
It's a little hard to read,
but you'll be able to see it in the handouts.
And this is, I mean, this is basically,
this isn't sort of a silly rename stuff like NFS.
This is actually mapping directly onto modifications
onto the server file system.
It turns out that stat of a file system
returns more information,
blocks, inode count,
you know, fundamental.
Here's the local and remote view of the same file system.
So stat dash F.
Notice it's...
Yeah, the block size is wrong in there.
But the block size is the only thing reported differently.
You've got the inode count and all that.
So anyway, it's interesting getting that right,
so keep going.
Now, with the POSIX extensions,
you can see what it looks like. Notice with the POSIX extensions, you can see what it looks like.
Notice with the POSIX extensions,
things match better.
Okay? This is
the static file system. Okay, so what are the
gory details? I don't know if you want to talk about the
gory details of the negotiate.
So this is
what's changed. Yeah.
So, essentially,
the client sends the 16-byte GUID
on the Go shape saying,
I would like to do POSIX extensions,
and if the server replies with that,
then you know that the server at least speaks
that version of POSIX extensions.
And the thing about this is that
if we ever do need to make
a fundamentally
compatible protocol change, we can just version the grid. We can say, okay, this is the old,
you know, I don't do this version anymore. We'll just allocate a new grid and that's the one that
you get back. So at that point, when you're talking to a server, originally we thought about
just making this a new create context and essentially you would just send it on any create context you wanted.
The benefits of adding it into the negotiate are such that you know that the server is capable of doing POSIX,
and if it denies you a POSIX handle when you request an open, you know that that's deliberate.
It's not that it didn't understand POSIX, it's just that it said,
for this area of the file system, I ain't giving you POSIX semantics. Now, that might
be because you've got an NTFS drive mounted that is actually obeying different semantics
or an EXT or an exFAT drive mounted that you're exporting, but it allows the client to at
least make a sensible decision of whether, hey, I didn't get POSIX semantics because
the server didn't want to give it me versus I didn't get POSIX semantics because the server didn't want to give it to me
versus I didn't get POSIX semantics
because the server can't do it.
So one quick thing.
Notice that the context we send is not that big.
It basically has a GUID.
You know, 100 is the type.
And then it's not that big.
So it's just sending the GWID in it.
Negotiate context should have been GWIDs
in the first place anyway.
Yeah, exactly, exactly.
That was a whole protocol decision.
So this is easy for implementers.
It took less than an hour for the other vendor
to do this part, right?
Yeah.
So let's go to the next one.
Now on create, we actually have to send that grid.
So you can kind of see the chain.
You've got the durable handle context right underneath it.
You've got your context.
So the POSIX context sent on the request is actually pretty easy.
Yeah, it's just sending that request. It's basically just adding a new create context type
and then passing the result if you get it back
that says you got it.
Yeah, and you can see it on the response here.
Yeah.
Samba's sending it back on the response,
and there's a little bit of data in it that has some...
Yeah, and it's basically the same kind of information
that you would get back over and above the info level,
the standard info levels that you get back
when you do a Windows query file info.
So the goal, again,
I know this is repeating from last year,
but the goal with the POSIX extensions
was never to send the same information twice.
So if Windows already gives you that information,
you don't duplicate it in any POSIX returns.
You enhance the information
that you could have gotten from Windows.
You don't overwrite it, or you don't duplicate it,
because any time you send back duplicate information
to the client, there's the possibility
that the server can send you back
two different conflicting values,
and then what do you do?
You've no idea where exactly you are.
A good example of this is the inode number.
We can get the inode number back today
in one of the contexts, QFID, right?
So today we query that to get the inode number.
There's no sense duplicating that
because the only code we want to change for POSIX
is stuff that really is different.
So sending this create context,
getting it back,
was a clue that the server said
that the rename semantics and delete semantics
are POSIX style delete semantics.
On this handle.
But notice that many of the other things,
like those reparse points and other things,
had absolutely nothing to do.
They would work to a Windows server or a Mac server.
Okay, so let's go.
So here's an example without the POSIX extensions.
We can use, you know,
you can use various ways to map mode bits.
There's special aces, for example.
There's CIFS ACL.
There's various ways to map the mode bits.
But let's go to the next slide.
You know, with the Unix extensions,
we now have a better way to do that.
Now, what was the number one reason wrong with what we had?
The POSIX extensions worked great.
What is it, SCO and Hewlett-Packard 25 years or 20 years ago had done this stuff.
Jeremy and I had even occasionally modified it.
But there was this one-a-cry thing.
And you know, SMB 311 really.1 really is pretty good. Now, Apple
did some interesting things with their case sensitivity, but they
didn't handle all the positive compatibility issues.
So, although it's a useful experiment,
it didn't solve all the problems we needed.
Well, so, I mean,
Macs basically are interested in talking to
Macs and having the Apple semantics
that macOS needs.
So, essentially, I mean,
the Apple create context is essentially a Mac-to-Mac thing.
Now, having said that,
Ralph has written modules in Samba
that emulate a Mac server.
So Samba now has sort of like triple personality.
It can be a Windows server, it can be a Mac server,
and with the SMB3 POSIX extensions,
it can be something new,
which is sort of a hybrid,
which actually gives
as close to POSIX semantics
as we can make it.
Now, there are still some missing things
that were...
My guess is we probably shouldn't try
and get them done before we officially ship,
because the perfect is the enemy of the good.
And those are things like being able to set mixed case extended attributes.
So right now, essentially, you don't get case sensitive extended attributes,
which is what Linux at least requires.
We just map the extended attributes
you might want to send into the Windows space.
And that's probably good enough.
Yes, there may be some weird applications that fail.
And probably some test cases that check
the extended attribute semantics will fail.
But we haven't found any actual apps that use that.
The one thing we have to be careful about on that is...
So, for security,
SELinux, there are trusted and...
Well, SELinux is the only place
that we may end up having to extend,
and that's basically because right now
all of the EAs, including the POSIX EAs that you send,
they all live in the username space.
And so, for SELinux, you have the, what is it, the system security names.
System security and trusted.
Yeah, you have different namespaces.
Right now, we have no way of mapping those into existing SMB3.
SMB3 doesn't have the concept of different namespaces for EAs.
Yeah.
So that would be an extension maybe that we would have to add.
So one of the things we would love feedback on on this one is,
you know, after talking with, like, the WSL guys,
one option that people have considered is just sending on the wire
user.attribute or system.attribute,
and that doesn't require as much changing.
And as an aside, Ronnie had also done a proposed patch
on the CIFS client for basically encoding case into the whatever.
That's a possibility to consider.
But if you guys have feedback on it, please talk to us at the test event or in the hallways.
What that would mean, of course, is that you could never have a system namespace EA that started user.something.
But, I mean, that's horribly confusing anyway.
So this is one of those things where we may end up getting away with it basically.
And right now, the other thing is
the Windows EA namespace is ANSI only, I believe.
It's not even, you can't have Unicode.
You can't have UTF-8 extended attributes,
which I believe Linux can.
So this may be somewhere
that when you have a POSIX handle,
the EA namespace
changes such that if you
send it without a user system
or security
tag, then it goes into user,
or if it detects user system
security as the first part of the name,
then it puts it in the appropriate namespace,
depending on permissions that you've connected with, of course.
My theory is that because NFS 4.2, one of the few features in NFS 4.2, was support for
the SE Linux, that probably it's important enough to do.
But feedback would be welcome.
In addition, there's this evil app, Samba, that sometimes uses EAs.
One of the things we might play around with
is just seeing if Samba's EA usage
if it was sitting in a whatever would...
Basically, it's hard to find apps that use EAs
because there aren't as many as you might think.
But there are some examples like Samba.
Only when we're interoperating with NFS.
We would use the system namespace for storing...
Right.
If we're mapping an incoming Windows ACL
into an NFS v4 ACL,
not that the kernel ever looks at that,
but then we would...
But an example of Samba is it doesn't use...
It doesn't rely on case sensitivity,
just case preserving.
Yeah, that's true.
That's true.
Yeah.
So about once a year, maybe twice a year,
the API changes to the kernel.
There are minor global changes that happen more often.
I think we've talked about some examples.
IOU ring, async IO.
The mount API changed.
Copy file range and clone file range
now support cross-mount copies.
Three or four times last year,
but on a typical year, once or twice the API changes.
One of the things that we wanted to ensure
was that we were able to quickly update the protocol if needed,
because obviously some of these things we can compensate for,
just hack without changing the protocol.
But if we do have to change the protocol,
we don't control what all these guys do in that room over there
arguing about NVMe or whatever.
And if they change it, we have to be able to adapt quickly.
One of the reasons why we decided to do it on this,
it's basically a GUID idea.
It will allow us to do this in the future
because we can't predict the future.
And obviously we need to have much better interaction
with communities.
If you look at what's driving NFS, guess what?
A lot of interaction with the containers community.
What we have to be very aware of is we have to interact well with the database and containers community because? A lot of interaction with the containers community. What we have to be very aware
of is we have to interact well with the database and
containers community because in a lot of ways, this is
the commodity protocol everybody should be using.
Yeah. And
of course, as soon as Namj's code
gets into the Linux kernel, and if
they start implementing this, we end
up with two implementations
and having tests
and tracking changes then becomes,
you know, a million times more important at that point. Because it's not just him talking
to the Samba server. It's not just Steve's Linux kernel client talking to the Samba server.
At that point, it becomes an actual ecosystem that we have to make sure keeps working.
Even at the Tel Aviv event, having that one vendor doing this was helpful.
Yeah.
Let's see. Even at the Tel Aviv event, having that one vendor doing this was helpful. So we talked about create, rename.
Obviously, when the files are open,
it fails in Windows.
Obviously, we have to support that.
And that's different semantics and posics.
Let's go on to the next one.
There's more and different stuff that comes back.
One of the things that I found fascinating,
we were looking at get info, I think,
with Aurelian last week.
There's a get adder call,
and it takes like four fields,
and most file systems ignore two of the fields.
There's a lot that can come back on get adder.
So we need to be able to, you know,
there's more metadata,
and a lot of file systems ignore this.
And, of course, POSIX locking, your favorite topic.
I mean, Volcker has made some fantastic changes in the Samba server to eliminate some of the
SMB1 insanity and weirdness. And so it should be a lot easier as we move forward to map
the POSIX semantics onto our existing SMB2 only backend.
But we should be very careful because one of the things that is so subtle
is what's wrong with the slide?
Do you see the word POSIX shouldn't be there?
Do you know why?
Well, yes.
Because they're not POSIX locks.
Yes, but they're close enough,
or they are close enough that you've got a POSIX handle.
They're OFD locks, right?
Yeah, yeah.
Well, okay, yes.
So it's funny, Jeff Layton,
a guy who's been at some of these events before,
had noted that POSIX locks
are basically useless for most cases.
And so many of the applications now use OFD locks,
which are POSIX in some sense,
but they're stackable.
Or rather, they still use the POSIX
because they're using them in a way
that they think OFD locks behave.
They still set the standard handles.
But over the wire, Samba, by accident, by a very happy and lucky accident, that they think OFD locks behave. They still set the standard handles.
But over the wire, Samba, by accident,
by a very happy and lucky accident,
has always implemented OFD locks.
And that's much friendlier for most Linux use cases than what's recommended.
FS info, there's a few extra fields.
We talked about that with supports.
It works.
Okay.
Your pain points.
So the perspective.
Once the VFS changes are done, we can start.
The actual patch set in that somewhat moribund tree is reasonably small.
So once the VFS changes are done, we can probably move stuff over.
And at that point, the test.
So Steve's plan is great.
It's very easily changeable, we'll work with it but at that point what we really need
is more and more
test
protocol
protocol test suite
changes inside, and of course
as this is Samba, inside
SMB torture, so at that point
we can really nail down, okay, what are
exactly are the semantics that a POSIX handle expects
and is willing to grant? Now,
it may turn out that what we thought we were doing isn't what we're
actually doing, in which case we have to either document what we
wanted it to do or what we actually do. And we haven't gotten
to that point yet.
That's where, essentially,
the documentation piece gets really important.
Right now, because the code isn't in Upstream Master,
I still think it's a little too early to try and write
down exactly what the protocol should be,
especially with the WSL
changes that came through recently.
I was really excited.
Had we standardized before then,
we would have standardized too early,
and we would have had those RAM trips baked
into the
version of the protocol we were shipping.
I was really excited. Volker, if I remember correctly,
had done some SME torture
tests, right? Yes, yes. He's added some.
One of the things that was kind of fun is that those were
able to be leveraged by people who were experimenting,
right? Because you have the SMB torture tests.
You have the Linux client.
The Linux client implements half of the stuff.
SMB torture implements more.
So we have more stuff to try against Jeremy's code than you'd think.
So details are super easy.
You've got to positive negotiate 100.
You include a GWID.
Now you send a tree connect.
Maybe in the future that'll change,
but right now nothing.
This is very easy.
Nothing much.
An open context and a negotiate protocol context
in one new info level.
Keep going.
We're already showing case sensitivity.
Yeah, now, if you support the POSIX context, what are we expecting? info level. Keep going. We're already showing case sensitivity. Now,
if you support the POSIX contact, what are we expecting?
You support case sensitive names,
you support POSIX semantics, unlink, rename,
you support advisory
OFD locks, and if you want
a boring description of OFD locks, here it is.
And that the path names are not
remapped. Yeah. They're still
UCS2, but essentially the Windows restrictions
on standard path names go away.
And no streams.
No streams!
Sorry.
If you want streams, open a Windows handle.
Yep.
Hard links are just hard links.
Nothing to do.
Distinct reparse point tags, that's the one change.
Notice the cross out there.
We have an ace with a special sid
that allows you
to set the mode bits. And, you know, F allocate and other things are just mapped to existing
SMB3 operations where possible. Yeah. I mean, the existing SMB3 operations are actually
rich enough to cover, I think, do they cover everything Linux F allocate does or is it
close? There's a collapse range that we've been thinking about, like where you take a whole lot of the middle of a file
and smush it, where it's a two-step process,
so it's a question whether you can do it atomically safely,
but yet you could emulate it.
But I think those are kind of,
a lot of file systems don't support those,
so I think we can go forward.
And by the way, if you want to beat on Jeremy,
it would be really nice to get XFS
and some of the other file systems.
Right now, all of these operations in Samba
require the BTRFS.
Yeah, yeah.
But that's another reason
that he's having to rewrite the VFS.
Take a number, join the queue.
So one of the reasons you have to rewrite the VFS
is because some of the things like if allocate
require call outs to file system specific stuff
and are general operations.
But notice something really cool here.
A lot of this doesn't actually require POSIX extensions.
A lot of this stuff would work to any server.
So that's the goal, small.
So let's see, I think we've already covered that.
How it works to create context.
You can have POSIX and non-POSIX open
depending on what the handle was that you asked for.
Let's see, yep.
If you want to see the owner,
if you want to pass the UID or GID,
it's the same way, right?
Well, so this is basically, this is the fallback.
If you're talking to a server that doesn't support
isn't inactive
directory and has its own
UID and JID space, this is what you would get back.
But this is
yeah.
Let's see.
Actually, Aureliano
had a good link.
I may include it
in my presentation tomorrow.
I'll link to stuff that explains this better.
C, so that's the only other...
Yeah, so this is the only info level.
And notice it's a pretty simple payload.
216 bytes, whatever.
This is what it looks like.
You've got the POSIX create response
plus device ID, I know, whatever.
It's not that much.
Yeah, and it's basically based on top of the Windows
all info level with some extra things.
So, thank you Aurelien over there
because he has a dissector here
and Pike sample test code.
So we have Volker's test code,
we have Aurelien's test code,
we have a Wireshark dissector,
we have two servers, one open source, one closed source, we have Volker's test code. We have Aurelien's test code. We have a Wireshark detector. We have two servers, one open source, one closed source.
We have my client.
It's enough to experiment with.
C.
Okay.
So what's the hard part?
We have to examine every single, there's 560 of these things, XFS test, and every single failure.
And there's about 200 XFS tests
that have no relevance to a network file system,
but 350 of these, we have to go through every single one
and see if there's anything we missed.
So I can probably do some parallel work
on the new repass point stuff
in the separate tree that's using the old VFS layout
just so that we can experiment
and understand what it is that needs to look like.
But my
day job, basically, 99%
of my time is basically
cranking out and fixing up the
VFS changes that we need to modernize
the Samba internal VFS.
Okay, yes, I think that's
the last slide. So if anyone has any questions, we'd be very happy to...
Yes?
So if I were to simplify the goal,
like the purpose of the 7-positive extension,
would the following be the correct statement?
So the goal of the S&B3 positive extension is to
make
native Linux
applications
run
work with a remote S&B server
as if
it were an MFSP4 server.
Oh no, no, we already
so be careful about that. Better than that.
Sorry.
So the question was,
is the purpose of the POSIX extensions
to make SMB3 Linux to POSIX work as well as NFS v4?
And, of course, NFS v4 sucks,
so much better than that.
So if I was going to say,
what is my gold standard,
what I would really love POSIX extensions to be,
I would like you to be able to boot a diskless Linux kiosk from SMB3
and have everything mounted from an SMB3 file server.
The device directory, absolutely everything, and everything just work.
Probably SELinux won't work right now,
but eventually we will get there,
so that you'll be able to run everything,
even SELinux-aware apps over SMB3.
So the gold standard is the diskless boot.
The platinum standard is SELinux.
But we should be leaving NFS v4 in the dust.
So one of the things that was interesting,
completely independently,
somebody came up to me and was like,
yeah, Microsoft Distinguished Engineer, VP guy,
tell me when you can boot off SMB3.
This list.
And I think I can understand why.
There's a lot of cases where it makes sense
to have only remote storage or a swap remote
or boot remote, whatever.
Now, interestingly, one of Aurelian's colleagues, Paolo,
we just put those changes up there for booting over SMB.
Now, there's two lines of change or three lines.
There's a very tiny bit of other change needed
in one of the network drivers.
But why can't you boot over anything other than SMB1?
It's because of those special files.
So SMB3, the only thing stopping Paolo,
or Relian's colleague at SUSE,
there was a little news article published last week about it,
from booting over SMB3 is the special files.
And those special files you can actually do,
technically, without the Unix extensions.
So remember the goals here were
do everything you can with normal SMB,
recommend whatever you can with things like reparse points
and other things that work,
and only the things that you can't do
with things like the reparse point example
or some of these special aces or things like that,
only those special things make part of the extension.
So that POSIX behavior on open...
So here's a question for you, Steve, that I have.
Right now, when you're sending me a symlink,
are you sending me an NFS-style repasspoint symlink,
or are you sending one of the new ones?
So you're going to have to change that,
because I'm just storing what you send me.
And so I'm sending the wrong thing on that one.
Although, to be honest, by default,
I don't send them either one,
because by default, I use the Apple-style symlinks. And why? Because poor Jeremy has, I don't
want him to, like, go crazy. Because if I make a symlink that only the client recognizes,
he has more time left to finish the extension. So the So the Apple file are client-only evaluated.
Yeah.
But that essentially means that
when you hit a stopped on symlink,
when you're passing a path name
and you hit stopped on symlink,
it means right now we are returning
an NFS style stopped on symlink
and we're going to need to change that.
Right, and I recognize those.
WSL.
So this is why I didn't want him
to ship his client code out the door quite.
The good news is that only if you created it over NFS
or only if you created that symlink locally
would you ever see that.
Because otherwise it would be like Apple.
I will map it into a repass point on return.
Question, yes.
Yeah, currently FreeBSD only implements
user and system namespaces for our servers.
Are security and trusted eventually going to be required for proper...
So the question is, 3BSD right now only uses user and system.
Are you going to require trusted...
So, I mean, I don't think so.
If we do this the simple, naive way of basically saying,
okay, on a POSIX handle, you can send a full namespace qualified
EA name. Then at that point,
essentially, what
you send is what we would store, the
namespace we would try and store into.
If you never send me the
trusted and the other namespaces,
I never try and map into them. I never try
and use them.
Having said that,
once those things are there,
you would probably see them
when you enumerated EAs.
That's the other thing.
Are you going to see these namespaces
when you actually open a handle on a file
and say, get me a list of all EAs
that this file has?
Right now, I think we actually filter out the system
namespace.
I would have to...
This is why this is
the area that we are
least sure about.
I think what we need to do is
actually to write some test
code on a local file system that
supports these EA namespaces
to find out if you're running as a non-root
user and you say enumerate the
EAs on this file, what do you get back?
Do you actually see the system? Do you actually see
trusted? Or do you only see
the user ones that you could access?
Right now, I simply don't
know the answer to that question.
I don't think it's
ever going to be required unless
you're running a Linux app
against a FreeBSD server,
but then that would fail in the same way
if you weren't able to store those namespaces anyway.
I think in most cases,
Linux systems work with file systems
that don't store EAs perfectly.
The SELinux stuff may not work.
So SELinux wouldn't be able to enable,
but there are example file systems
that people have booted over
that don't have support for this.
But right now, I think probably NFS
might be the only remote file system
that is even dreaming of supporting SELinux.
Well, and how many people amount with NFS 4.2?
Yeah, exactly.
I mean, so realistically, I think in 90% of the cases,
they're not actually even going to use it.
Are we out of time?
Oh, we've got five minutes.
Okay, so any other questions?
Yeah.
I have another one.
Sure.
We have native NFS v4 ACLs.
We don't have POSIX 1.8 ACLs.
So I was wondering, in case where a client requests ACL information,
is it better to just creatively lie?
Oh, so the question is you don't have POSIX ACLs,
you only have NT ACLs.
POSIX ACLs do not exist in the SMB3 extensions.
POSIX ACLs are gone!
Because POSIX ACLs don't matter very well to SMB,
and they expose UID and GID information. So,
for SMB3, we made a decision
early on, for SMB3 POSIX,
it's Windows ACLs only.
And so Windows ACLs map
very well into NFS v4 ACLs and vice
versa, so you just
use our mapping layer to, or
have a mapping layer that says, hey,
it's an NFS v4 ACL, map it into Windows,
there you go, and do it on GEDM set. So, yeah, it's an NFS before I've got mapping into Windows. There you go. And do it on get and set.
So, yeah, that's no big problem.
Yes?
Just to clarify something,
or to make sure I'm thinking this correctly.
So the new reparch tag,
like the IO reparch tag,
the LX, FICO, character block,
those would only be returned
if I opened up the directory
with the LOSIX recontext?
So I think, so the question is, would you see those new repass point tag types if you opened up with a POSIX handle or not?
Yes, you see them always because I ran them to Windows.
So some of the Windows guys behind me, right, they reminded me Yes, you see them always, because I ran them to Windows, so
some of the Windows
guys behind me, right, they reminded me that
you can see these tags in a directory listing
and whatever. So I
took my laptop and I mounted a Windows
system and created those, and like, oh,
they're right. So they come back to normal.
Yeah, they're just Windows repass
point type, so you'll see them
on any directory handle.
The thing is, at least going to Samba,
I won't let you create them unless they're those types.
So you can never create...
Right now, we don't support a generic repass point type,
mainly because Windows repass points
can be created on directories,
which I think is a nightmare and a terrible idea.
So we never implemented that.
So right now, the only way you can see those tags
will be if you've opened a POSIX handle,
created a zero-length file,
and then you try and stick one of those repass point types on there,
and that will then be,
okay, so now you're turning this file into a client scene FIFO,
or now you're turning this file
into a client scene SimLink,
or a client scene Unix domain socket, et cetera.
So, you know, there's no sort of SMB3 call saying,
oh, create me a SimLink.
No, you open the, at least for POSIX, you open
a zero length file and you set
that reparse point type
on it. And that's then your symlink.
You send me the data blob that you want me to
store. And then as a client, when you
blunder into it, I come back and say,
oh, you blundered into a reparse point
that's of type
WSL symlink, and then
you either read it or you hit it
when you run into the stopped on symlink error.
Does that help?
It's not.
So because we only discovered this change
on the WSL tags last week,
this only exists in our heads.
This doesn't exist as code yet.
So that's going to take a while. Yes, another question.
I don't know if you missed that slide, but you mentioned those special SIDs with NFS, UID, GID numbers, and mode included.
Are they presentation only? And who does your access permission check?
So the question is the NFS,
the Unix permission modes, et cetera,
they are essentially presentation only because when you set those using the set ACL,
the server gets to decide
what it turns that into
on the file system.
So right now in Samba,
we map the mode bits that you send.
You do a special set ACL with that ACE,
and we will map that into a Chimod, right?
But that then goes through our mapping layer
of turning that into an ACL, et cetera, et cetera.
So that's essentially server-dependent.
So when a client does
Chimod 755,
you send that, you open
a POSIX handle, you send that
special
SID saying, I want to set
a mode, and then the
server will do what
the server decides to
do with that request.
It may turn it into a Windows ACL.
It may put it as a direct.
If it's underneath a POSIX file system,
it may put it as a POSIX, change the POSIX mode bits.
You don't know what it's doing at that point.
So one of the things that's kind of fun is that, once again,
this doesn't actually require the POSIX extensions.
So one of the things that discussed among the SMB client developers was
server doesn't understand it, so what?
In NFS, a lot of the times,
you're just evaluating permissions on the client.
You really don't care much about the server.
As long as the mode bits persist, it's good enough.
You trust the client.
You just want the mode bits to be evaluated correctly on the client.
So in that model, we have this thing called mode from SID.
That mountpar mode from SID
doesn't even require the POSIX extensions.
Oh, yeah.
And the cool thing about that is it sets the special ACE.
The special ACE never matches an existing user,
so it never is relevant.
The only thing that matters is the mode bits are perfect.
The client sees the exact mode bits.
Permissions can be evaluated on the client.
You can also estimate the ACLK holes on the server,
but it's hard because obviously with inheritance and things.
Because to be honest, the only thing that really matters
is the access decision that the server makes
when you request an open.
So you can set whatever mode bits you like,
and they can maybe be restored,
but it's when you do that, can I open it for this?
That's when the rubber hits the road
and you get the handle back or not.
If you wanted to play without the POSIX extensions,
try mode from Sidmount option.
Oh, okay, I think we're out of time.
I haven't got my glasses on.
Yes, oh,
one last question.
So when you set the mode
for the time number,
let's say 0, 7, 5, 5,
and then you try to create a time zone for this inside the deriving, so say 0, 7, 5, 5, and then try to create a file
so it folders inside the directory.
So we actually derive the mode bits as well
on the newly created folder file
and then show it that reference in the Linux.
So the question is,
if you set the mode bits on a directory
and then you create something inside it,
and what happens about deriving mode bits
from the containing directory?
That's a server decision.
So yes, the server has the information
because you set the mode bits.
The server could, if it's on a POSIX fast system,
say, oh, I'm going to do standard POSIX
inheritance of group or whatever
and set the mode bits.
But it doesn't have to do that.
And as a client, you're going to
have to cope with it not doing that anyway.
I mean, at least on Windows, it would ignore it and it would
set whatever the inheritable
ACL flags were anyway.
So, yes, theoretically
you should get...
So, if essentially you're talking
to a Samba server that's on top of
POSIX and he's doing as much POSIX pass-through as it can,
then yes, everything will work perfectly.
But your app, running through Steve's client code,
still has to work against a Windows server,
which will do nothing of the kind
and not even send that.
So, I mean, you know, the goal is that it tries to...
Oh, I think we just dropped off the internet.
Never mind.
The goal is... Plus you need to restart your Oh, I think we just dropped off the internet. Never mind. The goal is... Plus you need to restart
your Chrome, I think.
Anyway, the goal is that
we try and
do our best effort, but
we can't guarantee that it's going to be perfect.
Because remember, you're talking to a remote server.
The remote server may not even be running your OS.
It may not have the same system calls.
So you're asking
it, please do this
thing but whether it does so or not
is entirely up to the server
does that help?
got one last one then we're going to kick us out
yeah
server always makes the call
sorry the question is what happens when the server
makes the call
so the question is makes a call. In the case of the Samba server, does it look at the smb.conf settings?
So the question is,
in Samba, do the smb.conf
trump any requested clients of
always? If you say
I always want these mode bits, you will
always get those mode bits on the definition
in the share. If you say
I never want to see these mode bits, we will
always remove those mode bits.
The smb.conf parameters
set by the administrator
of the server are always truth.
Just because his sneaky client
says, I want this to be 777
so my friends can see it. If your
Samba server administrator
has said, thou shalt never set
any other mode bits, they'll be wiped out.
And we'll just, you know,
he thinks he's set them, but when he queries them
he'll get back the mode bits we set.
Yeah? And that happens in Linux today, by the way.
Yeah. Okay, and I think
given that, we're out of time, so thank you very much.
Thanks a lot.
Thanks for listening.
If you have questions about the material
presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional
information about the Storage Developer Conference, visit www.storagedeveloper.org.