Storage Developer Conference - #125: Opening up Linux to the wider world

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 125.

Starting point is 00:00:41 Welcome to another year of the POSIX extensions. So this is an exciting time. We get feedback and find out stupid things we did and find out things that can be done better. We had a chance to try these out at the... But first of all, this is Steve French of the Microsoft Azure file-serving team, for those of you who don't know us. And I'm Jeremy Allison from Google.

Starting point is 00:01:04 And there we go. That's what I was waiting for. And Google don't even know or care that I'm here. And I want to keep it that way. So everything I say here is not the opinions of Googling, blah, blah, blah. And for Steve, I think Microsoft know you're here. Microsoft knows I'm here, but this isn't Azure code.

Starting point is 00:01:22 This is the kernel code. There's lots of wonderful things we'll hear about Microsoft tomorrow because those guys are going to say everything, right? Okay. Okay, let's go. So let's talk about a couple things. I always like starting with the slide, and I hope you guys can forgive us for repeating it,

Starting point is 00:01:43 but every once in a while somebody brings up a four-letter word, POSIX. And some of you, old enough, have even sat on POSIX committees, heard screaming matches between different companies about POSIX standards, heard, like Jeremy, stories you don't want to know about byte range locking. So one of the things I wanted to go,

Starting point is 00:02:01 let's go to the next slide. You know, it's like Windows. It doesn't matter what the spec says. It matters what the app does. And unfortunately, apps on Linux are written for Linux. Now, you see this little tiny thing called the Ganussi library and those things. It calls into POSIX. Let's go to the next slide, and I'll show you how bad it is.

Starting point is 00:02:23 So there are about 100 POSIX. Let's go to the next slide and I'll show you how bad it is. So there are about 100 POSIX API calls. A few minutes ago, I did a git grep syscall underscore define. Just in the file system directory, not any other syscall, just file systems, is 222. There are 100 defined in POSIX. Linux has 222 in the file system alone. The man page only pulls up 400 syscalls. So you can kind of see our problem here, right, is that we think in terms of this narrow thing, but it's actually a bigger problem, right? Linux is evolving.

Starting point is 00:02:58 And the thing to remember, of course, is that every single one of those syscalls is there for a reason, and applications that use it expect to keep working across kernel upgrades, across distributed file systems, you know, and of course, if you run an application locally and then you run it against a remote file system like SME3 and it fails, it's always the remote file system's fault. Yep. By definition.

Starting point is 00:03:25 Yeah, and, you know, as you guys probably noticed, there are some well- the remote file system's fault. Yeah. By definition. Yeah, and as you guys probably noticed, there are some well-known Linux file system developers here. Some of them are two rooms over. Notice they're not here. They're not here. So it's our fault, even if they're not here, right? So we have to adapt to them. And this year alone, there have been new syscalls added. In the news

Starting point is 00:03:50 two weeks ago was a follow-on discussion about OpenApp 2, one of Jeremy's favorite topics. So these continue to be discussed. It's not ending here. So I want to give an example of fall allocate, because some of you have actually dealt with sparse files or allocated this API call. I actually had somebody last week say, well, I don't use F trunk to set the file size. I use F allocate to set the file size. Oh, well, we got to, I mean, we can't avoid these things.

Starting point is 00:04:21 This is a real Linux app developer who just expects it to work. So there are seven flags in F allocate alone. Rename has three flags. Imagine the combination of these things. Protocol stuff isn't easy in our case because we're not dealing with POSIX. This is beyond POSIX. And these are just two examples, syscalls. Okay? So, let's get apps to work. Case sensitivity, you've got to have it if you want to build Linux. The first test people ask me about is,

Starting point is 00:04:54 can you build the kernel on a SIFS mount? They want it, whether it's on-premises, in the cloud, or sitting in Starbucks. Right? Sitting in your hotel, running your presentation. So improve the common situations where they access Linux. And, by the way, too bad Ned Pyle isn't here, right? Yeah.

Starting point is 00:05:15 Deprecate CIFs. This is SMB 3.1.1 only, the most secure only. We don't want the old stuff. Because security matters more than everything. I don't care if you're Azure or Google or whatever else. We can't afford to run less secure stuff. Okay. So I want to talk this slide rather than him because we have to brag about him and Volker and Metz and Ralph. I hope you guys got a chance to see the presentation this morning about all the progress this last few years in Samba.

Starting point is 00:05:46 Why do we care about Samba? Some of you remember 1992. Some of you remember Tridge. There's a lot of reasons we care about Samba. It's three and a half million lines of code of stuff that you don't want to rewrite or replace. It's great stuff. Or at least you want to be able to split it into pieces and use the pieces you need, which is what we're trying to get to.

Starting point is 00:06:10 It's like a toolbox that lets you work on the fanciest car with these really fun tools. You've got all the tools there. Okay? Okay. Okay. Anybody recognize any of these people? So Samba.io Lab last week,

Starting point is 00:06:25 this is a great development community to work with. I think you'll recognize some of them here this week. But this is a great group. And as you don't really realize, this is actually up at the Microsoft campus. Just to show you how Microsoft is very much, and this is where I get to talk about, Steve, how Microsoft is very much a change community and is basically just another member of the open source community developing Linux, developing software, open source free software on multiple different platforms. So over the last four years, we've gotten to get together to test four times over the last year as a group. So this has been kind of good. We've gotten a chance to get some feedback, try some things.

Starting point is 00:07:11 We're able to get some other implementers. You know, Linux makes a lot of progress. Linus enjoys crazy names. A year ago, we had the Merciless Moray. A week ago, we have the Bobtail Squid. So he enjoys these crazy names. Linux continues to evolve. It's a lot of fun to track.

Starting point is 00:07:28 If you want to experiment with the things we're talking about in the talk, you can experiment today at the test event in Israel, Tel Aviv at STC back then. There's a vendor who implemented the server side. You just have to try. Here's the patch if you want to backport it.

Starting point is 00:07:44 But anybody with a 5.1 kernel or later, just works. The server side, you just have to try. Here's the patch if you want to backport it. But anybody with a 5.1 kernel or later just works. So what terrifies me about Steve is that he keeps pushing this stuff out to the public into general kernels that people are actually building and using with way before the server side is considered ready to ship. So we have experimental trees for the server side of things, but Steve is busily pushing the client code out there to everybody. There is some precedent in NFS and others for this. But when you mount, it's not by default. When you mount, it prints an ugly message in your log

Starting point is 00:08:16 saying this is experimental. And I didn't say it because Jeremy told me so. But it's true. But it's true. Okay, so let's go. Okay, so one of the things that Ronnie and Aurelian are probably trying to hide here, but one of the things that our distro partners and others have pushed is this stability and regression automation.

Starting point is 00:08:40 And it's a lot of fun, right? These XFSS, I think we run more than NFS does now, or at least that's what they claim. There's been a lot of good fixes, but this build bot that Ronnie Aureliano Paolo and some of those guys put together has been fantastic because it allows us now to start testing against the POSIX tree. So as we get back to work on some of this stuff, it makes it much safer for us. Both client and server changes because we have the automation for it. One of the test targets is Azure. One's Windows. We have the generic regression target.

Starting point is 00:09:11 If you have your favorite server you want to add, that's great. But up in Azure, we can spin up these various VMs, including with his tree, so we can do automated against his tree. Okay, so what could you try today? This is experimental. It is not enabled by default unless you try today? This is experimental. It is not enabled by default unless you type the word POSIX. You must mount, you don't have to actually specify verse equals 311 because verse 311 is the default now,

Starting point is 00:09:35 but for older distros, you would have to type that in. You need the mount option POSIX, that four-letter word. You need to specify that. There are very limited protocol features in it. Now, on the server code, we'll give you some instructions about that, but here is a pointer to his tree, which may change,

Starting point is 00:09:55 but here is his experimental SOMA tree that has the prototype server code. And as I said, there are some vendors who have tested this, and we saw this at the Tel Aviv event. Now, that tree hasn't been getting very much love of late, mostly because of the VFS changes that are going

Starting point is 00:10:09 on elsewhere. But what I'm hoping is that once all the VFS turmoil is finished, hopefully by the end of the year, then we can start taking the changes for that and moving them on top of the modified Samba and basically

Starting point is 00:10:24 get it into an experimental version of the modified Samba and basically get it into an experimental version of mainline Samba. So when you pull Samba, you'll get this code by default, just not turned on, you know, it will start testing it, but just not turned on in a standard production build of Samba. Good, cool. Okay. So why isn't this shipped already? Well, the problem is we thought we were getting close, and then all of a sudden the Windows subsystem for Linux

Starting point is 00:10:58 essentially changed the goalposts a little bit. And this is kind of important because it turns out that the way we're implementing in the POSIX extensions, the way we're implementing the file system objects that Windows clients don't want to deal with, that POSIX clients have to deal with, and these are things like SimLinks, FIFOs,

Starting point is 00:11:24 Unix domain sockets, character and block devices, et cetera. The way we're doing those is in repass points. We're exposing them as Windows-style repass points. And it turns out that everyone has a different idea of what those should look like. Now, our goal is to basically be as close to what Windows does as possible. But now the Windows subsystem for Linux has defined a new method of exporting these POSIX object types, and they've chosen different repass point tags for exposing these. So we've basically got to sit down and resolve

Starting point is 00:12:04 right now what we're doing is we for exposing these. So we've basically got to sit down and resolve. Right now, what we're doing is we were using the initial NFS repass point tag, which is what Windows used to store repass point data. Now it looks like we may need to change this, and this will mean an on-the-wire protocol change. I mean, actually, that's not true. It means an interpretation of existing fields that we have within our protocol change. I mean, actually that's not true. It means an interpretation of existing fields that we have within our protocol changes. So our protocol changes actually already have

Starting point is 00:12:32 the repass point tag returned as part of our new POSIX info, but the actual meaning of what gets put in that tag now may need to change. We'll see the good example two slides from now. We can show you exactly why this makes even more sense. And then the other thing is, basically, the original Samba VFS was built around the old open group path name-based operations, and the world is moving to handle-based operations.

Starting point is 00:13:01 And so we really have to build this on top of the at series of calls, like open at, make dir at, rmdir at, you name it. And if you follow Samba development, you'll be finding a large number of VFS changes. Basically, we decided that now was the best time we had. We just shipped 4.11, so we've got six months. Let's get all these VFS changes in and finished

Starting point is 00:13:29 before we start building POSIX extensions on top. Otherwise, we'll end up with trying to retrofit the VFS changes and the POSIX extensions at the same time. So unfortunately, this is the, why isn't this damn thing shipped already, is essentially that is the thing that has pushed us back a little bit. So one of the things that...

Starting point is 00:13:52 Actually, I think that's... Yes, there we go. That was the same slide you duplicated by accident. Yeah, so one of the things that's important, that we want to not add a big performance penalty to get POSIX information. And that's one of the things the WSL guys mentioned. They're like.

Starting point is 00:14:11 Yeah, they have exactly the same problem. Yeah, because if you do a query directory, you don't want to have to keep querying every single file in there, right? So yes you will if they ask for stat information, but you'd like to be able to find the file type for these special files. So you can see an example on the next slide.

Starting point is 00:14:30 So instead of having one tag where you have to query if the tag ever shows up, so tag, you know, an NFS tag, and you have to go do a second query on it, having the tag returned, and we verified that Windows does this already. So, and the nice thing about this is it'll work with Windows. It doesn't require an extension to the protocol for this.

Starting point is 00:14:48 It requires that when Samba exposes a local FIFO, that it exposes them the way Windows exposes it. The beauty of the way Windows exposes it now is that it doesn't require an extra round trip for me when you do a query to figure out, oh, that's a care device. No, that's a FIFO. No, that's a symlink.

Starting point is 00:15:06 So it saves round trips. And for directions with large numbers of files or objects in it, those round trips are a killer performance, for performance reasons. You're having to do an extra round trip to the server for every single type that you get back where you know it's a weird object, but you don't know what it is.

Starting point is 00:15:25 Now, since the last SDC, we talked about this at SOMXP, since the last, a year ago, there was one other change that we made, and we'll talk about that in a minute. So if you tried to use it today, you know, just mounting from your laptop or whatever you had in current,

Starting point is 00:15:43 you know, 5.1 or later, easy. Verse equals 311 is already there. You don't even have to type that in. You're just mounting from your laptop or whatever you had in current 5.1 or later. Easy. Verse equals 311 is already there. You don't even have to type that in. POSIX. We prefer, because we don't want Jeremy to lose any more sleep, but use client sim links, MF sim links like Apple, Apple-style sim links,

Starting point is 00:15:57 because these are only evaluated in the client. They make life a little bit easier, but obviously he can expose server ones too. And on the server side, if you wanted to experiment with Samba, here's the mangled names know, create mask. But one thing that is a little bit unusual just for anybody experimenting with this tree

Starting point is 00:16:13 is the bottom line. That's just a bug. Yeah. It's a little bit odd, so I just mention that. That's a very odd thing. It took us a little while to figure out why that happened, and it's a long story. But if you're experimenting with Jeremy's tree,

Starting point is 00:16:24 those are the four things we'd recommend. Now, remember we talked about the reparse tags. So in practice, what it changes, so here's Jeremy's code. So this is, you know, during the last talk, during Namj's talk, I did this wire chart capture. So this is, you know, not fake. Notice the find response, and notice the tag, okay?

Starting point is 00:16:47 So we're enumerating these files, right, whatever, and it's not completely parsed all the code on the client, but notice this tag, that tag changes. So the tag comes back with a strange name, this is just the way Windows defined the protocol, but the tag comes back in EA size. Anyway, not a big deal. Just this is the only thing that changes.

Starting point is 00:17:06 It's a very small change. Okay, go ahead. Yeah, and you can see the query directory example. Yeah, so right now, YShop doesn't decode these, but this is basically the POSIX query directory returning the new info level that we need for POSIX information. Yeah, notice the info level?

Starting point is 00:17:32 Yeah. Okay. It's the same information that would be returned in the POSIX create context. Yeah, info level 100. Yeah. Yeah. Yeah.

Starting point is 00:17:43 Okay, next. Yep, cool. Okay, so this is an older slide. You guys may have seen some of this before. But, you know, all those weird path names that you guys love to do. You know, building the kernel is fun because you end up with these path conflicts

Starting point is 00:17:58 that wouldn't work with Windows, right? But there they are. I mean, they work. This is, you know, this code's working. All the weird path names you want to try with question marks or exclamation points. Next one. And, you know, here's what it looks like.

Starting point is 00:18:14 You know, if you want to make sure you mount it with POSIX, you can see POSIX and POSIX paths. You can see those in what we display in the mount options. And you should be able to see some of it in the debug info as well. Yep, you can see the POSIX is enabled in the build. Nothing fancy here, but just kind of showing you this isn't a trick.

Starting point is 00:18:36 You know, you want to try case sensitivity. Upper and upper are two separate files. Okay? Case sensitivity works. You can create directories with different mode bits. Yay. You know? Here's positive context work.

Starting point is 00:18:55 This is Jeremy's code to from my code, right? Keep going. Rename. This is always fun. It's a little hard to read this, but what you have here is a, you know, renaming open files, that kind of thing. That would fail if Windows with POSIX, it works to his code.

Starting point is 00:19:13 It's a little hard to read, but you'll be able to see it in the handouts. And this is, I mean, this is basically, this isn't sort of a silly rename stuff like NFS. This is actually mapping directly onto modifications onto the server file system. It turns out that stat of a file system returns more information,

Starting point is 00:19:34 blocks, inode count, you know, fundamental. Here's the local and remote view of the same file system. So stat dash F. Notice it's... Yeah, the block size is wrong in there. But the block size is the only thing reported differently. You've got the inode count and all that.

Starting point is 00:19:59 So anyway, it's interesting getting that right, so keep going. Now, with the POSIX extensions, you can see what it looks like. Notice with the POSIX extensions, you can see what it looks like. Notice with the POSIX extensions, things match better. Okay? This is the static file system. Okay, so what are the

Starting point is 00:20:14 gory details? I don't know if you want to talk about the gory details of the negotiate. So this is what's changed. Yeah. So, essentially, the client sends the 16-byte GUID on the Go shape saying, I would like to do POSIX extensions,

Starting point is 00:20:34 and if the server replies with that, then you know that the server at least speaks that version of POSIX extensions. And the thing about this is that if we ever do need to make a fundamentally compatible protocol change, we can just version the grid. We can say, okay, this is the old, you know, I don't do this version anymore. We'll just allocate a new grid and that's the one that

Starting point is 00:20:55 you get back. So at that point, when you're talking to a server, originally we thought about just making this a new create context and essentially you would just send it on any create context you wanted. The benefits of adding it into the negotiate are such that you know that the server is capable of doing POSIX, and if it denies you a POSIX handle when you request an open, you know that that's deliberate. It's not that it didn't understand POSIX, it's just that it said, for this area of the file system, I ain't giving you POSIX semantics. Now, that might be because you've got an NTFS drive mounted that is actually obeying different semantics or an EXT or an exFAT drive mounted that you're exporting, but it allows the client to at

Starting point is 00:21:40 least make a sensible decision of whether, hey, I didn't get POSIX semantics because the server didn't want to give it me versus I didn't get POSIX semantics because the server didn't want to give it to me versus I didn't get POSIX semantics because the server can't do it. So one quick thing. Notice that the context we send is not that big. It basically has a GUID. You know, 100 is the type.

Starting point is 00:22:03 And then it's not that big. So it's just sending the GWID in it. Negotiate context should have been GWIDs in the first place anyway. Yeah, exactly, exactly. That was a whole protocol decision. So this is easy for implementers. It took less than an hour for the other vendor

Starting point is 00:22:20 to do this part, right? Yeah. So let's go to the next one. Now on create, we actually have to send that grid. So you can kind of see the chain. You've got the durable handle context right underneath it. You've got your context. So the POSIX context sent on the request is actually pretty easy.

Starting point is 00:22:40 Yeah, it's just sending that request. It's basically just adding a new create context type and then passing the result if you get it back that says you got it. Yeah, and you can see it on the response here. Yeah. Samba's sending it back on the response, and there's a little bit of data in it that has some... Yeah, and it's basically the same kind of information

Starting point is 00:23:01 that you would get back over and above the info level, the standard info levels that you get back when you do a Windows query file info. So the goal, again, I know this is repeating from last year, but the goal with the POSIX extensions was never to send the same information twice. So if Windows already gives you that information,

Starting point is 00:23:22 you don't duplicate it in any POSIX returns. You enhance the information that you could have gotten from Windows. You don't overwrite it, or you don't duplicate it, because any time you send back duplicate information to the client, there's the possibility that the server can send you back two different conflicting values,

Starting point is 00:23:41 and then what do you do? You've no idea where exactly you are. A good example of this is the inode number. We can get the inode number back today in one of the contexts, QFID, right? So today we query that to get the inode number. There's no sense duplicating that because the only code we want to change for POSIX

Starting point is 00:23:58 is stuff that really is different. So sending this create context, getting it back, was a clue that the server said that the rename semantics and delete semantics are POSIX style delete semantics. On this handle. But notice that many of the other things,

Starting point is 00:24:12 like those reparse points and other things, had absolutely nothing to do. They would work to a Windows server or a Mac server. Okay, so let's go. So here's an example without the POSIX extensions. We can use, you know, you can use various ways to map mode bits. There's special aces, for example.

Starting point is 00:24:35 There's CIFS ACL. There's various ways to map the mode bits. But let's go to the next slide. You know, with the Unix extensions, we now have a better way to do that. Now, what was the number one reason wrong with what we had? The POSIX extensions worked great. What is it, SCO and Hewlett-Packard 25 years or 20 years ago had done this stuff.

Starting point is 00:24:55 Jeremy and I had even occasionally modified it. But there was this one-a-cry thing. And you know, SMB 311 really.1 really is pretty good. Now, Apple did some interesting things with their case sensitivity, but they didn't handle all the positive compatibility issues. So, although it's a useful experiment, it didn't solve all the problems we needed. Well, so, I mean,

Starting point is 00:25:15 Macs basically are interested in talking to Macs and having the Apple semantics that macOS needs. So, essentially, I mean, the Apple create context is essentially a Mac-to-Mac thing. Now, having said that, Ralph has written modules in Samba that emulate a Mac server.

Starting point is 00:25:33 So Samba now has sort of like triple personality. It can be a Windows server, it can be a Mac server, and with the SMB3 POSIX extensions, it can be something new, which is sort of a hybrid, which actually gives as close to POSIX semantics as we can make it.

Starting point is 00:25:54 Now, there are still some missing things that were... My guess is we probably shouldn't try and get them done before we officially ship, because the perfect is the enemy of the good. And those are things like being able to set mixed case extended attributes. So right now, essentially, you don't get case sensitive extended attributes, which is what Linux at least requires.

Starting point is 00:26:23 We just map the extended attributes you might want to send into the Windows space. And that's probably good enough. Yes, there may be some weird applications that fail. And probably some test cases that check the extended attribute semantics will fail. But we haven't found any actual apps that use that. The one thing we have to be careful about on that is...

Starting point is 00:26:46 So, for security, SELinux, there are trusted and... Well, SELinux is the only place that we may end up having to extend, and that's basically because right now all of the EAs, including the POSIX EAs that you send, they all live in the username space. And so, for SELinux, you have the, what is it, the system security names.

Starting point is 00:27:10 System security and trusted. Yeah, you have different namespaces. Right now, we have no way of mapping those into existing SMB3. SMB3 doesn't have the concept of different namespaces for EAs. Yeah. So that would be an extension maybe that we would have to add. So one of the things we would love feedback on on this one is, you know, after talking with, like, the WSL guys,

Starting point is 00:27:31 one option that people have considered is just sending on the wire user.attribute or system.attribute, and that doesn't require as much changing. And as an aside, Ronnie had also done a proposed patch on the CIFS client for basically encoding case into the whatever. That's a possibility to consider. But if you guys have feedback on it, please talk to us at the test event or in the hallways. What that would mean, of course, is that you could never have a system namespace EA that started user.something.

Starting point is 00:28:01 But, I mean, that's horribly confusing anyway. So this is one of those things where we may end up getting away with it basically. And right now, the other thing is the Windows EA namespace is ANSI only, I believe. It's not even, you can't have Unicode. You can't have UTF-8 extended attributes, which I believe Linux can. So this may be somewhere

Starting point is 00:28:26 that when you have a POSIX handle, the EA namespace changes such that if you send it without a user system or security tag, then it goes into user, or if it detects user system security as the first part of the name,

Starting point is 00:28:42 then it puts it in the appropriate namespace, depending on permissions that you've connected with, of course. My theory is that because NFS 4.2, one of the few features in NFS 4.2, was support for the SE Linux, that probably it's important enough to do. But feedback would be welcome. In addition, there's this evil app, Samba, that sometimes uses EAs. One of the things we might play around with is just seeing if Samba's EA usage

Starting point is 00:29:07 if it was sitting in a whatever would... Basically, it's hard to find apps that use EAs because there aren't as many as you might think. But there are some examples like Samba. Only when we're interoperating with NFS. We would use the system namespace for storing... Right. If we're mapping an incoming Windows ACL

Starting point is 00:29:25 into an NFS v4 ACL, not that the kernel ever looks at that, but then we would... But an example of Samba is it doesn't use... It doesn't rely on case sensitivity, just case preserving. Yeah, that's true. That's true.

Starting point is 00:29:37 Yeah. So about once a year, maybe twice a year, the API changes to the kernel. There are minor global changes that happen more often. I think we've talked about some examples. IOU ring, async IO. The mount API changed. Copy file range and clone file range

Starting point is 00:29:57 now support cross-mount copies. Three or four times last year, but on a typical year, once or twice the API changes. One of the things that we wanted to ensure was that we were able to quickly update the protocol if needed, because obviously some of these things we can compensate for, just hack without changing the protocol. But if we do have to change the protocol,

Starting point is 00:30:16 we don't control what all these guys do in that room over there arguing about NVMe or whatever. And if they change it, we have to be able to adapt quickly. One of the reasons why we decided to do it on this, it's basically a GUID idea. It will allow us to do this in the future because we can't predict the future. And obviously we need to have much better interaction

Starting point is 00:30:37 with communities. If you look at what's driving NFS, guess what? A lot of interaction with the containers community. What we have to be very aware of is we have to interact well with the database and containers community because? A lot of interaction with the containers community. What we have to be very aware of is we have to interact well with the database and containers community because in a lot of ways, this is the commodity protocol everybody should be using. Yeah. And

Starting point is 00:30:53 of course, as soon as Namj's code gets into the Linux kernel, and if they start implementing this, we end up with two implementations and having tests and tracking changes then becomes, you know, a million times more important at that point. Because it's not just him talking to the Samba server. It's not just Steve's Linux kernel client talking to the Samba server.

Starting point is 00:31:15 At that point, it becomes an actual ecosystem that we have to make sure keeps working. Even at the Tel Aviv event, having that one vendor doing this was helpful. Yeah. Let's see. Even at the Tel Aviv event, having that one vendor doing this was helpful. So we talked about create, rename. Obviously, when the files are open, it fails in Windows. Obviously, we have to support that. And that's different semantics and posics.

Starting point is 00:31:35 Let's go on to the next one. There's more and different stuff that comes back. One of the things that I found fascinating, we were looking at get info, I think, with Aurelian last week. There's a get adder call, and it takes like four fields, and most file systems ignore two of the fields.

Starting point is 00:31:53 There's a lot that can come back on get adder. So we need to be able to, you know, there's more metadata, and a lot of file systems ignore this. And, of course, POSIX locking, your favorite topic. I mean, Volcker has made some fantastic changes in the Samba server to eliminate some of the SMB1 insanity and weirdness. And so it should be a lot easier as we move forward to map the POSIX semantics onto our existing SMB2 only backend.

Starting point is 00:32:23 But we should be very careful because one of the things that is so subtle is what's wrong with the slide? Do you see the word POSIX shouldn't be there? Do you know why? Well, yes. Because they're not POSIX locks. Yes, but they're close enough, or they are close enough that you've got a POSIX handle.

Starting point is 00:32:39 They're OFD locks, right? Yeah, yeah. Well, okay, yes. So it's funny, Jeff Layton, a guy who's been at some of these events before, had noted that POSIX locks are basically useless for most cases. And so many of the applications now use OFD locks,

Starting point is 00:32:53 which are POSIX in some sense, but they're stackable. Or rather, they still use the POSIX because they're using them in a way that they think OFD locks behave. They still set the standard handles. But over the wire, Samba, by accident, by a very happy and lucky accident, that they think OFD locks behave. They still set the standard handles. But over the wire, Samba, by accident,

Starting point is 00:33:09 by a very happy and lucky accident, has always implemented OFD locks. And that's much friendlier for most Linux use cases than what's recommended. FS info, there's a few extra fields. We talked about that with supports. It works. Okay. Your pain points.

Starting point is 00:33:23 So the perspective. Once the VFS changes are done, we can start. The actual patch set in that somewhat moribund tree is reasonably small. So once the VFS changes are done, we can probably move stuff over. And at that point, the test. So Steve's plan is great. It's very easily changeable, we'll work with it but at that point what we really need is more and more

Starting point is 00:33:49 test protocol protocol test suite changes inside, and of course as this is Samba, inside SMB torture, so at that point we can really nail down, okay, what are exactly are the semantics that a POSIX handle expects

Starting point is 00:34:08 and is willing to grant? Now, it may turn out that what we thought we were doing isn't what we're actually doing, in which case we have to either document what we wanted it to do or what we actually do. And we haven't gotten to that point yet. That's where, essentially, the documentation piece gets really important. Right now, because the code isn't in Upstream Master,

Starting point is 00:34:33 I still think it's a little too early to try and write down exactly what the protocol should be, especially with the WSL changes that came through recently. I was really excited. Had we standardized before then, we would have standardized too early, and we would have had those RAM trips baked

Starting point is 00:34:49 into the version of the protocol we were shipping. I was really excited. Volker, if I remember correctly, had done some SME torture tests, right? Yes, yes. He's added some. One of the things that was kind of fun is that those were able to be leveraged by people who were experimenting, right? Because you have the SMB torture tests.

Starting point is 00:35:06 You have the Linux client. The Linux client implements half of the stuff. SMB torture implements more. So we have more stuff to try against Jeremy's code than you'd think. So details are super easy. You've got to positive negotiate 100. You include a GWID. Now you send a tree connect.

Starting point is 00:35:30 Maybe in the future that'll change, but right now nothing. This is very easy. Nothing much. An open context and a negotiate protocol context in one new info level. Keep going. We're already showing case sensitivity.

Starting point is 00:35:43 Yeah, now, if you support the POSIX context, what are we expecting? info level. Keep going. We're already showing case sensitivity. Now, if you support the POSIX contact, what are we expecting? You support case sensitive names, you support POSIX semantics, unlink, rename, you support advisory OFD locks, and if you want a boring description of OFD locks, here it is. And that the path names are not

Starting point is 00:36:01 remapped. Yeah. They're still UCS2, but essentially the Windows restrictions on standard path names go away. And no streams. No streams! Sorry. If you want streams, open a Windows handle. Yep.

Starting point is 00:36:16 Hard links are just hard links. Nothing to do. Distinct reparse point tags, that's the one change. Notice the cross out there. We have an ace with a special sid that allows you to set the mode bits. And, you know, F allocate and other things are just mapped to existing SMB3 operations where possible. Yeah. I mean, the existing SMB3 operations are actually

Starting point is 00:36:35 rich enough to cover, I think, do they cover everything Linux F allocate does or is it close? There's a collapse range that we've been thinking about, like where you take a whole lot of the middle of a file and smush it, where it's a two-step process, so it's a question whether you can do it atomically safely, but yet you could emulate it. But I think those are kind of, a lot of file systems don't support those, so I think we can go forward.

Starting point is 00:37:01 And by the way, if you want to beat on Jeremy, it would be really nice to get XFS and some of the other file systems. Right now, all of these operations in Samba require the BTRFS. Yeah, yeah. But that's another reason that he's having to rewrite the VFS.

Starting point is 00:37:15 Take a number, join the queue. So one of the reasons you have to rewrite the VFS is because some of the things like if allocate require call outs to file system specific stuff and are general operations. But notice something really cool here. A lot of this doesn't actually require POSIX extensions. A lot of this stuff would work to any server.

Starting point is 00:37:36 So that's the goal, small. So let's see, I think we've already covered that. How it works to create context. You can have POSIX and non-POSIX open depending on what the handle was that you asked for. Let's see, yep. If you want to see the owner, if you want to pass the UID or GID,

Starting point is 00:37:56 it's the same way, right? Well, so this is basically, this is the fallback. If you're talking to a server that doesn't support isn't inactive directory and has its own UID and JID space, this is what you would get back. But this is yeah.

Starting point is 00:38:18 Let's see. Actually, Aureliano had a good link. I may include it in my presentation tomorrow. I'll link to stuff that explains this better. C, so that's the only other... Yeah, so this is the only info level.

Starting point is 00:38:36 And notice it's a pretty simple payload. 216 bytes, whatever. This is what it looks like. You've got the POSIX create response plus device ID, I know, whatever. It's not that much. Yeah, and it's basically based on top of the Windows all info level with some extra things.

Starting point is 00:38:52 So, thank you Aurelien over there because he has a dissector here and Pike sample test code. So we have Volker's test code, we have Aurelien's test code, we have a Wireshark dissector, we have two servers, one open source, one closed source, we have Volker's test code. We have Aurelien's test code. We have a Wireshark detector. We have two servers, one open source, one closed source. We have my client.

Starting point is 00:39:09 It's enough to experiment with. C. Okay. So what's the hard part? We have to examine every single, there's 560 of these things, XFS test, and every single failure. And there's about 200 XFS tests that have no relevance to a network file system, but 350 of these, we have to go through every single one

Starting point is 00:39:31 and see if there's anything we missed. So I can probably do some parallel work on the new repass point stuff in the separate tree that's using the old VFS layout just so that we can experiment and understand what it is that needs to look like. But my day job, basically, 99%

Starting point is 00:39:52 of my time is basically cranking out and fixing up the VFS changes that we need to modernize the Samba internal VFS. Okay, yes, I think that's the last slide. So if anyone has any questions, we'd be very happy to... Yes? So if I were to simplify the goal,

Starting point is 00:40:14 like the purpose of the 7-positive extension, would the following be the correct statement? So the goal of the S&B3 positive extension is to make native Linux applications run work with a remote S&B server

Starting point is 00:40:38 as if it were an MFSP4 server. Oh no, no, we already so be careful about that. Better than that. Sorry. So the question was, is the purpose of the POSIX extensions to make SMB3 Linux to POSIX work as well as NFS v4?

Starting point is 00:40:55 And, of course, NFS v4 sucks, so much better than that. So if I was going to say, what is my gold standard, what I would really love POSIX extensions to be, I would like you to be able to boot a diskless Linux kiosk from SMB3 and have everything mounted from an SMB3 file server. The device directory, absolutely everything, and everything just work.

Starting point is 00:41:23 Probably SELinux won't work right now, but eventually we will get there, so that you'll be able to run everything, even SELinux-aware apps over SMB3. So the gold standard is the diskless boot. The platinum standard is SELinux. But we should be leaving NFS v4 in the dust. So one of the things that was interesting,

Starting point is 00:41:43 completely independently, somebody came up to me and was like, yeah, Microsoft Distinguished Engineer, VP guy, tell me when you can boot off SMB3. This list. And I think I can understand why. There's a lot of cases where it makes sense to have only remote storage or a swap remote

Starting point is 00:41:59 or boot remote, whatever. Now, interestingly, one of Aurelian's colleagues, Paolo, we just put those changes up there for booting over SMB. Now, there's two lines of change or three lines. There's a very tiny bit of other change needed in one of the network drivers. But why can't you boot over anything other than SMB1? It's because of those special files.

Starting point is 00:42:28 So SMB3, the only thing stopping Paolo, or Relian's colleague at SUSE, there was a little news article published last week about it, from booting over SMB3 is the special files. And those special files you can actually do, technically, without the Unix extensions. So remember the goals here were do everything you can with normal SMB,

Starting point is 00:42:48 recommend whatever you can with things like reparse points and other things that work, and only the things that you can't do with things like the reparse point example or some of these special aces or things like that, only those special things make part of the extension. So that POSIX behavior on open... So here's a question for you, Steve, that I have.

Starting point is 00:43:07 Right now, when you're sending me a symlink, are you sending me an NFS-style repasspoint symlink, or are you sending one of the new ones? So you're going to have to change that, because I'm just storing what you send me. And so I'm sending the wrong thing on that one. Although, to be honest, by default, I don't send them either one,

Starting point is 00:43:24 because by default, I use the Apple-style symlinks. And why? Because poor Jeremy has, I don't want him to, like, go crazy. Because if I make a symlink that only the client recognizes, he has more time left to finish the extension. So the So the Apple file are client-only evaluated. Yeah. But that essentially means that when you hit a stopped on symlink, when you're passing a path name and you hit stopped on symlink,

Starting point is 00:43:53 it means right now we are returning an NFS style stopped on symlink and we're going to need to change that. Right, and I recognize those. WSL. So this is why I didn't want him to ship his client code out the door quite. The good news is that only if you created it over NFS

Starting point is 00:44:10 or only if you created that symlink locally would you ever see that. Because otherwise it would be like Apple. I will map it into a repass point on return. Question, yes. Yeah, currently FreeBSD only implements user and system namespaces for our servers. Are security and trusted eventually going to be required for proper...

Starting point is 00:44:28 So the question is, 3BSD right now only uses user and system. Are you going to require trusted... So, I mean, I don't think so. If we do this the simple, naive way of basically saying, okay, on a POSIX handle, you can send a full namespace qualified EA name. Then at that point, essentially, what you send is what we would store, the

Starting point is 00:44:51 namespace we would try and store into. If you never send me the trusted and the other namespaces, I never try and map into them. I never try and use them. Having said that, once those things are there, you would probably see them

Starting point is 00:45:09 when you enumerated EAs. That's the other thing. Are you going to see these namespaces when you actually open a handle on a file and say, get me a list of all EAs that this file has? Right now, I think we actually filter out the system namespace.

Starting point is 00:45:27 I would have to... This is why this is the area that we are least sure about. I think what we need to do is actually to write some test code on a local file system that supports these EA namespaces

Starting point is 00:45:43 to find out if you're running as a non-root user and you say enumerate the EAs on this file, what do you get back? Do you actually see the system? Do you actually see trusted? Or do you only see the user ones that you could access? Right now, I simply don't know the answer to that question.

Starting point is 00:46:01 I don't think it's ever going to be required unless you're running a Linux app against a FreeBSD server, but then that would fail in the same way if you weren't able to store those namespaces anyway. I think in most cases, Linux systems work with file systems

Starting point is 00:46:20 that don't store EAs perfectly. The SELinux stuff may not work. So SELinux wouldn't be able to enable, but there are example file systems that people have booted over that don't have support for this. But right now, I think probably NFS might be the only remote file system

Starting point is 00:46:37 that is even dreaming of supporting SELinux. Well, and how many people amount with NFS 4.2? Yeah, exactly. I mean, so realistically, I think in 90% of the cases, they're not actually even going to use it. Are we out of time? Oh, we've got five minutes. Okay, so any other questions?

Starting point is 00:46:53 Yeah. I have another one. Sure. We have native NFS v4 ACLs. We don't have POSIX 1.8 ACLs. So I was wondering, in case where a client requests ACL information, is it better to just creatively lie? Oh, so the question is you don't have POSIX ACLs,

Starting point is 00:47:10 you only have NT ACLs. POSIX ACLs do not exist in the SMB3 extensions. POSIX ACLs are gone! Because POSIX ACLs don't matter very well to SMB, and they expose UID and GID information. So, for SMB3, we made a decision early on, for SMB3 POSIX, it's Windows ACLs only.

Starting point is 00:47:32 And so Windows ACLs map very well into NFS v4 ACLs and vice versa, so you just use our mapping layer to, or have a mapping layer that says, hey, it's an NFS v4 ACL, map it into Windows, there you go, and do it on GEDM set. So, yeah, it's an NFS before I've got mapping into Windows. There you go. And do it on get and set. So, yeah, that's no big problem.

Starting point is 00:47:48 Yes? Just to clarify something, or to make sure I'm thinking this correctly. So the new reparch tag, like the IO reparch tag, the LX, FICO, character block, those would only be returned if I opened up the directory

Starting point is 00:48:04 with the LOSIX recontext? So I think, so the question is, would you see those new repass point tag types if you opened up with a POSIX handle or not? Yes, you see them always because I ran them to Windows. So some of the Windows guys behind me, right, they reminded me Yes, you see them always, because I ran them to Windows, so some of the Windows guys behind me, right, they reminded me that you can see these tags in a directory listing and whatever. So I

Starting point is 00:48:32 took my laptop and I mounted a Windows system and created those, and like, oh, they're right. So they come back to normal. Yeah, they're just Windows repass point type, so you'll see them on any directory handle. The thing is, at least going to Samba, I won't let you create them unless they're those types.

Starting point is 00:48:53 So you can never create... Right now, we don't support a generic repass point type, mainly because Windows repass points can be created on directories, which I think is a nightmare and a terrible idea. So we never implemented that. So right now, the only way you can see those tags will be if you've opened a POSIX handle,

Starting point is 00:49:15 created a zero-length file, and then you try and stick one of those repass point types on there, and that will then be, okay, so now you're turning this file into a client scene FIFO, or now you're turning this file into a client scene SimLink, or a client scene Unix domain socket, et cetera. So, you know, there's no sort of SMB3 call saying,

Starting point is 00:49:41 oh, create me a SimLink. No, you open the, at least for POSIX, you open a zero length file and you set that reparse point type on it. And that's then your symlink. You send me the data blob that you want me to store. And then as a client, when you blunder into it, I come back and say,

Starting point is 00:49:58 oh, you blundered into a reparse point that's of type WSL symlink, and then you either read it or you hit it when you run into the stopped on symlink error. Does that help? It's not. So because we only discovered this change

Starting point is 00:50:18 on the WSL tags last week, this only exists in our heads. This doesn't exist as code yet. So that's going to take a while. Yes, another question. I don't know if you missed that slide, but you mentioned those special SIDs with NFS, UID, GID numbers, and mode included. Are they presentation only? And who does your access permission check? So the question is the NFS, the Unix permission modes, et cetera,

Starting point is 00:50:54 they are essentially presentation only because when you set those using the set ACL, the server gets to decide what it turns that into on the file system. So right now in Samba, we map the mode bits that you send. You do a special set ACL with that ACE, and we will map that into a Chimod, right?

Starting point is 00:51:18 But that then goes through our mapping layer of turning that into an ACL, et cetera, et cetera. So that's essentially server-dependent. So when a client does Chimod 755, you send that, you open a POSIX handle, you send that special

Starting point is 00:51:35 SID saying, I want to set a mode, and then the server will do what the server decides to do with that request. It may turn it into a Windows ACL. It may put it as a direct. If it's underneath a POSIX file system,

Starting point is 00:51:51 it may put it as a POSIX, change the POSIX mode bits. You don't know what it's doing at that point. So one of the things that's kind of fun is that, once again, this doesn't actually require the POSIX extensions. So one of the things that discussed among the SMB client developers was server doesn't understand it, so what? In NFS, a lot of the times, you're just evaluating permissions on the client.

Starting point is 00:52:12 You really don't care much about the server. As long as the mode bits persist, it's good enough. You trust the client. You just want the mode bits to be evaluated correctly on the client. So in that model, we have this thing called mode from SID. That mountpar mode from SID doesn't even require the POSIX extensions. Oh, yeah.

Starting point is 00:52:27 And the cool thing about that is it sets the special ACE. The special ACE never matches an existing user, so it never is relevant. The only thing that matters is the mode bits are perfect. The client sees the exact mode bits. Permissions can be evaluated on the client. You can also estimate the ACLK holes on the server, but it's hard because obviously with inheritance and things.

Starting point is 00:52:48 Because to be honest, the only thing that really matters is the access decision that the server makes when you request an open. So you can set whatever mode bits you like, and they can maybe be restored, but it's when you do that, can I open it for this? That's when the rubber hits the road and you get the handle back or not.

Starting point is 00:53:05 If you wanted to play without the POSIX extensions, try mode from Sidmount option. Oh, okay, I think we're out of time. I haven't got my glasses on. Yes, oh, one last question. So when you set the mode for the time number,

Starting point is 00:53:21 let's say 0, 7, 5, 5, and then you try to create a time zone for this inside the deriving, so say 0, 7, 5, 5, and then try to create a file so it folders inside the directory. So we actually derive the mode bits as well on the newly created folder file and then show it that reference in the Linux. So the question is, if you set the mode bits on a directory

Starting point is 00:53:40 and then you create something inside it, and what happens about deriving mode bits from the containing directory? That's a server decision. So yes, the server has the information because you set the mode bits. The server could, if it's on a POSIX fast system, say, oh, I'm going to do standard POSIX

Starting point is 00:54:00 inheritance of group or whatever and set the mode bits. But it doesn't have to do that. And as a client, you're going to have to cope with it not doing that anyway. I mean, at least on Windows, it would ignore it and it would set whatever the inheritable ACL flags were anyway.

Starting point is 00:54:16 So, yes, theoretically you should get... So, if essentially you're talking to a Samba server that's on top of POSIX and he's doing as much POSIX pass-through as it can, then yes, everything will work perfectly. But your app, running through Steve's client code, still has to work against a Windows server,

Starting point is 00:54:34 which will do nothing of the kind and not even send that. So, I mean, you know, the goal is that it tries to... Oh, I think we just dropped off the internet. Never mind. The goal is... Plus you need to restart your Oh, I think we just dropped off the internet. Never mind. The goal is... Plus you need to restart your Chrome, I think. Anyway, the goal is that

Starting point is 00:54:49 we try and do our best effort, but we can't guarantee that it's going to be perfect. Because remember, you're talking to a remote server. The remote server may not even be running your OS. It may not have the same system calls. So you're asking it, please do this

Starting point is 00:55:06 thing but whether it does so or not is entirely up to the server does that help? got one last one then we're going to kick us out yeah server always makes the call sorry the question is what happens when the server makes the call

Starting point is 00:55:20 so the question is makes a call. In the case of the Samba server, does it look at the smb.conf settings? So the question is, in Samba, do the smb.conf trump any requested clients of always? If you say I always want these mode bits, you will always get those mode bits on the definition in the share. If you say

Starting point is 00:55:41 I never want to see these mode bits, we will always remove those mode bits. The smb.conf parameters set by the administrator of the server are always truth. Just because his sneaky client says, I want this to be 777 so my friends can see it. If your

Starting point is 00:56:00 Samba server administrator has said, thou shalt never set any other mode bits, they'll be wiped out. And we'll just, you know, he thinks he's set them, but when he queries them he'll get back the mode bits we set. Yeah? And that happens in Linux today, by the way. Yeah. Okay, and I think

Starting point is 00:56:16 given that, we're out of time, so thank you very much. Thanks a lot. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #125: Opening up Linux to the wider world

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.