Storage Developer Conference - #39: SMB3.1.1 and Beyond in the Linux Kernel

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 39. Today we hear from Stephen French, Principal Systems Engineer, Protocols, Samba Team Primary Data, as he presents SMB 3.1.1 and beyond in the Linux kernel, providing optimal file access to Windows, Mac, Samba, and other file servers from the 2016

Starting point is 00:00:58 Storage Developer Conference. I'm Steve French. I've given talks at Cineo before, so many of you know me, and obviously we have a lot of people on Samba Team and Microsoft here. I'm Steve French. I've given talks at Cineo before so many of you know me and obviously we have a lot of people on Samba team and Microsoft here. I'll be talking about the current status of the Linux client and also some of the things we're working on. This week we have a plug fest going on downstairs and next week at Microsoft we'll

Starting point is 00:01:18 be continuing to work on these. So it tends to be super intense time for all of the Samba developers and all of the kernel developers because we're rapidly trying to remember the things we were last working on three months ago at the last one of the plug tests, or two months ago. So it's a very interesting time. But what I want to do is get you guys up to date a little bit

Starting point is 00:01:38 on where we are with the kernel client. Obviously, the kernel client is largely independent of Samba, but it does have pieces in common in user space and it has many of the same challenges Volker has done some great work with SMB client especially with

Starting point is 00:01:56 its performance and there's been some recent work on actually very exciting work on Samba server we can talk about okay I work for primary data Actually, very exciting work on the SAML server we can talk about. Okay. I work for Primary Data. And at Primary Data, we have some very interesting NFS work going on with FlexFiles, parallelizing I.O., the follow-on to the file layout.

Starting point is 00:02:18 But I'm not going to be talking about that. I'll be talking more about SMB stuff today. We had a talk this morning about some of the stuff our company's doing. Okay. I'm going to be talking about some of the file system activity generally, then the status of some key features, a little bit of discussion about performance and bugs, and then

Starting point is 00:02:36 Jeremy's going to be giving a talk later about POSIX extensions for SMB3. Jeremy and I and others had done extensions for POSIX for CIFS in the past but we're long overdue probably three years overdue for getting this into SMB3 partly because SMB3 does 80% of what we wanted and partly because we've been busy doing other SMB3 features. Okay so I work on the Linux kernel. Linux kernel is an amazing project.

Starting point is 00:03:06 I think all of you are aware of just how many people, I mean, from Christoph Helwig outside to some really amazing developers working on the Linux kernel. The scale of it is staggering. But to give you some ideas of it, 12 months ago,

Starting point is 00:03:20 we had this crazy Linux version 4.2. Like clockwork, every 10 or 11 weeks, we have our new release. months ago we had this crazy Linux version 4.2, like clockwork, every 10 or 11 weeks we have our new release. I'm not a real fan of the name here, but that's what they call it. Each release has its own name.

Starting point is 00:03:35 So we are almost at 4.8. Now, we had a file system summit. Some of you guys were there. File system summit has some impressive developers. These are not all the people working on Linux file systems, of course, but it's a good chunk of them. But it gives you an idea of the kind of people we have to deal with.

Starting point is 00:03:54 So in the Samba world, we have one set of people. We see it's Samba XP. In the file system world, Linux people, you have another set. You have other sets you deal with at the Storage Developer Conference. There is some overlap. But a lot of the work, the amazing work you see at Linux is done by this group.

Starting point is 00:04:10 Now, what are the things that I hear about in the background that are driving some of this activity? I think this conference is the best time to actually get a feel for that. But some of the things you're saying are the better support for NVMe.

Starting point is 00:04:21 I think Christoph gave a... Actually, we had a couple talks about this today, I think. Also, new cheaper adapters RDMA these low latency adapters and much higher network bandwidth are enabling us to do with NAS things that wouldn't be possible before. At the file system summit we had a lot of discussions about how we're going to get rich ACLs. There's some violent disagreement about rich ACLs in the kernel. SMB has always supported a richer ACL model. NFS 4.1 and NFS 4.0 copying, in some extent, SMB had a similar ACL model,

Starting point is 00:04:52 but there's strong resistance against the idea of having deny aces. So our more primitive POSIX ACLs in the Linux kernel, we keep trying year after year to push this. It looks so close. So close. We also have violent agreement on XStat. Everybody seems to be behind this idea of XStat, as long as there's no bike shedding, as they put it.

Starting point is 00:05:17 As long as we don't keep adding. But the reason this matters to us is that this would, for the first time, allow all Linux to not only get, but set birth time. It would allow us to get at key information that today there's no POSIX API for. Volker, Jeremy, all the Samba team guys daily have to be setting stuff into attributes and dealing with non-atomic calls

Starting point is 00:05:41 to query and set information that really should be part of the metadata that comes back. Copy-off load, this is fascinating. To be able to get that Blu-ray disc copied in half a second without the client doing any work. We have extensions in Windows 2016 that I'll talk about later, but there's been a lot of push here. I think you've seen in NFS, you've seen in XFS, implementations of copy-offload.

Starting point is 00:06:05 We had it in BTRFS. David did some great work on enabling the APIs for this. This is so important to cut the network bandwidth as you move files within a server driven by client activity. And, of course, virtualization, Hyper-V, as well as Linux virtualization and VMware have driven a lot of the activity, not just in NFS, but also in other protocols.

Starting point is 00:06:31 Being able to improve our support for sparse files and F-allocate. And, of course, a lot of workload shifting. What was the last talk on? You know, Swift. A lot of cloud workloads. We have new ways of new access patterns. These general things are driving a lot of the work we have to do. Now,

Starting point is 00:06:48 what happens in Linux? Linux file system activity is huge. Many of you guys, you know, 4,000 change sets in the Linux file system area, that's huge. These are also very heavily checked, very heavily reviewed, especially in some of the core areas like

Starting point is 00:07:03 EXT4 and XFS, for example. Now, the Linux kernel activity is down, though. The file system activity is down about 10%, and probably because of the gradual maturing of file systems. Still, this is a hugely active area of development. Over 5% of the kernel changes, obviously most kernel changes are driven by drivers, new hardware devices, weird embedded devices, drive kernel activity, but over 5% of it is still file system code. The CIFS code represents about 42,000 lines of that,

Starting point is 00:07:48 not counting the user space stuff and the Samba pieces we pull in. What's the most active thing? I think many of you guys would not think that BTRFS is the most active part of the kernel today. People don't think about BTRFS sometimes like that. They think of EXT4 perhaps. Notice the EXT4

Starting point is 00:08:03 activity continues to decrease actually. XFS is very active. BTRFS is very active. The NFS client activity increased somewhat. The NFS server activity decreased, partly because a year ago there was a lot of scalability work on the NFS server. Now Samba though is 1,800 change sets in the same time period. It's much more active.

Starting point is 00:08:26 Well obviously because Samba has a lot more components than just file system related stuff. But it's interesting that Samba is as big as the top four of those bind in activity. Okay, by release. In 4.2, the SMB 3.1.1 support. Can we do the full 3.1.1

Starting point is 00:08:43 support? No. Can you authenticate as a guest? Sure. Do we support the more primitive form of the SMB 311 secure and negotiate contacts? Yes. We added support for duplicate extents, the ref link copy offload that was added in 2016 on REFS.

Starting point is 00:09:00 That's there in 4.2. In 4.3, we added the KRB 5 support for SMB3. We had it in CIFS, but not in SMB3. Obviously very important for more secure authentication. A lot of people were curious in their apps about how to query information about whether the mount underneath them is actually,

Starting point is 00:09:20 does it support, you know, is it on an SSD? Different hardware characteristics. And, you know, we had PROC FS kind of pseudo files to get this, but it was easier to get it from an octal. So I added an octal to allow you to query detailed information on the share properties, the device properties, and the volume properties underneath you. 4.4, we realized.

Starting point is 00:09:42 A lot of times, you're on a server, and you want to copy a file from your client from share one to share two on the same server. Well, allow copy off load across shares. We also added a sort of primitive form of resilient and persistent handle support. You could mount with the option. We'd request the durable V2 contacts. We'd request the persistent handle if you mounted with persistent handle or if your share said continuous availability, but we didn't do all of the guarantees

Starting point is 00:10:08 properly so I mean it works but it's not perfect and talking with David and others we're very close to getting a reconnect to be much more closer to what you expect and we'll talk about that later so 4.5 you know there's a lot of tunable stuff. O direct with cache equals loose. So if you want to write directly to your file but with loose cache and semantics, you can now do it. People in Linux are often running different network topologies where their network may be very slow or very fast. The echo interval, we're pinging the server to see if it's awake to decide if we need to reconnect to it. It was made tunable.

Starting point is 00:10:50 And we began this encryption support. Now, encryption support is not finished, but share encryption is probably the biggest feature that we get asked about right now. Encryption support is so helpful when you're trying to mount to an Azure share or when you're trying to mount to something at Microsoft.

Starting point is 00:11:07 When you're going across the Internet, this per share encryption is very important. But I think it's useful for lots of use cases, not just that. 4.7 was dominated by bad lock fixes. It was a big set of fixes. Thank you, Samba team guys. There were some side effects on our guest mount options, even in the kernel for this, and some NetApp-related fixes. Thank you Samba team guys. There were some side effects on our guest mount options even in the kernel for this and some NetApp related fixes. Red Hat and SUSE did some interesting fixes for 4.8. Some of the Red Hat contributed stuff was this prefix path stuff and we also had an interesting problem that you know it's fascinating you

Starting point is 00:11:41 can still have bugs like this that happen even years and years afterwards. We had a problem with MakeJer. And, you know, fell through a hole in a test. We didn't notice it. And there's some stuff that's in progress that's really kind of neat. Okay, right now we're reviewing fixes for PrefixPath. So, for example, if you get the slashes, depending on how you... When you're mounting to a share, you might not have access to the root directory in the share.

Starting point is 00:12:12 So you might be mounting one or two levels lower than that. But we have to traverse the whole thing. Well, we have to get the slashes right, depending on whether your server supports POSIX or not. Well, that fix is almost, we're working on that as we, you know, an hour ago. We added some fixes recently for improved POSIX compatibility. Just crazy things like trailing spaces, trailing periods, that sort of thing.

Starting point is 00:12:36 That just went in fairly recently. One of the things that we added last night, or I added last night, was, you know, we've had this creation time for a long time. XStat's not in. How do we return creation time? How do we return the file attributes? Today, the only file attributes you can get back,

Starting point is 00:12:54 other than the boring ones, are, it's a file or a directory, is it's compressed. So I think David added the compressed file support a while ago. You can get that flag, but none of the other flags, like it's indexed.

Starting point is 00:13:09 It's sparse. There's no way to get those back. So we added an XADDR for that so you can have simple user space tools to display that. These aren't POSIX things. Obviously POSIX doesn't know anything about anything except the EXT2I octals allow you to get compressed and encrypted and that's pretty much it.

Starting point is 00:13:29 The rest of them, things like it's indexed, things like this you have to, or archive system hidden those DOS attributes we have to get with a sudo x editor. And the name is up for negotiation if you guys don't like sifs.dos attributes. NTFS tends to put these up in user space, so it's hard to look at the code because the user space tools are all a little more hidden, but NTFS has a similar kind of approach to returning some of the NTFS metadata. But the general problem we're trying to solve here is if your server stores it, we probably want to back it up. If your server stores it, we probably want to see it sometimes from tools.

Starting point is 00:14:10 So this SMB3 metadata that you can't get out from POSIX matters, and we have to have some way of displaying it. We also need to improve reconnect support. Networks are terrible. They go down. How many times does your network go down today? And being able to provide best effort reconnection is nice, but there are guarantees. One of the things we're looking at

Starting point is 00:14:29 literally right now with David and others was today, recently we fixed two, three months, four months ago we fixed it so if the server goes down we try to reconnect proactively to that share.

Starting point is 00:14:46 But we also need to reopen all of the persistent handles immediately. So we need to improve this HA support, and we're very close on that. Encrypted share support we talked about. Most of the core pieces are in there. It's over half done. We need to finish this last bit. And I'm hoping, you know, with these two test events back-to-back, this will give us a chance

Starting point is 00:15:06 to do that. ACLs we have for CIFs. So finishing the XADDR that allows you to view ACLs with SMB3 would be useful. I also want to have a way of viewing auditing information, quota information,

Starting point is 00:15:22 and claims-based ACLs through pseudo-XADDRs, at least for backup if nothing else. And that's something, you know, if claims-based ACLs through pseudo-X headers, at least for backup, if nothing else. That's something, if you guys have ideas on the naming for that. Bug status. We've got about 50 bugs opened in Samba Bugzilla. The 50 bugs that are active in Samba

Starting point is 00:15:37 Bugzilla, there's a few that look serious. Most are not. Most it's cleanup work we need to do to get it down. There's a smaller number opened in the kernel bugzilla. Some of these are stale and they need some love to close them off. Now,

Starting point is 00:15:53 what's the high level view? SMB support is great for large file I.O., but it's not fast for metadata operations. If you're going to do a directory lookup with 10,000 files, it's going to be pretty slow because we're going to do a directory lookup with 10,000 files, it's going to be pretty slow because we're going to do too many open query closes, open query closes,

Starting point is 00:16:08 getting metadata over and over again for individual files. We need to support directory leases. That would help a lot. And we also need to add, and this is something we talked about at the File System Summit, we have this witness protocol prototype in

Starting point is 00:16:26 Samba. The client part of it is all we need. We need a witness protocol client that can I octal down into the SIFS kernel, just wait on a witness protocol event so we know when to failover. If we want to move a share from here to here because you're doing some management activity or some load balancing, that doesn't require the whole witness protocol implementation that's kind of stalled a little bit in Samba. I think they may talk about it a little more tomorrow. But the client piece, I'm hoping we can get linked in

Starting point is 00:16:57 because only the client piece, by opting down into our kernel code, is needed to notify the kernel as share movements requested. POSIX emulation we emulate. We don't have POSIX extensions. We're going to talk about that Wednesday with Jeremy. Jeremy's listened to that about POSIX emulation.

Starting point is 00:17:19 Today, though, we start POSIX protocol extensions. Right now, we're doing emulation. We'll talk about that a little bit at the end, though. And dialects. We support SIFs, of course, but I really want people to be using SMB 2.1 or later. There's no reason not to be using SMB 2.1 or later, except for the POSIX things we talked about, and we'll get to that later on. So there's a set of capability bits that are negotiated. Which capability bits do we know about and we support? DFS, leasing, large MTU. Sort of persistent handles. Yes, we negotiate them.

Starting point is 00:17:50 Yes, we reconnect them. But we don't reconnect them fast. So we don't reconnect them immediately when the server comes back up. And we're fixing that right now. And of course, the server support is an interesting story for persistent handles. This is very important to provide guarantees on data integrity.

Starting point is 00:18:09 CAP encryption we talked about earlier, it's in progress, not complete. Directory leasing we really need to do. That metadata performance is bad for the SIFS client compared, sorry, SIFS.co compared with the Windows client. That 30% performance boost you get from being able to cache metadata information longer on a directory for which you have a lease, it's a big deal. And multichannel, it's started, but it's not finished. It is a priority.

Starting point is 00:18:32 One of the things that is holding us up is getting the Mellanox guys, getting the guys from these adapter vendors to help us with, how do you simulate this on your little VM a little bit easier? Multichannel and RDMA, being able to get the kernel, you know, looking at the NFS code and how that works and figuring out how to simulate some of these multi-adapter cases, especially with RDMA, would help. Because I think there is significant value in that. Copy offload, I'm very excited about that.

Starting point is 00:19:02 It's a huge performance win. You can see here an example of, you know, here we have a 30 meg file. Now, the normal copy versus the, with the server-side copy, without the server-side copy. You know, half a second down to one, 18 milliseconds, right? Big performance win. So this performance gain is really neat. So seeing the server-side copy of 3 meg file, seeing a server-side copy of 30 meg file,

Starting point is 00:19:39 huge performance win. Notice that you're using dash dash ref link. The default behavior of ref link now calls the iOctl for duplicate extents and I know it gets a little confusing because there's at least four ways to do copy offload with SMB. Duplicate extents is the newer one for Windows 2016 RAFS. Here's a wire trace of it, and you can get an example. You can get a feel for what actually goes on. I'm going to copy a 30-meg file, and you'd normally see at least 30 writes of one megabyte, right? Do we see any writes?

Starting point is 00:20:26 You see an open, get info, an iOptal, a set info, an iOptal, and close. This is really fast. It's a heck of a lot better than sending 30 or 300 or 3,000 writes. So what happens when the copy file fails on the writes. This is interesting. So what happens if copy file fails on the server? It's a great question. Now, I haven't looked at this in the last... These slides are actually from...

Starting point is 00:20:58 This particular slide is actually from a couple months ago. CP, the CP command, its error handling I didn't think was that good for the reflink case if you used the same iOctl yourself, you can do cleanup but my impression was with the cp command, what you'd see is a 0 byte file left around, if I remember correctly

Starting point is 00:21:20 and you're welcome to try it just try some cp dash dash reflink, I don't think it's unique to CIFS, but if the reflink iOctl failed, I think you would leave a debris around a file, an empty file, a 0 byte file. I thought it fell back to just regularly write IO, so you can provide reflink with always and reflink with... Yeah, it was an option. Yeah, yes, yes.

Starting point is 00:21:43 I thought the default was left to file. That's a good point. So he thought there was options on the cp-ref link. There probably are. And also, to be fair, behind this system here, this is Ubuntu that I was testing around with today. This is not the latest Ubuntu. It doesn't have the latest cp command.

Starting point is 00:22:03 So I have to be a little bit careful because we in Samba and we in the kernel tend to write to the latest version of the code, but we don't tend to download the CP command that's the latest version of CP. So this is the problem we have sometimes. We're really good about bringing the kernel in. We're really good about bringing Samba in.

Starting point is 00:22:24 We're really good about running the latest Windows 2016, we're not really good about updating our two-year-old Ubuntu to bring in the latest CP command by reinstalling. I've been getting bugged every other day about, you know, do you want to upgrade? So, but I think this is an important thing for you guys to try

Starting point is 00:22:39 and I think that it would be very useful to check on that. What is the error handling that goes on and What is the error handling that goes on and what is the error handling? Does it match our expectations for error handling? This is somewhat outside the scope of an I-Octal, right? The I-Octal either works or fails. So these are tool questions, really.

Starting point is 00:22:56 In case of I-Octal fails with some CP error, so the client should fall back to the regular standard coordinate? The tool may or may not. And like David, I think that it's important to realize that whether you want to retry or fall back is part of the CP semantics, I believe. The CP command itself has those options. Let's say for ARQL, it's completely fake.

Starting point is 00:23:21 You should allow an option in your tool to do the boring way, the slow way. But realize that could be a thousand times slower. So there may be times if it fails, you don't want to take up all that bandwidth. So I understand that there should be options, and I do understand that users of this tool could be very narrow applications that aren't using CP.

Starting point is 00:23:43 They could be writing their own calls to the cyocles. And remember, this is not particularly CIF-specific, or SMB3-specific, because ButterFS and other file systems support this. Now you'll also see NFS. I think Christoph also did it for XFS, right? Do you remember which file systems support? XFS, NFS, ButterFS,

Starting point is 00:24:08 and CIFS. And he thought OCFS2 as well. There may be others, but it's kind of neat. Okay, so let's look at it versus copy chunk. Now, copy chunk is what we're used to. Copy chunk is more common, especially at NTFS. The problem is we don't have a lot of tools for this. We really need to write some more CIFS Samba-like tools for this, but I think you wrote Cloner, right? So Cloner allows you to do copy chunk.

Starting point is 00:24:38 Here we're doing a 500 meg file. And you can see how long it takes to do the copy, this 500 meg file. Not bad. And we tried it a few times. And notice the performance can kind of vary dramatically because here we actually had to do some write-through. Well, the second time was slower because, you know, you had an existing file you're writing over. So it slowed down a lot.

Starting point is 00:25:01 But look at ref link comparatively. Now, what are we talking about there? So it slowed down a lot. But look at ref link comparatively. Now what are we talking about there? 6,000 times faster, is that right? I mean, it's pretty cool. I mean, there's a big difference between the performance we got for copy chunk and the performance we got for ref link. It's kind of neat. Now remember, this is RAFS,

Starting point is 00:25:26 so your times for NTFS, those copy chunk times, were kind of interesting to look at. We don't have the luxury of duplicate sense for this, but it's kind of interesting because you don't get as much variation in NTFS as we saw with RAFS when the file existed or didn't exist.

Starting point is 00:25:46 So RAFS, you got a big penalty if the file already existed because you're having to clean up what was there. I find this very interesting. What's the best way to copy a file? We have lots of choices. Unfortunately, we only implement two of the four choices in CIFS, but it's a fun question.

Starting point is 00:26:02 What's the best way to copy lots of data across the network? These aren't bad ways to do it, but there are other options. Now, Samba doesn't support all these options, but across different servers we have these four different options to do this. Okay, so what about HA? We have the new mount options. We can mount with resilient handles or mount with persistent handles.

Starting point is 00:26:25 We probably don't care about resilient. If this mount option is specified, we're always going to try it, but I don't know what to do if it fails. I don't think we want to give up on a file if you refuse to give me a resilient handle. The server, for some reason, like Azure, wants to give you a persistent handle.

Starting point is 00:26:43 Great. If for some reason it said Azure, wants to give you a persistent handle, great. If for some reason it said no, I'll keep going, but I'm not going to, it's not a persistent handle, but I'm going to try. I'm going to try on opening to get a persistent handle and send that context. I do need to add the channel sequence number.

Starting point is 00:26:58 This isn't quite as important without multi-channel, but it is something. Also, there's a couple failover things that are easy. I know guys have played around in Sambo with exotic ways of doing DFS as load balancing. DFS is kind of neat. But one thing that will be relatively easy to do that I think we underestimate and should have finished a while ago was if a server gives us multiple EFS referrals for the same path, and one of the servers goes down, well, why not reconnect to the other?

Starting point is 00:27:29 And also this witness protocol we talked about earlier. If I want to move from your server to your server, because you're taking your server offline, I need to be able to be notified about that event. The client will have no problem reconnecting, but we need notification. And writing the witness protocol parsing, the RPC parsing for that isn't worth it.

Starting point is 00:27:50 It's already there. We have a prototype for it in Samba. And so one of the things we talked about at the file system summit was just separating out that client daemon, getting that checked into the Samba tree, so part of the Samba client tools can just iOctal down into the kernel

Starting point is 00:28:08 and wait on those events. Okay, Steve. Yes. Question on the previous one. Those mount options, so those are opt-in by default? You will now request that resilient or persistent? We will always request persistent.

Starting point is 00:28:21 So the use... The mount option is my question. Yeah, so request persistent. So the use... The mount option is my question. Yeah, so use persistent. The flag we keep on the share, so every share we mount, we will first mark it as use persistent if this mount option were set. We'll try.

Starting point is 00:28:39 And second, we'll set it if the server set continuous availability. So if the server set continuous availability, like Azure the server set continuous availability like Azure, right? Right. It's a server cap. It'll tell you. Right. But I understand.

Starting point is 00:28:53 So what I'm saying is if the user said you must, and by the way, if the server says it doesn't support it. You'll request persistent handles unconditionally. No, because if there's no cap there's no point in requesting it. But if it's not continuous availability and the server says it would support

Starting point is 00:29:15 so if you tell me that it's possible for you to support persistent handles but you don't mark the share of continuous availability then Alright. persistent handles, but you don't mark the share continuous availability, then... Alright, well it sounds like a really weird option. It seems like it's going to be very difficult to tell the administrator how to use it. Well, the short answer is you don't have to, because if you mark the share as continuous availability, we're implicitly setting this. So if it's continuous availability, we're doing it.

Starting point is 00:29:46 Well, it just seems like it's a lot of non-options. Realistically, what happens is your server either wants it for the share or it doesn't. So your server's going to say... ...offers it for the share. Whether the handle's persistent or not is the client's decision.

Starting point is 00:30:01 It's usually driven by the type of application. Persistent handles are really useful for virtual disks, for instance. But I agree with you. If you don't use the handle, you can reconnect and recover without damaging the disk. Right. I absolutely agree with you. So his point, I think, is persistent

Starting point is 00:30:17 handles are very valuable for specific applications. So, for argument's sake, let's say that you have an application, virtualization application, that wants persistent handles and the rest of your traffic doesn't care. You have two choices. You can either force the whole share to get it by setting continuous availability and then we'll try, the client will always try, or you mount it twice. You're trying to get the efficiency of non-persistent handles for apps that don't care.

Starting point is 00:30:47 But you force it on if you manage. Yeah, so the lesson we want to learn here, I think, is the lesson of resilient handles. Having the app requested on open, nobody would do it. So rather than have the app ask for it on open, If the admin wants apps to get the higher availability, they'll mount it twice, once with continuous availability, once without, to different directories. Right? OK.

Starting point is 00:31:16 So that's an option. Yeah? What does the protocol say? That is, if the CA is not set, can you deliver persistent handles from a shared for which CA continuous availability is not set? I think so. I mean, this is a good question.

Starting point is 00:31:31 This is a good question for Tom. The protocol says what happens when it's operating is designed. I think requesting a persistent handle from a server that didn't tell you it supported it, it's not illegal. No, no, the other way around. Simply ignore the context. I think it's the other way around. So the question...

Starting point is 00:31:49 If CA is not set, but the server returns a persistent handle to you, is that what you're asking? Well, it can't because the create context, you wouldn't look for the create context. The server can't, he would have to insert a create context in the reply. It hasn't there in the request. So basically, the interesting question is here. From a protocol perspective, when should the client request a persistent handle and when shouldn't it?

Starting point is 00:32:16 When he cares. Yeah, and since we don't have a per-app way in POSIX for an app to say, give me persistent handles, what we do is, okay, if your app wants it, use this mount. If your app doesn't want it, use this mount. Or, you have your server tell you, you must use it. Go ahead. The thing is, if the share is CA, you are always using persistent handles.

Starting point is 00:32:44 We'll always ask for them. So if the share of CA... You will always get them, but... Well, no, you're going to open another persistent handle on a CA share.

Starting point is 00:32:52 It would be weird, but... Yeah, I mean, we... A down-level client will do that, for instance, one that didn't support

Starting point is 00:32:57 persistent handles. Right. But the point is that on a continuous availability share, if we've negotiated a high enough dialect, we will try on the Linux CIPS client

Starting point is 00:33:07 because the server told us it's continuous available. We'll try to get a persistent handle. If we don't get it and the server wouldn't give it to us, okay, we tried. The thing is that this application doesn't need it.

Starting point is 00:33:20 There's no way to disable it. And you said you would mount it twice. Would you get persistent on both? That's actually an interesting point. So what happens if the client absolutely didn't want persistent handles, but the server had continuous availability?

Starting point is 00:33:39 So Azure, for example, you return persistent handles, right? But only if you give us a normal B2 correct on a text. It's that persistent flag. We would give you persistence if you didn't ask for it. But you always say that you support persistent, and you always say on continuous availability, this share. So we had been using the continuous availability flag

Starting point is 00:34:01 as a way of saying, this share has stuff that matters. Trust me, I'm the server. This share matters. There's something important here. Now the capability I support... It's not the share. It's what the application wants. The files on the share

Starting point is 00:34:20 are no different whether they're persistent handled or not. It's the recovery scenario that changes. So it's also the application, the thing that wants to open it. Yes, but your Windows client will open persistent

Starting point is 00:34:36 if it says continuous availability. That's the behavior of the Windows client. That's one example. And so will we in Linux. And the Windows client will open if it's one example. And so will we in Linux. Okay. Right? And the Windows client will open. If it says can use availability,

Starting point is 00:34:49 the open persistent, the Windows client will, so will our client. Okay. But the point is, I think it's an interesting point that you bring up. If the application really didn't want that penalty,

Starting point is 00:34:59 performance penalty, maybe we should mount no persistent and I don't care what the server supports I mean as an alternative today just now with with a dialect they didn't support it but it's just It can be available at the capability that comes at negotiate time, or the server knows that you're connecting to it. It can be available at share or global share. So that's how it's going to be there. And then when you come into the share, that's when you can find out if the share is capable of persistent.

Starting point is 00:35:32 So at that point, he has to choose a default behavior, just like says. If an application is running against a CA share, it will default to persistent. If it's against a non-CA share, it'll default to non-persistent. Now, the only case that's special, in my mind, is the case where you have an app that wants to run against a CA share, but doesn't want the overhead

Starting point is 00:35:47 because he's gonna do a lot of metadata operations and he doesn't want to recharge the neighbors. So I think the idea of a no persistent amount makes a lot more sense. The idea of having a persistent amount where you're going to attempt persistent handles against a share that's explicitly told you it doesn't do CA, which it sounds like

Starting point is 00:36:00 maybe you might wanna do the options, that sounds a little bit weird to me because the server's saying the share doesn't have the capability, and I'm not sure what it means to try to use it. Like, I don't think the server capability was intended, at least maybe it's not popular, to override the share capability.

Starting point is 00:36:13 But yeah, it's interesting. I mean, basically this... Both the server hint, this share matters, and the client hint, I don't care. I don't care. I don't care. I want better performance. They're important. And I think that the no option versus the yes

Starting point is 00:36:32 option, now realistically it probably doesn't make as much of a performance hit as we think, because once again, from the Linux perspective, a lot of our problem is metadata performance and things like this. It's not really reconnect delays perspective, a lot of our problem is metadata performance and things like this. It's not really

Starting point is 00:36:45 reconnect delays and I'm not as worried about that. But I think that this is something we should revisit and maybe something we can talk about next week at the Microsoft event or the test event this week. Okay. So F-allocate.

Starting point is 00:37:02 This is actually kind of interesting. There's been a lot of changes a year or two years ago, different F-allocate options. is actually kind of interesting. There's been a lot of changes a year to two years ago, different F-allocate options. We support the basic stuff, punch hole, zero range, keep size, and we discussed ways a little over a year ago on REFS at least.

Starting point is 00:37:16 Now that we support block ref counting, we could simulate a collapse range and insert range. These are kind of interesting options. In Linux, you have the ability to take a file, remove a chunk out of the middle of the file, and push the two pieces together.

Starting point is 00:37:30 It's kind of a neat thing. Then insert range, same idea. You've got a file. You want to insert something in the middle of the file. Keep the first half and the second half, but move them and stick something in the middle. With block ref counting stuff, it actually

Starting point is 00:37:46 isn't that risky to do something like that by just sort of iterating through block ref counting, block copies to do that. Because you can use the hint that the file system supports block ref counting to know whether it's safe to do this.

Starting point is 00:38:03 Okay, ACLs. Go ahead. Does Seq support SeqData, SeqHole? SeqData, SeqHole. So I think that would have to be mapped to the query allocated expense. Yeah, that would be interesting. So the question is, do we support SeqData, SeqHole? I haven't looked at this in over a year. But that's, yeah, this is interesting.

Starting point is 00:38:26 But the query allocated ranges thing you're talking about, right? So we'd have to query allocated ranges. So the seek whole thing, let's look at that. That's useful to... So seek whole

Starting point is 00:38:40 probably would require that we add support for querying the allocated ranges in a file. Okay, so ACLs. We have SIF support for ACLs. They're really important. There are cases where the mode bits can be emulated for this. I was kind of intrigued that Apple, I think, if I remember correctly,

Starting point is 00:39:01 they query the default permissions back so they can return a simulated mode on the file by figuring out whether the particular user they've got has access by using the create context. But ACLs are a nice way to simulate mode, but they're also useful in other ways. Being able to return the ACLs and set the ACLs is useful for backup, it's useful for other things, and especially if rich ACL support makes it into the kernel. The reason the SMB3 support

Starting point is 00:39:28 isn't there isn't an architectural problem, it's just the CIFS code's been around a long time and the SMB3 code was somewhat different in its implementation we need to finish up. Security features, we talked about secure negotiate it's partially implemented

Starting point is 00:39:43 and the share encryption. Earlier I had mentioned some very recent work. Here was the prototype of it that we were testing yesterday. We're returning the creation time on various files. So here's a set of files. You're going to see this one has the archive bit set. This has archive index. This has archive read only. But these files were created slightly different times you'll

Starting point is 00:40:08 notice that they're the the timestamps differ while they you know they're all created the same day the same hour but wrapping these in tools so it's actually human readable would be interesting but this is just the raw blob. And then, you know, what about metadata? Here you have the Windows view of a file. Hey, it's content indexed, archive bit set, read only on this one, system hidden. And here's the flag set here. You can see the flag, each of the flags set using the user DOS attribute.

Starting point is 00:40:44 Once again, this could be used for backup. It could be used for, you know, if we allow a set. I don't know if you guys have opinions on whether we should allow a set of this, but it would allow you to do... There are certain things like sparse we set other ways, and there are certain things like compressed we set other ways. But other than sparse and compressed, you know, there is value to these flags to know them.

Starting point is 00:41:06 We need to build some Python tools or little Samba-like client tools around those. Okay, we talked about XStat. Generally, at the file system summit, I got the impression that XStat integration was generally agreed on, as long as we didn't bike shed and keep adding new features to it.

Starting point is 00:41:24 But returning at least the birth time and some of the attributes in a more standardized format is important. We also, even if we don't want to be encouraging people to use alternative data streams, alternative data streams matter. You can open alternative data streams today. If you open file colon stream one, file colon stream one, you open it, just a file name, right? The problem is that you don't know that file has stream one and stream two and stream one. File colon stream one, you open it, just a file name, right? The problem is that you don't know that file has stream one and stream two and stream three. So how do you list that file has

Starting point is 00:41:51 stream one, stream two, and stream three? If you knew that it had those streams, you could open them and you could query information on those, but you can't because you don't know a way to list the streams. And, you know, once again, an X adder to list the streams was what we were working on literally as I was walking up to this talk. We talked about the clustering and witness protocol integration. We talked about DFS reconnect. Now, performance. There are some really cool things that can be done for performance. There are are some of the... There are probably more that... Well, I'm sure half you guys have other ideas

Starting point is 00:42:30 on things that help with performance. But from my view, these are some of the more interesting performance features for SMB to talk about. Compounding. Do we do compounding? No. Does the Mac do compounding?

Starting point is 00:42:39 Absolutely. You know, Mac does way too much compounding sometimes, right? It takes advantage of that feature really well. So, you know, does way too much compounding sometimes right so you know I have this most operations in SMB3 go through one routine called open query close well that could be a pretty obvious compounding

Starting point is 00:42:59 candidate couldn't it you know right now open query close calls open query and close but we just compound that it was done intentionally but we didn't finish the last bit. That would probably help 10% or 20% at least on metadata performance. Large file I.O., we do pretty well there. Our performance can be better than NFS in some cases. It's really neat.

Starting point is 00:43:15 Performance scales really well for large file I.O. File leases, yep. Support leases, can we upgrade them? No. Lease upgrades might be nice to reacquire leases from time to time after we lose them. We don't do that. Directory leases. Huge performance win. Shouldn't be too bad. We don't support it.

Starting point is 00:43:36 Copy offload. Yes, we support the two probably most important mechanisms. But it would be nice to support the T10 style as well. Multichannel. We do the basics. We query.10 style as well. Multichannel. We do the basics. We query. We know the server has multichannel. We know information about these adapters, but we don't take advantage of multichannel.

Starting point is 00:43:55 And this is unfortunate because Samba has recently added support for multichannel. And, of course, RDMA, one of the challenges of RDMA is getting some good sample code in the kernel and some sample drivers that we can emulate RDMA when we're running around in presentations without RDMA in our VMs. And then Linux-specific protocol optimizations.

Starting point is 00:44:14 I think we should be very aware that every operating system has particular quirks about IO, and being able to optimize that matters. And I think that we've spent a lot of time at these conferences listening to Hyper-V. Hyper-V has particular I.O. requirements. Probably Azure and other things do too. We have to be very aware of, in Linux, what we can do to reduce number of frames sent on the wire.

Starting point is 00:44:41 And hopefully as we go beyond the things that Jeremy is talking about with the Unix extensions, we can do that. Okay, when we talk about Linux extensions or Unix extensions or POSIX extensions, we have to remember that Linux is not POSIX. Linux does a lot of things that aren't POSIX. It has lots of extra system calls, things that are Linux specific. But, they matter. So what do we do today? The best effort, compatibility. We can handle all the reserved characters, mapping them

Starting point is 00:45:12 just like the Mac does. We can support these Minimal French SimLinks. We can emulate SimLinks multiple ways. We recognize the multiple ways. We can get most of the information we need. What can't we do? We can't do advisory locks. We can't do case sensitivity and opening paths. And without some sort of SIFT-SACL thing, What can't we do? We can't do advisory locks. We can't do case sensitivity and opening paths. And without some sort of SIFT-ACL thing, we can't really emulate the mode bits very well. Apple's approach to this is reasonable and might be worth looking at for returning some of the

Starting point is 00:45:38 mode bits. But the Unix extensions allow you to do this a little bit cleaner. This is actually good enough to run the vast majority of apps and test cases, though. So we could query maximal access request, create context, as we talked about, to get some of the mode bits. The case-sensitive volume, unfortunately, isn't... The servers lie about this. The servers say they're case-sensitive volume, and they lie.

Starting point is 00:46:03 So there's not a whole lot we can do about that. To use it as a clue, hey, the server says it's case sensitive, let's not worry about case sensitive mapping. Well, unfortunately we can't. The NFS SimLink code in Microsoft

Starting point is 00:46:18 NTFS allows you to create a reparse point that their NFS server uses and we could use these same things. They're nice that only the clients follow them. They don't have any server security problems as a server-followed junction might. We recognize it, but we could clean that up a little bit. Right now, the Minstrel French ones

Starting point is 00:46:38 have a sort of magic file size and signature in them that allow you to recognize them as a symlink, and Apple does that, and our Linux client does that. Query FS info, the physical bytes per sector, we can map that to an obscure set FS field. But it doesn't address byte range locking, it doesn't address the case sensitive path names and of course we have this problem with streams. There are some things like you download a file with Internet Explorer, it's going to add a stream name for its zone.

Starting point is 00:47:07 If I have file colon stream, how do I tell that from a valid POSIX path that has a colon in the valid POSIX path? So one of the problems we have is you're either mapping with POSIX emulation where colon is mapped to something else, or not. And if you're not, then that would be a stream name.

Starting point is 00:47:25 If you are, it's an illegal path to Windows. It's a legal path to POSIX. So how do we deal with that colon conflict between POSIX, where it's a legal character, and Windows, where it's a separator between that and the stream name? Okay, second one. Apple has this create context, AAPL. And you can see some of the things here.

Starting point is 00:47:45 And it would improve Mac interoperability. Here you see an example on the wire of what the AAPL. And you can see some of the things here. And it would improve Mac interoperability. Here you see an example on the wire of what the AAPL context looks like. And nothing too magic about it. It's good enough for a lot of their needs. We could certainly do it. We could make it a mount option. Or we can go finish up the very relatively small changes that Jeremy and I have been talking about for these POSIX extensions.

Starting point is 00:48:07 We'll have a breakout section on it as well as Jeremy's talk. Performance. We really need to do some of this compounding, get the direct releases in place. That should help a lot. There are cases where we're going to be faster than NFS. And there's cases where NFS, of course, is going to have fewer operations to get the same metadata back because their reader and their query info map a little bit better to POSIX

Starting point is 00:48:28 but the bottom line is we want SMB to be good enough for a lot of different workloads right now it's good enough for a reasonable set of workloads but to broaden that we need this POSIX support and we also need to improve the automated testing.

Starting point is 00:48:45 XFS test is wonderful, but there are a subset of tests that fail. And some of those we know why. Some of them need POSIX permission mode bits. Okay, we can deal with that. The F-alloc missing features that Dave and I were discussing, sure, there's a few of those.

Starting point is 00:48:59 There are not that many. XFS test tests lots of file system specific features. Things that other file systems don't support, things like F allocate special features. We're going to fail at least one test, test 131, because we don't have advisory locking. Okay, I can live with that. And there's a few that relate to network file systems generally.

Starting point is 00:49:20 Getting timestamp coherence between M time and A time and some of these consistent in a multi-node client server network, in some cases, is impossible. You can do it in a local file system, but in a multi-node client server network, in some cases it's impossible. You can do it in a local file system, but in a network file system, it's not always possible. And there are a couple of cases like this that on NFS or SMU aren't going to work. But generally, XFS test works reasonably well on Linux, and especially with the scratch mounts specified. And I think that as we clean up some of these last few things

Starting point is 00:49:48 we've talked about here, getting to the point where 90% of the tests make sense on CIFS and pass is really going to change a lot because it means that when you do a fix or when you're testing your server, you don't have to think as hard. When 60% pass,

Starting point is 00:50:02 there's too much thinking involved. Is this a bug in my server? Is this a bug in the CIFF client? What's going on? But it's important to get a little bit farther along in the compatibility for POSIX if for nothing else when NetApp or when any other server vendor wants to do a quick test of their NFS client and CIFF client, you want to use the same tools. Being able to use the same tools against different protocol versions, SMB3, SMB3.1, 1, NFS3, NFS4, using the same set of XFS tools, XFS test tools, is helpful. And I realize XFS test sounds misleading

Starting point is 00:50:34 because it has nothing to do with XFS anymore, but it's the name for the file system test suite. And its history obviously leads it to that name because it came from the XFS development team initially. But it's now become a kind of a catch-all for all the test tools. But at these events, specifically at the Plugfest downstairs this week and then next week at Microsoft, we have the opportunity to make some progress here.

Starting point is 00:50:55 And this is kind of exciting. Some of the SUSE developers and Citrix developers and others have been here working through some of these problems. I think SMB3 on Linux has a very bright future. I'm very excited about some of the improvements that were made in this quality of service. In our SPD talks earlier, I'm kind of excited about getting additional security features in and just getting the performance a little bit better each year. I think a lot of times on Linux we focus too much on local file systems and on maybe iSCSI

Starting point is 00:51:32 and NFS. There is a role for SMB3 quite broad, and not just on Macs, not just on Windows. I think there's some workloads where it's really exciting on Linux as well. But I also want to encourage you guys to send patches, because we definitely need more help here. And this is a fantastic, very, very interesting challenge. OK, we have time probably for a few questions. Anybody have questions?

Starting point is 00:52:01 Yes, go ahead. You mentioned earlier some of the Linux APIs like rich apple and xstab and I was just kind of generically curious if you had much involvement in that or insight into how those things are going more than just that they aren't in yet because I know some of those, I remember the rich apple patches coming across mailing lists probably six or seven years ago, maybe? Well, I remember them because on my team at IBM years ago, we were working on it. This was like 2007 or 2008. So the Rich Apple patches have gotten gradual improvements. was Rich Ackle, some of the NFS work and XADDR work

Starting point is 00:52:45 that was done was actually in preparation for Rich Ackle and cleanup. So some of the cleanup patches to make Rich Ackle go in cleaner have gone in. I noticed this yesterday in annoyance because remember I was talking about the creation time and the DOS attributes? I was like, where did my code go? And I realized it went in because they took out some dead code. They took out some dead code because in preparation

Starting point is 00:53:09 for Rich Ackle, they cleaned up some of the XADDR code. So there has been a little bit of the patch set go in, but realistically, I think Jeremy Allison's had some offline discussions. You could find Jeremy and ask him the progress he made. There's been strong objection to the idea of having deny aces in Linux

Starting point is 00:53:25 because it does complicate the admin model. The patch set is just a typical Linux example. When you get a patch set that's this much complexity, it takes 10 times more than something with this much complexity. On the other hand, there are product shipping with rich ACL support in it. The tree is available,

Starting point is 00:53:49 and we have the Samba modules to take advantage of it and you know NFS patches to take advantage of it go ahead yeah that's actually a good question is it more people than Christophe that object to rich Apple probably but he's but he's the one that everybody hears about. I haven't run into anybody other than Christoph recently that objects to it. He's here. Yeah. He's here. I haven't asked him.

Starting point is 00:54:17 Talk to him. Yeah. And to be fair, rich ACL is more complicated. XStat, on the other hand, shouldn't be a big objection. And I think that that got more general agreement. I think we're much closer on XStat. On the other hand, the problem that was mentioned over and over again at the kernel, because we had this breakout session, we were talking about this,

Starting point is 00:54:49 and Michael and I, no bike shedding. The kernel has this really horrible habit of trying to make it 2% better by adding 5% complexity, and then 100 of those changes, it just gets a little bit, and then it just falls apart. The whole XStack could return many things if we leave it at what it is right now it's not much objection but why it's not in because you know look we talked about it back in what April or May and Raleigh is a good question and it's something I haven't actually followed the mailing list discussions on to see where the last you might be able to go out on Linux FS develop and see what the last reaction and exit was because that was much less

Starting point is 00:55:29 controversial than rich And I feel a little bit guilty on the rich ackle side Because on rich ackle side I could make the life easier By doing what the NFS guys did and kind of like making us sort of set of patches for sips that kind of look like The NFS ones so when they turn on rich patches for SIFs that kind of look like the NFS ones so when they turn on RichAckle, it works for one more file system. Because one of the things that we have to be aware of

Starting point is 00:55:51 is that if it provides more and more value to more and more file systems, that's good. But not every file system is enabled for RichAckle. And SIF supports it on the wire, so why don't we just add a little bit of glue? Why don't you just store it as an extended attribute? Well, the

Starting point is 00:56:11 And only those files and systems need it. They're essentially Samba. I mean, you do it in Google Space. So Samba stores other metadata as well in extended attributes. And it's fine. There's no harm in storing it. The problem you get is that

Starting point is 00:56:27 multi-protocol access, it's a little bit more confusing and it is a little bit more expensive to store it in to have Samba stored in user space. And it's not atomic. Yes, and this is a big deal actually. If you think about it's actually not atomic either way.

Starting point is 00:56:43 The rich ACL interface wouldn't be atomic either. Because one of the problems we have today... Well, the access checks are atomic, yes. But what I was... Something painful is I create a file. In SME3, I'm presented with create context. And when we have to go out to various X adders to process, create context, you could have an open succeed but a create context relating to ACLs that we couldn't set.

Starting point is 00:57:11 . Yeah, and that's the way of... So we could... Yes. So you stick it in a temp file and then do the rename game. Yes? The other thing that I know we've been looking at, what I've been working on, is actually exporting a local file system over NFS before and with the current Linux NFS server without the rich Apple patches, it's completely faking the Apple support, so it's not really properly compliant. Yeah, I think that what I think is important about your point about the faking ACL support, when you have NFS ACLs on top of, you know, there are other platforms obviously that support NFS ACLs a little more natively.

Starting point is 00:57:59 When we, you know, rich ACL integration with NFS is, the patches are fine. But without that, there's a certain amount of complexity that is hard to understand when you're mapping twice. You're mapping on the client end and on the server end. So yes, you lose information. But more importantly, an administrator trying to be sane about denying access or allowing access, it's very hard to get right when you're mapping twice. And, you know, I think in this room, eventually all of us could

Starting point is 00:58:32 figure that out, but we get it wrong a lot. And the chance for accidentally making data available, you know, that's very painful. And this complexity, rich ACL may be more complex, but it's not more complex than mapping twice. See other questions? Okay, well, hopefully you guys can help out with patches and testing. But once again, this test event downstairs is a great opportunity for us to talk more about optimizations and how to do this. Let's get some more progress on this stuff.

Starting point is 00:59:08 Thank you guys. Yeah. So this is the old Samba logo, right? Yeah, we need a better Samba logo. No, we have one. We have a better Samba logo, but it's the wrong one in the test. Yeah, I know. I put the wrong

Starting point is 00:59:24 one because this is stolen from a year-old presentation. Mea culpa. My fault. Okay. Thank you, guys. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further

Starting point is 00:59:48 with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #39: SMB3.1.1 and Beyond in the Linux Kernel

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.