Storage Developer Conference - #56: Samba and NFS Integration

Episode Date: September 8, 2017

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 56. Today we hear from Stephen French, Principal Systems Engineer Protocols with Primary Data Corporation, as he presents Samba and NFS Integration from the 2016 Storage Developers Conference.
Starting point is 00:00:49 Okay, well I want to thank you all for coming. I think this is always one of the more interesting talks because you get a mix of NFS people and SMB people and I think that's always a useful thing because we have many of the same problems. But if you add clustering, we have more, we go from problems to psychoses or problems to very difficult, I don't know the right word for it, other than it is enough to keep us busy for a hundred years. Just a reminder, I work for Primary Data. We have some wonderful NFS people.
Starting point is 00:01:22 We have some wonderful people like Richard Sharp here in the audience. But I'm not particularly talking about our product. Obviously we share many of these same challenges, but so do many of you guys. I think Red Hat and others deal with many of the same problems every day. I maintain the kernel driver, sips.ko, for SMB3 and SIPs enablement in the kernel. I'm also a member of the Samba team. and I've been doing this for a long time, but it does seem like some of these same problems come back year after year in slightly more interesting ways. I'm going to talk about NFS 4.2 and SMB3, how to integrate them multiple ways, one on
Starting point is 00:01:57 top of the other, then both together exporting, and on top of a cluster file system. Some of this you all will be very familiar with with and hopefully some of this will be useful information. So why do we care about this? So let's see if... Yeah, so let's try this and see if it works better. This one's off, actually. Yeah. Let's see if there is a way. Yeah. You'd think it has an on-off button, wouldn't you?
Starting point is 00:02:53 No. Maybe not. Okay. That's weird. You would think. You know, we deal with networking problems, but they're different. Okay, sounds great. Okay, so why do we care about this?
Starting point is 00:03:37 And I think one of the reasons is that the performance and stability differs a lot. Now, did any of you sit through Ira's, remember, yesterday? He talked about your particular, right, Ceph. Ceph, right? Now many of you know Gluster, many of you know PNFS. Windows has its own clustering model. I think all of us have seen at some point GPFS or Luster, they all share different problems, headaches with performance, with stability, with compatibility. But we all know that our Windows clients and our Mac clients
Starting point is 00:04:05 are going to do pretty well over SMB3. We all know that there are workloads that are, I mean, I think, if I remember correctly, Ronnie still every day deals with NFS, right? There are workloads that we have to support these two protocols. Now, it's been a long time, but pretty much everyone else died. And I think you guys know that, right? You remember this?
Starting point is 00:04:27 You know, 1984 was not just George Orwell. 1984 was the birth of two protocols that ate all the rest. And I was thinking, why do we have dinosaurs on our shirt again? But if you guys were downstairs at the Plugfest, I think we now have, we still deal with these. Now, the nice thing is that 30 years of improvements have created some pretty impressive things. And even in the NFS world, we're seeing new things being developed. We're seeing new implementations of copy range coming in.
Starting point is 00:04:53 We're seeing new layouts proposed. The feature sets overlap, but they create kind of unique problems. But they also create kind of strengths and weaknesses for particular workloads. And I think it's obvious when you think about the RDMA discussions, right? We, Tom gave a talk on RDMA, right? SMB3 RDMA has done a very nice job. Both these protocols are very well tested and they're more common than all of the others combined.
Starting point is 00:05:18 Now, some of this is review I want to go through fairly quickly. Early versions of SNB, sorry, early versions of NFS 4 had some security features. They were layered in an interesting way that made it possible to do some nice security things. They had a uniform namespace. They were stateful. The original NFS v3, except for the byte range locks, was stateless, which was kind of odd. Their compound operations supported, but with NFS 4.1, they added parallel NFS trunking, they added this concept of a layout, and there's a good overview of this in some of the SNEAR presentations, and I think over the years people like Alex and Tom Haynes can have given good overviews of that. But we also added, not long ago NFS 4.2. To give a 30-second reminder of what
Starting point is 00:06:07 NFS 4.2 included, it added sparse file support so we can better do FLK, space reservation, labeled NFS so you could do SELinux, IOadvise, server-side copy, copy file, copy range, a clone file, clone range. Application data holes. Now, when you think NFS, most people actually think of NFS v3. Stateless, simple reads and writes, very different traffic patterns. But NFS 4.2 actually does add some useful features, and I don't want to underestimate its usefulness.
Starting point is 00:06:42 You can go out to the IETF site and look at the spec. Now, what's the status on some of these things? This is just review to sort of understand what's going on with NFS because many of us are in the SMB3 world. There are major layout types. NetApp guys love files, right? We have object and block. Each of these are developed by different vendors. The kernel server has added some support for PNFS. Layout stats and a new layout, flex files, were added in the last two years.
Starting point is 00:07:16 Flex files have been in since kernel version 4.0. Linux actually had sparse files for over a year and a half in NFS. Space reservations labeled NFS. And the copy-off load is fairly recent. The addition of copy-off load is quite recent. Last I checked, IOadvise and the application data holes were not in. I didn't look today, but I don't think they're in. Okay, so what are these layout types? It's kind of weird.
Starting point is 00:07:45 It's almost like every vendor added their own, but we have file, now flex files. So you can see Tom Haynes' presentation. Flex files is sort of file V2, if you want to look at it that way, layout type. Object and block. And there are others that have been proposed. So what is flex files?
Starting point is 00:08:03 It has a lot of improvements on top of the file layout. It lets you take Red Hat servers that are NFS or Isilon servers, all these different types of servers, whether V3, V4, V4.1, and you can spray your I.O. to them. You can have the client do the mirroring, have the client get the layout from the metadata server, the NFS client gets it and is able to write it to different places. So it allows you in a sense to use NFS as sort of NFS v3 or v4 as your sort of standard access protocol over the wire or just to do the reads and writes of data and you have a metadata server that understands NFS 4.2 and understands how to give out a layout for this data. And the clustered file system, existing clustered file systems don't map real well to this, but it does allow you to create clustered file systems sort of out of PNFS.
Starting point is 00:08:59 So here's a picture of it, stealing a slide from Alex McDonald here. And, you know, you can do fencing, and I think it's an interesting view here, right? picture of it, stealing a slide from Alex McDonald here. You can do fencing. I think it's an interesting view here. Your client goes to the NFS metadata server, gets a layout, and then is able to write one or more copies of that data to boring servers that really know nothing about PNFS.
Starting point is 00:09:28 So that's a quick review of the NFS stuff. Now, why do we care about both? Because they have unique and interesting features. Well, what are some of those unique and interesting? What's different? Obviously, NFS is more POSIX compatible. We're trying to fix that with the talk. As a matter of fact, immediately following,
Starting point is 00:09:43 we have a session where we'll get a chance to discuss your requirements in more detail on that. Things like advisory range locking don't map to SMB directly. You have to map through mandatory locks. Unlink behavior, good example, works better over NFS. There are Unix extensions that Jeremy and I and others did for SIFS, but not for SMB3. There's no equivalent of PNFS and the ability to query a layout and get layout stats in SMB3.
Starting point is 00:10:11 It could be added, but we don't have such an equivalent. Also, NFS, for good or bad, tends to be layered in a way that in some cases makes it easier, in some cases makes it much harder. That layering on top of SunRPC means that it's harder for some features, some security features, but in other ways,
Starting point is 00:10:33 it's somebody else's problem, most of the security things like Kerberos. Which is a nice thing, right? Samba guys lose a lot of sleep over Kerberos, like every day. Label NFS, we could do that over SMB, but right now the attributes that flow over SMB are all the user attributes, not security or
Starting point is 00:10:53 trusted or any of the other categories used by SELinux here. SMB 3.1 though includes things like a global namespace. There have been proposals for a global namespace in NFS. They have never been accepted. Claims-based ACLs, obviously the clustering features,
Starting point is 00:11:13 witness protocol, the RDMA is much better, I think, in SMB3. And like I said, I think Tom gave a talk on that earlier. And, of course, there's many management protocols that are layered on top of SMB in useful ways and that match very nicely when you're trying to manage a Windows server. You sort of get a whole set of features listing servers, managing servers, getting group information, user information that is almost always considered present at the same time. Branch cache, shadow copy, sort of SCSI over SMB, MSR, SPD, these don't have equivalents in NFS.
Starting point is 00:11:47 And I think the multi-channel is really neat. No one wants to set up the headache of bonding multiple adapters together in NFS, but man, it's easy with SMB3. Adding adapters just works. Okay, so what's the best way to get these to integrate together? We've got to have Windows clients, Mac clients. I think Jeremy, you mentioned at Samba XP, we now even have Google shipping SMB client. SMB client even on your phone, right?
Starting point is 00:12:18 SMB3, all these protocols. What's the best way to get these? Chromebook. Chromebook. Okay, so your Chromebook laptops, right? We have a Linux-like OS that's shipping user space libraries to access this. And I think some of our marketing guys
Starting point is 00:12:37 were using SMB apps on their phone to access data. It's not just the Xbox and weird appliances, but weird stuff uses SMB and NFS like routers and NATs. Should we do NFS over SMB? Should we do SMB over NFS? You have choices, right? We could have PNFS on the bottom
Starting point is 00:13:04 and just Samba sitting on top. We could have NFS sitting on top of an SMB client. These things are all possible in theory. It's funny, if you Google some of this stuff, you end up with Hadoop discussions where I really wish those Hadoop guys actually came to these conferences because they would learn how to do this much easier. But they have the same issues like, how do I get Samba on Hadoop? The more likely solution we're going to get to is a dual gateway over something like PNFS.
Starting point is 00:13:31 Now if you're in IRO's world, you're running on top of a different cluster, you're running on top of Ceph, and there are lots of people who would be running on top of Gluster. The cluster file system underneath varies and it has some of the same problems, but the most likely solution is we have Samba or something like it running in user space. I know we've had talks in this conference about kernel space servers. There have been a couple proposed for
Starting point is 00:13:56 kernel space servers, but here let's talk about Samba. Then of course you have NFS servers like the kernel server or Ganesha to serve your v3 clients. If you're doing PNFS, you can go directly to the back end. So what are the problems you have to solve? I think all of us have dealt with at least one of these problems. You have to deal with creation time.
Starting point is 00:14:15 You have to deal with these crazy DOS attributes. DOS attributes you don't think matter. Actually, they do. So we also have to have ways of dealing with security. I hate to state the obvious but I remember vaguely some news stories about North Korea. You guys remember those? I vaguely remember like they broke into SMB3 server or no sorry SIF servers, right? They broke into SIF servers and did something. I mean these are crazy stories, right?, sorry, CIF servers. They broke into CIF servers and did something. I mean, these are crazy stories.
Starting point is 00:14:45 Security matters. People actually care about ACLs. They actually care about this. And we can't just say, well, it doesn't matter. The lowest common denominator is fine. Directory leases, metadata performance matters. Leases, when you're opening
Starting point is 00:15:01 a file and it's not heavily contested, you should be able to cache it. Quotas, auditing, we should be able to allow our administrators some flexibility, no matter what protocol they're using, to configure easy to use quotas. Right mouse button on their Mac, right mouse button in Windows in Explorer, click on something. We have to deal with the fact that open has lots of things that happen, it's not atomic. We have to deal with the differences in byte range locking, and we have to deal with the fact that open has lots of things that happen. It's not atomic. We have to deal with the differences in byte range locking and we have to deal with the problem that NFS can spray data across lots of different servers and it's kind of invisible to us when we're running in this kind of
Starting point is 00:15:36 environment. The data may be spread across 10 or 20 or more servers. In addition we have to be able to deal with UIDs. Ronnie may be UID 1000 because he was the first one added on that server. Ira may be UID 1000 on server 2, and UID 1003 on something else, and 1000 on a different one. So how are we going to map these things? Obviously we can use WinBind for these sorts of things, but it is painful. We deal with these problems all the time. We also have to take into account that there were significant security improvements added in Windows. Share encryption is very useful.
Starting point is 00:16:14 There's a reason that share encryption is required to access some of these very remote servers, some of these cloud-based SMB servers. Secure negotiate is better. What do we introduce in terms of security problems when we're running in this environment? And we talked about the how do we get consistent UID mapping. You know, we have three separate ways of naming you. You could be david at microsoft.com. You could be UID 1000. Or you could be some long SID in the Windows world.
Starting point is 00:16:43 We have to get these all right. Okay, so what if we have KNFSD or Ganesha exporting the same data? We need a good cluster file system. We need something like CTDB to handle starting and stopping services to help with the the non-POSIX state that the cluster file system can't handle. Okay, so what about CTDB and NFS? Can they run together? Does CTDB do anything with NFS? Yes. I think you guys probably who were in one of the talks yesterday
Starting point is 00:17:11 probably got a chance to see the config file. Here's a CTDB config file. Notice it manages NFS at the bottom. It can turn that on. When it manages NFS, what does that mean? Not as much as I'd like, unfortunately. But what it does mean is it can start and stop NFS automatically in the cluster. So when cluster nodes go down or go up,
Starting point is 00:17:30 you can manage IPs, you can move IPs. There are about 15 distinct CTDB NFS helper and event scripts and about 40 test files for testing CTDB NFS related events, starting and stopping, address takeover, grace period.
Starting point is 00:17:48 There's additional links here you can get on the presentation to look up more information if you care about using CTDB in the NFS world. Of course what we would like is things like deny modes. What we would like is for things like advisory locks to be able to be reflected in something like CTDB state or at least a call out so Samba and NFS could keep them closer. Now, let's go back to something very interesting. The SIFS client. Well, forget this NFS stuff. Just run over a SIFS mount.
Starting point is 00:18:15 You can actually do this, sort of. At least I did for a while. It does need some work, though. If you want to follow up on this, the file systems documentation, NFS exporting talks about it. If you tried it, as you saw today, it gets an error. Well, the reason it gets an error is because I have CIFS NFT export turned off and so the
Starting point is 00:18:32 export ops are not exposed, but there are tiny versions of those. You really, really, really, really should not NFS export to the network share. There are many surprises. Yeah, yeah, I mean, NFS v3 is weird, but the things you have to, if you look at what export ops have to do, they have to deal with the mapping of the NFS file ID, right, they have to deal with the get parent,
Starting point is 00:19:02 and there's like four or five functions you have to be able to export well. Although it's technically possible to do, there's a reason that this is turned off by default. The reason is explained very, very well in this. But if you can, if a client can see the underlying CIFS share as well, surprises me, it corruptifies. That is what we, not the main, we have.
Starting point is 00:19:39 I mean, what's nice about this NFS over CIFS is that it is kind of an interesting experiment because a lot of the cluster file systems have worse POSIX semantics than CIFS or looser semantics than CIFS. It's interesting if you want to play some time, do a grep of the kernel and see which ones export-export operations. It's a little scary. Here's the BTRFS example of the... Okay, so let's forget that for a minute and let's go back to thinking about how we would export KNFSD and Samba over a cluster file system, what's necessary. NFS v3 and v4 can go to KNFSD or Ganesha and Samba can take care of the rest.
Starting point is 00:20:19 Now oversimplified, the problems that Volker and Jeremy deal with every day, Michael deals with every day. If you can get it from a POSIX API, like the file size or mtime, you go to the POSIX API to get it. You may have a wrapper for that POSIX call, but basically Samba has no problem with this over something weird, a weird cluster file system.
Starting point is 00:20:40 If it's file system specific, you can have a VFS module return. You can have something like Ceph that runs all in user space. So you don't even have to go down the kernel at all. So you can get a file system specific VFS, NFS, something, and you can return it. Now if that fails, if you don't have such a thing installed or if you don't find it there, you can get out of an XADDR.
Starting point is 00:20:59 So Samba heavily relies on things like EXT4 on writing to XADributes for DOS attributes, among other things. But creation time, for example. Go ahead, Jeremy. If this were really a good idea, why would somebody written a standard EFS module is purely an NFS translator? Oh, actually, somebody did. Is that you? No, it wasn't me. Yeah, so his comment was, well, why wouldn't you write a VFS module?
Starting point is 00:21:34 You could... Yep. Yep. Okay, so repeating the question and the answer, if this was a good idea to do NFL, why wouldn't you just, like with Ceph, why wouldn't you just have a user space NFS? We have PI NFS in user space.
Starting point is 00:21:52 We have your lib NFS, right? Ronnie has lib NFS, very nice library. Why wouldn't we just put a PNFS-capable user space module? And the answer was, well, somebody did. But the kernel actually gets a little bit more love. But the kernel actually gets a little bit more love. The kernel NFS gets a little bit more love. And to be honest, there is performance reasons why you do want to go through the kernel. I think that despite all of our headaches about the kernel,
Starting point is 00:22:16 the kernel does some things reasonably well. Now, it's an interesting question, but I don't think libNFS has PNFS support right now, right? There are other reasons for it as well. But you could, in theory, write a user space NFS module that plugged in here. But the general thing is if you can't find it in a file system specific module for Gluster or Ceph or NFS, we're going to try XHatters. But what if you don't have XHatters in your file system?
Starting point is 00:22:47 This makes life difficult. You have to guess or emulate in other ways sometimes, or you have to emulate XHatters. Well, Samba can emulate XHatters. What about getting these with better system calls? Those of you who were at the File System Summit or Vault probably heard the discussions on XStat, now called StatX, to make life confusing. And I have a link here to the LWN article on it
Starting point is 00:23:15 that has the patches. Looked like it had general agreement. The last comment was basically, you know, interchange between Dave Howells and Christoph that said, hey, you hey, shouldn't we get glibc to agree on this? Yes. I think. But he could never
Starting point is 00:23:31 get a response from them. On the other hand, there wasn't really any disagreement about the patches. This is a little bit frustrating because they patched the file systems to get birth time back and to get simple attributes back. Some people would argue that it should be simpler or more complicated, but basically it's annoying
Starting point is 00:23:47 because we're so close. Go ahead. I believe that the set is not getting the set we're trying because Jeff Layton published a version of the set that actually stands inside
Starting point is 00:24:03 the set. You can see see this in GitHub. You can see this set of catches that are required for the Yeah. So the point was that stat X or X stat
Starting point is 00:24:18 is not just for EXT4 and NFS and CIFS, but it's also for some of these cluster file systems, and they see activity on the mailing list and in the Git repos that show that some of these cluster file systems are implementing StatX. It's been frustrating, of course, because for three or four years we've seen everybody agree on StatX, and somebody had a feature and it just stops for a while.
Starting point is 00:24:42 This would be so helpful for us in our world because now NFS and SMB can both get the birth time of a file without relying on extended attributes. And that actually kind of makes a difference because right now if you can cut down the stuff that's emulated metadata that you have to write on every file, it really does help performance. And needless to say the birth time should be on every file. What about this creation time? There was another set of patches that Anne Marie Merritt proposed. They're very small, actually, to turn on all
Starting point is 00:25:19 of these fields in the NFS client and add a simple I-optical. Here's a patch series from back in May, late May. And this was interesting because it's the classic example. This would help everybody, these small little things to get an IOC to get this attributes out. And Christoph's comment was use XStat instead. Well I agree. We have two different ways to get this. If we don't get it, then every single file has to have stuff in xstat or tdb. Now, there's a whole set of attributes, right? Here's the whole list, unless you guys added one yesterday. Things like sparse probably get other ways. Compressed, I'd like. Encrypted? I'd like. But realistically we have most of them mapped already through the xstat that matter the most. Now what about ACLs? We talk about ACLs in a painful, disparaging way because we lose
Starting point is 00:26:17 sleep over it, but the ACL model between NFS 4.1, not 4.0, but the NFS 4.1 ACL model and SMB is close enough. There are problems and it is different. I mean, username and domain is not the same as SIDs. When we're talking about rich ACL, we're talking about UIDs usually, not username and domain usually. Rich ACL does make it easier. The last update on the rich ACL patch drama was about a month ago. I have a link to it here. Go ahead. I remember working with you at IBM about eight years ago, there was some really smart developer in India who put out a really good set of patches. Andreas took those over.
Starting point is 00:27:13 Pretty much every year it came up in the Linux file system summit. I think at least three or four times it came this close. So the question is, how do we get it? And suggestions are obviously welcome. Red Hat and SUSE probably have some leverage here. One of the things that would help a lot is having more file systems implemented. One thing I was very happy about, so we can get very cynical about certain political, philosophical things in the kernel. Deny aces are evil, for example.
Starting point is 00:27:48 Deny aces are evil, therefore we're not going to do something that's needed for Mac and Windows and NFS compatibility. Without being cynical, we should recognize Andreas got a lot of stuff in. I can tell you from experience he got stuff in because I'm sitting down at the plug fest trying to get this, some code written and my code was gone because he removed some dead code and clean up of the XAD or ACL stuff. So his patches did, his clean up patches went in, unfortunately it caused me about half hour of extra work trying to find where this code was.
Starting point is 00:28:21 But that's good, right? He cleaned up some dead code, he was making stuff better across the file system so his patch series is smaller. So less reasons to deny it but we still have this philosophical issue of deny bit reordering and just for completion I put the Microsoft blog entry
Starting point is 00:28:41 saying why they think deny entries belong in that order. Go ahead. Have people stopped laying their body down on train tracks to stop this yet? No. Okay. By the way, Jeremy is aware of more of this, I think. I'm sure you'd love to talk to more people about this, right?
Starting point is 00:29:01 I don't know. If you want, we can make the road better. I don't know. And to be fair, there are kernels out there and there are products out there that ship RichACL. And yes, we, you know, there is testing on RichACL. So this is, you know,
Starting point is 00:29:17 it's getting better. But it would be nicer if it weren't just everybody downloading a patch set from Andreas's Git tree. And by the way, there's a VFS Richacromato for Samba. Ronnie, go ahead.
Starting point is 00:29:26 So basically, eight years later, we're still at the overnight and body kind of negotiation. I really only... Yes, yes, but there are very, very few bodies in the way. There are very few bodies in the way? Body, yes, body, yes. Okay, no comment. In any case, from my perspective,
Starting point is 00:29:58 I have to be... The best thing I can do as a developer is to get all of the rich ACL prep done in my file system that I have some control over. The best thing Samba can do is update VFS rich ACL to make sure it works well with Samba 4.5 and master. The best thing the NFS developers can do is make sure VFS rich ACL continues to integrate well
Starting point is 00:30:20 in those patches. And they're actually fairly well tested. There were some changes that Tron and others merged in to make Rejackle a little cleaner. As individual developers, we can get all the other stuff out of the way. And, of course,
Starting point is 00:30:36 we can rely on our products on just patching in Rejackles. On the other hand, I really would like to get some agreement about why the heck somebody thought it was a good idea to have allow, allow, deny, allow, deny as the ordering of the mode bits. I don't know about you, but I sort of think that mode bits should be allow, allow, allow.
Starting point is 00:30:57 Go ahead, Jeremy. It's the only way to have exact posits. So Jeremy's... It's the only way. And the reason for that is, on the posits, you can have an owner with less rights than a group or vice versa. So you can have...
Starting point is 00:31:18 Because this is specific order, you have to allow them... The denier that you allow them, the denier that you allow them. Or do we come on like perfect? So Jeremy's comment was that Yeah. The deny that they allow them, the deny that they allow them. All the way to moment perfect. Yeah. So Jeremy's comment was that for POSIX, the only semantic way to do it is because the owner can have less permission than the group is to have deny bits that are,
Starting point is 00:31:41 in a sense, out of order, that are not all at the beginning. You would think, though, that you could put that are not all at the beginning. You would think, though, that you could put all the denies at the beginning. I've heard that claimed, and there may be a good reason for that. On the other hand, what it means in practice is that every file that's had a mode change by some evil NFS client,
Starting point is 00:32:03 every file that had a mode change by some utility running on the, every file that had a mode change by some utility running on the server is going to pop up with some warning in Windows Explorer when you edit it. If we use cackles or we use SMB cackles, it's much more polite. We don't display warnings at SMB cackles. It is sort of annoying. It's not a problem for files that are created with SMB, but if we use this evil CHMOD tool, you'll
Starting point is 00:32:28 end up with warnings. It's not a big deal, but it may confuse some users. The goal, I think, in a lot of this is not to confuse users. Ideas are welcome on that one. We talked about XStat
Starting point is 00:32:44 integration. What about alternative data streams? I think Jeremy's favorite topic in the world is spreading viruses through alternate data streams, right? So we can emulate these. We could do what Macs do. We could put them in XStaters. But NFS doesn't have XStaters. So maybe the best thing is just not to support them on NFS. But there are a few apps that do require alternate data streams. What about witness protocol integration and clustering?
Starting point is 00:33:09 Well, that's kind of a topic that's in progress. If you listen to Michael's presentation and Volker's presentation, you've got a feeling for a little bit of the progress that's been made in witness. But I think that in a mixed world, CTDB and witness have to play better together. And in addition to this, there are some things we could do that are kind of
Starting point is 00:33:28 cool with PNFS, allowing Witness protocol events when the metadata servers go down or whatever are moved. Okay, just DFS, global namespace. I think it's largely independent of any of this. DFS should be okay in this kind of world.
Starting point is 00:33:44 Okay, so what about the Samba activity? What are some things that would help? Merging rich ACL. That would make our life a little bit easier. Merging the XStat, or StatX actually. Updating VFS rich ACL. It's a little bit out of date. I think we've sent some patches.
Starting point is 00:34:00 Droz Adamson may have sent some patches to Jeremy on that. There's a couple little things where we have to do some more testing on that. But it's actually not that bad. And then, you know, XADDR-TDB, I think we've talked on the mailing list about various lock ordering issues with XADDR-TDB that show up. That may not be an issue in 4.5, though. So what about clone and copy range? Right now I think David
Starting point is 00:34:28 Disseldorf did a really good job with clone and copy but it was BTRFS specific, right? Now you have NFS, you have other file systems, XFS and others with patches for this kind of stuff, right? So we're going to need to figure out a way to extend the copy offload. You know, they're all kind of using the same iOctal, so maybe it's not going to be an issue, but
Starting point is 00:34:53 this was one that I was thinking about in terms of performance features that enabling it across a broader set of file systems. And that may already be done in 4.5, but I didn't notice it. Okay, and then the XStat enablement would allow us to significantly reduce the amount of traffic to these little tiny database files and stored metadata that we can't put in the file system.
Starting point is 00:35:16 Alternatively, that NFSI octal that Anne-Marie proposed should be very non-controversial. Those patches were very small, actually. I think they just got kind of forgotten about back in late May. In my view, there's no harm in having two ways of getting at the same data. If you had an XADDR or you had an IO, well, that's fine.
Starting point is 00:35:37 So what are the key features that we think about in SMB3 and performance? Obviously, Tom thinks about RDMA every day. The copy offload, I think, is really cool. The compounding operations, large file I. IO, file leases, directory leases, and then various Linux specific protocol optimizations that you could do with NFS and also some of the F allocate features. Now are any of these affected in the gateway environment where you're running NFS and SMB?
Starting point is 00:36:01 And the answer is yes, obviously. And what are the big ones? Well, I think the big one is leasing. But just at a very high level, what are the things that you see in terms of generally in this environment? You see more traffic, obviously, if you're going in a gateway environment, because you're seeing at least twice as much traffic, actually more, because you're seeing traffic to your Samba server and then traffic out the back to your cluster. If you're writing directly to your cluster, you're obviously seeing less than half as much traffic, actually more, because you're saying traffic to your Samba server and then traffic out the back to your cluster. If you're writing directly to your cluster, you're obviously saying less than half as much traffic.
Starting point is 00:36:29 But most of that would go away for less contended files if we had lease support. What's the problem with lease support? Big problem with lease support is NFS only supports it on the wire. It doesn't expose the API on the client. Now what we did in CIFS for this was if you asked for a lease and you already had a lease, we'd give you a lease. And then of course if we broke a lease, if
Starting point is 00:36:51 Oplock break came in, we'd break the lease. NFS could do the same thing and we've discussed that kind of patch before. But I think they wanted something a little different. So that'll be an interesting thing to argue about because realistically 90% of the time you have uncontested files. You already have a lease. Some app like Samba
Starting point is 00:37:12 asks for a lease, give it to it and then break the lease if needed. Also, copy offload should be relatively non-controversial to deal with. Now, here's the big thing. I don't know of any way to deal with directory delegations. Directory delegations, the Microsoft guys in one of the presentations I remember saved about one-third of the performance. The metadata operations were so much faster. So if you had things that had large directories, you had things that involved things that SOMA doesn't do particularly well.
Starting point is 00:37:41 SOMA does not like like million file directories. Directory release would help a lot, but we don't really have an equivalent on that in NFS. Well, NFS will cache. It'll cache for seconds, but it really doesn't have a concept of directory release, and that makes it hard for metadata caching. And that's something that we...
Starting point is 00:38:00 It does metadata caching. No, I mean some directory delegation. Or not one. Or not one. I don't see it. It has it, but it doesn't implement it, right? Yes, but it's limited. Right, right, right.
Starting point is 00:38:15 Yes, thank you. So a minor correction to what I was saying. The NFS 4.1 spec has directory delegations. The NFS client and server don't implement directory delegations. If we implement directory delegations, it helps in our world more than it might in other workloads because Samba doesn't handle large directories and metadata queries well. It doesn't scale as well.
Starting point is 00:38:41 Now, there are other things that are kind of interesting to think about in this world. Notify. I think all of you who've used Samba notice that when you launch Samba, you don't just get SMBD, you get SMBD notify D. You get a process that's sitting there looking for notify events. That's kind of cool in a cluster. It'd be interesting to think if that could help in the NFS world as well, how we might optimize that. In terms of CTDB traffic, apparently NotifyD has some effect on the CTDB traffic. Figuring out in a clustered world where we're exporting NFS v3 and Samba, there are optimizations maybe. It's certainly worth looking at. I think one of the things that could be done
Starting point is 00:39:19 that would make a lot of sense is just simply look at wire traffic more. Drill down one level on the wire traffic when you do a typical operation launching VI let's say or launching some application of renif s v3 looking at over PNFS that's not our problem right but the Samba guys now let's bring up word look at it open a file and then see what NFS traffic is sent on the far end. So seeing the PNFS traffic that comes out underneath Samba from the NFS client when you bring up Word and open a document. This is useful stuff. I've done some of this.
Starting point is 00:39:54 But certainly when we have support for leases and when we have some of the additional features that we talked about, like the iOctal to query these additional attributes that the spec supports and the client can support, but we don't have an API to get to right now until we have XStat or Anne Marie's API. When those are in place, we can start talking about drilling down one level and just optimizing the traffic a little bit better.
Starting point is 00:40:19 Because right now, this is a workload we can't ignore. We can't ignore the fact that SMB and NFS are run together. And it's important that we better optimize this and of course also put pressure on the kernel developers to fix the stupid APIs. I think all of us here have particular kernel features we need and guys like Jeff Layton have been very good about driving some of these gradually over time. If you think about the kernel VFS API, five or six of those APIs came because of Samba or network file systems. We do have some influence despite our cynicism about it being eight years sometimes.
Starting point is 00:41:00 Nine years. Testing. Here's one of the things I really like about this. XFS test runs over SIFS mounts. XFS test runs over NFS mounts. Ronnie's plug-ins, libNFS, right, runs underneath multiple... So you can run infrastructure like yours
Starting point is 00:41:21 over multiple protocols. So some of the scalable infrastructure for testing load over NFS and SMB is actually possible, even if you had an all Linux world. But on top of this, you have a nice Microsoft functional test suite. You have PyNFS. You have SMB Torture. The test suites are actually reasonably mature.
Starting point is 00:41:41 What we don't have is a couple of those little pieces I mentioned earlier that would make life easier. Obviously, it works works. Obviously we can do lots of wonderful things, but this could be much better. And I think a lot of these same things we discussed will also help Ceph. We had that discussion earlier in the talk where the activity Jeff was doing for StatX in Ceph will also be helping other file systems. So many of these are shared problems, and I think very interesting problems. And I think we need to think about, like, take Ronnie's example,
Starting point is 00:42:14 how to keep extending this infrastructure for better scalability. I think we had a talk yesterday on NAS testing. Some of this will apply to this kind of environment particularly well, because right now most of our test cases are focused on one protocol and contended data. It's a little tougher to test. Now some of the guys at EMC remember Pike. They gave a presentation I think three years ago on Pike, and they I think used that infrastructure on both NFS and SMB, if I remember their presentation correctly. But there hasn't been a lot of infrastructure out there that's open source that does protocol-specific operations in one protocol
Starting point is 00:42:53 and then does a protocol-specific operation in another protocol to try to contend or break. What they tend to do is things like, I open a file here, I open a file here, using the standard POSIX API, which is harder sometimes to get at some of these difficult problems with compatibility. So we could have some improvement there. And I think that the good news is that it tests out pretty well. The bad news is the cross protocol stuff could be expanded. Go ahead, Ira.
Starting point is 00:43:19 Wouldn't you have to define what success and failure was in order to test? Yes. So you'd have to define what success and failure was? Yes. In order to test? Yes. So you'd have to define what success and failure was for test was his point. And I think that's true. Because I'll give you a great example. Advisory locks. Do you want an advisory lock to fail if you have a mandatory lock? Do you want a mandatory lock to fail if you have an advisory lock?
Starting point is 00:43:42 If your window of time fails, the positive point has an advisory lockout. Jeremy says yes, and I agree with him. I'm sure there are people who disagree, but I agree with Jeremy on that, that that should fail. I tend to disagree. So, now, if you run with Ceph,
Starting point is 00:43:58 and no, never mind. But seriously, there are file systems with different expectations about loose coherence versus strong coherence. There are file systems with different expectations about loose coherence versus strong coherence. There's file systems with different expectations about what should happen when you try to delete an open file. And one thing that I think was mentioned maybe in Jeremy's talk was we're going to a world where cloud storage matters.
Starting point is 00:44:23 In the cloud world, you have to have really loose semantics sometimes to get anything to work. There are problems when we're in that kind of world, right? Because we can't, the coherence is much looser in that world. We tend to write a whole bunch of stuff at once.
Starting point is 00:44:42 So, you know, M time consistency, maybe not a, you know, I don't know. These are interesting problems. But yes, we have to define what success means in some of these cases. And we don't tend in IETF RFCs or Microsoft documents to talk about, this is what the spec says. This is what we kind of recommend
Starting point is 00:45:04 you do if you're thinking of testing this because maybe we should. And this has come up before. What do we think best practice would be if you tried this kind of operation? We mentioned in the Microsoft specs, we mentioned Windows 10 client returns 128 here. But we don't tend to say best practice would be you should try to do this.
Starting point is 00:45:28 Sometimes, but a lot of times we don't. And in the case of contended locks cross protocol, that was one that we've come up with in the past. And certainly at Connectathon we've had these arguments before. Okay, we're reaching the end of the talk and soon we're going to be following with
Starting point is 00:45:44 the panel discussion about POSIX extensions. Do you guys know how much time before that one starts? About ten minutes? Yeah, so we've got a couple minutes for questions if you guys wanted to talk about it. I think some of you guys may have even more experience with this world. Go ahead. So one of the problems we have seen when dealing with cross-protocol issues Go ahead. Go ahead. Go ahead. Right. Right.
Starting point is 00:46:21 Right. That's not really a problem because we're looking at a grace period which is a response to a similar function like this. And we, they have to honor each other's grace period to do it right. So that's something
Starting point is 00:46:30 that I don't see as possible. Yeah, so that grace period discussion is really interesting for a couple reasons. Imagine that you're trying to do lock recovery like you're discussing. CTDB can trigger
Starting point is 00:46:44 this, it has a little hook for Ganesha for grace recovery, right? CTDB can trigger this. It has a little hook for Ganesha for grace recovery, right? But that doesn't help you if you're going directly with PNFS to the back-end file system. And also, we don't support... We support persistent handles, sort of, but this could relate to persistent handles. Grace lock recovery on a file
Starting point is 00:47:05 which has a persistent handle opened by Windows might act differently than one that doesn't. And I think Volker was about to respond on maybe some more detail on what he thinks. I mean, there are... This is what I also see, that channels, variations, and so on, we need to talk to do those things. Yep. So we need... Yeah. some kind of logging and so on. Because what we need is really, first we need to define the semantics of the channels.
Starting point is 00:47:50 NFS has their own set of channels, whatever they call them. But for example, NFS doesn't have file-sharing needs. We need to first sit down with the NFS guy and define what it actually means. What should the error code be for what that about is going on, then we need this in user space. And then we can go to the kernel guys and say we need an IPR. So remember the good old days
Starting point is 00:48:14 when we had weird return codes that were added 20 years ago like jukebox that made no sense? Jukebox, yeah. Yeah, I mean, exactly. Add eVulker, and that would allow you to communicate some of this information. One of the things that's so strange is that the conversations between the Ganesha team and KNFST and Samba...
Starting point is 00:48:34 There is no communication. Yeah. There's no conversations. Yeah, now to be fair... There's no public conversations. There are conversations. I know they exist because I'm had them in the background with some of the HHS authors at this point. There are a few people thinking about it, but the problem
Starting point is 00:48:53 comes in when there's a third access method nobody's discussing at this table, and that access method is called local files, or kernel map, or views. And once you bring that guy in the picture, all this user-led integration stuff, especially if you're talking about, in one case, set where you're talking about kernel client, it doesn't play. For various file systems, you can or can't
Starting point is 00:49:22 try this in user space, or they may not want to integrate with CPDB if you're talking about use map you know if you're talking about that point integrating gradient yeah but that shouldn't prevent the ganesha people from talking to someone no it doesn't absolutely that should be happening there are some people having those discussions but the thing is whenever the people have it in terms of a product they're working on and the work they're doing they always end up with this third thing coming in on the side and hosing up everything. Well, it's not just the different processes running on one machine.
Starting point is 00:49:56 If you've got a cluster environment, you also have to deal with four. Well, in fact, I gave a talk on this seven or eight years ago here at SDC about doing cross-node coordination of race periods over the multiple nodes and cluster. And like you said, the hardest thing is to coordinate with stuff that's just running locally on the actual file system and we came to the conclusion that really wasn't the solve of the problem.
Starting point is 00:50:32 Who's me? Oh, I work at on the. That was me. Sorry. You had a viewer bulletin. Yeah, sorry. I couldn't really tell where the bulletin was coming from... . . .
Starting point is 00:51:00 . . put some code into the file system driver to assist with that. But even then, the part said, well, if you're accessing the same files and taking the same walks through a remote client and through a local program, it's still possible to screw everything up. I mean, the sad thing about this discussion, right, is that getting these changes across 20 different file systems isn't possible.
Starting point is 00:51:42 Getting them across things that have Hadoop, right, have no idea anything of a discussion. They barely know what Samba or PNFS is sometimes, right? Some of these developers are in a very different world. They're programming in Java, not in C or Python. So one of the things that's fascinating about this is, can we drive tiny changes? Because we know that big things like Rich Ackle
Starting point is 00:52:04 don't go in. 20 line or 50 line changes to the kernel do go in so adding 50 line kinds of changes is there something small that can be done to delay contended IO from local access while grace is going on or something silly like this or prioritize differently or you would think that that small changes to the kernel might be helpful coming back up, but we did it with a private API because we wanted to ship what we were doing in a reasonable amount of time, so we didn't want to take the time to try and get a native API in the kernel to change the behavior
Starting point is 00:52:58 of the process device. Yeah, and then, you know, when you think about lease behavior, blocking leases, it was doable, and it was driven by Samba, right? It was driven by Samba needs. But I think you're right that it could take years. We still don't want to forget these thoughts because we're going to come back next year and have the same problem.
Starting point is 00:53:12 Yeah, and that's the thing I know we're guilty of over and over again is that, oh, well, we want this in the kernel, but we can't get it right now. We'll do something else, and then we forget about it. Yep. We never go back and try and get it into the kernel later. Well, this is why we need to solve the protocol problem in the real space. Yes. And that's at least my view of things.
Starting point is 00:53:34 Yeah. I thought, yes, we are lacking the old product test. We are lacking whatever. This is HD or whatever sum of HDB. But if we can solve the NFS and HTTP problem, I think we are pretty far off the cliff. Because we know what the respective docking market and error code server might be.
Starting point is 00:53:58 Right, and you would at least know what kind of operation it needs. Yeah. What's your concern? And I mean, the other alternative is So, by the way, Volcker's comment, let's echo that. So, Volcker said, we have lots of ways to solve this. One is we could deal with the user space code, get the user space code working, and then drive that to other broader.
Starting point is 00:54:26 Another way is just to make SMB3 workable for all workloads. Because, I mean, we really like this. But that's a great segue to Jeremy getting on stage to talk to you about protocol extensions, and not just the protocol extensions for POSIX, but hopefully for things that are pragmatic in making all of our workloads better. Okay, so back to Jeremy. Yep, so we got five minutes for coffee breaks or something. Thank you, Steve. Good job.
Starting point is 00:55:14 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the storage developer conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.