Storage Developer Conference - #63: What’s new with SMB 3?

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 63. Hi, good afternoon. And my name is Matthew George,

Starting point is 00:00:44 and I am a developer working at the SMB team in Microsoft. And this is David Kruze. He's the lead for SMB, again, in Microsoft. And today we are here to talk about SMB3. It's more of a two-part talk. The first part giving you a status update on where things are and what are some of the new things which we have done for the upcoming Windows release. And then the second part, Dave is going to talk about some of his experiments, trying out stuff with SMB Direct, new scenarios for SMB Direct. So this will be a two-part talk.

Starting point is 00:01:24 So where are we now? So it's about five years. SMB3 was initially released in 2012, so it's about five years now. So it was designed with three pillars in mind, so scalability, performance, and continuous availability. And though it was designed as a file sharing protocol in mind, over time, people have invented very interesting and new uses for SMB3. So even though much in the

Starting point is 00:01:56 core protocol has not changed since 2012, we have added a few extensions like new signing algorithms and encryption and stuff like that. But the core of the protocol is still the same. But what's interesting is newer applications have used SMB as a data transport protocol or a file sharing protocol in interesting ways. So there are a couple of new scenarios I would like to call out. Storage Spaces Direct is something which you guys might have heard of. It's actually built on top of SMB3 as a transport protocol. They remote disks on multiple machines in a cluster, and they bring a unified view of all the disks on the cluster so that any node can access any disk.

Starting point is 00:02:45 And then you can actually build a file system on top of Storage Spaces Direct. So Storage Spaces Direct uses SMB as a transport protocol for blocks. Think iSCSI replacement. And they did it because SMB3 did provide authentication. It did multichannel and RDMA and all that. So they got all the features for free, and they exposed the disk as a file, like a virtual file.

Starting point is 00:03:12 And more interesting, then of course you know about the Azure files, which is basically a SMB3 front-end to cloud storage. And more interestingly, in the last one or two years, SMB 3 is being used inside Microsoft as inside of containers, a guest to host file access protocol for containers. So if you have a container like a Hyper-V container running inside of Windows on a Windows host, they actually use SMB 3. This is not across a physical TCP IP transport. They actually transport SMB over the VM bus channel between the guest and the host.

Starting point is 00:03:52 So there are some interesting uses for SMB there. And then we are also trying to explore some uses of SMB Direct to provide access to PM persistent memory devices on the server. Now, before we talk about anything else, I just wanted to give you a status update on where SMB1 stands. And we have been kind of heavily talking about removing SMB1 and how it should not be used and so on for the last four or five years. But finally, with the upcoming update of Windows, SMB1 is going to be off by default. So for the last two releases, for the most part, but for the last two releases, they are optional components, so people could uninstall them. So now with the fall 2017 update, the SMB1 server is off by default on all Windows queues.

Starting point is 00:04:51 So there's no more SMB1 server. You can always optionally install it. It's an installable component which shows up in the UI, so you can enable it if you want, but it's off by default. And client will be off by default on all server and enterprise queues. Unfortunately, we can't quite turn it off on home queues because there are still a boatload of NAS boxes and stuff like that out there which need SMB1 and we don't want to break those people.

Starting point is 00:05:18 So it's going to be there on SMB1 clients. But if you disable SMB1 explicitly and then you upgrade, then what we do is we preserve the previous state of SMB1 and then we also did some work to check for SMB1 usage because most people who have home boxes, they don't probably have NASes or whatever, so they have this SMB1 usage, because most people who have home boxes, they don't probably have NASes or whatever, so they have this SMB1 installed by default, but if we do not detect any usage for, like, 15 days,

Starting point is 00:05:52 two weeks, right, 15 days, we would just go ahead and uninstall SMB1. Is it two weeks online? No, I think it's an absolute... Is it two weeks online is the question, but, no, it's an absolute... Is it two weeks online is the question, but no, it's an absolute. It's an on time for the machine. It doesn't include the sleep time.

Starting point is 00:06:11 So, yeah, it's two 15 compute hours or whatever. Yeah. So hopefully this is getting us a little safer, and then over the next couple of releases we can turn this off on home SKUs also. So we are a good way there towards deprecating SMB1. So today's talk, the rest of the talk, focused on two major changes to the protocol to support two new scenarios which we have. First one I call synchronous share redirection.

Starting point is 00:06:46 So let me just give a brief recap on what scale-out shares are. As we know, SMB3 supports continuously available shares and then what are called scale-out shares. So they are shares exposed on all nodes on the cluster. They front the same set of shares. And typically you can have a shared disk with a clustered file system behind backing the share. And that's what I call by...

Starting point is 00:07:13 And there are two types of scale-out shares. One is symmetric, which is backed by typically a shared disk and a clustered file system. And the clients can come on any node and then talk to the file system, access their files, and so on. And this requires a shared disk or a SAN or something like that.

Starting point is 00:07:36 And then we have what are called asymmetric shares. And as we have seen in the last three or four years, the asymmetric shares seem to be the ones which get deployed more, primarily because, especially because of storage spaces, you don't need shared disk infrastructure. You can actually use local disks and then use spaces to kind of move the disks, expose the disks out to all the nodes. So this is a mode in which a disk is, the share is there on all nodes, but the disk is bound to one node, or it could even be a group of nodes, but typically one node. And the client can still connect to any node to do I-O to the share, but behind the scenes, if the client arrives to the wrong node, the file server will redirect traffic the right way to the correct node to do I.O. And you also have the witness protocol to so-called lazily move the clients to the right node

Starting point is 00:08:35 so that the server doesn't have to do the back-end redirection. So this is where we are, and we went a little bit further now to make the client move to the right node synchronously, in the sense that witness was a lazy protocol by which clients could move to the right node, but this is a mode in which you can actually force the client to move to the right node. So how this works is that the server will reject the client's attempt to connect to the wrong node, and it will redirect it to the right optimal node to which it has the direct file system access. And obviously this requires the client and the server both to support this. So the client conveys to the server using this new tree connect flag,

Starting point is 00:09:30 which says the client has the ability to disconnect from the server and move to the new node synchronously. And then the client is connected to the wrong node or the non-optimal node. The server will fail the attempt to connect to the share with an error, and then along with the error, it will return an extended error context. So error contexts were introduced as part of 3.1, I think, SMB 3.1. Servers can return arbitrary error contexts to SMB, and we have added a new error context which will basically tell the client where to move to. It's basically a list of IP addresses to put it simply and

Starting point is 00:10:11 now the client will process the error and then it will reconnect to the right server and then resume IO from there. So this is what we call a synchronous redirect mode. And the new MS-SMB2 spec, which is not yet released, but Tom Talpe tells me that it will be released in three days on Friday, so 15th, just after we leave the conference so we don't get to ask questions. This might be in the preview. There's a preview copy over there. Sure, yeah, there's a... This one might be in there.

Starting point is 00:10:49 Yeah, so another couple of days, and it will get released. Yes? If you second the list of IP addresses final, is that going to cause problems with getting Kerberos tickets that you redirect the server? The question is, if you send back IP addresses, will it cause problems with Kerberos tickets? The question is, if you send back IP addresses, will it cause problems with Kerberos tickets? And no, we still use the same SPN to connect.

Starting point is 00:11:12 It's just like Witness. Witness sends a list of IP addresses, so we do not change the SPN used to authenticate. That's still the same thing, but it's just a hint to the client to connect to this set of IP addresses. So why did we do this? The first thing is we do not have any hard dependency on the witness protocol to move clients, so this is built in line in the SMB protocol.

Starting point is 00:11:40 And the bigger advantage which we see is the server no longer needs to do back-end redirection. So if you have the previous mode, the asymmetric mode which we previously implemented, the server needs a redirection mode at the back-end to redirect traffic, and this makes the file system design on the server a little bit more complex than it needs to be because the client anyway has the ability to move to the right node, so why not just

Starting point is 00:12:08 force the use of it. I think over the long run it will simplify the backend. And then the other thing is failure modes are simpler because you don't have this two-leg I.O. going through, so the client talks to the right node, so it's much easier to diagnose things. And there are a few other things that you can do, like op locks were kind of hard with redirection, so you could potentially get improvements there, too. So it's a small change, and this will be done. So the only new thing here is the new bits for the tree connect request and the error response.

Starting point is 00:12:49 Other than that, there's a behavior change on the client and a little bit of code change on the server to return the error and so on. The second one is a little bit more interesting, what we call identity remoting. So the scenarios which kind of made us do this work are primarily two. So SMB3 is being used currently to support what are called these infrastructure shares, basically as a data share for hosting VHDs, so to say,

Starting point is 00:13:25 for VMs. And this is a multi, if you think about a multi-tenant VM hosting scenario, the VHDs reside on the server, and the VHDs themselves are actually using the tenant's identity of sorts. Now the host, the Hyper-V host, which opens the VHD, they authenticate using the tenant's identity, and access is granted to the VHD based on the tenant's identity, again using the authenticated identity. Now what happens is once the VHD is opened, the files inside the VHD are managed completely by the client. They have no, the server does not really play a part in it and basically access control is maintained, is done by the client itself.

Starting point is 00:14:14 There's no server involvement in that. Now what if we say that we remove the VHD container and let's say put the, create a file share per tenant and put the files for the tenant there. So the share is now equivalent to a VHD. It's provisioned for a tenant. And the tenant's identity will get you access to the share. So you do your normal SMB authentication using tenant credentials. You get access to the share. But once access to the share is granted, the client can potentially do two things.

Starting point is 00:14:51 You have a file system on the server which is very well capable of evaluating ACLs and so on. So one way of doing that is for the client to somehow channel or remote the application identity. So you have a tenant, you have multiple users or applications running inside the tenant's context. You can remote the identity to the server, and the server can actually do evaluation of ACLs and so on using the application identity rather than the channel. So it's what we call by remoted identity or tunneled identity. That's the first word we came up with.

Starting point is 00:15:30 And then we, I think eventually in the document it's called remoted identity because tunneling has various meanings in various contexts. And the second scenario I would like to call out is containers. So container shared volume. So folks who are familiar with Docker know that you can actually mount a share from the – mount a volume or a namespace from the host and then expose it into a container. And you can do this with SMB shares too. So there's a new global mapping functionality inside, sorry, in the redirector which now allows,

Starting point is 00:16:14 it's kind of like a Unix mount where a mount is accessible to everybody. It's visible to everybody and it's mounted using a single set of credentials valid for the machine. So anybody in the machine can access the mount. So here I'm giving you an example where I have mapped the G drive to server share using a specified set of credentials.

Starting point is 00:16:33 And then I can map a directory from G drive into a container. So I've given like a Docker run web server. So the G colon container data is now mapped into C colon app data inside the container. And now with tunneling, you can potentially tunnel the identity or remote the identity inside the container through the SMB mapping to the server. So the server can actually do actual evaluation based on the app identity rather than the container identity. So the first question people would ask is, isn't there a few tunnel identity across without an authentication protocol? Is there a privilege escalation?

Starting point is 00:17:20 You could potentially say that you are admin or local system or whatever. But the thing is, access to the share is granted via a secure authentication protocol. And then the second thing is it's not applicable to all shares. It's only for specific shares which are these kind of container or infrastructure shares. And they are explicitly marked and tagged that way, otherwise server won't do this. And obviously you don't want to do this on like RPC calls or IPC dollar shares or something that's dangerous. And as long as the server guarantees that the client can't escape the share,

Starting point is 00:17:59 you're good because the share is for one tenant. It doesn't matter, really. So let me quickly talk about the protocol. So there are a few changes here. First is a tree connect context. So essentially to tunnel this identity, we require to send the kind of the serialized identity to the server. So we thought the best place to do that was in the TreeConnect request.

Starting point is 00:18:29 And there have been a few other cases where we wanted to send additional parameters on TreeConnect. So along the lines of negotiate context, we decided to extend the TreeConnect request. And this probably is useful for other, like, POSIX kind of negotiation per share so we could potentially extend it and it's done by adding an extension to the tree connect request and it's backward

Starting point is 00:18:55 compatible. There are a couple of things here. There's a new flag which says an extension is present and then the gray box there, the table there in the middle is the extension that sits in between the end of the request and the path name. And since there's a path name offset in the structure, older servers which don't understand anything

Starting point is 00:19:19 about tree connect context should still work because everything still looks the same. There's still a little bit of data in the middle which will be treated as padding, and then there will be tree connect context at the end which will be ignored anyway. So it should be, we have tested out our servers, and I believe, I don't know whether we have explicitly tested, but we believe it should be backward compatible. And following the Preconnect extension, there are a list of Preconnect contexts. These are exactly similar to the Negotiate context.

Starting point is 00:19:52 They have an ID and whatnot. So the usual... So if you have done Negotiate context, this is exactly the same. So you can chain these to the tree connect request. Yes? . The question is why didn't we choose GUIDs rather than 16-bit? Why did we choose 16-bit context? We just copied the same,

Starting point is 00:20:27 picked up the same structures from the negotiate extension. I hate the text. I know, the create context add variable length names. Yes. Well. Admin. Okay. We can talk about it, but that's what we've chosen now. All right. Obviously, if you want to send a remote identity context,

Starting point is 00:21:05 the T-Connect request has to be signed, and it already has to be signed for the pre-auth integrity stuff anyway. So it's just calling it out. And then the context ID, there's a new context ID and it basically contains a serialized representation of the application's identity. So think of it

Starting point is 00:21:20 as all the fields that you need to evaluate an ACL. So most of the fields that you need to evaluate an ACL. So most of the fields there are similar to what an ACL contains. In Windows parlance, I think it would match what a token is, basically, all the fields in a token. So basically, it maps almost directly to an ACL. And the file system can now use this standard context to enforce access control. And the details of this will be, I'm not going to talk about it because it's a fairly big structure. So the protocol documentation has most of the info on it and pointers to other documents like MS, DTYP, and a few other references which give you these structures, but the documentation will cover that.

Starting point is 00:22:17 And how is it used? So I talked about this before, but client establishes a single session to the server using the tenant's credentials by setting up a global mapping. So it's a global map drive for the whole machine. And then every user session on the client will share the same global session, but it will issue a new tree connect for every user, and then it will tunnel the identity across.

Starting point is 00:22:44 And the server grants access to the share based on the global session. So the share access is controlled by the client session and the file system access is controlled by the tunnel identity. So that's the difference. So to get into the share, you need to go through the normal Kerberos and TLM, all the typical other. And nothing changes with respect to signing and all that because they're all based on the session. And this is a quick summary of what I've talked about. So there's increasing use for SMB for tenant infrastructure kind of scenarios,

Starting point is 00:23:22 whether it be like VM hosting or containers. And then SMB is being used for container guest host access over VM bus, and then it uses it even for RPC, I believe. And then one last thing I would like to call out is I don't know whether folks have implemented PKU2U. It's been a supported SP Nego extension since, I think, 2012. And it was primarily used for live ID, like the Hotmail online ID authentication. But it does provide a mechanism for certificate-based authentication. That's what it does. And I would expect to see more scenarios

Starting point is 00:24:09 based on PKU2U for simple client-server scenarios. So think certificate auth between two machines without a live ID or an online ID. So expect to see more use of this. And this kind of ties into the NTLM deprecation strategy also because peer-to-peer right now there's nothing other than NTLM, and NTLM is getting kind of old. So this is possibly going to be a replacement for NTLM

Starting point is 00:24:38 or an alternate way of doing peer-to-peer sharing going forward. So let me hand over to David who is going to talk about some of this. So before we go, are there any questions on the protocol side? Because we could go through those now before we transition. So, yeah. A couple of slides back to SMB1. Do you guys have a replacement for LAN and LAN 7s? There's only one

Starting point is 00:25:06 that still requires SMB1. Yeah, and so the question is, I think what we call the computer browser service, is there a replacement for computer browser for discovering machines on the network? And the answer from my team would be no. There have been other

Starting point is 00:25:21 like, what is it, like UPnP discovery and a link local discovery service. There have been a couple of solutions, it, like UPnP discovery and a link local discovery service. Like there have been a couple solutions, but none of them are really turned on by default in all cases, and not all of them have down level interop and stuff. So that is gonna be an interesting point, because with the SMB1 install stuff,

Starting point is 00:25:40 browser installs with the SMB1 server, based on, that's where announcements come, which means when we said we the SMB1 server based on that's where announcements come, which means when we said we pulled SMB1 server by default, that means the browser is not present on those machines either. Yeah, now they should be able to act as a browser client provided the browser server is running on some device on the network. So they'll get a list of all the devices on the network except for themselves because they won't be announcing. It actually means that for domainless networks, we still need to keep on SMB1. Well, what I would challenge, the question is for domainless networks,

Starting point is 00:26:14 do we still need to keep on SMB1? What I would challenge is, depending on your network, for example, if you're a work environment, the browser is completely insecure in like 100 different ways. So if your enterprise has a dependency on browser, you have bigger issues in general. When I hear about management software

Starting point is 00:26:28 that wants to use browser to enumerate what machines I should manage, that's kind of a scary thing if you think about it. Because it's totally insecure and anyone who has access to the network can easily just elect themselves to be browser. So dependencies on the browser outside of home, peer to peer, like where's my music,

Starting point is 00:26:44 where's my photos and stuff like that, I think are kind of questionable to begin with. For those environments, though, it's where we're talking to the networking teams, trying to understand what other technologies we've gone through with UPnP and LinkLocal Discovery, and also why we're not pushing those more and why the shell isn't integrating with them more. And the pure honest statement is part of the reason is if browser's there and it works, no one thinks about doing anything differently. They don't think about what it's actually doing

Starting point is 00:27:11 under the covers and the attack surface it brings with it and all this other stuff. They just say, well, I click on network neighborhood and it's there. And we can shake them and say, look, we don't want this here. It has significant cost. We want to move away from it.

Starting point is 00:27:21 But until it really starts to go, I don't know how else to derive that. So if you have ideas on where we should be looking, yeah, I'm more than happy to hear. Okay, maybe it's not a methodology discussion. In my opinion, just a methodology call for SRV as a solution. Yeah, we should talk more after. Yeah, Jeremy? Whatever you do, please talk about it for that call as well. Well We can, yeah, we should talk more after. Yeah, Jeremy? Whatever you do, please talk about it for that as well.

Starting point is 00:27:48 Well, and that's what, because they... Probably go after one year or... Yeah, and that's what we've looked at, too. And, well, I shouldn't say we, because there's also, within, there's a debate over whether that's a file system team or a network, like, there's debate about who should go down this path.

Starting point is 00:28:02 And I personally have the plan that it's not me. I'm not saying it's not Microsoft,. I'm not saying it's not Microsoft, but I'm just saying it's not me. But that's... Yeah, so, cool. Yes? Yes? Yes? Well, we kind of, the question is,

Starting point is 00:28:32 is this similar to DFS reference? In some ways, yeah, it redirects you to another node, but DFS, as we think about it, is a separate layer built on top of SMB, and it basically gives you a way to stitch together multiple, so it doesn't give you failover and things like that. But this is inside a cluster, yeah, inside a redirection inside a cluster,

Starting point is 00:28:54 and it gives you all the failover semantics and so on. From a protocol perspective, you could have used it. Somewhat, yes. The note was that from a protocol perspective could we have used it but dfsn if we go back to the auth question dfsn assumes that the identity of the target changes based on the target you select whereas an asymmetric server was sort of saying one of the trust things is someone's saying i want you to go to this node and it's because i can do mutual auth to that node using my original svn that i'm sure I'm not being sent off to some random location. So there are some subtle differences, but yes.

Starting point is 00:29:28 So, cool. Any? Yes? Can you repeat the question? Does the negotiation of the direct exchange because of the new extension for the protocol? No, there is no change to the negotiate because, as I said, the TreeConnect extension is or should be backward compatible. Yes.

Starting point is 00:29:56 Because we have structured it in such a way that older servers should ignore the extensions. Okay. On the TreeConnect part, servers should ignore the extensions. One more question. From the TreeConnect part, I think you mentioned that new authentication also takes place here. Is that correct? The question is, is there a new authenticated session for every TreeConnect? Is that the question? No, that's not correct because we do a single authentication

Starting point is 00:30:25 for the tenant or the in terms of global mappings when we establish a global mapping from the SMB client, there's a single session SMB session using those specified credentials and that will use Kerberos or NTLM and then multiple tree connects refer to the same

Starting point is 00:30:41 session. Is that not in the case? No, not like when say single user connect from the Windows client tree connects refer to the same session? The question is if a single user connects to multiple shares for the same server, we would, so with global mappings, we would actually establish a session for the global mapping separately so that the session will not leak to other shares. So we weren't restricted to the share, yes. Yes, one more. Do you think the APS needs to be CAK-tested in terms of using the APS as a way to... Yeah, so the question is,

Starting point is 00:31:29 are there plans for the client to support CA failover across the DFSN style redirection? And there's nothing official yet, so yeah. Cool. So if you have more questions, find us. We'll be at the Plugfest all week, and we can answer questions on this stuff. I'm going to keep moving on so we can fit this in.

Starting point is 00:31:46 For the other half, we've talked a lot, and we've heard a lot about persistent memory and various ways we can start to support it across the SMB protocol, and we actually had time this past summer to do some explorations. I don't feel the need to explain why this is interesting to us

Starting point is 00:32:03 in terms of storage latencies and performance, but I'm going to show a slide I stole from Tom Talpe from a previous SDC presentation that sort of talks about the three different ways we see that we can start to engage with NVDIMM support. And we sort of outlined this across threes. There's the traditional I.O., where we basically have an NVDIMM, and we're just going to format it as a block device.

Starting point is 00:32:26 And from the SMB server level, everything is going to look the same. We're going to issue reads and writes down the stack. And as far as we know, it's no different than an SSD or a spinning drive. It's just it's going to behave really fast. Then Tom has also talked about the third step, which is sort of the holy grail, where we say, since this is memory and RDMA lets us do memory registrations, it's going to be really great if we can sort of pass these registrations back to the client and let them do RDMA directly to storage. And then there's kind of this middle one right here, which is, he referred to as DAX load store by the SMB server.

Starting point is 00:32:58 And that's a little bit what we're going to dive into today. So like I said, block mode access, you can do this today on Windows platforms. I'm sure you can do it on other ones. You're just creating a normal volume off of one of these devices, creating a share on it, and you're going to execute against it. So the client and the server architecture, there's no code changes required from an SMP perspective,

Starting point is 00:33:20 and you should be able to see the decreased latency and the increased IOPS. But there's still, obviously, server-side processing that's involved to issue requests, have them flow down the storage stack, have completions come back, and all the normal things that go with that. The goal for push mode, which is the one farther out,

Starting point is 00:33:40 is that we can get rid of all that server-side processing by sending the client a registration of the memory directly and letting them interact with it using RDMA so that the SMB server doesn't actually have to do any processing on the requests, ideally. And those are sort of the steps that are there. As we look through it, you know, there are a bunch of challenges with push mode, and I just want to call it Tom's talking on Thursday. I think there might be overlap here, and he has a lot more detail in here,

Starting point is 00:34:11 but one of them is if you need write-through semantics, we need some way of actually committing or flushing the data after we do an Rdma write to ensure that it's cleared out of the PCI bus, the processor cache, and everything, and it's actually been hardened to the chips. Server also, if we're registering memory and sending these memory distributors back to the clients, we need ways to sort of balance resources here. If the actual card itself has a limited number

Starting point is 00:34:31 of resources we can deal out. Multi-channel has some interesting aspects because the memory registrations are bound to a protection domain which is associated with a connection and those are assigned at connect time, but we don't actually know whether two channels are authenticated from the same client until after we receive packets.

Starting point is 00:34:49 There's also questions about what do we do if the client said they wanted signing or encryption because all the data is being transferred outside of the SMB's control. So generally, there's a little bit more investigation we need to do along those lines. But the DAC support is an interesting one because the goal about here is we have a device

Starting point is 00:35:07 where we can ask it for a mapping of the file and get basically a memory mapping of where the data resides, and the server can interact with it directly. So there's ways we can explore this without actually requiring any changes to the client. This is just a server-side implementation. It allows us to try fully bypassing the file system for read or write operations.

Starting point is 00:35:27 So the server's going to basically ask the file system for a mapping. He's going to control the lifetime of that mapping. If the file system needs to change it, we need to come up with a way the file system can ask us to release the mapping, do things like that. And if all else fails, we can also just fall back

Starting point is 00:35:41 to a normal I.O. down the file system stack. And finally, the one other step is if we're doing right through IO for high availability and stuff the server can take responsibility for doing the appropriate flush operations on the regions that have been modified so in the tradition of letting interns do all the interesting work this was Daniel Daniel interned with me this summer a really sharp guy and his goal was to take the outline modifications of the SMB2 server

Starting point is 00:36:07 and see what sort of data we would actually get from it. I do want to call out that all the results that we're doing here based on the code, this is all preliminary prototype unreleased code. None of this is shipping. But here are the changes we actually did. He basically changed it so in our SMB server implementation at open we would query the attributes

Starting point is 00:36:29 of the underlying file system volume and we could mark it if it was done on a DAX volume. On the first read or write we basically had this file map structure which was protected with a rundown and we could, if it wasn't initialized we would initialize the map we would actually map the entire file.

Starting point is 00:36:46 This was pretty simple. We weren't trying to break it up into sections. We were just trying to get a rough cut on what the performance would be. And then every read or write that was going in progress could acquire the rundown to make sure the map didn't change. And on reads, we could just either copy the data or RDMA the data directly out of the buffer.

Starting point is 00:37:00 And on writes, we would copy it from the TCB buffer into the mapping. Or the really cool one was with RDMA, we can do an RDMA read directly from the client buffer across the network directly into the map. And then for write through files, which is all we were really testing, there's a new routine called RTL flush non-molible memory,

Starting point is 00:37:18 which is basically invoking a CL flush across the ranges to ensure that it's hardened and then we send the SMB response back. So we coded all this thing up, and then we got a test set up using two of these HP servers. It had like 120 gig of server class or storage. Is it server class memory or storage class memory? Sorry.

Starting point is 00:37:38 I always call it NB-DIMM. There's like 50 names for it. I don't know what to call it. And they had 200 gig-gig adapters. We disabled hyper-threading. We disabled power states, all the normal stuff we would do for benchmarking. And then we were using disk speed as a load generator. And I decided to limit it down.

Starting point is 00:37:55 He had done a whole... He was doing a whole sweep of stuff, but I decided to limit it to two main scenarios, one of which being synchronous I.O., which is single-thread, single I.O., 4K ops, either read or write, just to test what the latency improvements would be, and then a being synchronous I.O., which is single thread, single I.O., 4K ops, either read or write, just to test what the latency improvements would be, and then a highly parallel I.O., which was spinning up basically one thread per core, issuing eight I.O.s, outstanding.

Starting point is 00:38:15 And we did it all against NTFS volume on the server side, formatted with ISDAX is true, and the volume was created off of a single DIMM. I didn't try striping. I didn't try doing anything fancy. So given all that, we can sort of look at what sort of perf results are possible with this. So synchronous reads. This is, again, like this would be 4K. I think these were sequential reads from a single thread coming through.

Starting point is 00:38:41 So the benefits here from a processing perspective, this is coming over RDMA, is when the read arrives, it comes with a registration from the client, and I can RDMA write the data directly from the mapping back to the client and send the response back. If we compare that to what would normally happen in the DAX volume, is I would call down the file system to NTFS and he would give me an MDL describing the region,

Starting point is 00:39:00 and I would send it back. So we saw about a 20% improvement in IOPS and about a 20% reduction in latency. And if you want to, that's average latency. Greg always used to tell me that he doesn't care about average latency. He cares about your latency curve. So here's what the latency curve actually looks like

Starting point is 00:39:17 going from this is IOs up to the 99th percentile and up to, I think this goes to seven nines. There was one really big IO at the high tail. This chart's hard to see because it doesn't look like anything happened. So here it is as a relative percentage with that last data point clicked off, because if I leave that there as a percentage, it shoots way up. But basically, we're seeing about 15 to 20 percent reductions in latencies all the way up to the 99.9 percentile where we kind of even out. So this was single I.O. If we step on to the right version of that case, right is a little bit more interesting.

Starting point is 00:39:51 So the comparison of what would happen in the right case without us is we would actually, a right arrives, we would go to NTFS and either ask for an MDL describing the region, we'd read it into it, and we'd go back to NTFS and tell it we're done, or we could receive it into a buffer, send it down to NDFS, and they could do a copy.

Starting point is 00:40:07 The potential savings here was much greater. But the actual numbers where we go didn't end up being too different. Part of this is there's so much CPU available on these machines. We'll come to that later.

Starting point is 00:40:23 The latency improvements, let me see how much time we have if we go to the relative numbers um we're seeing about 30 latency improvement for 4k single ios and this is again without the client implementation actually changing so now if we step into lots of ios so this is where we really start to stress the server a little bit more i'm just going to do iops and average latency for this. And here what we see is because it's highly parallel, the overall read performance doesn't change in terms of IOPS too much, which kind of makes sense because in both cases, we're sort of asking NTFS for a pointer to the data and we're allowed to RDMA it right. So really all we're saving down here is a little

Starting point is 00:41:03 call down to the file system to kind of get a pointer to the data that we can then send out. But the write is significant, because again, with the write, we're not receiving into a buffer and then asking NTFS to do a copy into the persistent memory. We're receiving directly into the memory, and then we can do sort of a flush. And then the 50-50 reads writes sits about exactly where you would expect it to be in terms of it's about half the savings or half the improvement we saw from the write case. The other interesting thing if you're looking at

Starting point is 00:41:36 hyperconverged systems, though, is that the actual server CPU savings in the bad case, like the most intense case, which is writes, we were using between about a half and a third about the CPU on the server side in comparison to running on a normal DAX volume. So at this stage, everything looks great. Let's not look at it closely and let's all go get coffee. Sound good? Because there was a kicker that kind of came through, which is why I've been hammering the laptop and talking to Neil and all these guys trying to figure it out. Let's go back to our original diagram here. And all these numbers that I've showed you up to now are baselining using a

Starting point is 00:42:12 MTFS volume that was formatted for DAX operation. There's another option, which is use persistent memory, and you don't format it for DAX operation. You say this is only going to be used in block mode. And in that case, this driver called PMEM, which is a storage driver, is the one that's actually interacting with the persistent memory. If you have more interest in this, Neil also has a talk on Thursday

Starting point is 00:42:32 that I believe is all talking about this, and he can go into great detail. So then we said, what happens if we take our performance numbers and instead of comparing them A, B, on a volume that's formatted for DAX, let's also compare it to a volume that's formatted for DAX, let's also compare it to a volume that's formatted for block I.O.

Starting point is 00:42:48 And the numbers aren't nearly as impressive, and in fact, they're slower in the write cases. Like, we were actually slower in our prototype implementation doing either, like, for largely parallel writes than if we just left the volume format as block and left the architecture the same. Which leaves the question, like, we're using half the CPU,

Starting point is 00:43:12 we eliminated a context switch, but we actually lost IOPS. Yeah. It's a fascinating question. So this whole time we've been digging through, or I've been digging through expert tracing, because I honestly can't say why. What I can see is that we did add a new synchronization primitive,

Starting point is 00:43:31 we said on the map, where we're using a rundown protection. In our case for a rundown protection, we use them a lot in SMB direct, but SMB direct structures are affinitized per NUMA node, whereas we're actually highly parallel, so we're contending on the same rundown across many threads at the same time. So maybe that data structure, there's some active weight that's going on that doesn't show up as CPU consumption, but actually

Starting point is 00:43:51 slows us down. We also are doing CL flush to actually flush out the pages, because these are all right through IOs. I'm trying to talk to the owners of PMEM to figure out how they actually push data out, because they sit much closer to the hardware, and they might be using a what is a non-temporal copy or some other instruction that's just a more efficient way. If I disable all synchronization on the

Starting point is 00:44:15 rundown and I disable CL flushing then I can match block IO but I know that he doesn't have those consistency problems. The other option could be that because we're using so much CPU there might be something across these cores where our worker threads I can see we're just entering weight and resurrecting states and like like maybe there's just something about the way the workload of internet is the truth is I honestly can't say why because everything that I look about these data says that we should be faster than block IO but when it came down to actually running the numbers we were for read but we weren't for writes. So there's a mystery. If you have ideas or theories, I would love to hear it, because I honestly can't. Yes?

Starting point is 00:44:57 So I guess the question is, like, in DAX mode, how does the I.O. travel through the file system? Which would be a question for Neil. In DAX mode, it doesn't. The file system asks for mappings, and after those mappings are granted, there's no, the file system isn't involved at all. So the question was, does the page size add those things up to the package? Oh, I see.

Starting point is 00:45:24 Is it a factor there? Yeah, I see. Yeah, I don't know if there is support for page sizes other than 4K for DAX volumes, at least on our implementation. Actually, I'm going to talk about that. I'll talk about it on Thursday, but we do have large page support now. Yeah, cool. But I don't think you were using it.

Starting point is 00:45:42 No, I wasn't. And like I said, if we go to sort of the final slide, you know, it offers, if you already have a workload that wants DAX volumes for local operation because you have services that want to interact with the memory directly in a byte-addressable mode, definitely doing the server-side integration with DAX support shows huge savings.

Starting point is 00:46:00 But if you're purely looking at remote access scenarios, I just sort of call out the last point that I think there's more work that we have to do. We have to dive in to figure out why the numbers doesn't match sort of what we would actually expect it to be. So, but so far it's been, yeah, fascinating experiment. Experiment, so. Did you try to play with DDAO

Starting point is 00:46:19 like Sable and Angular? No, it's at a hardware level modification? Yeah. Normally you have this enabled. RDMA places the data into the CPU. Right. Right. Enabling DDIO would allow us to not call CL flush for consistency guarantees. That's, we're not just in platform level optimizations here.

Starting point is 00:46:55 This is the protocol. Just as an experiment to see whether performance will change. Yeah. I would be skeptical. I personally feel that full blown push mode to see whether performance will change. Yeah. I would be skeptical. I personally feel that full-blown push mode is the better place to explore, but that requires a very.

Starting point is 00:47:15 Cool. What's the size of your file? It was either, I switched between a two gig and 512 meg. With Daniel's code, we couldn't exceed four gig because he was doing a single mapping. It would be interesting if using large wave support would help. If DLB flushing is somehow interplaying in these random IOs. Yeah, that could be actually because it was for the large files it's full random IOs going through.

Starting point is 00:47:39 So we have about five minutes left for questions if you have it. Don't forget to uninstall SME1 if you yeah and yeah are there any other questions comments yes well so you can think about it as a different way, in that it's not a matter of whether we wanted a session per user. It's a matter of whether we wanted to be able to be used in workloads

Starting point is 00:48:14 where the token being run on the client isn't a network-provable token to the server. So the example I always give people is there were scenarios people would come up and say, I can use iSCSI. Because if I use iSCSI, I provide credentials, I mount a volume, and now once the volume is local to me, I can use local users and it works just fine. But over SMB, if I run as a local user and I try and access it, it fails at logon failure because the server knows no idea of that user. So really the global mounts remote identity is more just saying there are these scenarios where I want to do a network proven mount of a remote resource but then i want the actual acl evaluation with respect to

Starting point is 00:48:49 accessing individual objects to be more in control of the client so that's how i explain it matthew might explain it better so i can hand back to him too so yeah you're essentially delegating a piece of authentication back to the client. Yeah, exactly. So you can think about, like, with Windows containers as well, we ran into this where containers run with a different instance of LSA, so they have different users, and they can't do domain-based auth and stuff. So one of the first things is, how do I let a container access, you know, remote things without providing credentials?

Starting point is 00:49:24 And this was another method of doing that. Oh, I guess it was related to input. You'll be hurt if you don't have multi-channel in that scenario, right? Because in our server, we do best with lots of connections. You can, but at the same time, the note was that you'd be hurt without multi-channel. At the same time, establishing this tree connect with an identity will actually be faster

Starting point is 00:49:43 than establishing a session, because you're not going to have to loop through LSA and do a full-blown auth. So you could argue if we switched our Hyper-V model to do a single machine-based auth and then tunnel the token of the VM, we could actually see a speed-up from using identity rather than doing one session set up per VM

Starting point is 00:49:58 because they're all different log-in IDs. Yeah, the question is, can you still use multi-channel? And the answer is yes. So whenever you establish the session, we'll be striped across all the channels. You'll establish it once for the session, and then the tree connect is valid on any of them. So, yes, I saw you.

Starting point is 00:50:21 I believe he is coming back, although I don't know if he's coming back to the SMB team. So... I believe he is coming back, although I don't know if he's coming back to the SMB team. I think it might be more than that. So, cool. Well, yeah, thank you for your time. Come find us if you have any other questions. Thanks for listening. If you have questions about the material presented in this podcast,

Starting point is 00:50:43 be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #63: What’s new with SMB 3?

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.