Storage Developer Conference - #8: SMB 3.1.1 Update

Episode Date: May 23, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 8. Today we hear from Greg Kramer, Senior Software Engineer, and Dan Lovinger, Principal Software Engineer with Microsoft, as they present SMB 311 update from the 2015
Starting point is 00:00:47 storage developer conference. So thanks for coming I'm Greg Kramer you probably recognize me from years past I've done the SMB talks and the SMB direct talks Dan Lovinger is gonna join me later with some interesting 100 gig RDMA results and let's get started. So this time last year we were here with the Windows 10 preview and the SMB 3.1 dialect and this year we're going to cover 3.1.1 which is a minor revision on our preview release dialect. Real quick, our agenda here, talk about some dialect changes we've made, kind of fly through the SMB 3.1.1 features. Not a whole lot has changed since last year, and this is like the third presentation of
Starting point is 00:01:41 this material, I think, depending on which conferences and all that you go to. So I'm going to try to get through it rather quickly. I will be here all week with the rest of the SMB team, so if people have any sort of questions just grab us and we'll get you sorted out. Once we get through the 3.1.1 material, we've got some interesting sort of future directions to talk about, some fun prototypes we've worked on that I'd like to share with you guys and we'll make sure you leave happy. Some of the stuff we're looking at. So let's just get this out of the way quickly. So we've revised how we represent our dialect numbers now. We now use a major-minor revision,
Starting point is 00:02:27 so 2.0.2 instead of 2.002 or whatever. It's not a big deal. This is mostly just fit and finish. We realized that the way we've been doing it gets kind of silly if you start introducing like hexadecimal numbers into the dialect string and it becomes unclear how to write them. So let's just standardize on a format that we all understand.
Starting point is 00:02:51 This should all be updated in the protocol document at this point. And most places in Windows 10 where we represent dialect strings out to the users have been updated to use the new format. So jumping right in, the Windows 10 RTM SMB dialect is 3.1.1. Like I said last year at SDC, we had presented 3.1. Now the good news is for people that had started working on this, it's a very, very minor set of changes. We tweaked just a few things. We required a minor revision update just to retain compatibility. It's not a big deal. If you're a good way through your 3.1 implementation, there's not a whole lot you have to change. If you look at the slides, I will call out the differences in red text there. I'll also speak to them. There's only two, I think. So that's not so bad. The SMB 3.1 dialect is gone now, though. So Windows 10 RTM no longer speaks this dialect. It was an interim engineering dialect. We will not
Starting point is 00:03:59 negotiate it. So you will have to update your dialect strings. And we expect, but of course can't promise, that server 2016 will also speak 3.1.1 just like Windows 10 RTM. And by the way, if anybody has questions, feel free to interrupt as I'm talking. It could be if not 3.1.1. Excuse me? If it is not 3.1.1, what could it be? So the question is, if 2016 doesn't use 3.1.1, what else would it be?
Starting point is 00:04:28 So 2016 would, of course, support 3.1.1, but theoretically, if we were to add any new features, we may end up having to rev, like, minor version or something. Like I said, I very much doubt that will happen, but, of course, we can't promise it at this point. Alright, so new features in the protocol. 3.1.1 was mostly a fit and finish release for us. There was a bunch of loose ends that I think everybody
Starting point is 00:04:58 wanted to tie up and we had the opportunity to do at this release. Over the past several releases, you've noticed we've crammed quite a bit of functionality into SMB, and some of these capabilities that we've added, they start to get kind of complex. And so when the client and server need to negotiate how they're going to use these capabilities, sometimes they're not just simple flags anymore.
Starting point is 00:05:24 Like we actually have to exchange rich information between the client and the server and we're sort of running out of unused bits in our existing negotiate request. So we needed a way to introduce extensible negotiations so that the client and server could actually send rich information back and forth between each other. And so we introduced the idea of negotiate contexts. So if you guys are familiar with the protocol already, this looks very much like create contexts. It's basically the same exact idea. We took two of the remaining unused fields,
Starting point is 00:06:00 and we turned them into a negotiate context offset and count. And then we just create a linked list of these blobs behind the negotiate requests or responses. And you can see what that looks like here. The blobs begin on the first eight byte aligned offset following the usual request or response packet. Each subsequent blob begins on the next eight byte aligned
Starting point is 00:06:26 address after the preceding one. And the contexts are strongly typed. So they have an ID field that tells you what they are. They have a data length. And then the data that the context carries is type-specific. This is very general. You can use this to transport all sorts of information, as
Starting point is 00:06:43 you'll see shortly. The key things to know about this is that your client will only send these negotiate contexts if it supports 3.1.1. The server will only send them back if it selects 3.1.1 for the connection dialect. The receiver must ignore unknown negotiate contexts. So if you're interpreting the ID field, the type field, and you don't know what it is, you have to ignore it. And this is very important and we'll talk about this a little bit later. But the intent here was to allow us to add new features to the protocol without necessarily having to require a dialect revision.
Starting point is 00:07:22 Dialect revisions are sort of a pain, especially if you have to protodock them all. Tom Talpe has to go through and make a million changes to the document. So try to be nice to Tom. The one interesting thing to note about this is that since a client doesn't know that he's talking to an SMB 3.1.1 server beforehand, he just kind of doesn't know that he's talking to an SMB 3.1.1 server beforehand,
Starting point is 00:07:46 he just kind of has to assume that he might be, and he'll attach context to his negotiate request. Now, this could be problematic if your server was coded to say, hey, I know the size of an SMB 2 header and the size of an SMB negotiate request, and I'm only going to accept something that's that big because you might now get this linked list of blobs following your request. Now, in practice, we haven't run across anyone during the Plugfest that has this problem. It's just something to be aware of.
Starting point is 00:08:14 Windows will accept up to like 128K or something like that worth of data on the negotiate request. So just be aware of this. Pre-auth integrity. Negotiate contexts introduce a sort of unique problem from a security perspective. So in SMB3, we added negotiate validation to prevent dialect downgrade attacks. So it was an after-the-fact check
Starting point is 00:08:39 that no man in the middle messed with your negotiate request or response, right? And the way that that mechanism worked was that you re-exchange the same information that you exchanged in your negotiate request and response and compared notes afterwards in a signed fashion. And if you detected a difference, then somebody messed with your packets. The negotiate contexts don't fit in this scheme because, of course, the format of the negotiate validation
Starting point is 00:09:05 request only contained the fields that were present in the original negotiate request and response packets, so there's no way to actually represent negotiate contexts under this protection mechanism, which is not good because we want to use these to negotiate all sorts of complex connection properties and we don't want a man in the middle messing with those. The other thing that it doesn't cover is that session setup requests and responses are also pre-auth, right? The final session setup response is the first opportunity to actually sign or encrypt something, well not encrypt but sign something, right? So in the future if we wanted
Starting point is 00:09:45 to extend session setup in any way, we would have to be very careful about how we did that because of course those messages can be modified and nobody would detect it, right? And that's sub-optimal, we don't like that. So Preauth integrity solves these problems. So if you're familiar with TLS, this probably looks a lot like the TLS mechanism. The basic idea is that the client and server compute a rolling hash of every request they send, response they send, receive, you know, etc. And at the end, when you get to session setup, you have a hash value that represents the entire message exchange that you've seen so far.
Starting point is 00:10:26 And then if you use that hash value as an input into your key derivation function, you can derive secret keys that depend on the integrity of the message exchange that you've had, such that if you send the final signed session setup response, the client can only validate the signature if nobody modified the packets on the wire. This is one of our changes from 3.1 is that the client must sign or encrypt the tree connects
Starting point is 00:10:56 when he sends them back to the server. And this sort of closes the loop. The server will always send the final session setup response signed. The client will always sign or encrypt a tree connect. The ability to either decrypt or validate the signature means that you guys computed the same hash value and then nobody tampered with your data. The really nice thing about this system is that it's message agnostic. Like we can change session setup messages, negotiate, we can do whatever we want. It's just a hash. It doesn't interpret any of the fields or anything. So how do we select the actual hash function that we're going to use? And this is where the protocol starts making heavy use
Starting point is 00:11:36 of negotiate contexts. So we introduced a new negotiate context. It's the preauth integrity capabilities. It's basically just a list of hash functions that you support along with the salt value to prevent pre-image attacks on the actual hashing. The client sends his list of supported hash functions to the server. The server selects one, sends a context back that indicates which hash function you'll be using, and then both sides use that to compute their hash value. Currently, SHA-512 is the
Starting point is 00:12:07 only hash function that we support. It will probably be that way for some time, but if we want to shove new hash values in, it's pretty simple now. This is a quick illustration of how we actually compute the hash value. There's a whole write-up in the protocol document. The basic idea is you just start out with a hash value that's all zeros, and then you just concatenate each packet that you send, each packet that you receive to the existing hash value, rehash it again, and store the value. You end up with a hash for the connection, and then use that to build up the hash for all the sessions that are established on that connection.
Starting point is 00:12:51 And then you pass them into your key derivation function. So we have not changed the KDF. It remains SP108-800 counter HMAC SHA-256. It needs a snappier name. Anyway, that's the same as SMB3. But we have changed the labels that are used to derive the values. And the context value is now the session's final pre-auth
Starting point is 00:13:18 integrity hash value. So if you think about this, then the key derivation function is meant to take a master key and produce child keys such that if the child keys are compromised, you can't recover the master key. So inputting the hash value here produces unique child keys that are predicated upon everybody seeing the same hash value. Key points. Pre-auth integrity is mandatory for 3.1.1.
Starting point is 00:13:48 It supersedes negotiate validation, so we no longer perform negotiate validation on 3.1.1 connections. And we received a number of questions about this during the plug fest, so I thought I'd throw this in here too. You have to compute the pre preauth integrity hash for master session setup and binding session setups, but not reauth. And the reason is because reauth doesn't result in keys, right? The whole point of the preauth integrity hash is to use it to derive child keys and if you're not going to derive keys, you don't need to do it. So, reauth is exempt.
Starting point is 00:14:26 Cluster dialect fencing. So, imagine that you have a storage cluster and you're using SMB to serve out application data and you would actually like to update this cluster at some point in time. So, in the past, this was not an easy thing, right? It mostly involved bringing the cluster down, which resulted in downtime. And the reason for that is because if you want to have transparent failover, if a client connects to a node in the cluster and negotiates a dialect, he expects that he'll be able to failover to any other node in that cluster and reclaim that same dialect, because
Starting point is 00:15:01 he probably has handle state that is associated with features that are tied to a particular dialect. If that doesn't happen, the client breaks. During the process of upgrading the OS on your cluster nodes, we've passed, I think, every Windows release we've had a new version of SMB. So the probability is that the upgraded nodes will have a higher dialect than the non-upgraded nodes, at which point if a client were to connect to one of the upgraded nodes and then fail over to a non-upgraded node, it's out of luck.
Starting point is 00:15:35 It's not getting that dialect back. So solving this is actually fairly simple. We introduced a new concept of a maximum common cluster dialect, and we fence access based on that dialect. So for example, if you had a cluster that was running server 2012 R2, so everybody speaks 3.0.2, and then you start upgrading it to Windows Server 2016, you would define your common maximum dialect to be 3.0.2,
Starting point is 00:16:07 right? And then all of the Server 2016 nodes that are upgraded would be informed by the cluster infrastructure that, hey, you got to pretend to be Server 2012 R2 nodes for now, so don't hand out any 3.1.1 connections to anybody that's trying to access cluster resources. And if somebody comes in and attempts to access Tree Connect to a clustered share, you need to fail them. And we did that by introducing a new status code. You'll get this unique status code back as the client. And the error payload for your failed tree connect
Starting point is 00:16:46 will include the maximum cluster dialect. So if you come in, we'll fail you. Hey, your dialect's too high. The maximum dialect you can use is this. So then you disconnect and reconnect, and everything's good. So there's one minor change to tree connect request that we had to make. We took the old reserved field, turned it into a flags field.
Starting point is 00:17:11 There's sort of an interesting race condition we discovered here because when you finish updating the OS on all of your cluster nodes, the cluster infrastructure has to broadcast a message to all those nodes that says, hey, you guys can all start being Server 2016 nodes now. Like, we're all there. That message isn't necessarily received or processed simultaneously by all the nodes, right? So there's a small period of time where they disagree about what the maximum cluster dialect is.
Starting point is 00:17:40 So the easy way to fix that is that once a client has successfully tree connected to a clustered share, then all subsequent tree connects to the same cluster set this cluster reconnect flag. The flag is the client telling the cluster node, hey, I have previously connected to a clustered share with this dialect, so you've got to let me in. Like, I don't care what you think the maximum cluster dialect is. Now, the thing to realize here is that this isn't a, I mean, the server trusts the client, but this isn't a security
Starting point is 00:18:13 boundary. There's no vulnerability here. A client that would maliciously or erroneously set this flag only hurts himself. He's only establishing handle state that he can't presume if he fails over, at least not safely. So it's not a problem. Key points, dialect fencing only affects cluster chairs.
Starting point is 00:18:38 So cluster nodes generally have two personalities. They expose clustered resources, but they're also just standalone file servers if somebody wants to access the node as a standalone machine they're free to do so and they're not subject to the dialect fence since they have no failover what this does mean is that you can't mix clustered and non clustered access on the same connection it won't work and clients should just be aware that you need to have some sort of protection against malicious or buggy servers so that you don't go into an infinite retry loop if they just keep sending you the reconnect status.
Starting point is 00:19:20 All right. Cluster client failover. So just a quick overview. This was to solve cases where the application was actually running on a cluster also. And so you can end up in situations where the client application was connected to a server node and had handle state established, and then the application node dies and the cluster restarts the application on a different node in the application cluster. All of a sudden the server node sees reopens coming in for handles that it still might
Starting point is 00:19:56 think are open by the original client if he hasn't detected the failure yet. And CCF was a way to invalidate those old handles so that you could say, oh, okay, I get it. The guy I was previously talking to is somewhere else now and I'm going to allow him to come in and do this. The one thing that this didn't address was partitions in the application cluster, right? So it effectively handles the situation where the application dies on one node and restarts on another. What it doesn't handle is cases where the cluster believes that one of its nodes has died because of a network partition or some other reason, but that that node can still see storage. At which point the application
Starting point is 00:20:42 cluster is going to restart the application on a new node in its cluster and now the application is running in two places at once. And what would happen previously with CCFv1 is you could get into this tug of war where the new application would contact the server, ask for its old handles to be invalidated and so the server would do that. But then the still running original instance of the application would see errors and would ask for its handles to be recovered and the server would say, sure, here you go. And they would just ping pong back and forth as they fought each other over who gets to be the real application, right?
Starting point is 00:21:19 So CCFv2 addresses this problem. Basically, in addition to the application instance, we just add an instance version, with the idea being that the version is just increased in some way every time the application cluster detects that it has moved the application, such that we can resolve this conflict by saying, hey, if I get two people that both say they're trying to resume handle state, whoever has the higher version number is the most recent guy. So we're going to allow him to recover his handle state, and then we'll fail the other guy with an error code that says, hey, don't retry anymore. You're no longer the application. Somebody else has taken that job. So responsibilities for CCFv2, the client has to pass the new version
Starting point is 00:22:11 alongside the app instance on create. And when you get the new error code, you have to give up. So CCFv2 would prevent the original application from actually recovering handle state, but it would be obnoxious if he just sat there retrying infinitely. So you see the new error, stop trying. And the server has to look at the new instance version on every invalidating open. Higher versions win. And then we fail the lower version with the new error
Starting point is 00:22:43 code. For old client cluster or application clusters, we have some simple rules to deal with them since they don't know about these new versions. So if you don't give us one, we'll assume that your version is zero. Zero always trumps zero and otherwise all the same rules apply. All right, crypto improvements. So we introduced encryption in SMB3 and we mandated the AES-128 CCM cipher. So the question is, what happens if you need a different cipher? So you need different performance characteristics.
Starting point is 00:23:36 If you're operating in an environment that has particular regulatory requirements, what if the cipher's compromised? For all we know, next week there will be a big leak and we'll find out that there's something wrong with it. You don't know, right? You'd like the ability to have some crypto agility here. You'd like to be able to replace these ciphers if you needed to. So for 3.1.1, we allowed ciphers to be negotiated on a per-connection basis, and we added support for AS128GCM.
Starting point is 00:24:04 This is the second change from the 3.1 preview protocol. We had introduced a flag that the client could set in his session set up to tell the server, I don't care if you require encryption, I do. What I realized is that flag is totally unnecessary. You can achieve client-mandated encryption without any protocol changes. To do that, if you're inclined to implement this feature in your implementation, you just have to have your client indicate that he requires signing,
Starting point is 00:24:37 which is something that has existed in SMB since SMB2. And then once you complete session setup, just start emitting encrypted packets. Right? So the SMB protocol requires the server to reply in kind. So if you give him an encrypted packet, he has to give you one back that's encrypted. And you've already indicated that you require signing. So the server will reject anything that doesn't show up that's at least signed crypto Trump signing. So a man in the middle can't inject his own traffic, and as long as you're only emitting encrypted packets, you've got an all encrypted connection.
Starting point is 00:25:09 So the protocol change was unnecessary, and we've reverted it. Question? Does either Windows 10 or Windows Rebels server actually? So the question is whether we chose to implement the client mandated encryption for Windows 10 or Server 2016 and the answer is no, not at this time. Negotiating a cipher. Just like with the preauth integrity hash, we introduced a new negotiate context,
Starting point is 00:25:46 and it works basically the same way. The blob comes in. It contains just an ordered list of the ciphers you support from most to least preferred. The server will select one of those. The policy that you use to select is server's choice. And then the server will respond with a negotiate context indicating the cipher that was chosen for the connection. Something to note is that for 3.1.1 connections the SME2 encryption capabilities flag is no longer used. It's
Starting point is 00:26:17 not necessary anymore. The presence or absence of the negotiate context is unambiguous. You can tell whether the server supports crypto or not based on the response. We also had to make two minor changes to our transform header, which is the header we use for encryption that precedes the SMB2 header. AS-128-CCM used an 11-byte nonce.
Starting point is 00:26:43 GCM uses a 12-byte nonce, something to be aware of. And the encryption algorithm field was renamed to flags. It used to indicate which algorithm was being used for the connection. I think we realized that that didn't make sense because for pre-3.1.1 connections, it could only be AES-128-CCM. It was hard-coded to the protocol. And in 3.1.1, you now negotiate it. Once you've negotiated it,
Starting point is 00:27:13 you know which cipher is being used for your connection. It doesn't need to be in the field here. So this flags field, the value 1, now just simply indicates that the transform is encryption using whatever cipher was negotiated for your connection. All right, performance. So we decided that we were going to look at large file copy performance.
Starting point is 00:27:42 So this is a workload that historically does not do so well with encryption. So SMB can copy a 10 gig line rate easily when we're plain text, no encryption, no signing. These are the stats for the systems I use. If you've seen my past talks, I've been using the same systems for, you know, several years now. So this is the standard two-numa node, 16 physical core system. I've got an Intel 10 gig NIC. I've got an NVMe device on both ends. And I'm going to stand up a file copy workload.
Starting point is 00:28:18 So last year, this is where we were sitting. You can see the green bar was AS128GCM, the new encryption algorithm. It's a marked improvement over CCM, which was the SMB3 cipher. Over twice as fast. And significantly faster than signing, which is somewhat surprising, right? I mean, encryption provides both integrity and privacy. It's doing more work. You would think it would be more expensive.
Starting point is 00:28:49 But GCM is a particularly optimized algorithm, especially if you have AS and IS support in your chipset. So anyway, this is where we were last year. So the question is, where did we end up this year? And this is where we are. So we did some significant performance optimizations since last year and you can see that we're basically touching line rate now for file copy workloads using AES GCM. CCM also improved measurably but
Starting point is 00:29:21 GCM is still much better. and the really interesting thing is the efficiency of these algorithms so cycles per byte wise GCM is uses 33% fewer clock cycles than CCM does right what's the overall CPU usage? Only for the encryption, you mean like just the algorithm itself? I didn't break that out. Yeah. Are you CPU bound?
Starting point is 00:29:54 Am I CPU bound? I don't believe so, no. So, are those due to changes in the implementation of server reader or just in the bcrypt routines? So the question was why are we getting these improvements? No, the actual implementation of the crypto routines did not change. We found some pretty significant optimizations in the SMB client. There were some opportunities for exploiting parallelism that we took advantage of,
Starting point is 00:30:26 and it works pretty well. So key points for crypto. CCM remains required because you need it for SMB3 compatibility. GCM provides huge performance increases. If you're looking to implement crypto for SMB, please consider doing so. Please consider using GCM. It works much better. And one thing to be aware of is that
Starting point is 00:30:52 now that you can negotiate the cipher on a per-connection basis, if you're doing session binding, you have to be aware that when you bind one session across multiple connections, all of those connections better have negotiated the same cipher because you're going to have negotiated the same cipher
Starting point is 00:31:05 because you're going to be using the same keys for them. That would be not good. All right, future directions. So I'm going to have the obligatory. I'm about to talk about experimental things that we don't make a promise to ship and blah, blah, blah. They're fun to talk about. So as we saw in the previous slide, GCM is a lot faster than SMB signing.
Starting point is 00:31:32 And that's interesting, right? I mean, signing is only providing integrity for the packets. It's doing, conceptually, much less work. Why isn't it faster? And you could say, well, GCM is pretty fast now, right? So maybe who cares? Maybe we just use GCM and call it a day and you get privacy for free then. Yay. But, you know, what if we don't want to pay the extra cost? I mean,
Starting point is 00:31:57 what if we only need integrity? So, you know, maybe one example of that would be if you are using SMB for your hypervisor so that he accesses the virtual disks for the VMs he runs. And if those VMs are running full disk encryption, then all the data that we're transmitting for our application for the VM is already encrypted. So why double encrypt? Why pay the cost? It hurts the density for how many VMs you can run on your hypervisor. So maybe we can look at making signing faster and more efficient. So GCM is great. And it turns out that there's this integrity-only variant of GCM that's called GMAT.
Starting point is 00:32:40 So if it's only doing integrity, it should be a lot faster. So meet Aaron. Aaron Friedlander was our summer intern in the SMB team this summer. He's from Carnegie Mellon. And we had him actually prototype integrating GMAC support into SMB 3.1.1 just to see how it went. And I just want to give him a shout-out here. Aaron did a really good job
Starting point is 00:33:05 on this project. I mean, we threw some real hard work at him. He had no prior kernel development experience and he just did a really great job. Unfortunately, he couldn't be here today. We tried to make that happen, but he's back in school now and had conflicts. So what did we have to do to make this work? We took advantage of our negotiate context support and we defined a new context to allow the client and server to negotiate signing algorithms. And the interesting thing about this was that we didn't increment the dialect.
Starting point is 00:33:39 So this is the proof that the negotiate context actually allow us to slot in new behavior without having to rev the dialect revision. In fact, we showed that our prototype clients and servers interoperated just fine, even with mandated signing with RTM bits and down level bits. We refactored the entire encryption code path to be
Starting point is 00:34:00 knowledgeable of the fact that it can now be operating in integrity or integrity and privacy mode. And then we added a new transform header flag to indicate that the payload you received was signed and not actually encrypted. So let's see what that does for us here. So the purple bar is AES-GMAC, so our new signing, and the green bar is still AES-GCM. And you can see that for our file copy workload that we were testing, we didn't get any more throughput,
Starting point is 00:34:28 but that's because we're already pushing up against 10 gig line rate. There's nowhere to go, and I didn't have any 40 gig Ethernet NICs laying around to test with. So maybe I'll try to scrounge that together for next year. The more interesting thing is what it did for our efficiency. So GCM was 4.8 cycles per byte, and GMAC is 3.8.
Starting point is 00:34:47 So we get a 21% reduction in CPU utilization by going to this new signing algorithm, which is pretty interesting. Because if you're trying to pack your VMs in, and you want to protect their data over the wire, but they're already providing privacy for themselves, then you could just pack more VMs onto your hypervisor now. The other thing that I'll call out here too is that the prototype that Aaron worked on was focused entirely on functional correctness, right? We just wanted to get it up and running and see how it went.
Starting point is 00:35:21 We didn't spend any time performance optimizing this and after after he finished, we sat down and took a quick look through the code. And we're aware of several fairly easy improvements that would further reduce the cycles per byte for AES-GMAC. So this is kind of an interesting future direction. Yeah? Mr. Park, how exactly is GMAC negotiated? If you negotiate a GCM or a serial induction program, is it when GM If you negotiate a GCL
Starting point is 00:35:45 is it when is used for signing? So we negotiate signing and encryption separately. So using these negotiate contacts, the prototype client would indicate to the server, hey, I support these ciphers, and I support these signing algorithms now. Right? As far as I remember,
Starting point is 00:36:04 it's only if you were negotiating encryption cipher, right? Not signing. The negotiation context doesn't have people... So, like I said, for the prototype, we actually had to make changes so that we could negotiate the signing algorithm. Okay. Yeah.
Starting point is 00:36:22 One more question. Which Windows client has the performance improvements for encryption? Is it the version of Windows 10 that is not out yet or the TP3 version of Windows 10? Are you asking about the parallelization? Yeah, so the RTM bits for Windows 10 do not have the parallelization work yet.
Starting point is 00:36:42 I would expect that the next release of the client would include those, and then the next release of the preview server operating system would include those. And is that across the board on protocols like even SMB2 signing, SMB3 encryption, SMB3.1, can the encryption not really pull? The parallelization improvements would only affect the encryption path but it would also include the old CCM. So if I skip back real quick, you note that CCM used to be, what, 236 megasec and with
Starting point is 00:37:14 the parallelization improvements it's up to 973. All right, I'm going to hand it off to Dan now. How's this working in the back of the room? Sounds good? Great. So, I'm here to talk about something that's a little bit fun. We had an opportunity to do a bit of a quick sprint piece of work with some partners over about the last six to eight
Starting point is 00:38:14 weeks to present here today. Greg showed 100. I'll upgrade that slightly by two. So what we did was we stood up what we think is the first multi-vendor dual 100 gigabit testbed configuration just to see what our in-flight Windows Server 2016 is standing up to do, and put some numbers alongside some of the other 100 gigabit results that I think you've started to hear earlier in this conference and probably publicly over the last couple months.
Starting point is 00:38:52 Proud to say we partnered with Arista. They provided us. We're actually able to effectively pre-announce it for them here. They're Arista 7060QX32 port, 100 gig switch. We're using Mellanox ConnectX 4 NICs, single port. I probably need to click on my attention because it's... Is it going down? Okay. All right, well, I'll just try to keep it managed.
Starting point is 00:39:19 This isn't going to take too long. SW, just raise your hand in the back if I get too quiet. And Mellanox ConnectX for Nix on the back side, two per node, single port connected up. So we have four total ports connected up through this switch for the examples we're going to have here. We're also proud we were able to bring HGST's UltraStore SN150s into this. So I'm going to actually show two cases here. One case, memory to memory between the two nodes, DDR4-2133 and then all the way through to the storage on the other side.
Starting point is 00:40:00 Now if you count my PC lanes, you're going to see that this is very much still the crawlwalk run. This isn't crawl, walk, run in terms of system capability. It's more in terms of how much content we can assemble in a very short sprint time frame. We have 16 lanes of PCAE Gen 3 on each of those cards, so 32 total lanes. We're only able to assemble eight lanes of NVMe flash in the clients. We hope to go much denser in the future and get more to the bandwidth. So basically, just a preview, we're going to show bandwidth in the first slide, and
Starting point is 00:40:36 we're going to show latency all the way through to end-to-end storage in the second. One quick note I want to make about 100, a few folks have mentioned this in a second. One quick note I want to make about 100 gig, a few folks have mentioned this in other talks. What we're actually looking at here is only a 100 gig logical port. What this is actually constructed with are four bonded lanes of 25 gig ethernet. So in much the same way that if you're familiar with the
Starting point is 00:41:01 way that 40 gig switches have appeared and been utilized where you can have a single 40 gig port that's broken out into four 10 gig lanes. And the 100 gig generation, that's how 25 gig ethernets, at least one good way for that to appear. So you're going to have these 32 port switches on the top pushing four by 32 potentially 25 gig lanes out of them. So extremely dense high speed network. And we got 22 gig a second which is we believe right about on top of the theoretical limits.
Starting point is 00:41:41 This was basically just after having worked through the basic multi-vendor interrupt issues. This is all very new gear. This is the first, actually the first time that Arista had met Mellanox. They met them in our lab, which was kind of entertaining. Plugged it up and ran a short 512k workload into them and basically saturated. Theoretically, if you do the math on what's available through the PC Gen 3 slot and 100 gig line rate, you might get a little bit, just a touch
Starting point is 00:42:18 above 23 gig a second. But when you get to the edge of a pipe, you start to run into what's the actual theoretical limits of driving a PC slot and everything else. We think we're basically right on top of it with this load here, which is kind of exciting for us to see. And then we take it to the storage on the other side. So you can.
Starting point is 00:42:41 These are publicly available NVMe parts. They're rated for about 3 gig a second. And just as a matter of not showing completely overdriven performance, we wanted to focus a little bit on the latencies that the 100 gig fabric and our operating system were introducing. We drove them just a little bit short, so we're getting about 5.7 gig a second. And then measuring the latency distributions edge to edge, and this is actually kind of the point, you really can't see the separation between local and remote performance there on that latency distribution. You see a little bit of fuzz out there. That fuzz is the latency disk of the local relative to the solid remote. That's why I have the second graph over here for actually difference them.
Starting point is 00:43:30 And at the median of that latency distribution, we're only introducing 28 microseconds of additional latency. Taking that to remote system dual 100 gig ports, which we think is a pretty spectacular result to be starting from. With that, that really concludes what we have to talk about today. I just wanted to lead off with a quick little picture of what our lab looks like at the moment, which is both an interesting historical perspective all in one frame of gig to 10 gig. This is actually one of those. If you haven't seen them before, this is what one of those octopus cables looks like breaking out a 40 gig port
Starting point is 00:44:21 into four 10 gig links. The rest of the 40 gig, and then 100 gig. Which is actually an interesting note in and of itself we should make there, that's copper. A couple years ago, if you were talking to folks, we were being told left, right, up and down, front and back, that 100 gig was gonna require optical connections. That's not the case. This is two to three meter copper.
Starting point is 00:44:51 And you can see that we have a lot more ports wired up than I talked about here today, so watch this space. And we hope to show some more interesting results in the near future. I guess that's basically it. Any questions? My part, Craig's part? What's the I.O. size you used for the benchmark? I was using all 512K I.O. here, all large I.O. Basically, the focus for this sprint, we obviously, you know, we had some interop. We were coming up through all these results are basically in the last week and a half that we assembled together.
Starting point is 00:45:28 We're going to go the rest of the way. You know, small I was showing you small I latencies in future presentations. So we're basically focused on filling the pipes and, you know, filling the capabilities of whatever part we had on the other side. So filling the NVMEs and filling DOL-100. Why did you choose the copper winders So filling the end of MES and filling dual hundred. Why did you choose the cardboard liners instead of Melanox? That's what Melanox gave us. It worked.
Starting point is 00:45:55 It's just like for smaller, it's interesting to see what the latency difference would be. Yeah, you might actually start to appreciate some small difference there. And another thing that hopefully you're able to see this on the slide is that these are thick. I mean, just in terms of cable management, yeah, you could find yourself thinking that's pretty reasonable for 100 gig because it's noticeable how much fatter the cable is. In back, yeah. how much fatter the cable is. So the question is whether it's a change from 3.1 that you don't do pre-auth integrity hashing for re-auth session setups. No, it's not a change.
Starting point is 00:46:56 The original Windows 10 preview release didn't do that. It could be a doc bug, but it's one that's since been addressed because it's been superseded by the final RTM documentation. REVIEW RELEASE DIDN'T DO THAT. IT COULD BE A DOC BUG, BUT IT'S ONE THAT'S SINCE BEEN ADDRESSED BECAUSE IT'S BEEN SUPERCEDED BY THE FINAL RTM DOCUMENTATION. FOR A MASTER SESSION SETUP OR A BINDING SESSION SETUP, YOU MUST COMPUTE THE PRE-AUTH INTEGRITY HASH BECAUSE THOSE SESSION SETUPS RESULT IN SECRET KEYS BEING GENERATED. BUT A RE-AUTH IS JUST RE-AUTH. THERE'S NO KEY THAT'S GENERATED AS A RESULT. secret keys being generated. But a re-auth is just re-auth. There's no key that's generated as a result. Okay.
Starting point is 00:47:27 Yep. And I'm wondering about the encryption. So today on Windows 10 and Windows 16, would we be still using the flag that are a part of the tree connect response to figure out if we want to encrypt the session? So the question is whether we're still using flags in the tree connect response to determine whether we're encrypting.
Starting point is 00:47:48 Yes. Yeah. Nothing about how the server tells the client that he requires encryption has changed. So there was this idea in the preview release that it would be nice to allow a feature because in the original implementation of SMB encryption, the choice to encrypt was entirely the servers, right? It was the server that told the client, like, I require encryption or I don't require encryption. But it seems like it would be nice to also allow a paranoid client to say, like, well, I don't care what you do or don't require. I require encryption, right? And so the preview release was attempting to address that deficiency, right?
Starting point is 00:48:29 What I realized after thinking about it some more was that we didn't need any protocol changes to allow that, right? If a client wants to encrypt and wants to make sure that he's only using encrypted packets, he can do that by simply mandating signing and then only emitting encrypted packets after session setup completes. Yep. Audience Member 6 Audience Member 7 Audience Member 8 Audience Member 9
Starting point is 00:48:53 Audience Member 10 Audience Member 11 Audience Member 12 Audience Member 13 Audience Member 14 Audience Member 15 Audience Member 16 Audience Member 17
Starting point is 00:49:01 Audience Member 18 Audience Member 19 Audience Member 21 Audience Member 22 Audience Member 23 Audience Member 24 Audience Member 24 Audience Member 25 It only looks like it's open, but what if the older client's application continues to send writes through the old handle? So, yeah, so the question is what if the old instance of the application continues to try to do I.O.? So the issue is that he won't be able to because when the new instance spins up on the other
Starting point is 00:49:23 node, he's going to do an invalidating open. He's going to tell the server, hey, I'm the same guy. I'm on a different node. I want my handles back. And we'll kill those old handles so that they're no longer valid. Yeah. Yep.
Starting point is 00:49:40 We're out of time. Any last minute questions? Otherwise, you can find us around at the Plugfest or just around at the conference and ask away. All right, thanks. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list
Starting point is 00:50:00 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.