Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 8.
Today we hear from Greg Kramer, Senior Software Engineer,
and Dan Lovinger, Principal Software Engineer with Microsoft,
as they present SMB 311 update from the 2015
storage developer conference. So thanks for coming I'm Greg Kramer you probably
recognize me from years past I've done the SMB talks and the SMB direct talks
Dan Lovinger is gonna join me later with some interesting 100 gig RDMA results and let's get started.
So this time last year we were here with the Windows 10 preview and the SMB 3.1 dialect
and this year we're going to cover 3.1.1 which is a minor revision on our preview release dialect.
Real quick, our agenda here, talk about some dialect changes we've made, kind of fly through
the SMB 3.1.1 features.
Not a whole lot has changed since last year, and this is like the third presentation of
this material, I think, depending on which conferences and all that you go to. So I'm going to try to get through it rather quickly.
I will be here all week with the rest of the SMB team, so if people have
any sort of questions just grab us and we'll get you sorted out.
Once we get through the 3.1.1 material, we've got some interesting
sort of future directions to talk about, some fun
prototypes we've worked on that I'd like to share with you guys and we'll make sure you
leave happy. Some of the stuff we're looking at. So let's just get this out of the way
quickly. So we've revised how we represent our dialect numbers now. We now use a major-minor revision,
so 2.0.2 instead of 2.002 or whatever.
It's not a big deal.
This is mostly just fit and finish.
We realized that the way we've been doing it
gets kind of silly if you start introducing
like hexadecimal numbers into the dialect string
and it becomes unclear how to write them.
So let's just standardize on a format that we all understand.
This should all be updated in the protocol document at this point.
And most places in Windows 10 where we represent dialect strings out to the users have been updated to use the new format. So jumping right in, the Windows 10 RTM SMB dialect is 3.1.1.
Like I said last year at SDC, we had presented 3.1. Now the good news is for people that had started working on this,
it's a very, very minor set of changes. We tweaked just a few things. We required a minor revision
update just to retain compatibility. It's not a big deal. If you're a good way through
your 3.1 implementation, there's not a whole lot you have to change. If you look at the
slides, I will call out the differences in red text there. I'll also speak to them. There's only two, I think. So that's not so bad. The SMB 3.1 dialect is gone now, though. So Windows 10
RTM no longer speaks this dialect. It was an interim engineering dialect. We will not
negotiate it. So you will have to update your dialect strings. And we expect, but of course can't promise, that server
2016 will also speak 3.1.1 just like Windows 10 RTM.
And by the way, if anybody has questions, feel free to
interrupt as I'm talking.
It could be if not 3.1.1.
Excuse me?
If it is not 3.1.1, what could it be?
So the question is, if 2016 doesn't use 3.1.1, what else would it be?
So 2016 would, of course, support 3.1.1,
but theoretically, if we were to add any new features,
we may end up having to rev, like, minor version or something.
Like I said, I very much doubt that will happen,
but, of course, we can't promise it at this point.
Alright, so new features in the protocol.
3.1.1 was mostly a
fit and finish release for us. There was a bunch of loose ends that I think everybody
wanted to tie up and we had the opportunity to do at this release.
Over the past several releases,
you've noticed we've crammed quite a bit of functionality into SMB,
and some of these capabilities that we've added,
they start to get kind of complex.
And so when the client and server need to negotiate
how they're going to use these capabilities,
sometimes they're not just simple flags anymore.
Like we actually have to exchange rich information between the client and the server and we're
sort of running out of unused bits in our existing negotiate request.
So we needed a way to introduce extensible negotiations so that the client and server
could actually send rich information back and forth between each other. And so we introduced the idea of negotiate contexts.
So if you guys are familiar with the protocol already,
this looks very much like create contexts.
It's basically the same exact idea.
We took two of the remaining unused fields,
and we turned them into a negotiate context offset
and count.
And then we just create a linked list of these blobs
behind the negotiate requests or responses.
And you can see what that looks like here.
The blobs begin on the first eight byte aligned offset
following the usual request or response packet.
Each subsequent blob begins on the next eight byte aligned
address after the preceding one.
And the contexts are strongly typed.
So they have an ID field that tells you what they are.
They have a data length.
And then the data that the context carries is
type-specific.
This is very general.
You can use this to transport all sorts of information, as
you'll see shortly.
The key things to know about this is that your client will only send these negotiate
contexts if it supports 3.1.1.
The server will only send them back if it selects 3.1.1 for the connection dialect.
The receiver must ignore unknown negotiate contexts.
So if you're interpreting the ID field, the type field, and you don't know what it is, you have to ignore it.
And this is very important and we'll talk about this a little bit later.
But the intent here was to allow us to add new features to the protocol without necessarily having to require a dialect revision.
Dialect revisions are sort of a pain, especially if you have
to protodock them all.
Tom Talpe has to go through and make a million changes to
the document.
So try to be nice to Tom.
The one interesting thing to note about this is that since
a client doesn't know that he's talking to an SMB 3.1.1
server beforehand, he just kind of doesn't know that he's talking to an SMB 3.1.1 server beforehand,
he just kind of has to assume that he might be,
and he'll attach context to his negotiate request.
Now, this could be problematic if your server was coded to say,
hey, I know the size of an SMB 2 header and the size of an SMB negotiate request,
and I'm only going to accept something that's that big
because you might now get this linked list of blobs following your request.
Now, in practice, we haven't run across anyone during the Plugfest that has this problem.
It's just something to be aware of.
Windows will accept up to like 128K or something like that worth of data on the negotiate request.
So just be aware of this.
Pre-auth integrity.
Negotiate contexts introduce a sort of unique problem
from a security perspective.
So in SMB3, we added negotiate validation
to prevent dialect downgrade attacks.
So it was an after-the-fact check
that no man in the middle
messed with your negotiate request or response, right?
And the way that that mechanism worked was that you re-exchange the same information
that you exchanged in your negotiate request and response
and compared notes afterwards in a signed fashion.
And if you detected a difference, then somebody messed with your packets.
The negotiate contexts don't fit in this scheme because, of course,
the format of the negotiate validation
request only contained the fields that were present in the original negotiate request
and response packets, so there's no way to actually represent negotiate contexts under
this protection mechanism, which is not good because we want to use these to negotiate
all sorts of complex connection properties and we don't want a man in the middle messing
with those. The other thing that it doesn't cover is that session setup
requests and responses are also pre-auth, right? The final session setup response is
the first opportunity to actually sign or encrypt something, well not encrypt but sign
something, right? So in the future if we wanted
to extend session setup in any way, we would have to be very careful about how we did that
because of course those messages can be modified and nobody would detect it, right? And that's
sub-optimal, we don't like that. So Preauth integrity solves these problems. So if you're
familiar with TLS, this probably looks a lot like
the TLS mechanism. The basic idea is that the client and server compute a rolling
hash of every request they send, response they send, receive, you know, etc. And at
the end, when you get to session setup, you have a hash value that represents
the entire message exchange that you've seen so far.
And then if you use that hash value as an input into your
key derivation function, you can derive secret keys that
depend on the integrity of the message exchange that you've
had, such that if you send the final signed session setup
response, the client can only validate the signature if
nobody modified the packets on the wire.
This is one of our changes from 3.1 is that
the client must sign or encrypt the tree connects
when he sends them back to the server.
And this sort of closes the loop. The server will always send the final session setup response signed. The client will always sign or encrypt a tree connect. The ability to either decrypt or
validate the signature means that you guys computed the same hash value and then nobody
tampered with your data. The really nice thing about this system is that it's message agnostic.
Like we can change session setup messages, negotiate, we can do whatever we want. It's just a hash. It doesn't interpret any of the fields or anything.
So how do we select the actual hash function
that we're going to use?
And this is where the protocol starts making heavy use
of negotiate contexts.
So we introduced a new negotiate context.
It's the preauth integrity capabilities.
It's basically just a list of hash functions
that you support along with the salt value to prevent pre-image attacks on the actual
hashing. The client sends his list of supported hash functions to the server. The server selects
one, sends a context back that indicates which hash function you'll be using, and then both
sides use that to compute their hash value. Currently, SHA-512 is the
only hash function that we support. It will probably be that way for some time, but if
we want to shove new hash values in, it's pretty simple now.
This is a quick illustration of how we actually compute the hash value. There's a whole write-up in the protocol document.
The basic idea is you just start out with a hash value that's all zeros, and then you
just concatenate each packet that you send, each packet that you receive to the existing
hash value, rehash it again, and store the value.
You end up with a hash for the connection, and then use that to build up the hash for all the sessions
that are established on that connection.
And then you pass them into your key derivation function.
So we have not changed the KDF.
It remains SP108-800 counter HMAC SHA-256.
It needs a snappier name.
Anyway, that's the same as SMB3.
But we have changed the labels that are used to
derive the values.
And the context value is now the session's final pre-auth
integrity hash value.
So if you think about this, then the key derivation
function is meant to take a master key and
produce child keys such that if the child keys are compromised, you can't recover the
master key.
So inputting the hash value here produces unique child keys that are predicated upon
everybody seeing the same hash value.
Key points. Pre-auth integrity is mandatory for 3.1.1.
It supersedes negotiate validation, so we no longer perform negotiate validation on
3.1.1 connections.
And we received a number of questions about this during the plug fest, so I thought I'd
throw this in here too.
You have to compute the pre preauth integrity hash for master session
setup and binding session setups, but not reauth. And the reason is because reauth doesn't
result in keys, right? The whole point of the preauth integrity hash is to use it to
derive child keys and if you're not going to derive keys, you don't need to do it. So, reauth is exempt.
Cluster dialect fencing.
So, imagine that you have a storage cluster and you're using SMB to serve out application data
and you would actually like to update this cluster
at some point in time.
So, in the past, this was not an easy thing, right?
It mostly involved bringing the cluster down, which resulted in downtime. And the reason for that is because if you want to have transparent
failover, if a client connects to a node in the cluster and negotiates a dialect, he expects that
he'll be able to failover to any other node in that cluster and reclaim that same dialect, because
he probably has handle state that is associated with features
that are tied to a particular dialect. If that doesn't happen, the client breaks.
During the process of upgrading the OS on your cluster nodes, we've passed, I think,
every Windows release we've had a new version of SMB. So the probability is that the upgraded nodes will have a
higher dialect than the non-upgraded nodes, at which
point if a client were to connect to one of the
upgraded nodes and then fail over to a non-upgraded node,
it's out of luck.
It's not getting that dialect back.
So solving this is actually fairly simple.
We introduced a new concept of a maximum common cluster
dialect, and we fence access based on that dialect.
So for example, if you had a cluster that was running
server 2012 R2, so everybody speaks 3.0.2, and then you
start upgrading it to Windows Server 2016, you would define
your common maximum dialect to be 3.0.2,
right? And then all of the Server 2016 nodes that are upgraded would be informed by the
cluster infrastructure that, hey, you got to pretend to be Server 2012 R2 nodes for
now, so don't hand out any 3.1.1 connections to anybody that's trying to access cluster resources.
And if somebody comes in and attempts to access Tree Connect
to a clustered share, you need to fail them.
And we did that by introducing a new status code.
You'll get this unique status code back as the client.
And the error payload for your failed tree connect
will include the maximum cluster dialect.
So if you come in, we'll fail you.
Hey, your dialect's too high.
The maximum dialect you can use is this.
So then you disconnect and reconnect,
and everything's good.
So there's one minor change to tree connect request that we had to make.
We took the old reserved field, turned it into a flags field.
There's sort of an interesting race condition we discovered here because when you finish
updating the OS on all of your cluster nodes, the cluster infrastructure has to broadcast
a message to all those nodes that says, hey, you guys can all start being Server 2016 nodes now.
Like, we're all there.
That message isn't necessarily received or processed
simultaneously by all the nodes, right?
So there's a small period of time
where they disagree about what the maximum cluster dialect is.
So the easy way to fix that
is that once a client has successfully tree connected to a clustered share,
then all subsequent tree connects to the same cluster set this cluster reconnect flag.
The flag is the client telling the cluster node,
hey, I have previously connected to a clustered share with this dialect, so you've got to let me in.
Like, I don't care what you think the maximum cluster dialect is.
Now, the thing to realize here is that this isn't a, I mean,
the server trusts the client, but this isn't a security
boundary.
There's no vulnerability here.
A client that would maliciously or erroneously set
this flag only hurts himself.
He's only establishing handle state
that he can't presume if he fails over, at least not safely.
So it's not a problem.
Key points, dialect fencing only affects cluster chairs.
So cluster nodes generally have two personalities.
They expose clustered resources,
but they're also just standalone file servers if somebody wants to access the node as a
standalone machine they're free to do so and they're not subject to the dialect
fence since they have no failover what this does mean is that you can't mix
clustered and non clustered access on the same connection it won't work and
clients should just be aware that you need to have some sort of protection against malicious or buggy servers
so that you don't go into an infinite retry loop if they just keep sending you the reconnect status.
All right.
Cluster client failover.
So just a quick overview.
This was to solve cases where the application was actually running on a cluster also.
And so you can end up in situations where the client application was connected to a server node
and had handle state established, and then the application node dies and the cluster
restarts the application on a different node in the application cluster.
All of a sudden the server node sees reopens coming in for handles that it still might
think are open by the original client if he hasn't detected the failure yet.
And CCF was a way to invalidate those old handles so that you could say, oh, okay,
I get it. The guy I was previously talking to is somewhere else now and I'm going to
allow him to come in and do this. The one thing that this didn't address was partitions
in the application cluster, right? So it effectively handles the situation where the application
dies on one node and restarts on another. What it doesn't handle is cases where the
cluster believes that one of its nodes has died because of a network partition or some
other reason, but that that node can still see storage. At which point the application
cluster is going to restart the application on a new
node in its cluster and now the application is running in two places at once. And what
would happen previously with CCFv1 is you could get into this tug of war where the new
application would contact the server, ask for its old handles to be invalidated and
so the server would do that. But then the still running original instance
of the application would see errors and would ask for its handles to be recovered and the
server would say, sure, here you go. And they would just ping pong back and forth as they
fought each other over who gets to be the real application, right?
So CCFv2 addresses this problem. Basically, in addition to the application instance,
we just add an instance version,
with the idea being that the version is just increased in some way
every time the application cluster detects that it has moved the application,
such that we can resolve this conflict by saying,
hey, if I get two people that both say they're trying to resume handle state, whoever has the higher version number is the most recent guy. So we're going to
allow him to recover his handle state, and then we'll fail the other guy with an error code that
says, hey, don't retry anymore. You're no longer the application. Somebody else has taken that job. So responsibilities for CCFv2, the client has to pass the new version
alongside the app instance on create. And when you get the new error code, you have
to give up. So CCFv2 would prevent the original application from actually recovering handle state, but it would be
obnoxious if he just sat there retrying infinitely.
So you see the new error, stop trying.
And the server has to look at the new instance version on
every invalidating open.
Higher versions win.
And then we fail the lower version with the new error
code.
For old client cluster or application clusters, we have some simple rules to deal with them
since they don't know about these new versions.
So if you don't give us one, we'll assume that your version is zero. Zero always trumps zero and otherwise all the same rules apply.
All right, crypto improvements. So we introduced encryption in SMB3 and we
mandated the AES-128 CCM cipher.
So the question is, what happens if you need a different cipher?
So you need different performance characteristics.
If you're operating in an environment that has particular regulatory requirements,
what if the cipher's compromised?
For all we know, next week there will be a big leak and we'll find out that there's something wrong with it.
You don't know, right?
You'd like the ability to have some crypto agility here.
You'd like to be able to replace these ciphers if you needed to.
So for 3.1.1, we allowed ciphers to be negotiated on a per-connection basis,
and we added support for AS128GCM.
This is the second change from the 3.1 preview protocol.
We had introduced a flag that the client could set in his
session set up to tell the server, I don't care if you
require encryption, I do.
What I realized is that flag is totally unnecessary.
You can achieve client-mandated encryption
without any protocol changes. To do that, if you're inclined to implement this feature
in your implementation, you just have to have your client indicate that he requires signing,
which is something that has existed in SMB since SMB2. And then once you complete session
setup, just start emitting encrypted packets. Right?
So the SMB protocol requires the server to reply in kind. So if you give him an encrypted
packet, he has to give you one back that's encrypted. And you've already indicated that
you require signing. So the server will reject anything that doesn't show up that's at least
signed crypto Trump signing. So a man in the middle can't inject his own traffic, and as
long as you're only emitting encrypted packets, you've got
an all encrypted connection.
So the protocol change was unnecessary,
and we've reverted it.
Question?
Does either Windows 10 or Windows
Rebels server actually? So the question is whether we chose to implement the client mandated encryption for Windows
10 or Server 2016 and the answer is no, not at this time.
Negotiating a cipher.
Just like with the preauth integrity hash, we introduced a new negotiate context,
and it works basically the same way.
The blob comes in.
It contains just an ordered list of the ciphers you support from most to least preferred.
The server will select one of those.
The policy that you use to select is server's choice.
And then the server will respond with a negotiate context indicating the cipher
that was chosen for the connection. Something to note is that for 3.1.1
connections the SME2 encryption capabilities flag is no longer used. It's
not necessary anymore. The presence or absence of the negotiate context is
unambiguous. You can tell whether the server supports crypto or not
based on the response.
We also had to make two minor changes
to our transform header,
which is the header we use for encryption
that precedes the SMB2 header.
AS-128-CCM used an 11-byte nonce.
GCM uses a 12-byte nonce, something to be aware of.
And the encryption algorithm field was renamed to flags.
It used to indicate which algorithm was being used for the connection.
I think we realized that that didn't make sense because for pre-3.1.1 connections,
it could only be AES-128-CCM.
It was hard-coded to the protocol.
And in 3.1.1, you now negotiate it.
Once you've negotiated it,
you know which cipher is being used for your connection.
It doesn't need to be in the field here.
So this flags field, the value 1,
now just simply indicates that the transform is encryption using whatever
cipher was negotiated for your connection.
All right, performance.
So we decided that we were going to look at large file
copy performance.
So this is a workload that historically does not do so
well with encryption. So SMB can copy a 10 gig line rate easily when we're plain text,
no encryption, no signing. These are the stats for the systems I use. If you've seen my past
talks, I've been using the same systems for, you know, several years now. So this is the standard two-numa node, 16 physical
core system.
I've got an Intel 10 gig NIC.
I've got an NVMe device on both ends.
And I'm going to stand up a file copy workload.
So last year, this is where we were sitting.
You can see the green bar was AS128GCM, the new encryption algorithm.
It's a marked improvement over CCM, which was the SMB3 cipher.
Over twice as fast.
And significantly faster than signing, which is somewhat surprising, right?
I mean, encryption provides both integrity and privacy.
It's doing more work.
You would think it would be more expensive.
But GCM is a particularly optimized algorithm,
especially if you have AS and IS support in your chipset.
So anyway, this is where we were last year.
So the question is, where did we end up this year?
And this is where we are.
So we did some significant
performance optimizations since last year and you can see that we're basically touching
line rate now for file copy workloads using AES GCM. CCM also improved measurably but
GCM is still much better. and the really interesting thing is the
efficiency of these algorithms so cycles per byte wise GCM is uses 33% fewer
clock cycles than CCM does right
what's the overall CPU usage?
Only for the encryption, you mean like just the algorithm itself?
I didn't break that out.
Yeah.
Are you CPU bound?
Am I CPU bound?
I don't believe so, no.
So, are those due to changes in the implementation of server reader or just in the bcrypt
routines?
So the question was why are we getting these improvements?
No, the actual implementation of the crypto routines did not change.
We found some pretty significant optimizations in the SMB client.
There were some opportunities for exploiting parallelism that we took advantage of,
and it works pretty well.
So key points for crypto.
CCM remains required because you need it for SMB3 compatibility.
GCM provides huge performance increases.
If you're looking to implement crypto for SMB, please consider doing so.
Please consider using GCM.
It works much better.
And one thing to be aware of is that
now that you can negotiate the cipher
on a per-connection basis,
if you're doing session binding,
you have to be aware that when you bind
one session across multiple connections,
all of those connections better have negotiated
the same cipher
because you're going to have negotiated the same cipher
because you're going to be using the same keys for them.
That would be not good.
All right, future directions.
So I'm going to have the obligatory.
I'm about to talk about experimental things that we
don't make a promise to ship and blah, blah, blah.
They're fun to talk about. So as we saw in the previous slide, GCM is a lot faster
than SMB signing.
And that's interesting, right?
I mean, signing is only providing
integrity for the packets.
It's doing, conceptually, much less work.
Why isn't it faster?
And you could say, well, GCM is pretty fast now,
right? So maybe who cares? Maybe we just use GCM and call it a day and you get privacy
for free then. Yay. But, you know, what if we don't want to pay the extra cost? I mean,
what if we only need integrity? So, you know, maybe one example of that would be if you
are using SMB for your hypervisor so that he accesses the virtual
disks for the VMs he runs. And if those VMs are running full disk encryption, then all
the data that we're transmitting for our application for the VM is already encrypted. So why double
encrypt? Why pay the cost? It hurts the density for how many VMs you can run on your hypervisor.
So maybe we can look at making signing faster and more efficient.
So GCM is great.
And it turns out that there's this integrity-only variant of GCM that's called GMAT.
So if it's only doing integrity, it should be a lot faster.
So meet Aaron.
Aaron Friedlander was our summer intern in the SMB team this summer.
He's from Carnegie Mellon.
And we had him actually prototype integrating GMAC support into SMB 3.1.1
just to see how it went.
And I just want to give him a shout-out here.
Aaron did a really good job
on this project. I mean, we threw some real hard work at him. He had no prior kernel development
experience and he just did a really great job. Unfortunately, he couldn't be here today.
We tried to make that happen, but he's back in school now and had conflicts.
So what did we have to do to make this work? We took advantage of our negotiate context support
and we defined a new context to allow the client and server
to negotiate signing algorithms.
And the interesting thing about this was that
we didn't increment the dialect.
So this is the proof that the negotiate context
actually allow us to slot in new behavior
without having to rev
the dialect revision.
In fact, we showed that our prototype clients and servers
interoperated just fine, even with mandated signing with RTM
bits and down level bits.
We refactored the entire encryption code path to be
knowledgeable of the fact that it can now be operating in
integrity or integrity and privacy mode.
And then we added a new transform header flag to indicate that the payload you received
was signed and not actually encrypted.
So let's see what that does for us here.
So the purple bar is AES-GMAC, so our new signing, and the green bar is still AES-GCM.
And you can see that for our file copy workload that we were testing,
we didn't get any more throughput,
but that's because we're already pushing up
against 10 gig line rate.
There's nowhere to go,
and I didn't have any 40 gig Ethernet NICs
laying around to test with.
So maybe I'll try to scrounge that together for next year.
The more interesting thing is what it did for our efficiency.
So GCM was 4.8 cycles per byte, and GMAC is 3.8.
So we get a 21% reduction in CPU utilization by going to
this new signing algorithm, which is pretty interesting.
Because if you're trying to pack your VMs in, and you want
to protect their data over the wire, but they're already
providing privacy for themselves, then you could just
pack more VMs onto your hypervisor now.
The other thing that I'll call out here too is that the prototype that Aaron worked on was focused
entirely on functional correctness, right? We just wanted to get it up and running and see how it went.
We didn't spend any time performance optimizing this and after after he finished, we sat down and took a quick look
through the code.
And we're aware of several fairly easy improvements that
would further reduce the cycles per byte for AES-GMAC.
So this is kind of an interesting future direction.
Yeah?
Mr. Park, how exactly is GMAC negotiated?
If you negotiate a GCM or a serial induction program, is it when GM If you negotiate a GCL
is it when is used for signing?
So we negotiate signing and encryption separately.
So using these negotiate contacts,
the prototype client would indicate to the server,
hey, I support these ciphers, and I support these signing
algorithms now.
Right?
As far as I remember,
it's only if you were negotiating
encryption cipher, right?
Not signing.
The negotiation context doesn't have people...
So, like I said, for the prototype, we actually had to
make changes so that we could negotiate the signing algorithm.
Okay.
Yeah.
One more question.
Which Windows client has the performance improvements
for encryption?
Is it the version of Windows 10 that is not out yet
or the TP3 version of Windows 10?
Are you asking about the parallelization?
Yeah, so the RTM bits for Windows 10
do not have the parallelization work yet.
I would expect that the next release of the client
would include those, and then the next release of the preview server operating system
would include those.
And is that across the board on protocols like even SMB2 signing, SMB3 encryption, SMB3.1,
can the encryption not really pull?
The parallelization improvements would only affect the encryption path but it would
also include the old CCM.
So if I skip back real quick, you note that CCM used to be, what, 236 megasec and with
the parallelization improvements it's up to 973.
All right, I'm going to hand it off to Dan now. How's this working in the back of the room?
Sounds good?
Great.
So, I'm here to talk about something
that's a little bit fun.
We had an opportunity to do a bit of a quick sprint piece of
work with some partners over about the last six to eight
weeks to present here today.
Greg showed 100.
I'll upgrade that slightly by two.
So what we did was we stood up what we think is the first
multi-vendor dual 100 gigabit testbed configuration just to
see what our in-flight Windows Server 2016 is standing up to
do, and put some numbers alongside some of the other
100 gigabit results that I think you've started to hear earlier in this conference and probably publicly over the last couple months.
Proud to say we partnered with Arista.
They provided us.
We're actually able to effectively pre-announce it for them here.
They're Arista 7060QX32 port, 100 gig switch.
We're using Mellanox ConnectX 4 NICs, single port.
I probably need to click on my attention because it's...
Is it going down? Okay.
All right, well, I'll just try to keep it managed.
This isn't going to take too long.
SW, just raise your hand in the back if I get too quiet. And Mellanox ConnectX
for Nix on the back side, two per node, single port connected up. So we have four total ports
connected up through this switch for the examples we're going to have here. We're also proud
we were able to bring HGST's UltraStore SN150s into this.
So I'm going to actually show two cases here.
One case, memory to memory between the two nodes, DDR4-2133 and then all the way through
to the storage on the other side.
Now if you count my PC lanes, you're going to see that this is very much still the crawlwalk
run.
This isn't crawl, walk, run in terms of system capability.
It's more in terms of how much content we can assemble in a very short sprint time frame.
We have 16 lanes of PCAE Gen 3 on each of those cards, so 32 total lanes.
We're only able to assemble eight lanes of NVMe flash in the clients.
We hope to go much denser in the future and get more to the bandwidth.
So basically, just a preview, we're going to show bandwidth in the first slide, and
we're going to show latency all the way through to end-to-end storage in the second.
One quick note I want to make about 100, a few folks have mentioned this in a second. One quick note I want to make about 100 gig, a few folks
have mentioned this in other talks.
What we're actually looking at here is only a 100 gig
logical port.
What this is actually constructed with are four
bonded lanes of 25 gig ethernet.
So in much the same way that if you're familiar with the
way that 40 gig switches have appeared and been utilized
where you can have a single 40 gig port that's broken out into four 10 gig lanes.
And the 100 gig generation, that's how 25 gig ethernets, at least one good way for that
to appear.
So you're going to have these 32 port switches on the top pushing four by 32 potentially
25 gig lanes out of them.
So extremely dense high speed network.
And we got 22 gig a second which is we believe right about on top of the theoretical limits.
This was basically just after having worked through the
basic multi-vendor interrupt issues. This is all very new gear. This is the first,
actually the first time that Arista had met Mellanox. They met them in our lab,
which was kind of entertaining. Plugged it up and ran a short 512k workload into
them and basically saturated.
Theoretically, if you do the math on what's available
through the PC Gen 3 slot and 100 gig line rate,
you might get a little bit, just a touch
above 23 gig a second.
But when you get to the edge of a pipe,
you start to run into what's the actual theoretical limits of driving a PC slot
and everything else.
We think we're basically right on top of it with this
load here, which is kind of exciting for us to see.
And then we take it to the storage on the other side.
So you can.
These are publicly available NVMe parts. They're rated for about
3 gig a second. And just as a matter of not showing completely overdriven performance,
we wanted to focus a little bit on the latencies that the 100 gig fabric and our operating system
were introducing. We drove them just a little bit short, so we're getting about 5.7 gig a second.
And then measuring the latency distributions edge to edge, and this is actually kind of the point, you really
can't see the separation between local and remote performance there on that latency distribution.
You see a little bit of fuzz out there. That fuzz is the latency disk of the local relative
to the solid remote. That's why I have the second graph over here for actually difference them.
And at the median of that latency distribution, we're only introducing 28 microseconds of
additional latency.
Taking that to remote system dual 100 gig ports, which we think is a pretty spectacular result
to be starting from.
With that, that really concludes what we have to talk about today.
I just wanted to lead off with a quick little picture of what our lab looks like at the moment, which is both an interesting historical perspective
all in one frame of gig to 10 gig. This is actually one of those. If you haven't seen
them before, this is what one of those octopus cables looks like breaking out a 40 gig port
into four 10 gig links. The rest of the 40 gig, and then 100 gig.
Which is actually an interesting note in and of itself
we should make there, that's copper.
A couple years ago, if you were talking to folks,
we were being told left, right, up and down, front and back,
that 100 gig was gonna require optical connections.
That's not the case.
This is two to three meter copper.
And you can see that we have a lot more ports wired up than I talked about here today,
so watch this space.
And we hope to show some more interesting results in the near future.
I guess that's basically it. Any questions? My part, Craig's part?
What's the I.O. size you used for the benchmark?
I was using all 512K I.O. here, all large I.O. Basically, the focus for this sprint,
we obviously, you know, we had some interop. We were coming up through all these results
are basically in the last week and a half that we assembled together.
We're going to go the rest of the way.
You know, small I was showing you small I latencies in future presentations.
So we're basically focused on filling the pipes and, you know, filling the capabilities
of whatever part we had on the other side.
So filling the NVMEs and filling DOL-100.
Why did you choose the copper winders So filling the end of MES and filling dual hundred.
Why did you choose the cardboard liners instead of Melanox?
That's what Melanox gave us. It worked.
It's just like for smaller, it's interesting to see what the latency difference would be.
Yeah, you might actually start to appreciate some small difference there. And another thing that hopefully you're able to see this on the slide is that these are
thick.
I mean, just in terms of cable management, yeah, you could find yourself thinking that's
pretty reasonable for 100 gig because it's noticeable how much fatter the cable is.
In back, yeah. how much fatter the cable is.
So the question is whether it's a change from 3.1 that you don't do pre-auth integrity hashing for re-auth session setups.
No, it's not a change.
The original Windows 10 preview release didn't do that.
It could be a doc bug, but it's one that's since been addressed because it's been superseded by the final RTM documentation. REVIEW RELEASE DIDN'T DO THAT. IT COULD BE A DOC BUG, BUT IT'S ONE THAT'S SINCE BEEN ADDRESSED BECAUSE IT'S BEEN SUPERCEDED
BY THE FINAL RTM DOCUMENTATION.
FOR A MASTER SESSION SETUP OR A BINDING SESSION SETUP, YOU MUST COMPUTE THE PRE-AUTH INTEGRITY
HASH BECAUSE THOSE SESSION SETUPS RESULT IN SECRET KEYS BEING GENERATED.
BUT A RE-AUTH IS JUST RE-AUTH.
THERE'S NO KEY THAT'S GENERATED AS A RESULT. secret keys being generated. But a re-auth is just re-auth. There's no key that's generated as a result.
Okay.
Yep.
And I'm wondering about the encryption.
So today on Windows 10 and Windows 16,
would we be still using the flag
that are a part of the tree connect response
to figure out if we want to encrypt the session?
So the question is whether we're still using flags in the tree connect response to determine
whether we're encrypting.
Yes.
Yeah.
Nothing about how the server tells the client that he requires encryption has changed.
So there was this idea in the preview release that it would be nice to allow a feature because
in the original implementation of SMB encryption, the choice to encrypt was entirely the servers, right?
It was the server that told the client, like, I require encryption or I don't require encryption.
But it seems like it would be nice to also allow a paranoid client to say, like, well,
I don't care what you do or don't require. I require encryption, right? And so the preview release was attempting to address that deficiency, right?
What I realized after thinking about it some more was that we didn't need any protocol changes to allow that, right?
If a client wants to encrypt and wants to make sure that he's only using encrypted packets,
he can do that by simply mandating signing and then only emitting encrypted packets after session setup completes.
Yep.
Audience Member 6
Audience Member 7
Audience Member 8
Audience Member 9
Audience Member 10
Audience Member 11
Audience Member 12
Audience Member 13
Audience Member 14
Audience Member 15
Audience Member 16
Audience Member 17
Audience Member 18
Audience Member 19
Audience Member 21
Audience Member 22 Audience Member 23 Audience Member 24 Audience Member 24 Audience Member 25 It only looks like it's open, but what if the older client's application continues to
send writes through the old handle?
So, yeah, so the question is what if the old instance of the application continues to try
to do I.O.?
So the issue is that he won't be able to because when the new instance spins up on the other
node, he's going to do an invalidating open.
He's going to tell the server, hey, I'm the same guy.
I'm on a different node.
I want my handles back.
And we'll kill those old handles so that they're no longer
valid.
Yeah.
Yep.
We're out of time.
Any last minute questions?
Otherwise, you can find us around at the Plugfest
or just around at the conference and ask away.
All right, thanks.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.