Storage Developer Conference - #12: Azure File Service: ‘net use’ the cloud
Episode Date: July 29, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 12.
Today we hear from David Gerbel, software engineer with Microsoft,
as he presents Azure File Service from the 2015 Storage Developer Conference.
My name is David Gerbel. I'm in the Windows Azure group at Microsoft.
And thank you all very much for coming. This was a really competitive slot.
I was really considering going to one of the other talks myself, but I obviously have to be here
for this one. This is an SMB server
that we've done in the cloud. It's accessible both
on-prem,
if you're running, if you're limited to SMB2,
if it's accessible on-prem,
if you have encryption enabled with SMB3,
or within the Azure data center,
if you're limited to SMB2.1.
I'm going to go over the features and the API surface of the Azure files.
Again, it's an SMB server,
but because we leverage the REST primitives
on the back end inside of Azure,
we have coherent access via REST in the same namespace.
So it allows some really interesting combinations
and development schemes for moving applications to the cloud.
Some of why we enabled it
and then why we went and created this SMB server.
And then also the design of it,
which is interesting because most people
create SMB servers on top of regular file systems.
This is created on top of basically a NoSQL table
and then a blob store.
The most important thing to keep in mind,
and one of the biggest confusions
when I try to describe Azure Files to people,
is that it's not the SMB2,
serve2.sys driver running on an Azure node at all.
It's a completely new implementation in user mode,
and it uses the back end of the table server
for storing file system metadata, and it blobs for the actual payload of the files.
Because the table server is already a REST API that's been used in Azure for many years, it's robust and distributed and a good platform to build this metadata structure on.
So a lot of the, as I get into the talk,
you'll see a lot of the things that were really difficult
for the server group to do,
going for continuous availability and persistent handles,
actually were relatively easy for us
because we had this primitive of a durable
and distributed NoSQL table.
So it made things relatively easier
than some of the
torture stuff that those guys had to do
to get serve2.sys running for
SMB 3.0.
The current status,
it's been in preview since last summer,
which is just 2.1.
SMB 3.0
is with encryption and persistent handles,
which are the two big new features,
is in progress.
And I can't say exactly when it's going to ship, unfortunately.
The way that it's architected is that
SMB shares are Azure storage containers,
and a container is another concept that goes back with Azure.
You have Azure accounts,
and then accounts can have multiple containers,
and it's a way that that is partitioned, that space.
SMB clients generally should work completely unmodified
because we basically followed MS-SMB2, the spec,
and just implemented it as it was written down in most cases.
I mean, in some cases, when you go into the spec,
I mean, you basically see what is serve2.sys doing,
and then sometimes you have to basically go,
and if you want the maximum compatibility,
you obviously have to go and try to,
some of the subtle undocumented side effects,
the stuff that's at the end, basically,
and all of the behavioral things,
go and implement all of those.
It's built on top of the, again, the
Azure tables and blobs. And so because of that, we have our file system namespace per
share is completely reflected in the REST namespace. And so you're able to go and simultaneously
access REST files using REST API. You access the files using REST APIs at the same time that you're accessing via SMB.
And in fact, things like if you have an open,
if you have a read lease,
a client has a read lease and they're reading files,
if another client attempts to do a REST put operation,
it actually breaks that lease.
And the put blocks until the lease is broken.
And so the client can then go and reread the data.
So it really does,
the REST APIs are
really kind of
worked through in complete
compatibility with the SMB APIs.
If you think of REST being
kind of FTP, basically.
A little history here, if you
look at the evolution of SMB,
SMB 1.x goes way, way, way back.
I mean, IBM and then DOS and Landman,
and it picked up all the stuff along the way.
It would have been very difficult
trying to do that starting with SMB 1.x
because it's just so much stuff that didn't...
It was a multi-decade effort to basically get it done.
Whereas SMB2 was starting from a clean slate.
And if any of you are familiar with the NT APIs
and you look at the commands in SMB,
it's basically a one-to-one mapping.
I mean, they basically took,
how do I proxy the NT API set over the wire?
Because at the end on the server,
it's going to take those commands
and it's going to send them right down to NTFS
or FastFAT or XFAT or whatever.
And so it was optimized for that.
And then compound commands are a cool thing.
If there are certain commands you know, for instance,
that you execute always in sequence,
it can go and compound those to save network chatter.
Not just on reconnects,
but even things like directory enumeration.
The way it works is you have to go
and you always specify one more to get the no more files. And so you can actually
reduce a little bit of chatter that way.
And the way the spec is written, there's really no
limit on compound commands. So you could actually get
some really interesting and more creative attempts
to decrease chatter.
That's going off on a tangent.
But
so
because of that, we
had an interesting challenge
that we don't have a file system below us.
We have this NoSQL table and blobs.
So in some ways it was harder
because we can't just take this packet over the wire,
marshal out the parameters on the various fields
and call it an empty create file.
We can't do that.
We have multiple tables,
and we have tables that are coherent with other tables.
We have guaranteed coherency within a partition,
which is an architectural, a technical detail in Azure.
But what it allows us to do is it have transactions
across multiple tables.
So we have a table for leases and a table for file names
and by range locks.
And we're able to go and start transactions on those.
And basically, if anything goes wrong,
we simply roll back the transaction.
So it makes it really clean to basically do things
that are a nightmare if you're trying to graft on
these semantics after the fact.
Whereas we simply start a transaction,
and then we just go change all these tables.
And if anything goes wrong, we just abort the transaction.
And it's really nice and clean.
And because it's durable and distributed,
if anything goes wrong, if we committed it,
it'll still be there.
So when we reconnect, we'll find it. If something went wrong and we didn we committed it, it'll still be there. So when we reconnect, we'll find it.
If something went wrong and we didn't commit it, it won't
be there. So it's really nice in that
way.
One of the optimizations
you wanted to do was
looking at, when you're looking at
SMB's very stateful protocol, REST is completely
stateless. And all of Azure was designed
originally for REST.
Probably because it's easier.
When you don't have state, you've got to worry about
a lot of problems kind of melt away.
Whereas in SMB, it's very
stateful. And then basically because that's
the way file system APIs have worked for the last
40 years. It's been open
until now. And that has a lot of
powerful features as well. So it's not like
it's a bad thing. There are different problem
spaces that work differently for different approaches. But the state can be a challenge
when you're working with a system that was designed without any state in mind. So we
want to try and kind of segregate state into state that can change and is immutable and
state that can't so that we're able to go and only have to pay the painful price
of durably committing state, which really
has to survive some sort of failover.
This is a busy slide. I'm going to go jump to the next
one, which is a diagram, and then I'm going to go
back one. So we have to memorize this one, and I'll go back
and keep... If I had dual monitors, it would work better.
When a
request comes in,
this is the
scheme of our namespace.
We have an Azure account name, and then file.core.windows.net is a standard suffix there,
and then your share name.
That is a constant DNS name, and we have a dynamic load balancer.
It's a software load balancer that comes in and basically selects one of the front ends it's going to send you to.
But even across failovers, when you crash or anything bad happens,
you never go to a different IP address.
It's all the same to you.
It's all being virtualized by this load balancer inside of Azure.
And it picks a front end for you.
And so this is where our durable state is located.
And so the idea is that we want ephemeral and immutable state.
We can actually go and cache in the front end if we want ephemeral immutable state. We can actually
go and cache in the front end if we want to. Things like, is it a file or directory? That's
not going to change. It's either going to be one or the other forever. The file ID,
when you query the, which is the NTFS file ID. It's file information class internal info.
Those are, so we can actually maintain that state in the front end. We can't keep things
like, you know, byte range locks because those can change. Other nodes can have an
influence on that. So the
arbiter of final truth is actually the durable
storage of the cloud. On the back end,
it's all in the cloud, but in the back end
by the table server.
Okay, so going
back, remembering that previous one,
we try
to go and maintain things that we can
in the front end
will win us some performance
and things that are purely transient
like things about
our socket state
for instance on the TCP connection
that doesn't involve the back end at all
so that's purely on the front end
other things we can cache
the volatile ID
if you're familiar with an SMB2 file handle,
which is actually a handle ID, not a file ID,
but there's two parts to it.
There's a persistent and volatile.
And the idea is that the persistent part is what you need
in order to failover to different front ends
or to failover to when you reestablish a connection.
The volatile part is really some internal information
between the current client and the current node they're connected to. That doesn't need to be persisted either.
So we can maintain that stuff in the front end
and not pay the cost of having to go and durably commit that.
But anything that can change based on other clients
coming in, some other client comes in here and connects to a different front end
and, say, opens a file in an incompatible way
via leases.
Well, that information can't be on the front end.
It's got to all be stored on the back end.
What this means is that
if you're considering our performance
versus an on-premise SMB server,
if you're doing an open,
talking to serve2.sys in kernel mode,
and it gets cached,
hits on all the page pool and everything,
and it's opening an already-opened file,
the cost is basically some DPCs to read off the wire,
incrementing memory location, and then you're out.
Consider what we have to do here.
We have to durably commit three copies
on different upgrade domains and different fault domains
so that no matter what happens,
apart from a natural disaster at the data center,
it'll be committed.
And that's a very high bar.
So metadata-heavy operations,
like if you're doing a bunch of opens and closes of files,
you're going to be dramatically slower
compared to an on-prem serve2.sys.
However, if you're doing large reads and large writes,
we haven't implemented RDMA yet.
Hopefully, there's a lot of interesting problems there
because it wasn't envisioned at all
when Azure was being put together.
So hopefully, we'll get there.
So we'll be able to get, hopefully,
our large read and write performance
getting closer to what an on-prem server would do
if you're on a VM within the data center.
Now, if you're coming over the internet,
I mean, it's the internet, right?
You're never going to be as fast as RDMA on-prem.
But that's kind of the trade-off, if you think about this,
because of the, the, yes, go ahead.
So, on the trade, I was wondering what happens
if the client sets an unruly, arbitrary,
insanely large allocation size?
Azure, Azure storage is all, is all a sparse, by default.
So, yeah, no, we remember it. The value they set, when they query the allocation size, is all a sparse by default.
So,
no, we remember it.
The value they set, when they query the allocation size,
we return that, but no, we don't actually allocate anything until right.
Yeah, so that's how,
that wasn't intentional. It's just the way Azure was designed
for, we use what are called page blobs
for our blobs, and they're just, the way
they're designed, they're sparsely allocated.
However, having said that, you can't as much
as you want to pay for it.
You know, there's actually an interesting
thing about that. We have the, our billing is based
on, I believe it's content length. I don't know if we're
actually going and dithering on this, how we can
do it, but you can actually get in a pretty bad situation if you
set file size really high, because there's
some billing implications of that.
So, yeah, caveat
emptor on that one, right?
You should try it, let us know.
Okay, so this is an example showing
how we manage when we have
multiple shares. This is two clients,
and I've abstracted out one
that they're coming from internal or external, but they could be either.
It just works from the slide you get more
and it's a system they're both internal
and they're both accessing
the same share, they're both
accessing maybe the same file, reading and writing it
you know, waiting on byte range locks
breaking leases, you know, all the sort of stuff that
fun stuff happens when multiple people read and write
the same file
and that's all fine
because again, all that state is all handled back here.
And effectively, the stuff that's usually
in server-to-assist memory, a lot of it's in there.
And so if something happens here
where we basically go and...
Either we have some network glitch,
somebody trips over the network cable,
or the front end goes down.
The actual NT node doesn't usually crash very often.
What happens is we actually intentionally
take it down to software upgrades.
We're doing upgrades all the time.
So we're continuously going and killing services
so that we can go and update the software.
And that's fine.
It's not an event that is considered an exception.
It's considered part of normal.
It just happens all the time.
And so in that case, the client reconnects.
Again, this is the RDB, the Windows client code
or the Samba client code will reconnect,
again, using the same DNS name,
and I didn't draw it here for the sake of space,
but remember we have our load balancer
that's going to come in,
and it knows the load balancer knows that one's down,
and it will go and send a connection to our front end.
Our new front end,
and you just basically pick up where you left off.
Yes?
Sorry, I'm still thinking the way to the back here.
So you open the top-level share,
a handle on the top-level share,
and you change the only only part because the entire thing
is on the user.
You do that from a thousand or five.
How much does that overload your backend?
Well, the watch tree is a particularly nasty thing
to pass this through.
But it basically works because we have the way that
the paths...
Xtable isn't actually hierarchical in general.
It's actually a flat namespace.
So actually it's relatively easy in Xtable
when you don't have a hierarchical structure
to say between this key and that key any changes.
Because that's what a watch tree is.
If you think about a non-hierarchical flat directory
where, yes, we call a backslash.
Actually, it gets turned into a forward slash.
We call that a directory delimiter, but really that's just
a figment. We basically just
have a key range, and
we can specify minimum and maximum, and any change
that happens in that range, boom, that triggers a
tree watch.
Good
try on that one, though.
And this is the current state.
I don't know if I already did this or not,
but SME 2.1 released SME 3.0 again in the works.
That's fortunately all I can say right now.
These are limits per share
and a limit per file
that have some relation to certain decisions
that were made in underlying Azure architecture.
The specific one per share is one that we're working to rectify.
That has to do with, for simplicity in the first implementation,
shares are limited to a single partition.
And a partition is a range of the table space that a given backend node has mounted at a given time.
And so it was simpler if that was always mounted by one backend partition for obvious reasons.
But that being said, it's not that hard either
to split up into multiple of them,
and we're working on that.
Yes?
Is IOPS read-write or open-read?
What's that?
Is IOPS read-write or just read-write?
No, it's read-in combination or read-or-write.
So opens and make your...
No, no, I'm sorry.
Read or write, yes.
But opens, no.
You're not going to get 1,000 opens per second.
Yeah, that's what I was trying to tell you.
No.
Now, again, I had that slide about how we go
and segregate state into mutable and unmutable forms.
You can imagine if you were willing to accept some
in the case of a total
catastrophic loss. Azure has a
geo-replication feature so that if we lose
an entire data center, we've asynchronously
replicated all the data to another data center on the other side of the
country. So we can actually take a
nuke on one data center and you'll lose a few
minutes. And if we also say, well, you
lose your open handles too, then we
can go into something like a memcached
architecture where as long as our data center has power, we won't lose your handles and then we can go into something like a memcached type architecture where as long
as our data center has power,
we won't lose your handles and then we don't have to
do this thing of actually replicating durably three times.
We just store it in memory on enough different nodes
that they won't all lose power.
We have multiple independent
power supplies coming into our data centers.
We have all these generators and so we've been
guaranteed that we'll never lose power
and it only happens about once a year.
It's amazing things that happen.
Brownouts, I mean, all these.
We had one guy actually press the big red button.
Didn't know what it was for.
Do these still work there?
I don't know.
He's a contractor.
Okay. So this have a contractor. Okay.
So this is a demo.
And some of you might be old enough to remember something
from like the early 80s.
And, you know, I'm going to have to go and switch over here
so I can see what I'm doing.
There was a commercial that was kind of interesting.
Okay, what does it have to do with anything?
Well, what it has to do with is that right now... Where did my mouse go?
Oh, now my glass is on.
There.
So right now you'll notice this path up here
is whack-whack, plugfest, demo, share, about.
So basically I'm running my talk from an Azure Share.
So the whole time that you've been doing this,
we've actually been running it from an Azure share. So the whole time that you've been doing this, we've actually been running it from an Azure share.
And, yeah,
right now,
this is, I'm running 1.7
so that we can,
it has to be 2.1 because I have a Wireshark running 2.
Otherwise it would be encrypted
and it's just not that interesting to look at encrypted packets.
There's just nothing there to see
so because this generally
we don't allow unencrypted connections
from outside the data center
this is a special test cluster which is why it has a slightly different naming scheme
and for this week only
we've gone and taken and removed that limitation
so anyone can access this from
outside the data center
and let's do some
other kind of fun stuff here.
Let's get some...
So you can see that Change Notify
is working on filling everything in.
And these are all the requests coming in to actually do all the copying.
Unfortunately, I blocked.
You can't see the window now, the change notifies.
Let's see.
Let's go back to there.
Yeah, it's basically full.
Well, the screen's full anyway.
But now we can go and do the opposite.
And get rid of them all.
And I think there could be one read-only handle,
read-only file in here, and that
freaked me out. I thought, oh my god, is it a bug?
But I don't know if it's read-only or not.
Because you can't delete a read-only file. You have to change or remove the read-only
attribute or else it won't go away.
So that's it.
So they're all gone. So that was change notify running
in Explorer as the files got deleted.
So it kind of,
the idea is it's supposed to just work. It's supposed to be completely
transparent. So let's go. Yeah, yeah. Because again, all the stuff that is generally search
assist keeps in memory, we keep in tables that are all durably committed and are completely
distributed. So all the front ends basically have transactional semantics on all those tables.
And that's how... So we have our main file tables,
and then we have tables for byte range locks and for leases,
and those all match up with the handle tables and the namespace tables,
and we're able to go and create single transactions across all of them.
So yeah, we're able to...
It makes it...
When I was reading kind of how difficult
it was for the SwiftRouteSys guys
because they were trying to go and build this on top of an existing system
which didn't all anticipate it
and it was very painful
and it was made
it's like I need to make sure that well I know what the
end goal is or I don't have to go through all this pain
if I know what I'm supposed to have at the end
and so in a way it was
actually much easier for us that being said of course if you're on I'm supposed to have at the end. And so in a way, it was actually much easier for us.
That being
said, of course, if you're on-prem trying to open
and close files,
it can be orders of magnitude faster.
427.
This is current Linux support that we have.
Generally, Linux supports 2.1.
They negotiate 3.0, but then decline all the optional features.
So what that means is currently you can't use Linux outside the data center until they get encryption done.
And for the really, really perfect transparent failover, they also need to implement persistent handles.
And so if Steve French is somewhere around, yeah, bug him.
And tell him to implement it.
Yeah, that's what he's saying right now as we speak.
Okay. So, yes.
So you're opening 4.45. Yes.
Are we going to have problems
with ISPs that are basically
blocking the port?
Yeah, I bet in a slide.
Coming up.
Okay, so this is the one
marketing slide.
Why did we do this?
Well, there's a lot of apps out there
that were written 20 years ago, or they
lost the source code to, or whoever wrote it died or something. But they're mission
critical vertical apps, and it's their payroll system. And they just can't rewrite it. Maybe
eventually they could rewrite the alt-rest semantics, but they basically just need it.
They want to move to the cloud for cost reasons, but they can't rewrite their apps. They just
have to run it just the bits as they are.
Because of that, this capability allows them to go
and run all
their applications in an
Azure VM if they want the best performance
talking to an Azure
storage account. Or they can even run it if they want
to be a little more timid and go in
more baby steps to this. They can
actually take their application servers and just in more baby steps to this, they can actually take their
application servers and
just point the share instead to an Azure share
in the cloud and make sure that's all working
correctly, albeit a lot slower than it
used to be when it was on-prem, to make sure that
there's no bugs there before then trying and going
and moving some of those actual application
servers into cloud servers as well.
So it allows a smooth progression.
Yeah, I saw it coming up, yeah.
Tunneling DCRPC,
we do talk to things like pipes.
We don't support name pipes.
And we...
What about SID resolution,
looking up the name of the owners, et cetera?
We also, because Azure came out,
Azure never integrated with Active Directory,
and because it always had a scheme of a storage and a storage key,
which is basically, it's a super user key.
I mean, you have one account, and then you have a super user key,
and that's kind of...
Everything's owned by them.
And now that's obviously,
because you see how things have become more minor over time?
There are features that we don't implement yet.
And at the last slide of the talk,
there's actually a link where it shows all the Antivirus features
that we don't implement yet.
And a major one is completely integrating with Active Directory,
because then, yes, we can do all the ACLs,
we can do all the owners, we can do all that stuff.
But we still won't be able to print.
So SMB, we will never
implement printing over SMB.
So sorry about that. Yes?
Those things are on the top.
The actual policy, especially
multi-protocol, basically just one owner
and everything has operations.
Yes. Yeah. I know.
There's...
You fake opposite
and present that as security as possible. Well, if you just want,
if you're just worried about an app compat thing,
but if you're actually worried about security,
anyone who has that key can access all the files.
Yeah.
Oh, well, yes, right now we're turning,
there's the same one that FastFat returns,
which is a global everyone.
It's a S-1110 or something like that.
It's the same code, basically,
the code that FastFight uses to...
When it gets to the query for the security decryptor,
we basically use that.
Because it's in the same boat.
This is an interesting...
When I talked about a feature that encryption enables...
Sorry about...
I did this slide at 3 o'clock in the morning, so I apologize a little about little things
that aren't quite right, and realized I don't have enough room to put that one over there.
But what this is showing is that with encryption, you have two different clients. One is like
an Azure VM, so it's actually in the data center. Another one is, you know, some person
at their home or office or whatever. And they're both, again, connected to the same share,
reading and writing the same files.
All the file locks work.
All the leases work.
Even though their TCP connection is to two different nodes.
And again, because anything that actually involves multiple nodes
all goes down to the back end.
But this is a really cool thing that you can do now.
We have this term, common internet file system,
but really, how much of it runs over the internet?
You know, like, none, right?
I mean, it was, you know,
it actually,
SMB over the internet
really never happened.
But this actually,
this machine that I'm running on,
this wasn't a VM.
This is bare metal.
Well, bare plastic.
And it's actually,
I'm just literally connected.
The reason this cable's here
is because you might have noticed our network,
our wireless network keeps on dropping out. It's really
crappy. All these retransmissions
and it was killing my demo.
I made sure that they actually had a hardwired
one for me because I really wanted the demo
to go smoothly.
In the previous talk today, I was like,
oh no, things were stalling and dying.
It would have been a good demo in a way because it would have
showed things picking up and running again,
but no, it wouldn't have been as good.
And, you know, we do,
especially in the start of Azure
and a lot of cloud offerings,
really stressing REST
because there's a lot of benefits to statelessness.
There really are.
And especially if you're starting from scratch,
there are certain workloads
where it makes a lot of sense.
Again, because we implement this on top
of Xtable
and page blobs,
you have this coherent access to the namespace.
You can basically
move an application that's a legacy
application, which is business critical,
and move it into the cloud so you're still continuously
running, and then you can slowly, gradually, over time
just replace modules. If it makes sense, if that workload makes sense still continuously running, and then you can slowly, gradually, over time, just replace modules.
If it makes sense, if that workload makes sense for REST,
and then transition, so that eventually,
you've basically gotten to where you want to be
with REST implementation without any interruption at all,
even if it takes years.
So this is pretty cool.
And again, if you think about things like we did,
so if we did a write, we basically emulate an open for write access
and then a write.
So that's how we got all the op locks
and leases and the byte range locks
and everything to work correctly
is that we basically emulate the SMB operations
that would happen as part of that
gets and puts and lists and things.
This goes a little bit into
how we wanted to optimize this.
Again, this is taking the idea of deciding what state has to live where
and what we can get away with losing potentially
and what we really have to be for correctness.
We really have to durably commit.
A lot of stuff is really only in memory.
And on a serve-to-do system, it's only committed.
And then only if you specify write-through is it really actually committed to the disk.
Otherwise, I mean, try this sometime for fun.
Go and do a huge X copy to a server
and pull the plug out on the server
and see whether or not what you saw on the client
matches up with what's on the server when you reboot.
It won't.
But for the true active-active,
which you saw there in the other slides,
where we have, again, multiple clients coming in
and reading and writing the same data
connected to multiple clients
with completely transparent failover between them.
We really needed to do this...
Yes.
I was getting ahead of myself.
So this is just setting the stage of why we had to do this
fully continuously available
active-active for the failover.
And this is just a flashback
to the other slide
to remind you again
of how the state
was kind of divided up.
And on this one,
again, this is another situation
to memorize this slide.
And this is an example
of state tearing.
So we have this idea
of ephemeral state. These are things that really only have to live on the client. We have the volatile
ID credits, which is a type of throttling mechanism built into SMB. We have other throttling
on the server. So we have, on the back end, we have other throttling over, like, number
of requests per share. But this is the actual credit-based throttling we leave completely on the
front end. It's all reset if a session
reestablishes. TCP socket
details. Immutable
state is stuff that really, really, really is never going to change.
Then I kind of came to this.
Then we have two different types of durable state,
and this is a little bit fungible, but I was trying to think
of what's the best way to describe it. You have the
solid stuff, which is
stuff that's basically created by the server itself.
And it can change,
but not really based on you calling a create file or a write file.
It's things that are like losing a connection.
So like session ID is one.
If one of our nodes goes down or you lose a connection,
we'll reestablish the session.
You get a new session ID.
But that's not really the user directly going
and calling an API to change state.
Whereas this fluid durable state, open counts, file names,
file sizes, they're durable in the sense that we still
have to commit them. When we
act that right or that create, it needs
to be durably committed three different locations
so that only a nuclear weapon is going to destroy it.
That's kind of our goal, our
metric there.
And it's by far the largest group of states.
So we really do have to push
most of this stuff to the back end. Yes?
What about management disconnection of clients that, let's say, be kind and say,
someone on the West Coast opens a file with no share access, and then they go out to lunch Yeah.
Like a net file slash close on a server?
Yeah, not yet.
So yes, if somebody else has the correct account keys,
yes, they can open up a file and... and then go away and basically wedge that file.
And it's, yeah, there is right now
no colon of net file slash close.
Yeah, I mean, I could go on the actual,
I could go on the server itself,
and then I can update the tables
and change open counts and things, but,
yeah, but that doesn't scale.
I like Jeremy's suggestion of terminating the first. But that doesn't scale. That won't help.
The server will still live on.
No, that's very...
And actually, as we're developing this,
we had actual bugs
where we would actually get stuck in this state
and clients, you know,
customers would call in and report it and it's like, yeah, you know, customers would call in and report it
and it's like, yeah, you know,
we had a bias error or something.
That would be, yeah, we know.
Well, but it's persistent.
That's the problem.
Even if you reboot Azure,
when it comes back up,
that open count is still goddamn there, right?
It's still stuck.
You really need to go in and fix it by
hand, effectively.
Which is what a net file slash close does, right?
I mean, it goes in and fixes it. So that would be good to have.
So in our 2.1 implementation,
you may be familiar with durable handles.
This was created initially just to handle
a network disconnect, right?
A very simple network disconnect where all the state on the
server was still fine. It just was a disconnect.
Somebody tripped over a network cable or something.
And it worked well, but we knew right away
that even in our preview,
because we're constantly rebooting the,
not rebooting, but killing our services on the nodes,
we needed to handle better than just that.
Because we have this load balancer
that basically virtualizes the actual node,
we were actually able to be kind of sneaky here and stretch durable handles
because it wasn't visible to the server and go in, even if a whole node dies,
because we persisted everything we need that normally is on server-to-desk memory on the back end,
when it reconnected, we were able to say, yeah, yeah, okay, we'll reconnect that handle.
And as far as the client thinks, it's talking to the same server, but it's not.
But it doesn't know that.
So we're able to stretch it here, but the problem
is that the spec, MSS
and B2, basically says things like,
well, but if you don't have a handle lease, you have to close
that. You can't keep it.
And it was frustrating for us. It's like, no,
but you want to keep it, because we can. I mean, we have all
the states. We really have to go and, yeah, we have
to close down that handle, even though we really don't have to. But for MSS can. I mean, we have all the states. We really have to go and, yeah, we have to close down that handle,
even though we really don't have to.
But for MSSB2 compliance, we had to go and actually do this painful self-mutilation
and close these handles that we really didn't want to.
That was with durable handles because that's all we could get away with
without violating the spec.
But with persistent handles in SMB3, well now it's really cool.
This is the promised land. Because now we can go and we specifically advertise continuous
availability, it's a shared property, and that makes the client request persistent handles
and now it's just exactly what we want. Not only do we get away from the self-limiting
rules, but there's all this state of the client maintains for us,
which is really useful, like these create quids.
So it allows me to detect whether or not a create came down,
and I went and wrote it out, I durably persisted it,
but on the way back to the client, my front end died.
So the client has no idea whether or not it worked or not,
and there are certain requests that are not idempotent,
like if you specify a create disposition.
You can only happen once.
I need to be able to know that, oh, I've
already seen this create, and so I can basically just
succeed it, because I know I already did,
and it's just another client coming in. Who doesn't
know that? That was all thought about
when the persistent handles were designed.
It was pretty much exactly this scenario.
Basically,
hopefully, it covers all the
holes and all the gaps,
which is pretty cool because now we basically have this fully transparent failover,
which is pretty neat.
Why don't you actually limit yourself with your rules?
Because if you look at, I wish I could quote the actual lines in, you know,
chapter and verse of SMSV2. Just ignore the must, I wish I could quote the actual lines in, you know, chapter and verse of SMSV2.
It's just a golden must, I mean.
No, you can't.
I think Dave Cruz here, at some point you can chat with him about why it is that
if you don't have a handle lease that you can't go and allow a durable handle to survive a disconnect.
If you do, then you could just, yeah.
Yes, we could have just basically... We could have gotten persistent handles.
Well, not all of it.
We could have gotten most of it,
but this other, this state that the client goes
and thankfully transmits back to us
allowed us to plug a lot of other holes.
So it still wouldn't have been there.
It would have been a lot more there,
but hopefully we're rewarding SMB 3.0 very soon,
including Linux clients,
and so they'll get the same benefit.
OK, so these are three links that are useful.
This Getting Started blog, it's actually got a year old now,
but basically has all the details
you need of how to get the account, how to use
the PowerShell scripts for creating shares,
the actual programmatic ways to do them, and all that.
It's a step-by-step guide. It's pretty useful. As I said, we don't currently support all NTFS features, things like name streams, EAs. You can imagine we had a priority
list. We wanted to support the most important ones first. And it's not unlike if you look
at what REFS did. You know, it basically picked one of the most important
NTFS features and one of the ones
that we never want to support ever.
And so we basically had the same sort of calling.
But things that are important, we will get to.
I mean, ACLs is a huge one right now.
But we need Active Directory support for that,
and that's coming in via other initiatives in Azure
to get that fully, absolutely, completely synergy
between an on-premise domain
controller and the cloud. And we know that's a huge
gap right now that we're working on.
There's an interesting
caveat. Because we have this shared namespace
between REST and SMB,
REST has to come in over HTTP,
and so all these RSCs get their
mitts in there.
And when you're coming over
SMB, names aren't, they're actually not UCS, they're
not, they're not UTF-16, it's actually called UCS2.
It's an older version.
It's back before there were special characters, just literally a bunch of U shorts.
And that is all that goes over SMB, unless it's one of these special characters that,
that shall not be named.
You know, anything below 32 and then there's and then there's a handful of other special characters.
Everything's legal.
And you can actually create these really weird file names
in NTFS, and it's all completely legal.
But when you're coming over HTTP, that's limited.
Because we really wanted to maintain complete coherency,
we put some limits on what you can,
in terms of character lengths and depths of paths,
I guess there were some.
I'm not an expert in the HTTP protocol,
but the people that were basically said, these are the things you have to limit. And it's not egregious. depths of paths, I guess. I'm not an expert in the HTTP protocol,
but the people that were basically said,
these are the things you have to limit.
And it's not egregious.
I mean, they're much better than if you're calling Win32 APIs.
If you're calling NT APIs or do the special form or back question mark,
this is a special way you can kind of escape it out effectively.
You can get around that.
But so far, this hasn't really been a problem for us.
But just to note, this document goes into the details
of these characters legal, these ones aren't.
This is what, if you're talking to an SMB,
a real server2.sys server,
these are the limitations and these are our limitations.
So it's useful if you think you might have a problem.
I don't even know what, is that time?
Wow.
So, yeah, I guess I went kind of fast for that.
But does anybody have any questions?
You must have another one.
Oh, yes.
Yes.
Yes.
It's in here somewhere.
Unless I'm not using the latest version of the slides that I edited this morning.
That's weird.
So what I said was that some ISPs block 445
yeah
home ones
might
yeah
well yeah
we found
I mean either works or it doesn't usually.
I mean, either they're blocked 405 or not.
No, they sort of notice it for a while.
Really?
They whack.
Okay.
Well, I was trying to be the end.
Oh, okay.
So, yeah, I thought for sure I had,
I remember when I added this thing,
I thought I also added a slide talking about that.
If you could change the
source port on the window of the
IAP, people have been asking for a long
time.
I would get around it.
There was some decision to use a different port.
Yeah.
Again, for businesses, it's less of an issue
because they can program their own firewalls.
Here, I was...
We didn't know if HiApp blocked 445
because I knew that the wireless
was just not working at all for me here.
And so we got them to go and plug this in here
and we were like, we had no idea
if it was going to block 445
until after the last talk was over.
And I connected and it's like,
thank God it didn't block 445.
Yeah, yeah, yeah.
So yeah, that is a potential problem.
But again, it's a really cool feature
to be able to do this from on-prem,
but the latency implications are really high,
so we expect it to be an interim thing that people...
I mean, it's kind of cool,
but whether or not actually people use it in production,
some people might.
Yes?
How about the byte range locking?
Yeah, yeah.
Yes, so if you go in...
Well, we don't actually have an implement...
You can't actually set byte range locks,
but they'll be respected.
So if there's a byte range lock set,
an exclusive lock set on a range of following,
and you try to do a get for that region,
you'll get back an HTTP error.
I don't remember the exact error we give,
but it's one that seems conflict
or something. So yes, I mean,
we basically respect byte range locks with REST.
And then if you try to do a
write, that will fail
as well.
Yes?
Do you document the mapping to
the blocks on the table?
Is that sort of a supported feature, how you're mapping the file system into blocks? Well, no, because it changes all the time internally under the hood.
So, I mean, it's...
Oh, well, because it looks like a table.
So, I mean, it's, yeah.
So, that is, it's documented in that level in the sense that you do a list range
and you get the directories,
and then you can do a get and put
and can get the files.
Can I see what's locked at the bottom?
No, those are what's called,
we call them nested tables,
which is a terrible term,
because if you think of HTTP,
nested tables, other tables,
actually what it means is that
we can guarantee atomicity across them,
so it should be co-located tables.
But no, via these APIs, you have access only to the actual, Actually, what it means is that we can guarantee atomicity across them, so it should be co-located tables.
But no, via these APIs, you have access only to the directory structure and the actual file data.
You can't go and...
Yeah, otherwise...
Maybe you could fix it yourself if you had an off open count.
But that would obviously have other problems.
And can you keep versions of files
just by adding more and more blocks?
Nothing that I can talk about right now any other questions?
yes?
I'm not sure if you could talk about that
but how many opens per second can you get
if you're running a virtual
it depends if you're doing...
It depends on how many opens can you do per second.
So it depends.
Are you doing it in a loop,
or do you have a lot of clients doing it all at the same time?
Those are different questions.
The average latency I've seen for opens
is like five or six milliseconds.
Five or six milliseconds.
If you're talking to an SMB2 server,
it's almost always under a millisecond.
Almost always under a millisecond.
Assuming it doesn't have to...
Assuming it doesn't have to fault something in
and hit the disk.
But if stuff is in memory,
it's always under a millisecond.
We have a very busy server called Scratch2,
Scratch, or whatever.
And it's very, very busy,
and I was doing tests on that,
and I always get my opens in under a millisecond.
Whereas we're on the order of five or six for an open.
And it's, again, because we're doing a lot.
And if we can get to a state where we're actually,
stuff that doesn't have to survive,
only has to survive a natural disaster,
open counts, basically.
The temporal state about the handle,
and keep that in memory on enough nodes
that no one node going down would,
then we could really boost that
because then we would only be being passed,
still would never be as fast as the searcher.sys
because it's only updating memory in one location,
but we would only be having to update memory
on different locations, not physically writing.
At this layer, it's actually not writing to a spinning media.
It's logged in several SSDs
on different table servers,
but still, it's slower than memory.
But I mean, for 20 seconds, it's good.
Well, it's not as good as one.
So this is a little bit of history about myself.
Before I did this, before I joined Azure to do this,
I had never written user mode code except for tests.
I mean, what else is good for it?
You'd write tests in user mode.
And so I was like this idea,
well, yeah, I raised the DPC level.
Of course nobody else can run.
I can't do that anymore.
And all these things that
when you're writing in kernel mode,
because you can't be careful
what you touch in page pool,
so you become extremely cognizant
of how many cycles things take.
You become very cognizant
of how long things run.. You become very cognizant of how long things run,
and you're very efficient.
You don't go and call some constructor
that goes and calls some STL routine
that does God knows what for five million cycles later
and returns to you,
and you didn't even know that you triggered that.
So it's very tight.
And so five milliseconds is forever.
I know, but having the data replicated to three places in that five milliseconds.
Yeah.
That's impressive.
Oh, well, thank you. I didn't do that part. That was all done by these primitives, again,
that I can call. So all I've got to do is just go and just update these tables and it
all just happens.
If you're not doing name pipes, you're not doing server service. You can't do NetView.
I know.
Yeah, we would love to...
Why NetView isn't part of SMB2 as an iOctal would be great,
because then we could trivially implement it.
No, it's a...
Yeah, we need name pipes so we can do RPC.
Though someone earlier told me I could do RPC over TCP,
which I wasn't even aware of.
If that could possibly work, if a Windows
client would actually work that way, and
NetView would work, trying to do RPC over TCP,
I always thought I would always run over
name pipes. But yeah, we'd love
to get NetView, because it's the first thing people do
is to a NetView on the server, and it doesn't work.
Okay, is that it?
Well, I don't want to keep people longer.
If you have any more questions, come and talk to me.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.