Storage Developer Conference - #12: Azure File Service: ‘net use’ the cloud

Episode Date: July 29, 2016


Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at slash podcast. You are listening to SDC Podcast Episode 12. Today we hear from David Gerbel, software engineer with Microsoft, as he presents Azure File Service from the 2015 Storage Developer Conference. My name is David Gerbel. I'm in the Windows Azure group at Microsoft.
Starting point is 00:00:48 And thank you all very much for coming. This was a really competitive slot. I was really considering going to one of the other talks myself, but I obviously have to be here for this one. This is an SMB server that we've done in the cloud. It's accessible both on-prem, if you're running, if you're limited to SMB2, if it's accessible on-prem, if you have encryption enabled with SMB3,
Starting point is 00:01:12 or within the Azure data center, if you're limited to SMB2.1. I'm going to go over the features and the API surface of the Azure files. Again, it's an SMB server, but because we leverage the REST primitives on the back end inside of Azure, we have coherent access via REST in the same namespace. So it allows some really interesting combinations
Starting point is 00:01:36 and development schemes for moving applications to the cloud. Some of why we enabled it and then why we went and created this SMB server. And then also the design of it, which is interesting because most people create SMB servers on top of regular file systems. This is created on top of basically a NoSQL table and then a blob store.
Starting point is 00:02:01 The most important thing to keep in mind, and one of the biggest confusions when I try to describe Azure Files to people, is that it's not the SMB2, serve2.sys driver running on an Azure node at all. It's a completely new implementation in user mode, and it uses the back end of the table server for storing file system metadata, and it blobs for the actual payload of the files.
Starting point is 00:02:31 Because the table server is already a REST API that's been used in Azure for many years, it's robust and distributed and a good platform to build this metadata structure on. So a lot of the, as I get into the talk, you'll see a lot of the things that were really difficult for the server group to do, going for continuous availability and persistent handles, actually were relatively easy for us because we had this primitive of a durable and distributed NoSQL table.
Starting point is 00:03:01 So it made things relatively easier than some of the torture stuff that those guys had to do to get serve2.sys running for SMB 3.0. The current status, it's been in preview since last summer, which is just 2.1.
Starting point is 00:03:20 SMB 3.0 is with encryption and persistent handles, which are the two big new features, is in progress. And I can't say exactly when it's going to ship, unfortunately. The way that it's architected is that SMB shares are Azure storage containers, and a container is another concept that goes back with Azure.
Starting point is 00:03:42 You have Azure accounts, and then accounts can have multiple containers, and it's a way that that is partitioned, that space. SMB clients generally should work completely unmodified because we basically followed MS-SMB2, the spec, and just implemented it as it was written down in most cases. I mean, in some cases, when you go into the spec, I mean, you basically see what is serve2.sys doing,
Starting point is 00:04:09 and then sometimes you have to basically go, and if you want the maximum compatibility, you obviously have to go and try to, some of the subtle undocumented side effects, the stuff that's at the end, basically, and all of the behavioral things, go and implement all of those. It's built on top of the, again, the
Starting point is 00:04:27 Azure tables and blobs. And so because of that, we have our file system namespace per share is completely reflected in the REST namespace. And so you're able to go and simultaneously access REST files using REST API. You access the files using REST APIs at the same time that you're accessing via SMB. And in fact, things like if you have an open, if you have a read lease, a client has a read lease and they're reading files, if another client attempts to do a REST put operation, it actually breaks that lease.
Starting point is 00:04:58 And the put blocks until the lease is broken. And so the client can then go and reread the data. So it really does, the REST APIs are really kind of worked through in complete compatibility with the SMB APIs. If you think of REST being
Starting point is 00:05:15 kind of FTP, basically. A little history here, if you look at the evolution of SMB, SMB 1.x goes way, way, way back. I mean, IBM and then DOS and Landman, and it picked up all the stuff along the way. It would have been very difficult trying to do that starting with SMB 1.x
Starting point is 00:05:39 because it's just so much stuff that didn't... It was a multi-decade effort to basically get it done. Whereas SMB2 was starting from a clean slate. And if any of you are familiar with the NT APIs and you look at the commands in SMB, it's basically a one-to-one mapping. I mean, they basically took, how do I proxy the NT API set over the wire?
Starting point is 00:06:00 Because at the end on the server, it's going to take those commands and it's going to send them right down to NTFS or FastFAT or XFAT or whatever. And so it was optimized for that. And then compound commands are a cool thing. If there are certain commands you know, for instance, that you execute always in sequence,
Starting point is 00:06:14 it can go and compound those to save network chatter. Not just on reconnects, but even things like directory enumeration. The way it works is you have to go and you always specify one more to get the no more files. And so you can actually reduce a little bit of chatter that way. And the way the spec is written, there's really no limit on compound commands. So you could actually get
Starting point is 00:06:31 some really interesting and more creative attempts to decrease chatter. That's going off on a tangent. But so because of that, we had an interesting challenge that we don't have a file system below us.
Starting point is 00:06:47 We have this NoSQL table and blobs. So in some ways it was harder because we can't just take this packet over the wire, marshal out the parameters on the various fields and call it an empty create file. We can't do that. We have multiple tables, and we have tables that are coherent with other tables.
Starting point is 00:07:03 We have guaranteed coherency within a partition, which is an architectural, a technical detail in Azure. But what it allows us to do is it have transactions across multiple tables. So we have a table for leases and a table for file names and by range locks. And we're able to go and start transactions on those. And basically, if anything goes wrong,
Starting point is 00:07:22 we simply roll back the transaction. So it makes it really clean to basically do things that are a nightmare if you're trying to graft on these semantics after the fact. Whereas we simply start a transaction, and then we just go change all these tables. And if anything goes wrong, we just abort the transaction. And it's really nice and clean.
Starting point is 00:07:38 And because it's durable and distributed, if anything goes wrong, if we committed it, it'll still be there. So when we reconnect, we'll find it. If something went wrong and we didn we committed it, it'll still be there. So when we reconnect, we'll find it. If something went wrong and we didn't commit it, it won't be there. So it's really nice in that way. One of the optimizations
Starting point is 00:07:54 you wanted to do was looking at, when you're looking at SMB's very stateful protocol, REST is completely stateless. And all of Azure was designed originally for REST. Probably because it's easier. When you don't have state, you've got to worry about a lot of problems kind of melt away.
Starting point is 00:08:10 Whereas in SMB, it's very stateful. And then basically because that's the way file system APIs have worked for the last 40 years. It's been open until now. And that has a lot of powerful features as well. So it's not like it's a bad thing. There are different problem spaces that work differently for different approaches. But the state can be a challenge
Starting point is 00:08:29 when you're working with a system that was designed without any state in mind. So we want to try and kind of segregate state into state that can change and is immutable and state that can't so that we're able to go and only have to pay the painful price of durably committing state, which really has to survive some sort of failover. This is a busy slide. I'm going to go jump to the next one, which is a diagram, and then I'm going to go back one. So we have to memorize this one, and I'll go back
Starting point is 00:08:56 and keep... If I had dual monitors, it would work better. When a request comes in, this is the scheme of our namespace. We have an Azure account name, and then is a standard suffix there, and then your share name. That is a constant DNS name, and we have a dynamic load balancer.
Starting point is 00:09:17 It's a software load balancer that comes in and basically selects one of the front ends it's going to send you to. But even across failovers, when you crash or anything bad happens, you never go to a different IP address. It's all the same to you. It's all being virtualized by this load balancer inside of Azure. And it picks a front end for you. And so this is where our durable state is located. And so the idea is that we want ephemeral and immutable state.
Starting point is 00:09:44 We can actually go and cache in the front end if we want ephemeral immutable state. We can actually go and cache in the front end if we want to. Things like, is it a file or directory? That's not going to change. It's either going to be one or the other forever. The file ID, when you query the, which is the NTFS file ID. It's file information class internal info. Those are, so we can actually maintain that state in the front end. We can't keep things like, you know, byte range locks because those can change. Other nodes can have an influence on that. So the arbiter of final truth is actually the durable
Starting point is 00:10:09 storage of the cloud. On the back end, it's all in the cloud, but in the back end by the table server. Okay, so going back, remembering that previous one, we try to go and maintain things that we can in the front end
Starting point is 00:10:25 will win us some performance and things that are purely transient like things about our socket state for instance on the TCP connection that doesn't involve the back end at all so that's purely on the front end other things we can cache
Starting point is 00:10:42 the volatile ID if you're familiar with an SMB2 file handle, which is actually a handle ID, not a file ID, but there's two parts to it. There's a persistent and volatile. And the idea is that the persistent part is what you need in order to failover to different front ends or to failover to when you reestablish a connection.
Starting point is 00:11:00 The volatile part is really some internal information between the current client and the current node they're connected to. That doesn't need to be persisted either. So we can maintain that stuff in the front end and not pay the cost of having to go and durably commit that. But anything that can change based on other clients coming in, some other client comes in here and connects to a different front end and, say, opens a file in an incompatible way via leases.
Starting point is 00:11:27 Well, that information can't be on the front end. It's got to all be stored on the back end. What this means is that if you're considering our performance versus an on-premise SMB server, if you're doing an open, talking to serve2.sys in kernel mode, and it gets cached,
Starting point is 00:11:42 hits on all the page pool and everything, and it's opening an already-opened file, the cost is basically some DPCs to read off the wire, incrementing memory location, and then you're out. Consider what we have to do here. We have to durably commit three copies on different upgrade domains and different fault domains so that no matter what happens,
Starting point is 00:12:01 apart from a natural disaster at the data center, it'll be committed. And that's a very high bar. So metadata-heavy operations, like if you're doing a bunch of opens and closes of files, you're going to be dramatically slower compared to an on-prem serve2.sys. However, if you're doing large reads and large writes,
Starting point is 00:12:19 we haven't implemented RDMA yet. Hopefully, there's a lot of interesting problems there because it wasn't envisioned at all when Azure was being put together. So hopefully, we'll get there. So we'll be able to get, hopefully, our large read and write performance getting closer to what an on-prem server would do
Starting point is 00:12:36 if you're on a VM within the data center. Now, if you're coming over the internet, I mean, it's the internet, right? You're never going to be as fast as RDMA on-prem. But that's kind of the trade-off, if you think about this, because of the, the, yes, go ahead. So, on the trade, I was wondering what happens if the client sets an unruly, arbitrary,
Starting point is 00:12:55 insanely large allocation size? Azure, Azure storage is all, is all a sparse, by default. So, yeah, no, we remember it. The value they set, when they query the allocation size, is all a sparse by default. So, no, we remember it. The value they set, when they query the allocation size, we return that, but no, we don't actually allocate anything until right. Yeah, so that's how,
Starting point is 00:13:18 that wasn't intentional. It's just the way Azure was designed for, we use what are called page blobs for our blobs, and they're just, the way they're designed, they're sparsely allocated. However, having said that, you can't as much as you want to pay for it. You know, there's actually an interesting thing about that. We have the, our billing is based
Starting point is 00:13:34 on, I believe it's content length. I don't know if we're actually going and dithering on this, how we can do it, but you can actually get in a pretty bad situation if you set file size really high, because there's some billing implications of that. So, yeah, caveat emptor on that one, right? You should try it, let us know.
Starting point is 00:13:53 Okay, so this is an example showing how we manage when we have multiple shares. This is two clients, and I've abstracted out one that they're coming from internal or external, but they could be either. It just works from the slide you get more and it's a system they're both internal and they're both accessing
Starting point is 00:14:10 the same share, they're both accessing maybe the same file, reading and writing it you know, waiting on byte range locks breaking leases, you know, all the sort of stuff that fun stuff happens when multiple people read and write the same file and that's all fine because again, all that state is all handled back here.
Starting point is 00:14:28 And effectively, the stuff that's usually in server-to-assist memory, a lot of it's in there. And so if something happens here where we basically go and... Either we have some network glitch, somebody trips over the network cable, or the front end goes down. The actual NT node doesn't usually crash very often.
Starting point is 00:14:47 What happens is we actually intentionally take it down to software upgrades. We're doing upgrades all the time. So we're continuously going and killing services so that we can go and update the software. And that's fine. It's not an event that is considered an exception. It's considered part of normal.
Starting point is 00:15:06 It just happens all the time. And so in that case, the client reconnects. Again, this is the RDB, the Windows client code or the Samba client code will reconnect, again, using the same DNS name, and I didn't draw it here for the sake of space, but remember we have our load balancer that's going to come in,
Starting point is 00:15:24 and it knows the load balancer knows that one's down, and it will go and send a connection to our front end. Our new front end, and you just basically pick up where you left off. Yes? Sorry, I'm still thinking the way to the back here. So you open the top-level share, a handle on the top-level share,
Starting point is 00:15:44 and you change the only only part because the entire thing is on the user. You do that from a thousand or five. How much does that overload your backend? Well, the watch tree is a particularly nasty thing to pass this through. But it basically works because we have the way that the paths...
Starting point is 00:16:04 Xtable isn't actually hierarchical in general. It's actually a flat namespace. So actually it's relatively easy in Xtable when you don't have a hierarchical structure to say between this key and that key any changes. Because that's what a watch tree is. If you think about a non-hierarchical flat directory where, yes, we call a backslash.
Starting point is 00:16:22 Actually, it gets turned into a forward slash. We call that a directory delimiter, but really that's just a figment. We basically just have a key range, and we can specify minimum and maximum, and any change that happens in that range, boom, that triggers a tree watch. Good
Starting point is 00:16:39 try on that one, though. And this is the current state. I don't know if I already did this or not, but SME 2.1 released SME 3.0 again in the works. That's fortunately all I can say right now. These are limits per share and a limit per file that have some relation to certain decisions
Starting point is 00:17:03 that were made in underlying Azure architecture. The specific one per share is one that we're working to rectify. That has to do with, for simplicity in the first implementation, shares are limited to a single partition. And a partition is a range of the table space that a given backend node has mounted at a given time. And so it was simpler if that was always mounted by one backend partition for obvious reasons. But that being said, it's not that hard either to split up into multiple of them,
Starting point is 00:17:31 and we're working on that. Yes? Is IOPS read-write or open-read? What's that? Is IOPS read-write or just read-write? No, it's read-in combination or read-or-write. So opens and make your... No, no, I'm sorry.
Starting point is 00:17:46 Read or write, yes. But opens, no. You're not going to get 1,000 opens per second. Yeah, that's what I was trying to tell you. No. Now, again, I had that slide about how we go and segregate state into mutable and unmutable forms. You can imagine if you were willing to accept some
Starting point is 00:18:04 in the case of a total catastrophic loss. Azure has a geo-replication feature so that if we lose an entire data center, we've asynchronously replicated all the data to another data center on the other side of the country. So we can actually take a nuke on one data center and you'll lose a few minutes. And if we also say, well, you
Starting point is 00:18:19 lose your open handles too, then we can go into something like a memcached architecture where as long as our data center has power, we won't lose your handles and then we can go into something like a memcached type architecture where as long as our data center has power, we won't lose your handles and then we don't have to do this thing of actually replicating durably three times. We just store it in memory on enough different nodes that they won't all lose power.
Starting point is 00:18:35 We have multiple independent power supplies coming into our data centers. We have all these generators and so we've been guaranteed that we'll never lose power and it only happens about once a year. It's amazing things that happen. Brownouts, I mean, all these. We had one guy actually press the big red button.
Starting point is 00:18:57 Didn't know what it was for. Do these still work there? I don't know. He's a contractor. Okay. So this have a contractor. Okay. So this is a demo. And some of you might be old enough to remember something from like the early 80s.
Starting point is 00:19:16 And, you know, I'm going to have to go and switch over here so I can see what I'm doing. There was a commercial that was kind of interesting. Okay, what does it have to do with anything? Well, what it has to do with is that right now... Where did my mouse go? Oh, now my glass is on. There. So right now you'll notice this path up here
Starting point is 00:19:56 is whack-whack, plugfest, demo, share, about. So basically I'm running my talk from an Azure Share. So the whole time that you've been doing this, we've actually been running it from an Azure share. So the whole time that you've been doing this, we've actually been running it from an Azure share. And, yeah, right now, this is, I'm running 1.7 so that we can,
Starting point is 00:20:17 it has to be 2.1 because I have a Wireshark running 2. Otherwise it would be encrypted and it's just not that interesting to look at encrypted packets. There's just nothing there to see so because this generally we don't allow unencrypted connections from outside the data center this is a special test cluster which is why it has a slightly different naming scheme
Starting point is 00:20:34 and for this week only we've gone and taken and removed that limitation so anyone can access this from outside the data center and let's do some other kind of fun stuff here. Let's get some... So you can see that Change Notify
Starting point is 00:21:04 is working on filling everything in. And these are all the requests coming in to actually do all the copying. Unfortunately, I blocked. You can't see the window now, the change notifies. Let's see. Let's go back to there. Yeah, it's basically full. Well, the screen's full anyway.
Starting point is 00:21:29 But now we can go and do the opposite. And get rid of them all. And I think there could be one read-only handle, read-only file in here, and that freaked me out. I thought, oh my god, is it a bug? But I don't know if it's read-only or not. Because you can't delete a read-only file. You have to change or remove the read-only attribute or else it won't go away.
Starting point is 00:21:56 So that's it. So they're all gone. So that was change notify running in Explorer as the files got deleted. So it kind of, the idea is it's supposed to just work. It's supposed to be completely transparent. So let's go. Yeah, yeah. Because again, all the stuff that is generally search assist keeps in memory, we keep in tables that are all durably committed and are completely distributed. So all the front ends basically have transactional semantics on all those tables.
Starting point is 00:22:27 And that's how... So we have our main file tables, and then we have tables for byte range locks and for leases, and those all match up with the handle tables and the namespace tables, and we're able to go and create single transactions across all of them. So yeah, we're able to... It makes it... When I was reading kind of how difficult it was for the SwiftRouteSys guys
Starting point is 00:22:48 because they were trying to go and build this on top of an existing system which didn't all anticipate it and it was very painful and it was made it's like I need to make sure that well I know what the end goal is or I don't have to go through all this pain if I know what I'm supposed to have at the end and so in a way it was
Starting point is 00:23:04 actually much easier for us that being said of course if you're on I'm supposed to have at the end. And so in a way, it was actually much easier for us. That being said, of course, if you're on-prem trying to open and close files, it can be orders of magnitude faster. 427. This is current Linux support that we have. Generally, Linux supports 2.1.
Starting point is 00:23:34 They negotiate 3.0, but then decline all the optional features. So what that means is currently you can't use Linux outside the data center until they get encryption done. And for the really, really perfect transparent failover, they also need to implement persistent handles. And so if Steve French is somewhere around, yeah, bug him. And tell him to implement it. Yeah, that's what he's saying right now as we speak. Okay. So, yes. So you're opening 4.45. Yes.
Starting point is 00:24:05 Are we going to have problems with ISPs that are basically blocking the port? Yeah, I bet in a slide. Coming up. Okay, so this is the one marketing slide. Why did we do this?
Starting point is 00:24:20 Well, there's a lot of apps out there that were written 20 years ago, or they lost the source code to, or whoever wrote it died or something. But they're mission critical vertical apps, and it's their payroll system. And they just can't rewrite it. Maybe eventually they could rewrite the alt-rest semantics, but they basically just need it. They want to move to the cloud for cost reasons, but they can't rewrite their apps. They just have to run it just the bits as they are. Because of that, this capability allows them to go
Starting point is 00:24:49 and run all their applications in an Azure VM if they want the best performance talking to an Azure storage account. Or they can even run it if they want to be a little more timid and go in more baby steps to this. They can actually take their application servers and just in more baby steps to this, they can actually take their
Starting point is 00:25:05 application servers and just point the share instead to an Azure share in the cloud and make sure that's all working correctly, albeit a lot slower than it used to be when it was on-prem, to make sure that there's no bugs there before then trying and going and moving some of those actual application servers into cloud servers as well.
Starting point is 00:25:22 So it allows a smooth progression. Yeah, I saw it coming up, yeah. Tunneling DCRPC, we do talk to things like pipes. We don't support name pipes. And we... What about SID resolution, looking up the name of the owners, et cetera?
Starting point is 00:25:44 We also, because Azure came out, Azure never integrated with Active Directory, and because it always had a scheme of a storage and a storage key, which is basically, it's a super user key. I mean, you have one account, and then you have a super user key, and that's kind of... Everything's owned by them. And now that's obviously,
Starting point is 00:26:03 because you see how things have become more minor over time? There are features that we don't implement yet. And at the last slide of the talk, there's actually a link where it shows all the Antivirus features that we don't implement yet. And a major one is completely integrating with Active Directory, because then, yes, we can do all the ACLs, we can do all the owners, we can do all that stuff.
Starting point is 00:26:21 But we still won't be able to print. So SMB, we will never implement printing over SMB. So sorry about that. Yes? Those things are on the top. The actual policy, especially multi-protocol, basically just one owner and everything has operations.
Starting point is 00:26:37 Yes. Yeah. I know. There's... You fake opposite and present that as security as possible. Well, if you just want, if you're just worried about an app compat thing, but if you're actually worried about security, anyone who has that key can access all the files. Yeah.
Starting point is 00:26:59 Oh, well, yes, right now we're turning, there's the same one that FastFat returns, which is a global everyone. It's a S-1110 or something like that. It's the same code, basically, the code that FastFight uses to... When it gets to the query for the security decryptor, we basically use that.
Starting point is 00:27:15 Because it's in the same boat. This is an interesting... When I talked about a feature that encryption enables... Sorry about... I did this slide at 3 o'clock in the morning, so I apologize a little about little things that aren't quite right, and realized I don't have enough room to put that one over there. But what this is showing is that with encryption, you have two different clients. One is like an Azure VM, so it's actually in the data center. Another one is, you know, some person
Starting point is 00:27:41 at their home or office or whatever. And they're both, again, connected to the same share, reading and writing the same files. All the file locks work. All the leases work. Even though their TCP connection is to two different nodes. And again, because anything that actually involves multiple nodes all goes down to the back end. But this is a really cool thing that you can do now.
Starting point is 00:28:00 We have this term, common internet file system, but really, how much of it runs over the internet? You know, like, none, right? I mean, it was, you know, it actually, SMB over the internet really never happened. But this actually,
Starting point is 00:28:13 this machine that I'm running on, this wasn't a VM. This is bare metal. Well, bare plastic. And it's actually, I'm just literally connected. The reason this cable's here is because you might have noticed our network,
Starting point is 00:28:25 our wireless network keeps on dropping out. It's really crappy. All these retransmissions and it was killing my demo. I made sure that they actually had a hardwired one for me because I really wanted the demo to go smoothly. In the previous talk today, I was like, oh no, things were stalling and dying.
Starting point is 00:28:41 It would have been a good demo in a way because it would have showed things picking up and running again, but no, it wouldn't have been as good. And, you know, we do, especially in the start of Azure and a lot of cloud offerings, really stressing REST because there's a lot of benefits to statelessness.
Starting point is 00:29:01 There really are. And especially if you're starting from scratch, there are certain workloads where it makes a lot of sense. Again, because we implement this on top of Xtable and page blobs, you have this coherent access to the namespace.
Starting point is 00:29:14 You can basically move an application that's a legacy application, which is business critical, and move it into the cloud so you're still continuously running, and then you can slowly, gradually, over time just replace modules. If it makes sense, if that workload makes sense still continuously running, and then you can slowly, gradually, over time, just replace modules. If it makes sense, if that workload makes sense for REST, and then transition, so that eventually,
Starting point is 00:29:30 you've basically gotten to where you want to be with REST implementation without any interruption at all, even if it takes years. So this is pretty cool. And again, if you think about things like we did, so if we did a write, we basically emulate an open for write access and then a write. So that's how we got all the op locks
Starting point is 00:29:47 and leases and the byte range locks and everything to work correctly is that we basically emulate the SMB operations that would happen as part of that gets and puts and lists and things. This goes a little bit into how we wanted to optimize this. Again, this is taking the idea of deciding what state has to live where
Starting point is 00:30:10 and what we can get away with losing potentially and what we really have to be for correctness. We really have to durably commit. A lot of stuff is really only in memory. And on a serve-to-do system, it's only committed. And then only if you specify write-through is it really actually committed to the disk. Otherwise, I mean, try this sometime for fun. Go and do a huge X copy to a server
Starting point is 00:30:30 and pull the plug out on the server and see whether or not what you saw on the client matches up with what's on the server when you reboot. It won't. But for the true active-active, which you saw there in the other slides, where we have, again, multiple clients coming in and reading and writing the same data
Starting point is 00:30:47 connected to multiple clients with completely transparent failover between them. We really needed to do this... Yes. I was getting ahead of myself. So this is just setting the stage of why we had to do this fully continuously available active-active for the failover.
Starting point is 00:31:08 And this is just a flashback to the other slide to remind you again of how the state was kind of divided up. And on this one, again, this is another situation to memorize this slide.
Starting point is 00:31:22 And this is an example of state tearing. So we have this idea of ephemeral state. These are things that really only have to live on the client. We have the volatile ID credits, which is a type of throttling mechanism built into SMB. We have other throttling on the server. So we have, on the back end, we have other throttling over, like, number of requests per share. But this is the actual credit-based throttling we leave completely on the front end. It's all reset if a session
Starting point is 00:31:47 reestablishes. TCP socket details. Immutable state is stuff that really, really, really is never going to change. Then I kind of came to this. Then we have two different types of durable state, and this is a little bit fungible, but I was trying to think of what's the best way to describe it. You have the solid stuff, which is
Starting point is 00:32:03 stuff that's basically created by the server itself. And it can change, but not really based on you calling a create file or a write file. It's things that are like losing a connection. So like session ID is one. If one of our nodes goes down or you lose a connection, we'll reestablish the session. You get a new session ID.
Starting point is 00:32:20 But that's not really the user directly going and calling an API to change state. Whereas this fluid durable state, open counts, file names, file sizes, they're durable in the sense that we still have to commit them. When we act that right or that create, it needs to be durably committed three different locations so that only a nuclear weapon is going to destroy it.
Starting point is 00:32:38 That's kind of our goal, our metric there. And it's by far the largest group of states. So we really do have to push most of this stuff to the back end. Yes? What about management disconnection of clients that, let's say, be kind and say, someone on the West Coast opens a file with no share access, and then they go out to lunch Yeah. Like a net file slash close on a server?
Starting point is 00:33:15 Yeah, not yet. So yes, if somebody else has the correct account keys, yes, they can open up a file and... and then go away and basically wedge that file. And it's, yeah, there is right now no colon of net file slash close. Yeah, I mean, I could go on the actual, I could go on the server itself, and then I can update the tables
Starting point is 00:33:37 and change open counts and things, but, yeah, but that doesn't scale. I like Jeremy's suggestion of terminating the first. But that doesn't scale. That won't help. The server will still live on. No, that's very... And actually, as we're developing this, we had actual bugs where we would actually get stuck in this state
Starting point is 00:34:01 and clients, you know, customers would call in and report it and it's like, yeah, you know, customers would call in and report it and it's like, yeah, you know, we had a bias error or something. That would be, yeah, we know. Well, but it's persistent. That's the problem. Even if you reboot Azure,
Starting point is 00:34:20 when it comes back up, that open count is still goddamn there, right? It's still stuck. You really need to go in and fix it by hand, effectively. Which is what a net file slash close does, right? I mean, it goes in and fixes it. So that would be good to have. So in our 2.1 implementation,
Starting point is 00:34:36 you may be familiar with durable handles. This was created initially just to handle a network disconnect, right? A very simple network disconnect where all the state on the server was still fine. It just was a disconnect. Somebody tripped over a network cable or something. And it worked well, but we knew right away that even in our preview,
Starting point is 00:34:52 because we're constantly rebooting the, not rebooting, but killing our services on the nodes, we needed to handle better than just that. Because we have this load balancer that basically virtualizes the actual node, we were actually able to be kind of sneaky here and stretch durable handles because it wasn't visible to the server and go in, even if a whole node dies, because we persisted everything we need that normally is on server-to-desk memory on the back end,
Starting point is 00:35:17 when it reconnected, we were able to say, yeah, yeah, okay, we'll reconnect that handle. And as far as the client thinks, it's talking to the same server, but it's not. But it doesn't know that. So we're able to stretch it here, but the problem is that the spec, MSS and B2, basically says things like, well, but if you don't have a handle lease, you have to close that. You can't keep it.
Starting point is 00:35:37 And it was frustrating for us. It's like, no, but you want to keep it, because we can. I mean, we have all the states. We really have to go and, yeah, we have to close down that handle, even though we really don't have to. But for MSS can. I mean, we have all the states. We really have to go and, yeah, we have to close down that handle, even though we really don't have to. But for MSSB2 compliance, we had to go and actually do this painful self-mutilation and close these handles that we really didn't want to. That was with durable handles because that's all we could get away with
Starting point is 00:36:00 without violating the spec. But with persistent handles in SMB3, well now it's really cool. This is the promised land. Because now we can go and we specifically advertise continuous availability, it's a shared property, and that makes the client request persistent handles and now it's just exactly what we want. Not only do we get away from the self-limiting rules, but there's all this state of the client maintains for us, which is really useful, like these create quids. So it allows me to detect whether or not a create came down,
Starting point is 00:36:32 and I went and wrote it out, I durably persisted it, but on the way back to the client, my front end died. So the client has no idea whether or not it worked or not, and there are certain requests that are not idempotent, like if you specify a create disposition. You can only happen once. I need to be able to know that, oh, I've already seen this create, and so I can basically just
Starting point is 00:36:49 succeed it, because I know I already did, and it's just another client coming in. Who doesn't know that? That was all thought about when the persistent handles were designed. It was pretty much exactly this scenario. Basically, hopefully, it covers all the holes and all the gaps,
Starting point is 00:37:06 which is pretty cool because now we basically have this fully transparent failover, which is pretty neat. Why don't you actually limit yourself with your rules? Because if you look at, I wish I could quote the actual lines in, you know, chapter and verse of SMSV2. Just ignore the must, I wish I could quote the actual lines in, you know, chapter and verse of SMSV2. It's just a golden must, I mean. No, you can't. I think Dave Cruz here, at some point you can chat with him about why it is that
Starting point is 00:37:35 if you don't have a handle lease that you can't go and allow a durable handle to survive a disconnect. If you do, then you could just, yeah. Yes, we could have just basically... We could have gotten persistent handles. Well, not all of it. We could have gotten most of it, but this other, this state that the client goes and thankfully transmits back to us allowed us to plug a lot of other holes.
Starting point is 00:37:54 So it still wouldn't have been there. It would have been a lot more there, but hopefully we're rewarding SMB 3.0 very soon, including Linux clients, and so they'll get the same benefit. OK, so these are three links that are useful. This Getting Started blog, it's actually got a year old now, but basically has all the details
Starting point is 00:38:17 you need of how to get the account, how to use the PowerShell scripts for creating shares, the actual programmatic ways to do them, and all that. It's a step-by-step guide. It's pretty useful. As I said, we don't currently support all NTFS features, things like name streams, EAs. You can imagine we had a priority list. We wanted to support the most important ones first. And it's not unlike if you look at what REFS did. You know, it basically picked one of the most important NTFS features and one of the ones that we never want to support ever.
Starting point is 00:38:49 And so we basically had the same sort of calling. But things that are important, we will get to. I mean, ACLs is a huge one right now. But we need Active Directory support for that, and that's coming in via other initiatives in Azure to get that fully, absolutely, completely synergy between an on-premise domain controller and the cloud. And we know that's a huge
Starting point is 00:39:08 gap right now that we're working on. There's an interesting caveat. Because we have this shared namespace between REST and SMB, REST has to come in over HTTP, and so all these RSCs get their mitts in there. And when you're coming over
Starting point is 00:39:24 SMB, names aren't, they're actually not UCS, they're not, they're not UTF-16, it's actually called UCS2. It's an older version. It's back before there were special characters, just literally a bunch of U shorts. And that is all that goes over SMB, unless it's one of these special characters that, that shall not be named. You know, anything below 32 and then there's and then there's a handful of other special characters. Everything's legal.
Starting point is 00:39:47 And you can actually create these really weird file names in NTFS, and it's all completely legal. But when you're coming over HTTP, that's limited. Because we really wanted to maintain complete coherency, we put some limits on what you can, in terms of character lengths and depths of paths, I guess there were some. I'm not an expert in the HTTP protocol,
Starting point is 00:40:04 but the people that were basically said, these are the things you have to limit. And it's not egregious. depths of paths, I guess. I'm not an expert in the HTTP protocol, but the people that were basically said, these are the things you have to limit. And it's not egregious. I mean, they're much better than if you're calling Win32 APIs. If you're calling NT APIs or do the special form or back question mark, this is a special way you can kind of escape it out effectively. You can get around that.
Starting point is 00:40:20 But so far, this hasn't really been a problem for us. But just to note, this document goes into the details of these characters legal, these ones aren't. This is what, if you're talking to an SMB, a real server2.sys server, these are the limitations and these are our limitations. So it's useful if you think you might have a problem. I don't even know what, is that time?
Starting point is 00:40:42 Wow. So, yeah, I guess I went kind of fast for that. But does anybody have any questions? You must have another one. Oh, yes. Yes. Yes. It's in here somewhere.
Starting point is 00:41:13 Unless I'm not using the latest version of the slides that I edited this morning. That's weird. So what I said was that some ISPs block 445 yeah home ones might yeah well yeah
Starting point is 00:41:44 we found I mean either works or it doesn't usually. I mean, either they're blocked 405 or not. No, they sort of notice it for a while. Really? They whack. Okay. Well, I was trying to be the end.
Starting point is 00:41:53 Oh, okay. So, yeah, I thought for sure I had, I remember when I added this thing, I thought I also added a slide talking about that. If you could change the source port on the window of the IAP, people have been asking for a long time.
Starting point is 00:42:11 I would get around it. There was some decision to use a different port. Yeah. Again, for businesses, it's less of an issue because they can program their own firewalls. Here, I was... We didn't know if HiApp blocked 445 because I knew that the wireless
Starting point is 00:42:26 was just not working at all for me here. And so we got them to go and plug this in here and we were like, we had no idea if it was going to block 445 until after the last talk was over. And I connected and it's like, thank God it didn't block 445. Yeah, yeah, yeah.
Starting point is 00:42:41 So yeah, that is a potential problem. But again, it's a really cool feature to be able to do this from on-prem, but the latency implications are really high, so we expect it to be an interim thing that people... I mean, it's kind of cool, but whether or not actually people use it in production, some people might.
Starting point is 00:43:02 Yes? How about the byte range locking? Yeah, yeah. Yes, so if you go in... Well, we don't actually have an implement... You can't actually set byte range locks, but they'll be respected. So if there's a byte range lock set,
Starting point is 00:43:17 an exclusive lock set on a range of following, and you try to do a get for that region, you'll get back an HTTP error. I don't remember the exact error we give, but it's one that seems conflict or something. So yes, I mean, we basically respect byte range locks with REST. And then if you try to do a
Starting point is 00:43:31 write, that will fail as well. Yes? Do you document the mapping to the blocks on the table? Is that sort of a supported feature, how you're mapping the file system into blocks? Well, no, because it changes all the time internally under the hood. So, I mean, it's... Oh, well, because it looks like a table.
Starting point is 00:43:58 So, I mean, it's, yeah. So, that is, it's documented in that level in the sense that you do a list range and you get the directories, and then you can do a get and put and can get the files. Can I see what's locked at the bottom? No, those are what's called, we call them nested tables,
Starting point is 00:44:15 which is a terrible term, because if you think of HTTP, nested tables, other tables, actually what it means is that we can guarantee atomicity across them, so it should be co-located tables. But no, via these APIs, you have access only to the actual, Actually, what it means is that we can guarantee atomicity across them, so it should be co-located tables. But no, via these APIs, you have access only to the directory structure and the actual file data.
Starting point is 00:44:30 You can't go and... Yeah, otherwise... Maybe you could fix it yourself if you had an off open count. But that would obviously have other problems. And can you keep versions of files just by adding more and more blocks? Nothing that I can talk about right now any other questions? yes?
Starting point is 00:44:56 I'm not sure if you could talk about that but how many opens per second can you get if you're running a virtual it depends if you're doing... It depends on how many opens can you do per second. So it depends. Are you doing it in a loop, or do you have a lot of clients doing it all at the same time?
Starting point is 00:45:15 Those are different questions. The average latency I've seen for opens is like five or six milliseconds. Five or six milliseconds. If you're talking to an SMB2 server, it's almost always under a millisecond. Almost always under a millisecond. Assuming it doesn't have to...
Starting point is 00:45:30 Assuming it doesn't have to fault something in and hit the disk. But if stuff is in memory, it's always under a millisecond. We have a very busy server called Scratch2, Scratch, or whatever. And it's very, very busy, and I was doing tests on that,
Starting point is 00:45:42 and I always get my opens in under a millisecond. Whereas we're on the order of five or six for an open. And it's, again, because we're doing a lot. And if we can get to a state where we're actually, stuff that doesn't have to survive, only has to survive a natural disaster, open counts, basically. The temporal state about the handle,
Starting point is 00:46:03 and keep that in memory on enough nodes that no one node going down would, then we could really boost that because then we would only be being passed, still would never be as fast as the searcher.sys because it's only updating memory in one location, but we would only be having to update memory on different locations, not physically writing.
Starting point is 00:46:17 At this layer, it's actually not writing to a spinning media. It's logged in several SSDs on different table servers, but still, it's slower than memory. But I mean, for 20 seconds, it's good. Well, it's not as good as one. So this is a little bit of history about myself. Before I did this, before I joined Azure to do this,
Starting point is 00:46:39 I had never written user mode code except for tests. I mean, what else is good for it? You'd write tests in user mode. And so I was like this idea, well, yeah, I raised the DPC level. Of course nobody else can run. I can't do that anymore. And all these things that
Starting point is 00:46:54 when you're writing in kernel mode, because you can't be careful what you touch in page pool, so you become extremely cognizant of how many cycles things take. You become very cognizant of how long things run.. You become very cognizant of how long things run, and you're very efficient.
Starting point is 00:47:09 You don't go and call some constructor that goes and calls some STL routine that does God knows what for five million cycles later and returns to you, and you didn't even know that you triggered that. So it's very tight. And so five milliseconds is forever. I know, but having the data replicated to three places in that five milliseconds.
Starting point is 00:47:30 Yeah. That's impressive. Oh, well, thank you. I didn't do that part. That was all done by these primitives, again, that I can call. So all I've got to do is just go and just update these tables and it all just happens. If you're not doing name pipes, you're not doing server service. You can't do NetView. I know. Yeah, we would love to...
Starting point is 00:47:51 Why NetView isn't part of SMB2 as an iOctal would be great, because then we could trivially implement it. No, it's a... Yeah, we need name pipes so we can do RPC. Though someone earlier told me I could do RPC over TCP, which I wasn't even aware of. If that could possibly work, if a Windows client would actually work that way, and
Starting point is 00:48:11 NetView would work, trying to do RPC over TCP, I always thought I would always run over name pipes. But yeah, we'd love to get NetView, because it's the first thing people do is to a NetView on the server, and it doesn't work. Okay, is that it? Well, I don't want to keep people longer. If you have any more questions, come and talk to me.
Starting point is 00:48:33 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.