Storage Developer Conference - #196: Direct Drive - Azure's Next-generation Block Storage Architecture

Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 196. Good morning. My name is Greg Kramer. I'm a software architect in Microsoft Azure Storage, and I'm here today to talk to you about Direct Drive, which is Azure's next-generation block storage architecture. The agenda for the talk, we're going to do a quick introduction, talk about the what and the why, what was our motivation for designing a new architecture. Once we're through that, I want to walk through how the block architecture works,

Starting point is 00:01:15 and then we'll talk about some notable design elements that we thought might be interesting to this audience, and we'll reserve some time at the end for questions. That being said, if you have questions as I'm going, please just shout them out. It'll be easier that way and you'll get tired of listening to me talk at you for 40 minutes. It'll be nice to have a break. All right, let's get into it. So Direct Drive is an internal code name for Azure's new block storage architecture. And it's currently the foundation for a family of disk offerings. UltraDisk was released summer of and Azure Premium Disk V2 just entered preview this summer. So why did we take the time and effort

Starting point is 00:02:10 to come up with a new block storage architecture? So as many of you know, Microsoft has been in the storage game for decades. We have lots of experience on-premises with Windows and Windows Server, and of course, our public cloud experience in Azure. And so we've learned a lot of lessons over the years, seen a lot of things. There's also constantly new workloads emerging, as well as new technologies that we can take

Starting point is 00:02:38 advantage of. And so it became interesting to start thinking about how could we take the things that we've learned over the years and design something that would allow us to go after these new workloads and take advantage of new and emerging technologies? And what would that look like? So I'm going to walk through some of our thinking of what we were trying to build here. So one of the things that we were very interested in is building something that could be deployed across a wide spectrum of environments. I mean, obviously we have Azure at the hyperscale, but it would be ideal if we could also deploy this solution down to the smallest singlever deployment. And in fact, that's how our developers do their inner-ring testing for Direct Drive today.

Starting point is 00:03:30 We can deploy an entire storage solution to one developer desktop, fully virtualized, which is great for testing. But we're targeting everything in between, so we want to be able to scale that up to smaller deployments, all the way up way up again to Azure's hyperscale. Another thing we were interested in is simplifying the experience for the end users. So a common pattern that you see in block storage is that you gang up a number of disks to get, you know, the performance that you need or the capacity you need.

Starting point is 00:04:09 And this can be troublesome for customers. There's a lot to manage here. You tend to have to use something like, you know, storage spaces in Windows or the equivalent in Linux to stitch all these together into something that looks like one disk that does what you need. And ideally, we'd like to eliminate that. If a customer needs a disk of a certain size and a certain performance, then that's what they should provision. We also wanted to make sure that we didn't tie people to having to buy capacity or performance in certain units. So if you wanted a small, fast disk, that's fine.

Starting point is 00:04:52 If you need a slightly larger but slower disk for your workload, that's fine too. We wanted to be able to cover the entire spectrum of use cases for our customers. And we also wanted to make sure that people didn't have to provision for the worst case. I'm sure most people here are familiar with this. You have to try to estimate what your peak workload might look like and then make sure that whatever you've provisioned is going to allow you to do that. This can be expensive if you have to run that way just in case you're going to hit your peak.

Starting point is 00:05:28 And so we also wanted to come up with a design that allowed our customers to meet that peak demand and then scale down the performance targets when you were in off-peak hours. The new technologies that I mentioned at the beginning are also something that was front of mind. We're seeing networks are getting faster and faster and faster. We have new storage network protocols. I'm going to defer talking about that until a little later in the talk. And we have new innovations in storage hardware.

Starting point is 00:06:04 Storage class memory is an obvious example of that. And hardware offloads are becoming very, very popular, especially for things like cryptography, CRCs, et cetera. This was something that was very interesting to us because, of course, everything that we store, all of our customer data is encrypted, and we offer end-to-end integrity with CRCs that live with the data throughout its lifetime. And so being able to offload these things is a tremendous advantage. And so we wanted to come up with an

Starting point is 00:06:37 architecture that allowed us to take advantage of all of these new things that are showing up in the market. Lastly, we wanted to simplify the IOPath. So on the left there, we are showing the architecture that preceded us in Azure Storage. There's a lot of layers here. It's a well-architected, multi-layered solution. And the talk that's directly after mine will talk about some of the capabilities that we get out of this multi-layered solution but if what we're interested in is the

Starting point is 00:07:14 best and most consistent performance we need to remove those layers a thinner stack and so we wanted our disk clients to have direct access to their data. That's the direct and direct drive. So that was the motivation for why we went to design a new architecture. And so next, we'll step into how that architecture actually works. So before we get into that, though, I do want to offer sort of a word of warning or caution. The Direct Drive architecture is incredibly flexible. We have many different ways to deploy it, many form factors, many performance targets, et cetera. This talk is specifically about the architecture, not any of the products that we've built on top of it.

Starting point is 00:08:06 So the architecture does support things that we may not expose today in products. So this should not be construed as making any official statements about the capabilities or limitations of the products that we've built on top of it. All right, let's start with the basics. So we wanted to be able to support two disk types. So we have 4K native disks, 4K logical and physical sectors, as well as 512E. Now, with 512E disks, just like with actual physical disks, you'll get the best performance by issuing 4K aligned I.O., which allows us to avoid the read-modify-write penalty. The core feature set that we're going after is, other than just reading and writing your disks,

Starting point is 00:08:53 would be shared disks. So we want to support both single-writer, multi-reader, as well as multi-writer, multi-reader, so that your disk can be mounted from multiple VMs at once. We need to support crash-consistent and incremental snapshots. So this is the idea that if you take a snapshot and then take a second snapshot, the amount of space that we consume should be proportional to the amount of data that you've written since the last snapshot.

Starting point is 00:09:22 And we need to support disk migration. Now, disk migration is not necessarily an end-user facing feature, but it's an important feature that we use internally. Disk migration allows us to take a disk and move it from one set of storage nodes to a different set of storage nodes while that disk is mounted and taking I.O. We use this for a variety of purposes. We can move the disk closer to the customer. We can move the disk off of an extremely busy tenant to a less busy tenant. Or in the case of decommissioning a set of storage nodes that have reached the end of

Starting point is 00:10:00 their life, we can move the disk to a set of nodes that are still good. So how does a disk look to our back end? Obviously, the customer sees a flat LBA space, the classic disk layout. We actually manage disks in units of fixed-size chunks that we call slices. So a slice size is fixed for a particular disk, but they can vary across disks. In this example, I've shown 64 meg. Now, when would we might want to have larger slices? Well, if we had tremendously large disks, there is a certain amount of metadata that we have to maintain per slice.

Starting point is 00:10:44 And so with very, very large disks, we might choose to increase that slice size to lower our metadata overhead. This is just one possible example. Every disk slice is maintained by what we call a replica set. And a replica set consists of a single change coordinator service, the CCS, and multiple block storage services, the BSSs. The CCS is responsible for sequencing and replicating changes to the slice. So we're going to store multiple copies of the data, and they need to be kept in sync, and the CCS performs that function. The CCS is largely a stateless service

Starting point is 00:11:32 in the sense that if it crashes or reboots, there's nothing that it loads on startup again. So the CCS's information is placed into it as it joins the cluster, and it runs that way until it goes away, and then we'll pick a different CCS. We'll talk a little bit more about that later. The block storage services are stateful. So that's where we have nodes with SSDs attached to them. Other than just reading and writing your disk data, which is their primary responsibility, the BSSs are also responsible for scrubbing the data. So I mentioned that we have end-to-end integrity built into our protocols, but if you're not actively doing I.O. to your disk,

Starting point is 00:12:19 then there's no one to check that the data is still good, and so we have background scrubbers that go through and constantly check it, just in case there's no I.O. happening so that we can fix it before it's a problem. The BSSs are also responsible for repairing slice data when necessary. So, for example, if the scrubber were to find a problem, that BSS would raise its hand, report an error, and it would be asked to repair the data from one of the good copies, one of the other BSSs in the replica set. One thing to note here is that the number of BSSs in the replica set is not fixed. So this particular example has four, but the number of BSSs is really a function of the durability guarantee that you want to provide. So we could

Starting point is 00:13:06 have two BSSs. We could have four like in this example. You could have seven. It's really up to you. This is the flexibility in the architecture. Because the BSSs are stateful, we need to make sure that there aren't correlated failures that could take down multiple copies of your data. And so the BSSs have to be split up into multiple fault domains. Now, fault domain is another one of those terms that the definition depends on what exactly you're going for. And the direct drive architecture allows you to sort of pick where on the spectrum you want to be. One common example that you would see would be like the BSSs in a replica set shouldn't all be under the same switch.

Starting point is 00:13:55 If they were, then the switch dies and you lose access to your data. But depending on your environment, you might actually define replica set to also include power supply as just another example. The definition is baked into the metadata layer that we'll talk about later, which is aware of these fault domains you define, and it'll make sure that the BSSs are spread out across them. All right, so let's talk about writing to a slice. When you issue a write to your disk, the write goes to the change coordinator, the CCS.

Starting point is 00:14:34 The CCS is going to assign a sequence number to that write, an LSN. So in this example, we have an unsequenced write coming in, and the CCS has assigned the next sequence, 107, to this write. The CCS then replicates this in parallel to all the BSSs in the replica set. And you can see that in this example, all of those BSSs are at 106 at the moment. The BSSs will receive that write, they will durably store it, and then they'll do something called promising the write back to the CCS. A promise is exactly what it sounds like. If you came back to me and asked for the data, I promised that I could return it to you. So in this particular example, you can see that BSSs

Starting point is 00:15:16 0, 1, and 3 have successfully promised the write, but BSS 2 is a bit of a laggard in this case. Maybe there was some network congestion, or he's just slow at the moment. Promising happens, or from the CCS's perspective, the operation is promised when N of M BSS instances have successfully promised back to it. This is another one of these flexible areas in the architecture. In this example, I'm using quorum replication, so three out of four.

Starting point is 00:15:49 But we could just as well pick four out of four if we wanted to, or any other value that satisfies your durability requirements. Once the CCS gets the requisite number of promises, it's free to complete that operation back to the initiator. So now we can promise the initiator that that write actually occurred, and that if you were to come back and read it, we would be able to give it to you. So reading. Reads can be issued directly to any BSS in the replica set. So in this example, we've issued a read to BSS0, and it has the data, and so it can succeed the read.

Starting point is 00:16:30 You'll note that the reads come in with a sequence number, unlike writes. This is what allows us to avoid returning stale data in case you have configured quorum-based promising. So we're using three out of four quorum-based promising in this example here. And so if we were to issue a read to our laggard BSS, BSS2, it cannot return the data, or should not, must not, or it would be giving you data from the past. And so the sequence numbers allow that

Starting point is 00:17:00 BSS to fail the I.O. In this case, you'd be able to reissue the read to any other BSS in the replica set, and it would succeed. Now, what we've found is that in production, this need to retry in this fashion is pretty rare. Most of the workloads we see don't have the pattern of issuing writes and then immediately reading the data that was just written.

Starting point is 00:17:26 And the replication is fairly fast, and so it doesn't take very long for all of the replicas to catch up. If this turned out to be a problem, though, you could just switch from quorum-based promising to full promising by setting N and M equal. And in that case, you would never see your write complete until all the members of the replica set had the data. So replica sets. We've seen how a CCS and multiple BSSs are part of a replica set for a slice. But every CCS and every BSS play that role

Starting point is 00:18:04 for multiple slices. So in this example, you can see that slice 0 has CCS and every BSS play that role for multiple slices. So in this example, you can see that slice 0 has CCS 0 and BSS 0, etc. Slice 1 shares the CCS with slice 0 and several BSSs, and so on. So what this is doing is it allows us to spread the load for a particular virtual disk across many back-end storage elements. Now, we've been talking about how we manage disks in terms of slices on the back-end, but that's not exactly the view that we exposed to the disk clients themselves. If we did it that way, then certain workloads, like sequential workloads, would tend to victimize certain replica sets for extended periods of time until they had switched to the next slice and then the next slice. And so we use a very common technique.

Starting point is 00:19:00 We stripe the data from the client's perspective. The striping algorithm is also very flexible. This is just one example. So we have a four-way stripe set. So every stripe set contains four slices, and we've set the width to 256K. And as a result, the first 256K of the disk from the client's perspective

Starting point is 00:19:21 is going to be the first 256K of slice zero. And the next 256K of the disk from the client's perspective is going to be the first 256K of slice 0. And the next 256K of the disk from the client's perspective will be the first 256K from slice 1, and so on. So speaking of the disk client, our disk client is referred to as the virtual disk client, VDC. And it's actually broken into two components. So VDC sits below the Hyper-V storage stack. And the first component in the chain is what we call VDC proxy.

Starting point is 00:19:56 VDC proxy is an important element for allowing us to service disks while they're active. So while they are taking I.O. from customers. Most of the time, VDC proxy runs in pass-through mode, and so the I.O.s are simply flowing from the VM through it all the way down to VDC. However, if we need to service the stack below it while there are disks present and I.O.s flowing, VDC proxy will kick in and it will begin accumulating the IOs from the VMs in its layer and it will also serialize the state from the VDC below it in the memory allowing us to unload the VDC stack below it, replace it with an updated version at which point we can reload all of that state into VDC and

Starting point is 00:20:40 allow the IOs through. So at worst the VMs might experience a momentary blip in their IO performance, not knowing that we have replaced the entire storage stack beneath them. VDC is where most of the action is at. VDC is responsible for mounting and dismounting disks. It reports errors. It handles error responses. It does throttling for IOs,

Starting point is 00:21:08 and it's where all of the disk striping logic that we just looked at is present. VDC has two communication channels to the nodes that actually host the storage roles. So we have a separate data plane and control plane. We'll talk about these a little bit more later. So at this point, we've introduced all of the components that we would say are part of the direct drive data plane. So we have our VDC client. We have our change coordinators that sequence operations. We have our block storage services that durably store and retrieve those. And these are all connected

Starting point is 00:21:47 by a custom storage network protocol that we refer to as DDX. I'm going to defer talking about DDX until just a little bit later in this talk. What we haven't talked about so far is the control plane. The primary service in our control plane is known as the metadata service, or MDS. MDS is the brains plane. The primary service in our control plane is known as the metadata service

Starting point is 00:22:06 or MDS. MDS is the brains of the operation. It's responsible for deciding which CCSs and BSSs are in which replica sets. It handles disk control plane requests like creating disks, deleting disks, growing them, etc. It's also the accumulator of all error reports. So any problems in the system, the agents will report those to MDS. MDS will collect those, examine them, and then determine what corrective action is needed and then issue commands for corrective action to the other storage elements.

Starting point is 00:22:42 MDS also needs to be spread across multiple fault domains to avoid correlated failures, and we use Paxos to replicate state between them. And so there is, at any given time, a primary MDS and several standby or secondary MDSs. If the primary were to fail, an election is held, a new primary is elected, and we carry on. And we use a custom RPC for inter-cluster control traffic. So this is how the MDSs communicate with each other, as well as the BSSs and CCSs. Another important element of the control plane is our gateway service. So the gateway service sits between our control plane, MDS, and the outside world. So all extra cluster traffic comes through a software load balancer, hits the gateway, is authenticated, and then passed along to the primary MDS using our custom RPC.

Starting point is 00:23:50 All right, putting it all together, we finally see the full picture. So this is the high-level block overview of what the Direct Drive architecture looks like. You see we have our data plane up at the top, we have our control plane at the bottom, and we have our Azure resource providers, which would be fronting things like the Azure portal. When you go into the UI and click create disk or whatnot, those requests come in from these resource providers. Any questions so far? All right. So we picked a handful of what we considered notable design elements that we thought might interest this audience, and we're going to walk through a couple of those now.

Starting point is 00:24:41 So earlier I mentioned that we have a custom data plane storage network protocol. It was built for the direct drive data plane. And one of the very first questions that we get is, why didn't you just use NVMover fabrics, iSCSI, or any of the other off-the-shelf block protocols? Which is a fair question. Our answer there is that there's three primary reasons we decided to implement a custom protocol. So we really want to eliminate middle layers to allow clients direct access to their data. And as we saw earlier, the read and write path require those sequence numbers to be transmitted back and forth between the storage elements and the disk client. Without those, we can't efficiently provide the features that we're interested in, like consistent reads and writes across distributed slice replica sets, shared disks, crash consistent distributed snapshots, disk migration, etc. So the client has to be in on the game, and the protocol has to

Starting point is 00:25:44 understand that there are these additional elements that need to be in on the game, and the protocol has to understand that there are these additional elements that need to be transmitted back and forth. Another really important and often overlooked attribute here is that we sort of live and die by our ability to quickly diagnose problems. Especially at Azure scale, there are simply too many machines out there to have developers spending significant time looking at individual things to try to figure out why something may not be working. So one of the ways that we've addressed this is that we've spent significant time baking diagnostic support directly into the protocol. One way that that looks is that we can tag these IOs with what we refer to as activity IDs so that as a customer IO transits through these multiple elements across a big network, we can easily do distributed log searching that will quickly pull up from every relevant node what happened at each node in the transit. This is crucial to us figuring out how things are working and what's going wrong.

Starting point is 00:26:55 But more importantly, we don't want to have to look at the logs if we don't have to. And so one of the important things that we've done in DDX is we've allowed the response messages to carry significant diagnostic content from the responder. Some examples of what that looks like is how long did this request spend in queues on my node? Or from my perspective as the responder, how much time processing this I.O. is attributable to the network itself? Or how much time did I spend waiting on storage media?

Starting point is 00:27:30 If we pass this information back and aggregate it, we can actually write some pretty complex automatic diagnostic tools that will find slow spots in the network. And so this makes it much easier for us to find, for example, the node whose network cable is going bad or is otherwise starting to fail. And so we can evict it from our clusters automatically, not having engineers have to look at it. Lastly, we wanted to maintain our agility. Things move fast in the cloud, and we need the ability to evolve our protocol

Starting point is 00:28:08 to match our customer demands, to implement new features that are important to us and our customers, and to take advantage of opportunities that are popping up. Now, none of this is to say that the off-the-shelf protocols are bad. They're obviously not. They're super useful as a common standard to allow clients that you do not control to access storage. In this particular case, though, we control the client, and so baking in these additional features is advantageous to us. Now, in the talk that follows this one, we'll be talking about the Azure X-Store architecture,

Starting point is 00:28:46 and we'll return to some of these other protocols there. Another interesting thing is our use of RDMA. So if you've been coming to this conference for a while, you may remember that I've been here for several years talking about RDMA in terms of SMB Direct in the Windows file server. Now, when we designed and built SMB Direct, it was to support the Windows file server at first, and we could have built that functionality directly into the SMB client and server. Instead, we chose to implement it as its own layer that would hopefully

Starting point is 00:29:26 one day be generically useful as an RDMA transport protocol. And so we finally realized that vision with Direct Drive, and we have stolen a page from Server 2012's book, and we have slotted it in underneath our DDX initiator and target. What this means is if RDMA is present, we can take advantage of it. If it's not, or if it is faulty, we can fall back to TCP. And we could do this without having to go implement, again, more custom RDMA protocols. We just slid it right on top of the existing SME Direct. And in fact, today, the vast majority of Azure disk traffic is transported on RDMA using SME Direct. Another interesting feature is our use of SCM, so non-volatile memory.

Starting point is 00:30:23 SSDs are fast, but not as fast as we'd like. And they have some interesting characteristics in terms of how reads and writes interact with garbage collection. And so periodically you will get unlucky and you will observe high latency IOs as there's interference in the flash translation layer. So consistent performance is very important to us and our customers. And one of the ways that we address these issues with SSDs is that the BSSs are able to promise back to the CCS after they have made data durable in non-volatile memory. Now in this example I'm showing that for every slice we have a log sequenced data structure that's built into the SCM.

Starting point is 00:31:11 We'll accept writes into that as soon as we've made them durable, you know, using whatever techniques are necessary for the SCM. We have sort of an abstraction layer built into our software. We're able to promise back to the CCS. Now, our goal is to allow these writes to sit in the non-volatile memory for as long as possible. Just long enough that we don't fill it up. Now, the advantage here is that if reads come in later, we may be able to service them entirely from NVDIMM, which is going to be much faster than SSD. But even in the write path, we have a big win in terms of write aggregation. If you

Starting point is 00:31:51 have workloads that are writing and overwriting data, you know, for example, if you had a workload that, you know, overwrote sector zero a hundred times, we don't actually have to perform a hundred IOs to durable media, right? We just write the last one out. So in fact, all of those get papered over and we do less IO to our SSDs. So when the NVDIMM starts to fill up, that's when we'll perform an operation called destaging, where we will take IOs, we'll drain them to the SSDs, reclaiming space in the SCM and allowing us to accept more operations. The last notable feature that I wanted to talk about was sort of an interesting throttling system that we have. So before we talk about that, I think it's useful to talk about

Starting point is 00:32:44 what traditional IO throttling usually looks like. So usually, you'll receive I.O.s from the VM, and you will then accumulate them in your disk client, and you'll apply your throttle in the initiation path. In other words, you would hold I.O.Os and not issue them to the back end until they match whatever rate you had configured for your disk. And then the back end will process them, complete the IOs back to your client, and then you usually complete them back to the VM as quickly as you can. The potential problem with this is that by the time you've issued the. to the back end, you have to process it now. And anything that might slow down the processing of those I.O.s increases your chance that you're going to miss your completion deadline. So we'll take a quick look at an example of that. So in this case, we see that our VM has issued a burst of I.O.s to VDC.

Starting point is 00:33:44 And if we had performed traditional throttling, VDC would hold those and space out the issuing of those to the back end to the configured IOPS, which in this trivial example is four, four IOPS, the world's slowest disk. The back end would complete those, processing those IOs as fast as it can, completes them back to VDC, and then VDC hands them back to the VM as fast as it can.

Starting point is 00:34:09 And you can see that in this example, we're meeting our configured IOPS rate for the disk. Now, the challenge for this traditional type of throttling can be seen here. So I've introduced that latency bubble. Now, this might be spike in load on the back end. It might be a burst of network congestion. It almost doesn't matter. There's many ways that things can slow down. And you can see that because we've held the I.O. in VDC, the yellow I.O., and issued it on schedule to the back end, any slowdown causes us to miss our completion target for that yellow I.O. What we've done is something called completion side throttling. So we have flipped

Starting point is 00:34:53 the pattern around. VDC will receive I.O.s from the VM and issue them to the back end as fast as possible. And then we'll delay completion of those IOs to the VM to match the configured IOPS rate. So in effect, we're allowing the backend to sort of work ahead of time. You can get your work out of the way before it's actually due. So this is an example of what that looks like.

Starting point is 00:35:21 The same burst of requests comes in from the VM. In this case, VDC will just issue them as fast as it can to the back end, and the back end processes them at the rate that it can. And then VDC holds those completions and spaces them out to meet the configured IOPS. Now I'm going to introduce the same latency bubble. And in this case, because we started processing the IOs ahead of time, if you will, it doesn't matter. So the completions will still meet the configured IOP rate for the disk. Now, of course, this doesn't allow us to paper over every delay,

Starting point is 00:35:58 but hopefully the design of the system and the way that we have instantiated this architecture in hardware means that delays longer than our tolerable limit are fairly rare. All right. So the question is if the delay... ... Oh, I see what you're saying. Yes, so the question is that if we move that bubble higher and the two completions, the green and yellow completions, were to come back after, I think what you're asking is where they should happen, right?

Starting point is 00:37:06 No, I'm saying, the blue car at the top, the red, and the blue car at the back, after the red was already done. Right. And then present the red, and the blue car at the back.

Starting point is 00:37:16 Oh, I see what you're saying, yes. So, yeah, we would know when the IOs were issued and when they were due. And so, if we got a late completion like that, we wouldn't throttle it. Like, in some sense, it's already due. There's no throttling necessary, and we would just allow it through. How does the VDC discover the CCS? Right.

Starting point is 00:37:41 So when... Repeat that. Yes. When the question was, how does VDC discover CCS? So when VDC goes to mount a disk, it's going to issue a control plane request to MDS. And it's going to present a token of authorization that says, I assert that I am authorized to mount this disk, which MDS will check. And assuming that everything checks out, what MDS is going to hand back to VDC is the slice map. So you'll get back a data structure that says, okay, you have a disk that's of such and such size with such and such properties.

Starting point is 00:38:24 And here are all the slices. you have a disk that's of such and such size, with such and such properties, and here are all the slices, and for each slice, here's the CCS, and here's all the BSSs. So now VDC has that information, and when it receives an I.O. from the VM, it can consult that slice map and determine, you know, for writes, it has to go to this CCS.

Starting point is 00:38:40 For reads, I have any of these BSSs that I could issue the I.O. to. And if that map changes in the future, it's forced back to the MDS to refresh it? Right. So the question is, if the map changes, how does VDC become aware of that? Right. So if the map changes, which could happen for error handling purposes. So, for example, assume that a BSS dies. If the back-end discovers that first, before the client does, we will have reported that to MDS. MDS will have told us to kick it out of the replica set and replace it with a new BSS, but VDC may not be aware of that. If VDC were to issue an I.O. to that BSS and it simply wasn't there anymore, VDC would report an error to MDS saying, like, hey, I can't talk to this person that's supposed to be there.

Starting point is 00:39:31 If the BSS happened to still be reachable on the network, some of the LSN sequencing that I discussed earlier would allow us to detect that you're talking to a stale member of the replica set. In other words, that BSS would be able to assert to VDC like, hey, you're not up to date. I'm no longer the person you should be talking to, at which point VDC would again go ask for the updated replica set. way you've pushed so much that was traditionally implemented in the client into the family. So is the replication, the topology aware of any of that? Sorry, can you repeat that? Is the replication topology aware? Is the replication topology aware?

Starting point is 00:40:20 Well, I mean, the CCS is directly aware of which BSS is because it has to replicate to them. I think maybe I don't understand the question. Oh, I got you. You're referring to, like, chain replication. No, in our system, at least today, the way that we've configured it, we do parallel replication for performance reasons. Yeah, so the question is, by doing completion-side throttling, do we see increased pressure on the back end

Starting point is 00:41:12 because we're allowing sort of a flood of requests to come in? No, not really. Completion-side throttling is an interesting optimization, but we're not completely tied to it. In fact, I think I had a footnote. Yeah. If it were to become problematic, we can simply start applying traditional initiator side throttling. So if we had a VM that was just like going crazy and issuing just a ton of IO,

Starting point is 00:41:44 we might allow them to use this optimization for a certain length of time, and then we're going to say, all right, that's enough. We're going to wait for you to sort of settle down before we allow you back into this mode. So we do have defenses against that problem. What are the limiting factors for MDS scalability? It's an interesting question.

Starting point is 00:42:15 The MDS in normal operation is not doing very much. It knows the map of which disks exist and which slices there are in the replica sets. But for the most part, it just sits there and it's waiting for disk mount requests to come in so it can return the map, or, you know, for error reports to come in so it can issue corrective action.

Starting point is 00:42:41 Now, even in the error report case, it's not an incredibly CPU-intensive operation, right? It's collating the error report information it gets, and then it's deciding who to kick out of the replica set and who needs to rebuild from who. So we don't see a ton of scalability issues with MDS. That being said, if we started to observe that problem, it would be relatively easy to just allow multiple MDS rings. The CCSs and the BSSs sort of don't care. As long as they get an authenticated message, they're happy to do what they're told well if one MDS becomes a hotspot how does it get handled usually the way that we handle this is by avoiding the problem in the first place so if we're deploying a unit of direct drive storage, we know how much load we

Starting point is 00:43:48 intend to put on it in total, and we would design our MDSs such that they could handle that load. Now, if we had an MDS that, you know, for some reason was misbehaving and was slowing down and not performing the way that we needed it to, we would just fail over to one of the other MDSs in the ring. Sorry, I couldn't hear. Oh, I got you. So you're talking about some value-add processing that you want to do in the back end. I mean, we don't have that today. This is a disk subsystem, and there is no common disk a big enough ask to do some particular operation, there's really nothing from stopping us, you know, transmitting that is that with the use of RDMA and NVDIMM,

Starting point is 00:45:29 we actually experienced that the load on our back-end nodes is fairly low. You know, these nodes are not running at like 90%, you know, CPU or anything like that. Sorry, I can't hear you. Right. Failover with hotspots. Right. So I had mentioned failover in terms of RDMA and TCP. Is that what you're referring to?

Starting point is 00:46:05 Okay. right. So just like the Windows file server, the DDX layer is aware of the network that sits below it. So if there's RDMA capabilities, we're going to try to use them, right? But if we discover that a certain backend element is unreachable via RDMA, or if the RDMA fabric became destabilized for some reason, there's heuristics and logic built into our DDX initiator that would allow it to detect that and then decide that falling back to

Starting point is 00:46:43 TCP would be the better option. And then it probes and it will attempt to fail back to RDMA. It looks very much like the Windows file server multi-channel capabilities. What kind of block storage are we supporting? Yes, we have 4K native disks and 512E disks. You mean in terms of the VMs being able to attach metadata? No, not today. Right, so the question is what interface do the guest VMs see? Today, the VMs will see a SCSI device. We started with SCSI simply because it's the most widely supported disk interface for all of the operating systems we see our customers bringing to the cloud,

Starting point is 00:47:32 and it's obviously incredibly mature. So, yes, they'll issue SCSI commands. Those SCSI commands will be delivered through VDC proxy to VDC via the Hyper-V virtualized storage stack. And then in VDC, we convert SCSI commands into our own internal command set. So the question is, will we see this disk architecture on Windows in the future?

Starting point is 00:48:01 I don't know. I have no product announcements to make today. Today, all of our production use of Direct Drive actually does use NVDIMM. And even in our test environments, we can use virtualized NVDIMM. So the VMs believe that they really do have NVDIM there. NVDIM is not really a stretch anymore. I mean, there was a time, I guess, when it was fairly expensive, but not so much anymore. And you don't need a tremendous amount of it.

Starting point is 00:48:41 Like I had mentioned earlier, we do allow data to remain in NVDIMM for as long as possible, and then we destage just to keep it ahead of filling up. But that's one of those things that you can right-size for your workloads. So if you had an extremely small cluster, you're going to have a commensurately smaller workload placed upon it, and you would need less NVDIMM to service that. So the question is, do we really need NVDIMM, or are we really just looking for fast block storage? Today, we're counting on having byte-addressable non-volatile memory. So the things that we're putting in our logs, obviously we need to put the actual data,

Starting point is 00:49:30 the sectors that customers are writing into NVDIMM, but there's additional context there. Thank you. And so for every I.O., besides the data itself, we also need to store what disk, what slice, what's the sequence number, all this stuff. And if we had to consume whole blocks for that, which is, they're relatively small data records, that would get really inefficient. And so we count on having byte

Starting point is 00:49:54 addressable memory. So folks, we're out of time for today. I'm going to be around the whole conference. If there's still questions, please feel free to grab me. I'm happy to discuss this at length. Thank you. Thanks for listening. For additional information on the material presented in this podcast, be sure to check out our educational library at snea.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #196: Direct Drive - Azure's Next-generation Block Storage Architecture

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.