Storage Developer Conference - #189: Behind the Scenes for Azure Block Storage Unique Capabilities

Episode Date: April 25, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode number 189. Welcome to the session on behind the scenes for Azure block storage unique capabilities. I'm Yiming, Principal Product Manager from Microsoft. I'm very excited to be here.
Starting point is 00:00:52 Reason is that I eventually got to talk about something I worked for nine years when I started my career working on the first storage client driver that is supporting the SSD-based block storage premium disks. And I'm here co-presenting with Greg. I'll let Greg do his intro. Hello, everyone. I'm Greg Kramer. I'm a software architect in Azure Storage. I'm actually here today filling in for a colleague that couldn't be here today. Malcolm Smith, he works on a related team in Azure Storage. So pardon me, I'm going to do my best to represent him and his team, but I may have to defer some questions till later. All right, let's get started. So we'll do a quick kind of walkthrough of what is Azure Storage, and then we'll jump
Starting point is 00:01:46 directly into the unique capabilities we offer, but more as how we are able to build it on top of our XStore stack. We'll for sure leave some time at the end for Q&A. All right. So for Azure Storage, we offer, you know, a comprehensive portfolio of all the storage offerings. We have disks, which is the block storage offering. You know, going forward, I'll just use this to refer to Azure block storage. And then we have, you know, Azure blobs, which is the object storage, and Azure data lake generation 2, which is, you know, built on top of blobs and provide HDFS support. Lastly, we have the file storage and also hybrid storage solutions. So if I actually look at all the storage offerings we have and then overlay it on top of our stack, you could see that we have like blocks, Azure blob files, and also part of the Azure block storage
Starting point is 00:02:42 standing on a single storage architecture, which is, we call it XStore. And on top of this architecture, we actually provide optimizations for each specific services so that you have different performance and price point and interfaces for the targeted workloads. And then for the block storage, we also have it on a new stack, which is what Greg delivered the previous talk about direct drive. So as this talk is specifically on block storage, we'll just zoom in and take a
Starting point is 00:03:14 look of our block storage portfolio. So going from the right to the left, you can see that the performance and the scale goes up. So for standard HDDs, standard SSDs, and premium SSDs, those are hosted on our X-Store stack, where the premium SSD version 2 and then UltraDisk is on the Direct Drive stack, which is the newest, latest stack that is optimized for performance. Specifically on today's talk, we are going to talk more about XStore. The reason is that we have evolved the XStore stack for past years and added a lot of comprehensive capabilities on the XStore stack that is best fitted for lift and shift workloads and all that. So I will do a quick kind of high-level overview of the X-Store stack. So for X-Store, it is a three-tier stack, three layers starting from the top, which is our front-end layer.
Starting point is 00:04:11 This is where it's, you know, the front-end actually process all the incoming traffic. It provides the protocol endpoints, authentication, authorization, and then the logging and metrics. Then we have the middle tier, which is the partition layer. This is where you can see that this is the layer that understands the business logics and then
Starting point is 00:04:30 manages all the data abstractions. That's where it presented the data as in blobs, files, and block storage that you see at the different services. You can consider this as a massive scalable index, which actually are able to kind of map each of the objects to the specific streams and the data stored in stream layer. The last layer is on the stream layer. This is where it actually writes the bits into the actual hardware. And this is where it managed the replication
Starting point is 00:04:58 and also the distribution of the data across servers within a cluster. It sits on a JBONT, and then it's an append-only file system. So in the later talk, we're going to talk about one of the specific replication or redundancy options we offer, which is called zonal redundancy storage, and we'll come back to the stream layer. So, you know, for XStore, there is actually a paper on this, so you can find it online. We also did a, you know, a talk in Siena, I think, a couple years ago specifically on this. So there are a lot of materials online about XStore if you're interested to know more.
Starting point is 00:05:36 So for today, we are going to talk about specific unique capabilities that we support on XStore. We picked three, which we think are very representative that only Azure offers compared to any other cloud providers. The first one is regarding of instant snapshot and restore. So this is critical for any enterprise workloads, which they are using block storage. They will want the best RTO and RPO they can offer on any cloud. Second is that we'll talk about the data access over different protocol paths. So we have the single X-Store stack where you can actually have both SCSI and also REST, you know, different paths for you to do data access on your storage. Lastly, we'll talk about the highly available storage, which is our zonal redundancy storage,
Starting point is 00:06:29 which is synchronously replicated across three availability zones in Azure environment. I'll put a disclaimer first, as we will be focusing talking about the capabilities and also the engineering design. So specifically on how these capabilities applies or offered on specific SKUs, we recommend you to refer to our product page for that. All right. So let's talk about instant snapshot create and restore.
Starting point is 00:07:00 So what is a snapshot? So basically snapshot is a point-in-time capture of the state of your disk. We support so-called incremental snapshots, as we only capture the changes between your snapshots and changes on your disks on the snapshot. So you'll be saying that, oh, it's just a snapshot, kind of doing a versioning. what's so fancy about it, you know. What is unique is that we are able to provide snapshots in the choice of your storage, meaning that if, let's say, your disk is hosted on SSD, you will be able to create a snapshot either on SSD or HDD and be made instantaneously available. Similarly, on the restore path, you could have your snapshots in HDD, which you store
Starting point is 00:07:47 for a cost savings, you know, cheapest storage you can buy. And then when you're trying to restore, you'll be able to quickly create a disk in a higher performance tier, which is referencing using the, you know, snapshot in a lower tier storage. also made it instantaneously available. So how do we orchestrate this flow? Let's say the large cylinder on the right, that represents your disks. And then you have the pages written in these disks, which is colored in green. So, you know, we use a coloring just to show that you can write different page ranges, but we are an append-only system, so it's not really that you're actually kind of located in different,
Starting point is 00:08:30 you have kind of gaps in between, but just an indicator as the page ranges you have written. So what we do first is that we create something called a reference blob. What this reference blob has is that it has a shallow copy of the, you know, disks where we capture what is the pages and indexing for the pages you have written and also the properties of the disks. So in this reference page, when it's getting reference blob, when it gets created, it basically have a standalone lifecycle of the disks. At this point, if you go and delete the disk, it's perfectly fine. The reference blob has all the information that is needed to construct a snapshot.
Starting point is 00:09:12 Our next step is to create a copy on the target cluster. So going back to the storage of your choice, your target cluster can be completely different than your source cluster. It can be hosted on HDD versus your source cluster is all on SSDs. So in this case, we create something as a copy blob. What is unique about this copy blob is that we have a technology called copy on read. The moment you create this copy blob, we will be matching, doing a background copy of the data from the source into this copy blob. Similarly, if you have new reads that is made to this copy blob, and if the data has not yet been replicated from the source, we will go back to the source and actually fetch the data,
Starting point is 00:10:00 persist it in this copy, and then return it to the user. So in this case, the copy blob on the target cluster is basically the snapshot that we actually fronted to the user, where the reference blob is something which is the internal implementation, and that will be hidden on the user end. When the copy is completed, at that point, we are good to delete the reference blob. You no longer need it.
Starting point is 00:10:27 So our snapshot is also incremental. So what that means is that you could have, let's say, you wrote the green areas, and then you create a snapshot. Then you make some changes regarding of your disk, and then write the yellow part, which are your changes. The flow is very similar, where you'll go and create a reference blob. And then from a reference blob, you will go and create a copy on the target cluster, where Azure will dynamically identify the changes of the latest snapshot versus your previous snapshot that is made available in the target cluster. So in this case, if you have the green snapshot, which is already available, the new snapshot will only capture the changes, which is the yellow piece. But let's say you don't have a previous snapshot. In that case, the new snapshot will capture both the green and yellow area.
Starting point is 00:11:33 So we just talked about create, where we'll talk about the restore part as well. So when we say snapshot restore, this is where you're able to create a disk from a snapshot. So the restore flow is slightly more complicated. As you can see, the moment the disk is created, you need to handle both read and also write traffic from the user. In this case, the green kind of cylinder is the snapshot you have created. Same first step, you'll go and create a reference blob. Same idea. This is where you have a shallow copy. The moment the reference blob is being created, you can go ahead and delete your snapshot, and that will not impact the lifecycle of reconstructing the disk.
Starting point is 00:12:11 Your next step is to create a copy blob on the target cluster. So your target cluster can be HDD or SSD-based, regardless of where your snapshot is being stored. So in this case, if you're using, you know, HTT for cheap, you know, snapshot copy, you can still immediately actually create a high-performance disk out from that snapshot. So as the disk is getting, as the, you know, snapshot is getting restored, we also create something we call a differencing blob,
Starting point is 00:12:42 which is there to actually handle the new coming write. So both the copy blob plus a differencing blob, which is there to actually handle the new coming write. So both the copy blob plus the differencing blob together composes the new disk. From a user perspective, they will only see a single top-level resource, which is a disk, but in the background, we actually have these two blobs, which is hosting the data for this one disk. And the copy blob and the differencing blob are there for the entire lifetime of the disk. So here you may have a question as, okay, so when you have a differencing blob, you'll be writing data into that differencing blob, and then it will overwrite the pieces of ranges which you have copied over. So you have some inefficiency here where, you know, same range, you can have an old copy and then a new copy.
Starting point is 00:13:31 So we handle this as the moment when you have the page ranges which is overwritten, we will actually mark the original page ranges, and then we have our existing garbage collection process to actually reclaim that resource. So we are, you know, going back to XStore, we are a pen-only system. So this is something which we do it not only for the scenarios, but all the scenarios when you have a pen-only system.
Starting point is 00:13:54 That's where you'll be marking, you know, the overwritten kind of the page ranges or the data sectors and then leveraging a garbage collection process to kind of free up those resources. So in this case, I think if you look at it, you know, we kind of walk through regarding of the snapshot restore flow. So I wanted to kind of just quickly go through the data path so you have a better understanding of how we are actually, you know, serving the read and write traffic
Starting point is 00:14:23 on this newly restored disk. So let's say if you have a read which hits the new disk. What it will do is that it will first actually hit the differencing blob. So it will try to figure out whether it's in the differencing blob as I've been recently written. If it's not available in the differencing blob, we'll go and hit the copy blob. So if the data is not yet being background copied from the source to the copy, we will go back to fetch the data from your source. So that's for read. Write is much simpler.
Starting point is 00:15:00 Write will always go to the differencing blob. Same idea, you know, copy blob and also the differencing blob will live for the entire lifecycle for the disk, where when a copy is being completed, you can go ahead and delete the reference blob. As you can see, that's the way Azure, we are able to support the instance snapshot restore and create. It's all, you know, leveraging the copy-on-read technology that we have,
Starting point is 00:15:26 where in the copy blob, we have the orchestration of reading the data from the source and also being able to fetch the data on the fly if there are incoming reads which are not already being persisted in this copy blob. What are the other use cases for core? So copy on read, we just call it on core going forward. The other use case we leverage core is for provisioning VMs from a single golden image. So let's say you kind of have based a new kind of Linux distro, and then you wanted to do a scale testing from that. What you are most likely going to do is that you have a single golden image, and then you will create multiple VMs from that. What you are most likely going to do is that you have a single golden image and then
Starting point is 00:16:05 you will create multiple VMs from that single image. In this case, all the VMs will be using the same golden image for bootstrapping. And then all the VMs will also need its unique disks for capturing the differences, which are the latest changes. You would also want the scale of these new VMs to be very fast because if, let's say, you're scaling up your VMs for a production workload to handle additional traffic, you don't want it to be there waiting for your VMs to come up. As you can see, this will easily impose a challenge where all the VMs started to read from the same golden image,
Starting point is 00:16:46 where the performance of that image will be limiting on the factors of how your VMs and your disk performs. The way we handle this problem is by constructing a copy on read tree. So on the top, you can see that is the image. And then we will go and spin off new source cores that are pointing back to this image. So the moment the source core is getting created, it will be reading the data from the image and then populating the source core. Then you have child disks, which is pointing back to the source core. So each of the child disks will have their own core and also a differencing blob, just like the disk restore scenario we just talked about. In this case, you can see that, let's say you have a VM that disks, you're trying to read from that disk.
Starting point is 00:17:39 The read to that disk will first go and attempt to find the data in the child core, then going back to the source core, then going back to the source core, and then back to the image. The reason why we structure it this way is that you can see the middle layer regarding of the source core is basically copies of the images where you could create them instantaneously, and then in the background, pulling the data from the image so that you're not having all the childs going back to the source image and then in the background, pulling the data from the image so that you're not having all the trials going back to the source image and then create a bottleneck there. We could scale out the source course as basically adding more of the source course or kind of scale it in a vertical manner.
Starting point is 00:18:19 It depends on what is the scale that we are talking about in the VM provisioning scenario. All right. As you can see, for constructing a core tree for the VM provisioning scenario, we need to also tightly manage the lifetime of the source core. Because when the child has all the data written, you know, all the data, you know, replicated, you don't want it to keep the source core forever because that will add additional cogs for your system. There's inefficiency there. So we need to manage the life science of the source core. To do so, you know, classic kind of a referencing counting issue, which is that you need a reference on the source to actually track the dependent children. So in that example, each of the source core have like six children. So in this case, we need some manners to actually know that,
Starting point is 00:19:17 oh, what are all the children that has dependency on the source core? So you can't go ahead and delete the source core if they are still dependent children. This is quite challenging as if we can't really do a simple reference counting with numerical values, because it's very hard for you to guarantee that the add reference operations is applied exactly once. So think about this in the distributed system, where you could have packet loss, and then your child will go and retry
Starting point is 00:19:41 trying to do this add reference, and then easily lead to incorrect reference count on the source core. The way we address this is that we kind of look at what is our design principle to kind of managing the lifecycle for the source core. First thing we wanted to guarantee is that a child should only have a single reference on the source core. Then, to achieve that, we are implementing an add reference call, which transnationally records the identity of the child. The identity of the child includes the UI of the child and also authentication information for the source to actually make calls against the child.
Starting point is 00:20:23 We ensure that the add reference call is idempotent so that in each of the source, you only have one record for that unique child. The diagram showing here is just an example. As on the source, you have six childrens, and then you will have kind of a list of all the identities of the children that has dependency on the source core.
Starting point is 00:20:48 Still, this is not perfect because how do I ensure that a child would only have a single reference? Because in the creation of the child, you need to solve the problem as whether you wanted to create a child object first or take the reference first. So it's hard to guarantee that both will happen simultaneously. So the decision we have taken is that we will always first take the reference. So what it means is that whenever you're creating a child, which is spinning off from the source core, we'll go ahead and first take the reference on the source, then create the child object.
Starting point is 00:21:26 That means that you will potentially have leaked reference as when you take the reference and then attempt to create a child, the child object may failed or there are reasons that it's just kind of deleted or the call itself, we just don't want it to proceed forward. So you could have leaked references in this case.
Starting point is 00:21:46 We approached this problem to do the reference management from two sides. One is that on the source aspects, what we do is that whenever the source core get a call from any of the child on a delete reference, or it receives a delete for its core itself from an upper level resource manager, the source will explicitly call a break reference across all the children. So no matter whether you have, you know, kind of called the delete reference or not, we'll attempt to break reference. And if the break reference call actually failed due to the reference not found by using the identities that is stored around stored on this child we will treat this as a leaked reference what we'll do is that we'll go
Starting point is 00:22:32 and delete the reference we also look at it from the child aspects where whenever you're trying to delete the child you'll first go and delete the reference on the source. So we have a process in place to retry on the delete reference to ensure that it eventually will succeed. And this is what we do in the way of deleting a child. So you can see that the delete reference is a sole responsibility on the child side. So it's quite a journey where we started first talking about the instant snapshots, restore, and creation. That's where we introduced our copy-on-read technology. Then we talked about how we are using our copy-on-read technology for doing VM provisioning.
Starting point is 00:23:25 And then we shared about in this way of using a core as constructing a core tree, how do we handle specific challenges in doing the reference counting? So I'll wrap this part up and then I'll turn it to Greg to talk about our next unique capability, which is supporting data access over different protocol path.
Starting point is 00:23:53 So one of the interesting capabilities of the X-Store architecture is the ability to support access to your block data, not just through traditional block protocols like SCSI, but also through additional protocols. We'll talk about that a little bit. You can get SCSI out of the way quickly here. Obviously, the traditional way to access your disk data is going to be to attach that disk to a VM running in the cloud. SCSI is probably one of the older and most mature protocols
Starting point is 00:24:23 for giving access to that data. And so, of course, we support that. In the last talk that I gave, someone asked about NVMe. And the answer I gave there was that we don't have anything specific to announce, but in choosing which block protocol we were going to go after first, SCSI was sort of the obvious choice because it is the most widely supported. I mean, one of the things that we see is that customers, you know, bring their solutions from on-prem into the cloud, and oftentimes these can be based on quite old operating systems that may not actually speak NVMe yet.
Starting point is 00:25:06 So SCSI was sort of the obvious choice here to start with. The interesting use case, though, here is the use of REST. And so the XStore architecture allows you to access your block data through HTTP REST requests. Now we can go through an example of why that might be interesting. So if you consider a sort of standard disaster recovery solution, you might want to replicate your data from one region to a completely different region, so in case it gets taken out, you still have access to your data. And the standard way of doing this would be to stand up another VM in the secondary region and then use some sort of, you know, technology to replicate your data. So Windows Server, for example, has block replication built in or rsync or there's a zillion ways that you might do this.
Starting point is 00:26:06 The downside of this approach, though, is that you actually have to run a second VM. VMs cost you money, right? And so sitting there with that other VM replicating the data into your second region, it's just costing you the entire time it's up and running. What you'd really like to do is to be able to move that data to your secondary region without having to run the VM. Now, had we only had block storage protocols like SCSI, that would have been your only choice. But with REST, you can move that data to your secondary region without having to run the VM. So we can take a quick look at what that might look like. In your primary region, you can take snapshots,
Starting point is 00:26:51 which Eman has already spoken about, and the snapshots support incremental data transfer. So these snapshots exist as objects that are accessible through the blob store using REST APIs, which means that you can write software using REST that's capable of taking those snapshots, copying them to your secondary region, and then as new snapshots are created in the primary,
Starting point is 00:27:16 you just pull the diff using the existing REST APIs and then apply those diffs in the secondary location. The way that the REST paths is implemented is in the FE or front-end layer. So the FEs are sort of like gateways to the storage service in XStore, and the FE is broken up into two layers. So the protocol layer is where we can introduce new APIs, new ways of accessing the data. So this would be where our existing network requests come in from the SCSI initiators, but also our REST APIs are implemented here.
Starting point is 00:28:02 Underneath that layer in the FEs is what they call the service layer. The service layer is where we have our internal primitives that allow the XStore storage stamp to get and retrieve data. And so the core idea of the FE is that we can introduce new APIs fairly easily in the protocol layer without anything underneath it needing to be affected or aware of that because it's all using a common internal interface from there on out.
Starting point is 00:28:33 The next thing that we'll talk about is ZRS. So in Azure, we have a replication strategy referred to as LRS. This would be three copies of your data that's created inside a single data center. Now, data centers are fairly survivable entities, but, you know, disasters happen. They are powered by, you know, regional power sources. And for certain workloads, you want to make sure that a disaster that would take out a data center would not prevent you from accessing your data. So different zones are physically separate areas within an Azure region that are fed by independent power sources or have independent cooling and are meant to stand alone so that if one zone goes down, the others are very likely to survive. ZRS replication allows you to create a disk, perform I.O. to it, and have the XStore system replicate that data for you across those zones. So this is showing an example of an HA setup that you might create using ZRS replication.
Starting point is 00:29:48 So you've created a ZRS disk here, and you have a VM that's doing IO to die, you could take one of your secondaries, spin it up, attach the disk, and immediately pick up where you left off before. Now, this is a unique capability to Azure to replicate your data synchronously like this underneath the covers across three zones. The way that ZRS is implemented is that as writes are performed to the ZRS disk, those writes flow through the FE to the partition layer, the table server, where they're passed
Starting point is 00:30:34 off to the stream layer, the ENs. The ENs then handle replicating that data to the other zones. Reads are issued or could be issued to any FE. In this example, we show that the reads are going to an FE in Zone 2 that passes it through to the table server and the EN succeeding the read. But if Zone 2 were to go down, those reads could be forwarded to an FE in either Zone 1 or Zone 3, and you would have immediateed to an FE in either zone one or zone three, and you would have immediate access to the data as it was last written.
Starting point is 00:31:15 I think we can, you know, we wrapped up our talk, and then if you have any questions, feel free to raise your hand and speak up. All right. Very good question. So let me pull it back to here. So I think, okay. Repeat the question. Oh, sorry. Let me repeat the question. So the question is regarding of how does, you know, XStore related to direct Drive and if that's the backend implementation. So I think going back to this, XStore and Direct Drive are two independent stack where Direct Drive is where we are supporting on the premium,
Starting point is 00:31:59 the newer disk offerings, and then we have XStore supporting the standard HDD to the premium SSDs. So they are standalone, but then we also provide data movements for customers for us to move data between those two stacks. So they are not in the isolated stack, but they are kind of two separate. Does that address your question? Okay. Oh, so the question is that does both app provide SCSI
Starting point is 00:32:31 access from Hyper-V? Is that... No, so if the question is specific, so the question is that can these disks be accessed through iSCSI? So today for both, for disks, no matter it's on the XStore or the direct drive stack, from the guest VM side, it can only be accessed to SCSI. It doesn't support iSCSI. But if you go back to what Greg talked about, our ability to add different protocols on our single X-Store stack, in the future, we could
Starting point is 00:33:19 add more protocols, but there's nothing that we support iSCSI today. All right. So the question is that for the ZRS disk, when do we actually acknowledge the write back to the client? So the way that I think we talked about is that for the ZRS disk, the write to these three zones are synchronously being written across the three zones. So the idea is that for the specific zones that you hit, let's say it was zone two, that's where we have the example, where on the stream layer, it will replicate the data to the
Starting point is 00:34:03 zone one and zone three, only until zone one and zone three completed the write, where on the stream layer, it will replicate the data to the zone one and zone three, only until zone one and zone three completed the write, and also the within zone two write has completed, we'll go back and acknowledge the write has been completed. Yeah? Very specific product question. So I think if my memory is right, today we support 200 snapshots per disk.
Starting point is 00:34:32 And these snapshots, just to kind of hold, you know, drill a bit deeper into this, all the snapshot we provide today on disks, they are standalone, meaning that they have an independent lifetime of the, you know lifetime of the disk itself. And for snapshots, you could create whether it's full snapshot or incremental snapshot. But roughly, I think it's 200 per disk.
Starting point is 00:34:55 I'll go with that first. Yeah, that gentleman. Do you mind repeating? Oh, so your question is that how do we handle the zonal replication if a zone goes down? Did I understand it correctly? Okay. So I think this is where on the ZRS cases that our intent is to actually replicate across the three zones. But if one zone actually goes down, we have an implicit logic that we will be replicating against the two available zones,
Starting point is 00:35:33 and then the other zones when it comes back. So we will try to reattempt on the replication. Good question. So your question is about how do we switch between the disk between X-Store and Direct Drive? So today, we actually, from a user experience perspective, we don't support that. But the way which we are looking to support it is kind of leveraging the snapshots, where you could use snapshot as to create a point in time, then create a new disk on whichever the stack you decided to go with based on the disk you choose and then use that to actually do the restore.
Starting point is 00:36:30 Very much like the copy on read that we just talked about. So I think the question is that for the X-Store and then the direct drive stack, what are the key differences? And then whether it's just on the performance or there's anything else that we wanted to, you know, different considerations that we have on this two stack. So right now, I think if you go back, so for the X-Store stack, it is a three-layer architecture where the direct drive stack, I think Greg talks about, so it kind of removed the layer for both on the front end and also the partition layer.
Starting point is 00:37:28 So that is very much optimized for block level access. So from a performance perspective, direct drive will definitely provide better performance and is optimized for that. I'll let Greg also comment on this in a bit. And then on the X-Store, as you can see, for X-Store, we have these layers. The layers are there for a reason. For example on the XStore, as you can see, for XStore, we have these layers. The layers
Starting point is 00:37:45 are there for a reason. For example, they're there to actually provide the additional protocol support on that. So if you look at how we position these two stacks, the XStore stacks is that we have comprehensive capabilities, which are optimized for lift and shift, because a lot of those you need, you know, iSCSI protocol. Actually, on the same stack, we support our blob and file storage, which you have SMB and then NFS and all that, where on the Direct Drive, it's purposely built for highest performance block storage. Anything? Yeah, the other difference that I would call out is that Direct Drive allows you to independently provision IOPS bandwidth and disk capacity. So instead of buying performance in fixed-size units, you're free to choose the disk size that you need
Starting point is 00:38:34 and then select however many IOPS or how much bandwidth you need based on that. Yeah, go ahead. So the question is that does Direct Drive support Snapshot and copy-on-read? So, architecturally, the answer is yes. I'll say what I said in my earlier presentation, which is for any particular product you're interested in, I would direct you to the product pages. Go ahead. Can you talk about the performance differences? The performance differences?
Starting point is 00:39:38 I think, let me try to see if I can address this on the offering perspective to give you a bit of sense. So it's very hard to directly just say from a stack perspective. We can talk about the product that we offer. So let's say we take the premium, you know, but I do want to put a disclaimer. What you get on the product is not exactly what the stack is able to offer, but I do want to provide some data points so you have some idea. So on the premium SSDs, which is on the X-Store stack, roughly from a performance perspective, it can scale to roughly double-digit thousand IOPS in that range. And then the latency is more around the one millisecond ranges, where on the direct drive on our ultra disks, those ones can easily scale to more than 100,000.
Starting point is 00:40:28 And that's not the max we can do. The stack can actually do better. But right now on the product offering, that's where it is. And then from a latency perspective, it's definitely in a consistent some millisecond range. For the locally attached,. . have this locally-attached storage. Is that, for example, you will have a node that has 64-carat of locally-attached TTI storage
Starting point is 00:41:10 with VPN. Do you serve those two cases with direct storage, or do you have separate services for that? Okay, so the question is that for most of the virtual machines or instance, some of the ones you purchase on cloud, you have local attached storage. Usually, I think each cloud provider stated differently like temp storage, local SSDs and all that. Do we serve it by direct drive stack? So today, no.
Starting point is 00:41:40 So today, all the local SSD or temp drives that you get on the virtual machines on Azure, those are locally attached. They are not sitting on the persistent storage stack. So today, for both XStore and Direct Drive, those are all persistent. We have different redundancy options. But in compared to the local SSDs you're talking about, those are kind of with the lifetime of the VM. And those are not persistent storage.
Starting point is 00:42:05 But they still serve a very good purpose that sometimes if you just do kind of large data analytics workflow where you just need local caching, those are perfectly fine. You don't really need it to be persistent, and it is locally attached. Yeah, go ahead. So the question is that do we support any data reduction capabilities or compression? So no, we don't. From a user perspective, we do not support any data reduction or compressions, which is user-facing. As when you buy a disk on Azure and say you have data reduction and compression. No, we don't support that. Do we plan to? I think that is a very good question. I think we hear very mixed feedback from users on whether they want
Starting point is 00:43:07 this. I think definitely in the end, you know, they don't want it because they want it to do data reduction. They want it because they want it to eventually, you know, reduce the cost they need to pay for the disks. So I think it's just different way to kind of scan the cats as in the end, if you have a TCO which hits their needs, whether you take an approach to directly expose data reduction or compression versus you actually just
Starting point is 00:43:34 give a price which is lower on provision wise, it will serve the same purpose. So I don't, right now we don't have any plans to go in doing kind of customer facing compression or dedupe or reduction. we don't have any plans of closing, you know, sharing, doing kind of customer-facing compression or dedupe or reduction. Yeah. So I think... So definitely on the storage stack, we do optimization so that we can reduce the overall cost.
Starting point is 00:44:11 I can't speak more on what we do, but the idea is that we will... The approach which we have taken is that we will be doing optimization so that from a user perspective, you get the lowest dollar you need to pay for this. Yeah? You mentioned on slide 24 that you're supporting SKZ and REST
Starting point is 00:44:30 and this other standard block storage protocol. Any comments? No. I think on this one, what we are attempting to kind of deliver the message is that without your stack on the FE layer, today we have REST supported there. We could easily add a new protocols besides REST. Right now we don't have any, but, you know, this provides, the stack itself provides the, you know, opportunity for us to add
Starting point is 00:45:02 any other standard industry protocols. Good question. So I think for Azure, we, by default, support something called server-side encryption with platform managed key, which Microsoft basically have the key to encrypt all the data. So if you don't do anything, it's always encrypted by Microsoft managed key. So encryption is there. We also expose the capability of encryption with customer managed key. So in that case, you are able to actually have, you know, kind of upload your key into something we call a key vault, and then use that key in the key vault to do the encryption for the disk.
Starting point is 00:45:52 Yeah. So the question, sorry, the question is that where in the stack that we actually perform the encryption. Greg, do you want to? So I think Aaron is our helper on the audience seat. So Aaron's answer was that on the X-Store stack, the encryption is done on the FE layer. And then we'll let Greg comment on the Direct Drive. Right. So for Direct Drive, architecturally, we support encryption happening either at the disk client or in the CCS for the right path. So it's just a configuration option.
Starting point is 00:46:54 Yes, that's true. Yes, the comment there was that the general guarantee we're making is that before any customer data is committed to a durable media, it is guaranteed to be encrypted. Right. Any questions? No. Oh, one more question.
Starting point is 00:47:29 So the question is that what about data in flight? Did we touch about the different protocols? So I think for both XStore and Direct Drive, from the client, which is where your VM sits, to the storage, we all use a proprietary protocol on that. Specifically on Direct Drive, I think Greg talked about in the previous talk, I'll protocol on that. Specifically on direct drive, I think, you know, Greg talked about in the previous talk, I'll not repeat that. Similarly, on the XStore, we have a proprietary kind of protocol, which talk from the client to the backend. Specifically on this area, it is actually going through kind of the network backbone, which it is strictly kind of
Starting point is 00:48:02 managed by Azure from the computer and storage. So we don't have anything. It's not like in the sense that you have, you know, SMB and then SMB does the specific network additional layer of the encryption on that protocol itself. So we kind of are leveraging our network infrastructure to ensure that the security from the compute to the backend storage communication. Yep. Sorry, I got the last piece. Just let me rephrase and make sure I understand it correctly. You are stating that for the storage stack, we have Hyper-Vs and then our storage client drivers. Your question is that is it in the same box? I'm not sure I get it.
Starting point is 00:49:39 Okay, the question is that is storage a separate server? So our storage are in a different cluster than the compute. So basically, the storage servers are on storage clusters where your virtual machines are in the so-called, we call it compute clusters. So they are remotely attached. I think we are. I don't know if we are running out of time or I think it's. We are at time. All right.
Starting point is 00:50:05 One more question. Yeah. Well, two actually. One, I'm frustrated by the possible differences between the presentation you showed and what is currently uploaded. Will you be uploading a new version? Yes, yes.
Starting point is 00:50:16 I think we uploaded a new version. Probably it's not yet being synced yet. I think we will make sure the latest version is getting uploaded. Very good. And I wanted to point out, speaking of your question, you may follow the new IEEE standards, I think we will make sure the latest version is getting uploaded. So this is a recommendation for us to look at the new IEEE kind of encryption. Yeah, we'll take a look.
Starting point is 00:50:40 Yeah. I see. Okay. Oh, that's great. Then I'll go to that session. All right. I think we're good. And then I think Greg and I will stay around for a bit in case you have any other questions.
Starting point is 00:50:58 Thank you. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.