Storage Developer Conference - #189: Behind the Scenes for Azure Block Storage Unique Capabilities
Episode Date: April 25, 2023...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode number 189.
Welcome to the session on behind the scenes for Azure block storage unique capabilities.
I'm Yiming, Principal Product Manager from Microsoft.
I'm very excited to be here.
Reason is that I eventually got to talk about something I worked for nine years when I started my career working on the first storage client driver that is supporting the SSD-based block storage premium disks.
And I'm here co-presenting with Greg. I'll let Greg do his intro.
Hello, everyone. I'm Greg Kramer. I'm a software architect in Azure Storage.
I'm actually here today filling in for a colleague that couldn't be here today.
Malcolm Smith, he works on a related team
in Azure Storage. So pardon me, I'm going to do my best to represent him and his team,
but I may have to defer some questions till later. All right, let's get started. So we'll do a quick
kind of walkthrough of what is Azure Storage, and then we'll jump
directly into the unique capabilities we offer, but more as how we are able to build it on top
of our XStore stack. We'll for sure leave some time at the end for Q&A. All right. So for Azure
Storage, we offer, you know, a comprehensive portfolio of all the storage offerings. We have
disks, which is the block storage offering. You know, going forward, I'll just use this to refer
to Azure block storage. And then we have, you know, Azure blobs, which is the object storage,
and Azure data lake generation 2, which is, you know, built on top of blobs and provide HDFS support. Lastly, we have the file storage and also hybrid storage solutions.
So if I actually look at all the storage offerings we have and then overlay it on top of our stack,
you could see that we have like blocks, Azure blob files, and also part of the Azure block storage
standing on a single storage architecture, which is,
we call it XStore.
And on top of this architecture, we actually provide optimizations for each specific services
so that you have different performance and price point and interfaces for the targeted
workloads.
And then for the block storage, we also have it on a new stack, which is what Greg delivered
the previous talk
about direct drive. So as this talk is specifically on block storage, we'll just zoom in and take a
look of our block storage portfolio. So going from the right to the left, you can see that the
performance and the scale goes up. So for standard HDDs, standard SSDs, and premium SSDs, those are hosted on our X-Store stack,
where the premium SSD version 2 and then UltraDisk is on the Direct Drive stack,
which is the newest, latest stack that is optimized for performance.
Specifically on today's talk, we are going to talk more about XStore. The reason is that we have evolved the XStore stack for past years and added a lot of comprehensive capabilities on the XStore stack that is best fitted for lift and shift workloads and all that.
So I will do a quick kind of high-level overview of the X-Store stack. So for X-Store, it is a three-tier stack,
three layers starting from the top,
which is our front-end layer.
This is where it's, you know,
the front-end actually process all the incoming traffic.
It provides the protocol endpoints,
authentication, authorization,
and then the logging and metrics.
Then we have the middle tier,
which is the partition layer.
This is where you can see that this is the layer that understands the business logics and then
manages all the data abstractions. That's where it presented the data as in blobs, files, and block
storage that you see at the different services. You can consider this as a massive scalable index,
which actually are able to kind of map each of the objects
to the specific streams and the data stored in stream layer.
The last layer is on the stream layer.
This is where it actually writes the bits
into the actual hardware.
And this is where it managed the replication
and also the distribution of the data across servers
within a cluster.
It sits on a JBONT, and then it's an
append-only file system. So in the later talk, we're going to talk about one of the specific
replication or redundancy options we offer, which is called zonal redundancy storage, and we'll come
back to the stream layer. So, you know, for XStore, there is actually a paper on this, so you can find it online.
We also did a, you know, a talk in Siena, I think, a couple years ago specifically on this.
So there are a lot of materials online about XStore if you're interested to know more.
So for today, we are going to talk about specific unique capabilities that we support on XStore. We picked three, which we think are very
representative that only Azure offers compared to any other cloud providers. The first one is
regarding of instant snapshot and restore. So this is critical for any enterprise workloads,
which they are using block storage. They will want the best RTO and RPO they can offer on any cloud.
Second is that we'll talk about the data access over different protocol paths. So we have the single X-Store stack where you can actually have both SCSI and also REST,
you know, different paths for you to do data access on your storage.
Lastly, we'll talk about the highly available storage,
which is our zonal redundancy storage,
which is synchronously replicated across three availability zones in Azure environment.
I'll put a disclaimer first,
as we will be focusing talking about the capabilities
and also the engineering design.
So specifically on how these capabilities applies
or offered on specific SKUs, we recommend you to refer to our product page for that.
All right.
So let's talk about instant snapshot create and restore.
So what is a snapshot?
So basically snapshot is a point-in-time capture of the state of your disk.
We support so-called incremental snapshots, as we only capture the changes between your snapshots and changes on your disks on the snapshot.
So you'll be saying that, oh, it's just a snapshot, kind of doing a versioning. what's so fancy about it, you know. What is unique is that we are able to provide snapshots in the choice of your storage,
meaning that if, let's say, your disk is hosted on SSD,
you will be able to create a snapshot either on SSD or HDD
and be made instantaneously available.
Similarly, on the restore path, you could have your snapshots in HDD, which you store
for a cost savings, you know, cheapest storage you can buy. And then when you're trying to restore,
you'll be able to quickly create a disk in a higher performance tier, which is referencing
using the, you know, snapshot in a lower tier storage. also made it instantaneously available.
So how do we orchestrate this flow?
Let's say the large cylinder on the right, that represents your disks.
And then you have the pages written in these disks, which is colored in green.
So, you know, we use a coloring just to show that you can write different page ranges, but we are an append-only system,
so it's not really that you're actually kind of located in different,
you have kind of gaps in between,
but just an indicator as the page ranges you have written.
So what we do first is that we create something called a reference blob.
What this reference blob has is that it has a shallow copy of the, you know, disks where we
capture what is the pages and indexing for the pages you have written and also the properties
of the disks. So in this reference page, when it's getting reference blob, when it gets created,
it basically have a standalone lifecycle of the disks. At this point, if you go and delete the disk, it's perfectly
fine. The reference blob has all the information that is needed to construct a snapshot.
Our next step is to create a copy on the target cluster. So going back to the storage of your
choice, your target cluster can be completely different than your source cluster. It can be hosted on HDD versus your source cluster is all on SSDs.
So in this case, we create something as a copy blob.
What is unique about this copy blob is that we have a technology called copy on read.
The moment you create this copy blob, we will be matching,
doing a background copy of the data from the source into this copy blob.
Similarly, if you have new reads that is made to this copy blob, and if the data has not yet
been replicated from the source, we will go back to the source and actually fetch the data,
persist it in this copy, and then return it to the user. So in this case, the copy blob on the target cluster
is basically the snapshot that we actually
fronted to the user, where the reference blob is something
which is the internal implementation,
and that will be hidden on the user end.
When the copy is completed, at that point,
we are good to delete the reference blob.
You no longer need it.
So our snapshot is also incremental. So what that means is that you could have, let's say,
you wrote the green areas, and then you create a snapshot. Then you make some changes regarding
of your disk, and then write the yellow part, which are your changes.
The flow is very similar, where you'll go and create a reference blob. And then from a reference blob, you will go and create a copy on the target cluster, where Azure will dynamically
identify the changes of the latest snapshot versus your previous snapshot that is made
available in the target cluster. So in this case, if you have the green snapshot, which is
already available, the new snapshot will only capture the changes, which is the yellow piece.
But let's say you don't have a previous snapshot. In that case, the new snapshot will capture both the green and yellow area.
So we just talked about create, where we'll talk about the restore part as well.
So when we say snapshot restore, this is where you're able to create a disk from a snapshot.
So the restore flow is slightly more complicated.
As you can see, the moment the disk is created, you need to handle both read and also write traffic from the user. In this case, the green
kind of cylinder is the snapshot you have created. Same first step, you'll go and create a reference
blob. Same idea. This is where you have a shallow copy. The moment the reference blob is being
created, you can go ahead and delete your snapshot,
and that will not impact the lifecycle of reconstructing the disk.
Your next step is to create a copy blob on the target cluster.
So your target cluster can be HDD or SSD-based, regardless of where your snapshot is being stored.
So in this case, if you're using, you know, HTT for cheap, you know, snapshot copy,
you can still immediately actually create
a high-performance disk out from that snapshot.
So as the disk is getting, as the, you know,
snapshot is getting restored,
we also create something we call a differencing blob,
which is there to actually handle the new coming write. So both the copy blob plus a differencing blob, which is there to actually handle the new coming
write. So both the copy blob plus the differencing blob together composes the new disk. From a user
perspective, they will only see a single top-level resource, which is a disk, but in the background,
we actually have these two blobs, which is hosting the data for this one disk. And the copy blob and the differencing blob are there for the entire lifetime of the disk.
So here you may have a question as, okay, so when you have a differencing blob, you'll
be writing data into that differencing blob, and then it will overwrite the pieces of ranges
which you have copied over. So you have some inefficiency here where, you know, same range,
you can have an old copy and then a new copy.
So we handle this as the moment when you have the page ranges which is overwritten,
we will actually mark the original page ranges,
and then we have our existing garbage collection process to actually reclaim that resource.
So we are, you know, going back to XStore,
we are a pen-only system.
So this is something which we do it
not only for the scenarios,
but all the scenarios when you have a pen-only system.
That's where you'll be marking, you know,
the overwritten kind of the page ranges
or the data sectors
and then leveraging a garbage collection process
to kind of free up those resources.
So in this case, I think if you look at it, you know, we kind of walk through regarding of the
snapshot restore flow. So I wanted to kind of just quickly go through the data path so you have a
better understanding of how we are actually, you know, serving the read and write traffic
on this newly restored disk.
So let's say if you have a read which hits the new disk.
What it will do is that it will first actually hit the differencing blob.
So it will try to figure out whether it's in the differencing blob as I've been recently written.
If it's not available in the differencing blob, we'll go and hit the copy blob.
So if the data is not yet being background copied from the source to the copy, we will go back to fetch the data from your source.
So that's for read.
Write is much simpler.
Write will always go to the differencing blob.
Same idea, you know, copy blob and also the differencing blob
will live for the entire lifecycle for the disk,
where when a copy is being completed,
you can go ahead and delete the reference blob.
As you can see, that's the way Azure,
we are able to support the instance snapshot restore and create.
It's all, you know, leveraging the copy-on-read technology that we have,
where in the copy blob, we have the orchestration of reading the data from the source and also being
able to fetch the data on the fly if there are incoming reads which are not already being
persisted in this copy blob. What are the other use cases for core? So copy on read, we just call it on core going forward.
The other use case we leverage core is for provisioning VMs from a single golden image.
So let's say you kind of have based a new kind of Linux distro,
and then you wanted to do a scale testing from that.
What you are most likely going to do is that you have a single golden image,
and then you will create multiple VMs from that. What you are most likely going to do is that you have a single golden image and then
you will create multiple VMs from that single image. In this case, all the VMs will be using
the same golden image for bootstrapping. And then all the VMs will also need its unique disks for
capturing the differences, which are the latest changes. You would also want the scale of these new VMs to be very fast
because if, let's say, you're scaling up your VMs for a production workload
to handle additional traffic,
you don't want it to be there waiting for your VMs to come up.
As you can see, this will easily impose a challenge
where all the VMs started to read from the same golden image,
where the performance of that image will be limiting on the factors of how your VMs and
your disk performs. The way we handle this problem is by constructing a copy on read tree.
So on the top, you can see that is the image. And then we will go and spin off new source cores that are pointing back to this image.
So the moment the source core is getting created, it will be reading the data from the image and then populating the source core.
Then you have child disks, which is pointing back to the source core. So each of the child disks will have their own core and also a differencing blob,
just like the disk restore scenario we just talked about.
In this case, you can see that, let's say you have a VM that disks,
you're trying to read from that disk.
The read to that disk will first go and attempt to find the data in the child core,
then going back to the source core, then going back to the source core,
and then back to the image. The reason why we structure it this way is that you can see the middle layer regarding of the source core is basically copies of the images where you could
create them instantaneously, and then in the background, pulling the data from the image
so that you're not having all the childs going back to the source image and then in the background, pulling the data from the image so that you're not having all the trials going back to the source image
and then create a bottleneck there.
We could scale out the source course as basically adding more of the source course
or kind of scale it in a vertical manner.
It depends on what is the scale that we are talking about in the VM provisioning scenario.
All right.
As you can see, for constructing a core tree for the VM provisioning scenario, we need to also tightly manage the lifetime of the source core.
Because when the child has all the data written, you know, all the data, you know, replicated, you don't want it to keep the source core forever because that will add additional cogs for your system. There's inefficiency there.
So we need to manage the life science of the source core. To do so, you know, classic kind of a referencing counting issue, which is that you need a reference on the source to actually
track the dependent children.
So in that example, each of the source core have like six children.
So in this case, we need some manners to actually know that,
oh, what are all the children that has dependency on the source core? So you can't go ahead and delete the source core if they are still dependent children.
This is quite challenging as if we can't really do a simple reference counting
with numerical values,
because it's very hard for you to guarantee
that the add reference operations is applied exactly once.
So think about this in the distributed system,
where you could have packet loss,
and then your child will go and retry
trying to do this add reference,
and then easily lead to incorrect
reference count on the source core. The way we address this is that we kind of look at what is
our design principle to kind of managing the lifecycle for the source core. First thing we
wanted to guarantee is that a child should only have a single reference on the source core. Then, to achieve that, we are implementing an add reference call, which
transnationally records the identity of the child.
The identity of the child includes the UI of the child and also authentication
information for the source to actually make calls against the child.
We ensure that the add reference call is idempotent
so that in each of the source,
you only have one record for that unique child.
The diagram showing here is just an example.
As on the source, you have six childrens,
and then you will have kind of a list
of all the identities of the children
that has dependency on the source core.
Still, this is not perfect because how do I ensure that a child would only have a single reference?
Because in the creation of the child, you need to solve the problem as whether you wanted to
create a child object first or take the reference first. So it's hard to guarantee that both will happen simultaneously.
So the decision we have taken is that we will always first take the reference.
So what it means is that whenever you're creating a child,
which is spinning off from the source core,
we'll go ahead and first take the reference on the source,
then create the child object.
That means that you will potentially have leaked reference
as when you take the reference
and then attempt to create a child,
the child object may failed
or there are reasons that it's just kind of deleted
or the call itself,
we just don't want it to proceed forward.
So you could have leaked references in this case.
We approached this problem to do the reference management from two sides.
One is that on the source aspects, what we do is that whenever the source core
get a call from any of the child on a delete reference,
or it receives a delete for its core itself from an upper level resource manager,
the source will explicitly call a break reference across all the children. So no matter whether you
have, you know, kind of called the delete reference or not, we'll attempt to break reference. And if
the break reference call actually failed due to the reference not found by using the identities that is stored
around stored on this child we will treat this as a leaked reference what we'll do is that we'll go
and delete the reference we also look at it from the child aspects where whenever you're trying to
delete the child you'll first go and delete the reference on the source.
So we have a process in place to retry on the delete reference to ensure that it eventually will succeed.
And this is what we do in the way of deleting a child.
So you can see that the delete reference is a sole responsibility on the child side.
So it's quite a journey where we started first talking about the instant snapshots,
restore, and creation. That's where we introduced our copy-on-read technology.
Then we talked about how we are using our copy-on-read technology for doing VM provisioning.
And then we shared about in this way of using a core as constructing a core tree,
how do we handle specific challenges
in doing the reference counting?
So I'll wrap this part up
and then I'll turn it to Greg
to talk about our next unique capability,
which is supporting data access
over different protocol path.
So one of the interesting capabilities of the X-Store architecture is the ability to support access to your block data,
not just through traditional block protocols like SCSI,
but also through additional protocols.
We'll talk about that a little bit.
You can get SCSI out of the way quickly here.
Obviously, the traditional way to access your disk data
is going to be to attach that disk to a VM running in the cloud.
SCSI is probably one of the older and most mature protocols
for giving access to that data.
And so, of course, we support that.
In the last talk that I gave, someone asked about NVMe.
And the answer I gave there was that we don't have anything specific to announce,
but in choosing which block protocol we were going to
go after first, SCSI was sort of the obvious choice because it is the most widely supported.
I mean, one of the things that we see is that customers, you know, bring their solutions from
on-prem into the cloud, and oftentimes these can be based on quite old operating systems that may not actually speak NVMe yet.
So SCSI was sort of the obvious choice here to start with.
The interesting use case, though, here is the use of REST. And so the XStore architecture allows you
to access your block data through HTTP REST requests. Now we can go through an example of
why that might be interesting. So if you consider a sort of standard disaster recovery solution,
you might want to replicate your data from one region to a completely different region,
so in case it gets taken out, you still have access to your data. And the standard way of doing this would be to stand up another VM in the secondary
region and then use some sort of, you know, technology to replicate your data. So Windows
Server, for example, has block replication built in or rsync or there's a zillion ways that you might do this.
The downside of this approach, though, is that you actually have to run a second VM.
VMs cost you money, right? And so sitting there with that other VM replicating the data into your
second region, it's just costing you the entire time it's up and running. What you'd really like to do is to be able to move that data to your secondary region
without having to run the VM.
Now, had we only had block storage protocols like SCSI, that would have been your only choice.
But with REST, you can move that data to your secondary region without having to run the VM.
So we can take a quick look at what that might look like.
In your primary region, you can take snapshots,
which Eman has already spoken about,
and the snapshots support incremental data transfer.
So these snapshots exist as objects
that are accessible through the blob store using REST APIs,
which means that you can write software using REST
that's capable of taking those snapshots,
copying them to your secondary region,
and then as new snapshots are created in the primary,
you just pull the diff using the existing REST APIs
and then apply those diffs in the secondary location.
The way that the REST paths is implemented is in the FE or front-end layer.
So the FEs are sort of like gateways
to the storage service in XStore,
and the FE is broken up into two layers.
So the protocol layer is where we can introduce new APIs, new ways of accessing the data.
So this would be where our existing network requests come in from the SCSI initiators, but also our REST APIs are implemented here.
Underneath that layer in the FEs is what they call the service layer.
The service layer is where we have our internal primitives
that allow the XStore storage stamp to get and retrieve data.
And so the core idea of the FE is
that we can introduce new APIs fairly easily in the protocol
layer without anything underneath it needing
to be affected or
aware of that because it's all using a common internal interface from there on out.
The next thing that we'll talk about is ZRS. So in Azure, we have a replication strategy
referred to as LRS. This would be three copies of your data that's created
inside a single data center. Now, data centers are fairly survivable entities, but, you know,
disasters happen. They are powered by, you know, regional power sources. And for certain workloads,
you want to make sure that a disaster that would take out a data center would not prevent you from accessing your data.
So different zones are physically separate areas within an Azure region that are fed by independent power sources or have independent cooling and are meant to stand alone so that if one zone goes down, the others are very likely to survive. ZRS replication
allows you to create a disk, perform I.O. to it, and have the XStore system replicate that data for
you across those zones. So this is showing an example of an HA setup that you might create using ZRS replication.
So you've created a ZRS disk here, and you have a VM that's doing IO to die, you could take one of your secondaries,
spin it up, attach the disk,
and immediately pick up where you left off before.
Now, this is a unique capability to Azure
to replicate your data synchronously like this
underneath the covers across three zones.
The way that ZRS is implemented is that as writes are performed to the ZRS disk, those
writes flow through the FE to the partition layer, the table server, where they're passed
off to the stream layer, the ENs.
The ENs then handle replicating that data to the other zones.
Reads are issued or could be issued to any FE.
In this example, we show that the reads are going to an FE in Zone 2 that passes it through
to the table server and the EN succeeding the read.
But if Zone 2 were to go down, those reads could be forwarded to an FE in either Zone
1 or Zone 3, and you would have immediateed to an FE in either zone one or zone three,
and you would have immediate access to the data as it was last written.
I think we can, you know, we wrapped up our talk, and then if you have any questions,
feel free to raise your hand and speak up. All right.
Very good question. So let me pull it back to here. So I think, okay.
Repeat the question.
Oh, sorry. Let me repeat the question. So the question is regarding of how does,
you know, XStore related to direct Drive and if that's the backend implementation.
So I think going back to this, XStore and Direct Drive are two independent stack
where Direct Drive is where we are supporting on the premium,
the newer disk offerings, and then we have XStore supporting
the standard HDD to the premium SSDs.
So they are standalone, but then we also provide data movements for customers for us to move
data between those two stacks.
So they are not in the isolated stack, but they are kind of two separate.
Does that address your question?
Okay. Oh, so the question is that does
both app provide SCSI
access from Hyper-V?
Is that... No, so if the question is specific,
so the question is that can these disks be accessed through iSCSI?
So today for both, for disks, no matter it's on the XStore
or the direct drive stack, from the guest VM side, it can only be accessed to
SCSI. It doesn't support iSCSI. But if you go back to
what Greg talked about, our ability to add different protocols
on our single X-Store stack, in the future, we could
add more protocols, but there's nothing that we support iSCSI
today.
All right.
So the question is that for the ZRS disk, when do we actually acknowledge the write back to the
client? So the way that I think we talked about is that
for the ZRS disk, the write to these three zones are synchronously being written across the three
zones. So the idea is that for the specific zones that you hit, let's say it was zone two,
that's where we have the example, where on the stream layer, it will replicate the data to the
zone one and zone three, only until zone one and zone three completed the write, where on the stream layer, it will replicate the data to the zone one and zone three,
only until zone one and zone three completed the write,
and also the within zone two write has completed,
we'll go back and acknowledge the write has been completed.
Yeah?
Very specific product question.
So I think if my memory is right,
today we support 200 snapshots per disk.
And these snapshots, just to kind of hold,
you know, drill a bit deeper into this,
all the snapshot we provide today on disks,
they are standalone,
meaning that they have an independent lifetime
of the, you know lifetime of the disk itself.
And for snapshots, you could create whether it's full snapshot or incremental snapshot.
But roughly, I think it's 200 per disk.
I'll go with that first.
Yeah, that gentleman.
Do you mind repeating?
Oh, so your question is that how do we handle the zonal replication if a zone goes down?
Did I understand it correctly? Okay.
So I think this is where on the ZRS cases that our intent is to actually replicate across the three zones. But if one zone actually goes down,
we have an implicit logic that we will be replicating
against the two available zones,
and then the other zones when it comes back.
So we will try to reattempt on the replication. Good question.
So your question is about how do we switch between the disk between X-Store and Direct Drive?
So today, we actually, from a user
experience perspective, we don't support that. But the way which we are looking to support it is kind
of leveraging the snapshots, where you could use snapshot as to create a point in time, then create
a new disk on whichever the stack you decided to go with based on the disk you choose
and then use that to actually do the restore.
Very much like the copy on read that we just talked about. So I think the question is that for the X-Store
and then the direct drive stack,
what are the key differences?
And then whether it's just on the performance or there's anything else that we wanted to,
you know, different considerations that we have on this two stack.
So right now, I think if you go back, so for the X-Store stack, it is a three-layer architecture
where the direct drive stack, I think Greg talks about, so it kind of removed the layer for both on the front end
and also the partition layer.
So that is very much optimized for block level access.
So from a performance perspective,
direct drive will definitely provide better performance
and is optimized for that.
I'll let Greg also comment on this in a bit.
And then on the X-Store, as you can see,
for X-Store, we have these layers.
The layers are there for a reason. For example on the XStore, as you can see, for XStore, we have these layers. The layers
are there for a reason. For example, they're there to actually provide the additional protocol
support on that. So if you look at how we position these two stacks, the XStore stacks is that we have
comprehensive capabilities, which are optimized for lift and shift, because a lot of those you
need, you know, iSCSI protocol. Actually, on the same stack, we support our blob and file storage, which you have SMB and then NFS and all that, where on the
Direct Drive, it's purposely built for highest performance block storage. Anything? Yeah, the
other difference that I would call out is that Direct Drive allows you to independently provision IOPS bandwidth and disk capacity.
So instead of buying performance in fixed-size units,
you're free to choose the disk size that you need
and then select however many IOPS or how much bandwidth you need based on that.
Yeah, go ahead.
So the question is that does Direct Drive support Snapshot and copy-on-read? So, architecturally, the answer is yes.
I'll say what I said in my earlier presentation,
which is for any particular product you're interested in,
I would direct you to the product pages. Go ahead.
Can you talk about the performance differences?
The performance differences?
I think, let me try to see if I can address this on the offering perspective to give you a bit of sense.
So it's very hard to directly just say from a stack perspective.
We can talk about the product that we offer.
So let's say we take the premium, you know, but I do want to put a disclaimer.
What you get on the product is not exactly what the stack is able to offer,
but I do want to provide some data points so you have some idea. So on the premium SSDs, which is on the X-Store stack, roughly from a performance perspective,
it can scale to roughly double-digit thousand IOPS in that range. And then the latency is more
around the one millisecond ranges, where on the direct drive on our ultra disks, those ones can easily scale to more than 100,000.
And that's not the max we can do.
The stack can actually do better.
But right now on the product offering, that's where it is.
And then from a latency perspective,
it's definitely in a consistent some millisecond range.
For the locally attached,.
. have this locally-attached storage. Is that, for example, you will have a node
that has 64-carat of locally-attached TTI storage
with VPN.
Do you serve those two cases with direct storage,
or do you have separate services for that?
Okay, so the question is that for most of the virtual machines
or instance, some of the ones you purchase on cloud, you have local attached storage.
Usually, I think each cloud provider stated differently like temp storage, local SSDs and all that.
Do we serve it by direct drive stack?
So today, no.
So today, all the local SSD or temp drives that you get on the virtual machines on Azure,
those are locally attached.
They are not sitting on the persistent storage stack.
So today, for both XStore and Direct Drive, those are all persistent.
We have different redundancy options.
But in compared to the local SSDs you're talking about, those are kind of with the lifetime
of the VM.
And those are not persistent storage.
But they still serve a very good purpose that sometimes if you just do kind of large data
analytics workflow where you just need local caching, those are perfectly fine. You don't
really need it to be persistent, and it is locally attached. Yeah, go ahead.
So the question is that do we support any data reduction capabilities or compression?
So no, we don't.
From a user perspective, we do not support any data reduction or compressions, which is user-facing. As when you buy a disk on Azure and say you have data reduction
and compression. No, we don't support that.
Do we plan to? I think that is a very good question. I think we hear very mixed feedback from users on whether they want
this. I think definitely in the end, you know, they don't want it because they want it to do
data reduction. They want it because they want it to eventually, you know, reduce the cost they need
to pay for the disks. So I think it's just different way to kind of scan the cats as in the
end, if you have a TCO which hits
their needs, whether you take an approach
to directly expose
data reduction or
compression versus you actually just
give a price which is lower on provision
wise, it will serve the same purpose.
So I don't, right now
we don't have any plans to go in
doing kind of
customer facing compression or dedupe or reduction. we don't have any plans of closing, you know, sharing, doing kind of customer-facing compression
or dedupe or reduction. Yeah. So I think... So definitely on the storage stack,
we do optimization so that we can reduce the overall cost.
I can't speak more on what we do,
but the idea is that we will...
The approach which we have taken
is that we will be doing optimization
so that from a user perspective,
you get the lowest dollar you need to pay for this.
Yeah?
You mentioned on slide 24 that you're supporting SKZ and REST
and this other standard block storage protocol.
Any comments?
No.
I think on this one, what we are attempting to kind of deliver the message
is that without your stack on the FE layer,
today we have
REST supported there. We could easily add a new protocols besides REST. Right now we don't have
any, but, you know, this provides, the stack itself provides the, you know, opportunity for us to add
any other standard industry protocols.
Good question.
So I think for Azure, we, by default, support something called server-side encryption with platform
managed key, which Microsoft basically have the key to encrypt all the data. So if you don't do
anything, it's always encrypted by Microsoft managed key. So encryption is there. We also
expose the capability of encryption with customer managed key. So in that case, you are able to
actually have, you know, kind of upload your key into something we call
a key vault, and then use that key in the key vault to do the encryption for the disk.
Yeah.
So the question, sorry, the question is that where in the stack that we actually perform the encryption.
Greg, do you want to?
So I think Aaron is our helper on the audience seat.
So Aaron's answer was that on the X-Store stack, the encryption is done on the FE layer.
And then we'll let Greg comment on the Direct Drive.
Right. So for Direct Drive, architecturally, we support encryption happening either at the disk client or in the CCS for the right path.
So it's just a configuration option.
Yes, that's true.
Yes, the comment there was that the general guarantee we're making is that before any
customer data is committed
to a durable media, it is guaranteed to be encrypted.
Right.
Any questions?
No.
Oh, one more question.
So the question is that what about data in flight?
Did we touch about the different protocols?
So I think for both XStore and Direct Drive,
from the client, which is where your VM sits, to the storage, we all use a proprietary protocol on that.
Specifically on Direct Drive, I think Greg talked about in the previous talk, I'll protocol on that. Specifically on direct drive, I think, you know,
Greg talked about in the previous talk, I'll not repeat that. Similarly, on the XStore, we have
a proprietary kind of protocol, which talk from the client to the backend. Specifically on this
area, it is actually going through kind of the network backbone, which it is strictly kind of
managed by Azure from the computer and storage.
So we don't have anything. It's not like in the sense that you have, you know,
SMB and then SMB does the specific network additional layer of the encryption on that protocol itself.
So we kind of are leveraging our network infrastructure to ensure that the security from the compute to the backend storage communication. Yep. Sorry, I got the last piece.
Just let me rephrase and make sure I understand it correctly.
You are stating that for the storage stack, we have Hyper-Vs and then our storage client drivers.
Your question is that is it in the same box?
I'm not sure I get it.
Okay, the question is that is storage a separate server?
So our storage are in a different cluster than the compute. So basically, the storage servers are on storage clusters where your virtual machines are in the so-called,
we call it compute clusters.
So they are remotely attached.
I think we are.
I don't know if we are running out of time or I think it's.
We are at time.
All right.
One more question.
Yeah.
Well, two actually.
One, I'm frustrated by the possible differences
between the presentation you showed
and what is currently uploaded.
Will you be uploading a new version?
Yes, yes.
I think we uploaded a new version.
Probably it's not yet being synced yet.
I think we will make sure the latest version
is getting uploaded.
Very good.
And I wanted to point out, speaking of your question, you may follow the new IEEE standards, I think we will make sure the latest version is getting uploaded.
So this is a recommendation for us to look at the new IEEE kind of encryption.
Yeah, we'll take a look.
Yeah.
I see.
Okay.
Oh, that's great.
Then I'll go to that session.
All right.
I think we're good.
And then I think Greg and I will stay around for a bit in case you have any other questions.
Thank you.
Thank you.
Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information
about the Storage Developer Conference, visit www.storagedeveloper.org.