Storage Developer Conference - #166: Future of Storage Platform Architecture
Episode Date: April 5, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 166.
Hello, I'm Mohan Kumar, and I'm a fellow at Intel Corporation and I lead the cloud architecture team at Intel in the data center group.
And my co-presenter, Reddy Chagam, is the lead cloud storage architect.
And together, we're going to talk about the future of storage platform architecture in this talk. A decade back when PCIe and SSDs came along,
it changed the face of storage architecture.
We're at a similar inflection point right now with CXL.
And we want to show in this presentation,
we want to take you through what are the possible ways
the storage platform could change in the future with the help of CXL.
So we want to give you a quick overview of what CXL is and what the various types of CXL and the
memory and storage users are and then get into some of the future storage architecture concept in this talk. First of all, CXL is computer express link.
CXL stands for computer express link.
It's built on top of PCIe physical layer,
and it's an alternate protocol that allows you to transport memory
and cache traffic on top of the same link that it used to carry, just PCIe.
In order to do that, you needed to change both the processor architecture and the device
architecture.
Processor memory used to be something that was directly available from the integrated
memory controllers in the processor.
Now with CXL, the memory could be from the integrated memory controllers with those DDR
channels, or it could be memory behind the CXL
from a CXL-based device.
And that memory could be a dedicated memory device
or a device that's an accelerator with also
memory present in it.
And similarly, what used to be a traditional IO device
with a DMA and interrupt semantics changes into a device that potentially is
capable of supporting coherency traffic as well as memory traffic in terms of cache line
accesses.
And because it's built on PCIe, it benefits from the same high bandwidth of the PCIe.
And PCIe was designed primarily,
the transport layer of PCIe was primarily designed to transport IO device traffic.
So it was more block-oriented
in terms of its latency characteristics.
With CXL, because you have to carry coherency traffic
and memory traffic,
it's a low latency fabric designed to transport
cache lines of traffic. It's a low latency fabric designed to transport cache lines of traffic.
There are broadly three classifications of CXL. Type 1, which is an accelerator with a caching
device. So you're allowed to cache the lines that you access from the system memory.
An accelerator, which also has got memory in it.
Now you have both I-O cache and memory semantics.
But the total memory that's present in the system is the memory that's attached to the host CPU,
as well as on the accelerator.
Or it could be a type 3, which is a dedicated memory buffer.
It's access an expansion device by adding more memory to the system through CXL.
And this allows you to have more memory bandwidth.
Your service memory bandwidth is no longer limited by a number of DDF channels,
but you can add additional memory bandwidth by adding additional memory buffers over CXL.
You could also have more memory capacity
by doing the same, by adding more memory buffers.
And so your total memory capacity is no longer limited
by this total memory that's attached to your CPU
directly on the DDR channels.
More importantly, for the purpose of this talk,
the models that are shown here in this picture are primarily one-to-one.
There is one host and one memory buffer, one host and one I.O. device, one host and one accelerator.
But it needn't be so. Just like PCIe has got switches, you can conceive of a CXL switch that allows you to have a one-to-any where one memory
buffer is potentially connected to multiple system hosts.
And this type of desegregation from the system
allows us to build some interesting storage topologies
in the future.
So first one, we're going to talk about a few of those topologies in the future. So first one, we're going to talk
about a few of those topologies.
First one is higher availability architecture
for scale-out storage that leverages
the benefits of CXO.
Similarly, if you're going to build a software-defined
storage and you want to speed up the performance,
one option is to take metadata and speed up the metadata accesses.
And we want to show, using CXL, how you can solve that problem easier than you could in a host-based environment.
And we'll also talk about how memory storage converged device can unlock the potential for storage platform architecture
of the future.
And finally, we will show how CXL accelerators
can provide offload for storage in the future.
This picture shows the storage architecture
for software-defined storage. In this type of an picture shows the storage architecture for software defined storage.
In this type of an architecture,
the storage is replicated across
a number of storage servers.
Each storage servers here essentially is storing
its data and metadata in an SSD format.
In this case, it's a shade-nothing architecture.
So any node failure, any one of the storage servers
or storage nodes failing would cause
a cluster-wide rebuild and rebalancing.
And that could take several hours to complete,
because you need to pull the data from the peer storage
nodes.
And also, of course, you have to rebuild the metadata.
Those are today's problems.
Let's see how it changes in the world of CXL.
With on the right hand side,
with the CXL based mechanism,
the storage node reduces to the CPU and local memory,
but there is both memory and there's both storage and memory present behind these switches.
And these switches allow you to essentially be mapped into two different servers or any
and different servers for that matter.
So a failure domain of your data is not the failure domain of the server.
The server could fail. For example, of the server. The server could fail,
for example, the storage server 1 shown here could fail without essentially putting out access to the
data that's associated with it. Currently the data associated with it may be in the SSDs in the
switch beneath it and the metadata may be in the CXO memory bin. But the storage node failure does not mean that data is
no longer available. Storage node failure still allows you access to the data through another
storage node that can swap its links and access these SSDs and access these metadata through
the CXL link. This means the host failure does not trigger cluster wall rebuilds anymore.
This also means metadata is stored in CXL memory,
which is for the purpose of this discussion,
you can think of it as a storage class memory,
which helps reduce the rebuild time
because you're not completely rebuilding the metadata as well.
Second, since we talked about metadata, so in the previous concept, let's look at an
option where you have a storage server where you have the persistent data that's stored
in an NVMe SSD and the metadata that's stored in a DDR based memory.
If you do this, in order for you to protect against the system failure,
you need to make sure that the metadata that's stored is persistent.
And the way to do that is to essentially make the entire system,
the entire server domain in what's called a full
system persistence so your metadata is not lost on a server failure. But achieving this type of
full system persistence is a platform and CPU dependent problem because there are various
things. The CPUs have caches, right? So when there is a power failure, the caches have to be flushed.
Internal fabric of the CPU has to be flushed.
And all of this needs to happen in time before, in a way,
in a manner that none of the data that's stored
in those caches and in the internal fabric is lost.
You could go to a storage class memory.
But if you go to a storage class memory,
you've got to change the software semantics.
You've got to change your software
because you need to essentially explicitly cost durability
points to make sure that the data has
achieved that durability.
And of course, you can always go back to NVMe SSDs
for your metadata. But that, as we know, is going to be slow.
So what's the alternative?
The alternative is, our problem is the fact that the full system persistence domain is too big.
And is there a way for us to reduce the full system persistence domain?
And here's a concept where we're using CXL and the metadata is stored in the CXL memory, the memory behind the CXL memory buffer. And therefore, when there is a failure,
when there is a power failure of the host node, you really don't have to worry about
power protecting the entire system.
All you got to do is to power protect the CXL memory buffer.
So your persistence in memory domain is much reduced, and this allows for much faster metadata
operations because now you're operating at the latency of DDR memory, which is
much faster. And also when you have to achieve this persistency, you have no dependence on a
server platform or a CPU, right? You don't have to worry about the caches, any of those things.
As long as the data has reached the CXL's memory buffer and you've accepted that data right,
you have to preserve that data. That's all the
problem that you need to solve. That allows you to create a much simpler architecture.
Now here to take you through the sequence of steps that happen in this persistence model and to
describe the few other models in this space is my colleague, Reddy Changam.
Thanks, Mohan, for the nice intro. Let's take a look at how CXL memory buffer persistence can
be used to speed up the metadata operations
in the software-defined storage.
Now, why metadata operations in the software-defined storage. Now, why metadata operations?
Metadata operations tend to be fairly expensive,
specifically the right metadata operations
in the software-defined storage,
primarily because these metadata operations
tend to be fairly small in nature,
like 64 bytes to 128 bytes.
And in order to protect that, you have content written in DRAM, and then you are essentially logging all the changes in the transaction to be persisted in the NVMe device,
which uses the block Ivo operation
that is fairly expensive from a latency perspective.
That in turn results in the throughput reduction
for client write Ivo operations.
So that's one of the reasons why we think
having the metadata acceleration using the
CXL memory buffer significantly improves the cluster-wide throughput. So let's take a look
at how the metadata operations actually play a very critical role for the storage I-O operations.
In this example, we are talking about, you know, this right I-O operation that is actually coming from the storage client.
Now, the storage client does not have any insight into, you know, the exactly which storage node, which NVMe SSD, you know, that this data belongs to. So it basically is operating at a higher level abstraction where you have a virtual
volume and you have an offset in that virtual volume and then you are actually issuing the
right operation against that virtual volume. When the storage client in the cluster receives that right I-O operation request,
it has to go figure out exactly how that virtual volume
maps to a specific physical device
in a pool of NVMe SSDs that the storage server is managing.
That's a metadata lookup.
Once it actually identifies exact location
of the NVMe SSD LBA range, it issues the write operation.
Once that write operation is complete,
it needs to make sure that that LBA range is reserved
for this virtual volume.
So there is a commit operation that happens
to protect the metadata integrity.
Now, that commit operation,
typically if it is an NVMe device, you're essentially going through the transaction
log, like I mentioned before, and then you are issuing the block
Ivo. In this flow, the commit operation
basically makes a bit flip in the DDR
and then the response comes back to the host storage software.
That improves the latency significantly compared to the traditional implementation.
Once the local I.O. is committed, then it has to issue the write to the other storage node in the cluster.
That itself is another metadata lookup by mapping the virtual volume
and say where does it belong for the second copy
and finding that server
and issuing that right operation to that server
is really the next logical step.
Once that is done,
once the server acts the right I.O.,
the whole software on the primary storage server
essentially responds back to the client,
indicating that the operation is successful.
So as you can see in this flow,
the metadata operations are the critical ones
that actually prevent the actual data reads and writes.
Without going through the metadata operations,
you won't be able to do the media reads and writes.
And that's why metadata operations are fairly critical in nature.
And it is important to actually speed up that portion of the software-defined storage,
you know, the bottleneck.
So having the battery-backed DDR type of persistence in the memory buffer
really enables, you know,
significantly higher throughput for write operations at a cluster level in the
SDS implementation, as opposed to, you know,
current implementations out there.
That's really the benefit we are looking for in this architecture. Let's take a look at
how
the
memory and storage architecture
is actually converging using the CXO implementation.
Historically, when we look at the storage architecture,
we look at essentially pool of servers with shared nothing architecture
where everything is actually part of the storage
server itself. That includes the memory, that
includes the processing, and that includes the
storage. And then everything else is essentially replicated and protected.
Using CXL and the switching capability,
we have to look at the storage architecture somewhat differently.
So the switching essentially enables us to pull the storage compute acceleration type of capability and drive the disaggregated architecture for memory and storage.
So if you look at the storage implementation, you can have the storage logic in the CXL accelerator behind the switch. Anytime that logic wants to actually read and write
the storage in and out of SSD,
it can actually issue the P2P operation
to the PCIe SSD attached to the switch.
And then the P2P flows are actually offloaded
into the switch as opposed to the host CPU managing the P2P flows.
So that significantly offloads the host to CPU processing capability for real workload execution. And then, of course, you can also use the CXL memory buffer
with the persistence as a way to actually, you know,
deliver, you know, somewhat improved metadata lookup operations
on top of that, like what we talked in the previous slide.
So having a CXL type of architecture
with the switching and disaggregated capability,
you can actually think of storage being not completely shared nothing architecture,
but rather disaggregated and pooled architecture that significantly improves the resource sharing aspect in the data center,
and then offloads the compute processing capabilities
behind the switch for workload execution on the host processor.
So that's really the benefit of what we are looking for
in the converged architecture.
Now, let's click down on the device itself.
Historically, what we have been doing is
you have NVMe SSD for block I-O workloads.
NVMe SSD provides the NVMe block I-O semantics.
So the software stack has to be designed
to take advantage of the block I-O semantics
through the kernel drivers as well as the user space implementations
to read and write data in and out of the NVMe SSD
using the NVMe block I-O protocols.
With a CXL memory buffer, you essentially have a load store semantics
that the software can actually take advantage of.
Now, so these are two distinct protocols,
two distinct devices,
and then the software has to change.
And then, of course, on the platform side,
the firmware device training,
as well as the reliability, availability,
serviceability type of features
need to be implemented differently
based on whether it is a CXL device
or whether it is the NVMe SSD.
They do change.
They are pretty much unique to each one of the device types.
So imagine a situation where you have a converged controller,
where the protocol, where the controller provides both NVMe and CXL protocol feature set
and exposes the NVMe media based on the type of software requirements that you have.
So let's say, for example, you want to use the NVMe SSD for inference embedding table lookups.
Embedding table lookups are basically in-memory array data structure
where you are looking up for embedding vectors as part of the inference in execution flow.
But these embedding tables are fairly large.
They're fairly, you know,
you're looking at like multi-gigabyte,
you know, type of capacity.
If you were to take NVM media
and expose that as the dot mem semantics to the CPU,
you can actually take advantage
of significant amount of capacity
that NVMe media provides
as opposed to traditional DDR
media. If you don't have this functionality, the software essentially have to connect to the NVM
media through the block IOS semantics, pull that media into DDR, host DDR memory, and then you
essentially have to do the lookups on top of that. It has significant number of disadvantages. One, you need to make the
software change. Two, you know, it's the performance, you know, is somewhat slower compared to the,
you know, because you are issuing the block I.O. and then you are translating that into the
load store through the host DRAM, you knowAM table storage. Instead, if you provide just
dot mem semantics to the to the converged controller, you can actually bypass pretty
much software changes as well as you can improve the performance. So the workloads that really
require capacity and use load store type of semantics
can significantly benefit
if you want to have the converged controller
that can expose the NVM media
through the load store semantics
using the CXL protocol, right?
So that's really what will be the beneficial aspect
for the set of workloads out there
that can benefit from.
So that's the converged device architecture.
Now, there has been a lot of focus about
computational storage type of outflow capability.
SNIA, there has been a lot of work in SNIA
on the standards as well.
So the current implementations
on the computational storage use BlockIO. So essentially use the BlockIO in a new protocol to submit a command acceleration capability inside the NVMe SSD. We'll execute that command,
process that data inside within the SSD itself, and
return the response back through the NVMe protocol back to the host CPU software stack.
And then we have the CXL acceleration capability where you are looking at, you know,.mem semantics
and then be able to actually submit the commands to the accelerator using a cache-coherent interface.
Now, what it does is, all of a sudden, you have a mechanism to interact with the host CPU cache hierarchy
and be able to actually seamlessly interrupt between the acceleration capability within the CXL device, as well as the host, you know, the software stack running on the host,
you know, the CPU.
So you essentially have two sets of protocols,
two sets of, you know, computational offload implementations
and the architecture around it.
If you were to look at this as a converged architecture,
so, you know, the essentially you don't have to change
the software stack and the software stack continues
to use the dot mem semantics.
And then you essentially provide the acceleration capability
whether it is computational storage
or computational memory, irrespective,
you can actually run it through the CXL acceleration capability.
And then tap into the NVMe media through NVMe controller.
So you can actually unify this architecture for computational offloads.
The benefit, again, here is that you don't have to worry about, you know, the protocol nuances.
You don't have to worry about translation layer within
VME. You can actually bury the protocol nuances, avoid software changes, and deliver a seamless
computational offload capability through the CXL accelerator functionality. That's really the benefit of what the converged architecture will give you.
In summary, CXL enables storage and memory architecture innovations.
This includes converged controller concept that provides both storage and memory semantics to enable, you know, workloads that require memory-centric, you know, load-store semantics.
And then CXL acceleration capability
to enable computational storage offload implementation as well,
along with computational memory offload architecture
to deliver seamless, you know, acceleration type of interface to the host platform.
And then having DDR persistence within the CXL memory buffer, as opposed to depending on the platform-based implementations
to speed up the SDS metadata operations, specifically the right operations to improve the cluster-wide
right data throughput.
And then last but not least, bringing in the high availability
architecture from storage appliance type of implementations
out there into scale-out architecture using CXL constructs
to reduce the cluster-wide rebuild and recovery times and enable the high availability
uptime goals for the data center. Thank you for your time.
Thanks for listening. If you have questions about the material presented in this podcast, Thank you for your time. you can ask questions and discuss this topic further with your peers in the storage developer
community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.