Storage Developer Conference - #166: Future of Storage Platform Architecture

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 166. Hello, I'm Mohan Kumar, and I'm a fellow at Intel Corporation and I lead the cloud architecture team at Intel in the data center group. And my co-presenter, Reddy Chagam, is the lead cloud storage architect. And together, we're going to talk about the future of storage platform architecture in this talk. A decade back when PCIe and SSDs came along,

Starting point is 00:01:08 it changed the face of storage architecture. We're at a similar inflection point right now with CXL. And we want to show in this presentation, we want to take you through what are the possible ways the storage platform could change in the future with the help of CXL. So we want to give you a quick overview of what CXL is and what the various types of CXL and the memory and storage users are and then get into some of the future storage architecture concept in this talk. First of all, CXL is computer express link. CXL stands for computer express link.

Starting point is 00:01:50 It's built on top of PCIe physical layer, and it's an alternate protocol that allows you to transport memory and cache traffic on top of the same link that it used to carry, just PCIe. In order to do that, you needed to change both the processor architecture and the device architecture. Processor memory used to be something that was directly available from the integrated memory controllers in the processor. Now with CXL, the memory could be from the integrated memory controllers with those DDR

Starting point is 00:02:24 channels, or it could be memory behind the CXL from a CXL-based device. And that memory could be a dedicated memory device or a device that's an accelerator with also memory present in it. And similarly, what used to be a traditional IO device with a DMA and interrupt semantics changes into a device that potentially is capable of supporting coherency traffic as well as memory traffic in terms of cache line

Starting point is 00:02:52 accesses. And because it's built on PCIe, it benefits from the same high bandwidth of the PCIe. And PCIe was designed primarily, the transport layer of PCIe was primarily designed to transport IO device traffic. So it was more block-oriented in terms of its latency characteristics. With CXL, because you have to carry coherency traffic and memory traffic,

Starting point is 00:03:20 it's a low latency fabric designed to transport cache lines of traffic. It's a low latency fabric designed to transport cache lines of traffic. There are broadly three classifications of CXL. Type 1, which is an accelerator with a caching device. So you're allowed to cache the lines that you access from the system memory. An accelerator, which also has got memory in it. Now you have both I-O cache and memory semantics. But the total memory that's present in the system is the memory that's attached to the host CPU, as well as on the accelerator.

Starting point is 00:04:03 Or it could be a type 3, which is a dedicated memory buffer. It's access an expansion device by adding more memory to the system through CXL. And this allows you to have more memory bandwidth. Your service memory bandwidth is no longer limited by a number of DDF channels, but you can add additional memory bandwidth by adding additional memory buffers over CXL. You could also have more memory capacity by doing the same, by adding more memory buffers. And so your total memory capacity is no longer limited

Starting point is 00:04:35 by this total memory that's attached to your CPU directly on the DDR channels. More importantly, for the purpose of this talk, the models that are shown here in this picture are primarily one-to-one. There is one host and one memory buffer, one host and one I.O. device, one host and one accelerator. But it needn't be so. Just like PCIe has got switches, you can conceive of a CXL switch that allows you to have a one-to-any where one memory buffer is potentially connected to multiple system hosts. And this type of desegregation from the system

Starting point is 00:05:17 allows us to build some interesting storage topologies in the future. So first one, we're going to talk about a few of those topologies in the future. So first one, we're going to talk about a few of those topologies. First one is higher availability architecture for scale-out storage that leverages the benefits of CXO. Similarly, if you're going to build a software-defined

Starting point is 00:05:42 storage and you want to speed up the performance, one option is to take metadata and speed up the metadata accesses. And we want to show, using CXL, how you can solve that problem easier than you could in a host-based environment. And we'll also talk about how memory storage converged device can unlock the potential for storage platform architecture of the future. And finally, we will show how CXL accelerators can provide offload for storage in the future. This picture shows the storage architecture

Starting point is 00:06:23 for software-defined storage. In this type of an picture shows the storage architecture for software defined storage. In this type of an architecture, the storage is replicated across a number of storage servers. Each storage servers here essentially is storing its data and metadata in an SSD format. In this case, it's a shade-nothing architecture. So any node failure, any one of the storage servers

Starting point is 00:06:49 or storage nodes failing would cause a cluster-wide rebuild and rebalancing. And that could take several hours to complete, because you need to pull the data from the peer storage nodes. And also, of course, you have to rebuild the metadata. Those are today's problems. Let's see how it changes in the world of CXL.

Starting point is 00:07:09 With on the right hand side, with the CXL based mechanism, the storage node reduces to the CPU and local memory, but there is both memory and there's both storage and memory present behind these switches. And these switches allow you to essentially be mapped into two different servers or any and different servers for that matter. So a failure domain of your data is not the failure domain of the server. The server could fail. For example, of the server. The server could fail,

Starting point is 00:07:46 for example, the storage server 1 shown here could fail without essentially putting out access to the data that's associated with it. Currently the data associated with it may be in the SSDs in the switch beneath it and the metadata may be in the CXO memory bin. But the storage node failure does not mean that data is no longer available. Storage node failure still allows you access to the data through another storage node that can swap its links and access these SSDs and access these metadata through the CXL link. This means the host failure does not trigger cluster wall rebuilds anymore. This also means metadata is stored in CXL memory, which is for the purpose of this discussion,

Starting point is 00:08:36 you can think of it as a storage class memory, which helps reduce the rebuild time because you're not completely rebuilding the metadata as well. Second, since we talked about metadata, so in the previous concept, let's look at an option where you have a storage server where you have the persistent data that's stored in an NVMe SSD and the metadata that's stored in a DDR based memory. If you do this, in order for you to protect against the system failure, you need to make sure that the metadata that's stored is persistent.

Starting point is 00:09:19 And the way to do that is to essentially make the entire system, the entire server domain in what's called a full system persistence so your metadata is not lost on a server failure. But achieving this type of full system persistence is a platform and CPU dependent problem because there are various things. The CPUs have caches, right? So when there is a power failure, the caches have to be flushed. Internal fabric of the CPU has to be flushed. And all of this needs to happen in time before, in a way, in a manner that none of the data that's stored

Starting point is 00:09:58 in those caches and in the internal fabric is lost. You could go to a storage class memory. But if you go to a storage class memory, you've got to change the software semantics. You've got to change your software because you need to essentially explicitly cost durability points to make sure that the data has achieved that durability.

Starting point is 00:10:19 And of course, you can always go back to NVMe SSDs for your metadata. But that, as we know, is going to be slow. So what's the alternative? The alternative is, our problem is the fact that the full system persistence domain is too big. And is there a way for us to reduce the full system persistence domain? And here's a concept where we're using CXL and the metadata is stored in the CXL memory, the memory behind the CXL memory buffer. And therefore, when there is a failure, when there is a power failure of the host node, you really don't have to worry about power protecting the entire system.

Starting point is 00:11:09 All you got to do is to power protect the CXL memory buffer. So your persistence in memory domain is much reduced, and this allows for much faster metadata operations because now you're operating at the latency of DDR memory, which is much faster. And also when you have to achieve this persistency, you have no dependence on a server platform or a CPU, right? You don't have to worry about the caches, any of those things. As long as the data has reached the CXL's memory buffer and you've accepted that data right, you have to preserve that data. That's all the problem that you need to solve. That allows you to create a much simpler architecture.

Starting point is 00:11:51 Now here to take you through the sequence of steps that happen in this persistence model and to describe the few other models in this space is my colleague, Reddy Changam. Thanks, Mohan, for the nice intro. Let's take a look at how CXL memory buffer persistence can be used to speed up the metadata operations in the software-defined storage. Now, why metadata operations in the software-defined storage. Now, why metadata operations? Metadata operations tend to be fairly expensive, specifically the right metadata operations

Starting point is 00:12:33 in the software-defined storage, primarily because these metadata operations tend to be fairly small in nature, like 64 bytes to 128 bytes. And in order to protect that, you have content written in DRAM, and then you are essentially logging all the changes in the transaction to be persisted in the NVMe device, which uses the block Ivo operation that is fairly expensive from a latency perspective. That in turn results in the throughput reduction

Starting point is 00:13:17 for client write Ivo operations. So that's one of the reasons why we think having the metadata acceleration using the CXL memory buffer significantly improves the cluster-wide throughput. So let's take a look at how the metadata operations actually play a very critical role for the storage I-O operations. In this example, we are talking about, you know, this right I-O operation that is actually coming from the storage client. Now, the storage client does not have any insight into, you know, the exactly which storage node, which NVMe SSD, you know, that this data belongs to. So it basically is operating at a higher level abstraction where you have a virtual volume and you have an offset in that virtual volume and then you are actually issuing the

Starting point is 00:14:13 right operation against that virtual volume. When the storage client in the cluster receives that right I-O operation request, it has to go figure out exactly how that virtual volume maps to a specific physical device in a pool of NVMe SSDs that the storage server is managing. That's a metadata lookup. Once it actually identifies exact location of the NVMe SSD LBA range, it issues the write operation. Once that write operation is complete,

Starting point is 00:14:49 it needs to make sure that that LBA range is reserved for this virtual volume. So there is a commit operation that happens to protect the metadata integrity. Now, that commit operation, typically if it is an NVMe device, you're essentially going through the transaction log, like I mentioned before, and then you are issuing the block Ivo. In this flow, the commit operation

Starting point is 00:15:16 basically makes a bit flip in the DDR and then the response comes back to the host storage software. That improves the latency significantly compared to the traditional implementation. Once the local I.O. is committed, then it has to issue the write to the other storage node in the cluster. That itself is another metadata lookup by mapping the virtual volume and say where does it belong for the second copy and finding that server and issuing that right operation to that server

Starting point is 00:15:54 is really the next logical step. Once that is done, once the server acts the right I.O., the whole software on the primary storage server essentially responds back to the client, indicating that the operation is successful. So as you can see in this flow, the metadata operations are the critical ones

Starting point is 00:16:16 that actually prevent the actual data reads and writes. Without going through the metadata operations, you won't be able to do the media reads and writes. And that's why metadata operations are fairly critical in nature. And it is important to actually speed up that portion of the software-defined storage, you know, the bottleneck. So having the battery-backed DDR type of persistence in the memory buffer really enables, you know,

Starting point is 00:16:50 significantly higher throughput for write operations at a cluster level in the SDS implementation, as opposed to, you know, current implementations out there. That's really the benefit we are looking for in this architecture. Let's take a look at how the memory and storage architecture is actually converging using the CXO implementation.

Starting point is 00:17:23 Historically, when we look at the storage architecture, we look at essentially pool of servers with shared nothing architecture where everything is actually part of the storage server itself. That includes the memory, that includes the processing, and that includes the storage. And then everything else is essentially replicated and protected. Using CXL and the switching capability, we have to look at the storage architecture somewhat differently.

Starting point is 00:18:00 So the switching essentially enables us to pull the storage compute acceleration type of capability and drive the disaggregated architecture for memory and storage. So if you look at the storage implementation, you can have the storage logic in the CXL accelerator behind the switch. Anytime that logic wants to actually read and write the storage in and out of SSD, it can actually issue the P2P operation to the PCIe SSD attached to the switch. And then the P2P flows are actually offloaded into the switch as opposed to the host CPU managing the P2P flows. So that significantly offloads the host to CPU processing capability for real workload execution. And then, of course, you can also use the CXL memory buffer

Starting point is 00:19:05 with the persistence as a way to actually, you know, deliver, you know, somewhat improved metadata lookup operations on top of that, like what we talked in the previous slide. So having a CXL type of architecture with the switching and disaggregated capability, you can actually think of storage being not completely shared nothing architecture, but rather disaggregated and pooled architecture that significantly improves the resource sharing aspect in the data center, and then offloads the compute processing capabilities

Starting point is 00:19:51 behind the switch for workload execution on the host processor. So that's really the benefit of what we are looking for in the converged architecture. Now, let's click down on the device itself. Historically, what we have been doing is you have NVMe SSD for block I-O workloads. NVMe SSD provides the NVMe block I-O semantics. So the software stack has to be designed

Starting point is 00:20:21 to take advantage of the block I-O semantics through the kernel drivers as well as the user space implementations to read and write data in and out of the NVMe SSD using the NVMe block I-O protocols. With a CXL memory buffer, you essentially have a load store semantics that the software can actually take advantage of. Now, so these are two distinct protocols, two distinct devices,

Starting point is 00:20:50 and then the software has to change. And then, of course, on the platform side, the firmware device training, as well as the reliability, availability, serviceability type of features need to be implemented differently based on whether it is a CXL device or whether it is the NVMe SSD.

Starting point is 00:21:10 They do change. They are pretty much unique to each one of the device types. So imagine a situation where you have a converged controller, where the protocol, where the controller provides both NVMe and CXL protocol feature set and exposes the NVMe media based on the type of software requirements that you have. So let's say, for example, you want to use the NVMe SSD for inference embedding table lookups. Embedding table lookups are basically in-memory array data structure where you are looking up for embedding vectors as part of the inference in execution flow.

Starting point is 00:22:02 But these embedding tables are fairly large. They're fairly, you know, you're looking at like multi-gigabyte, you know, type of capacity. If you were to take NVM media and expose that as the dot mem semantics to the CPU, you can actually take advantage of significant amount of capacity

Starting point is 00:22:22 that NVMe media provides as opposed to traditional DDR media. If you don't have this functionality, the software essentially have to connect to the NVM media through the block IOS semantics, pull that media into DDR, host DDR memory, and then you essentially have to do the lookups on top of that. It has significant number of disadvantages. One, you need to make the software change. Two, you know, it's the performance, you know, is somewhat slower compared to the, you know, because you are issuing the block I.O. and then you are translating that into the load store through the host DRAM, you knowAM table storage. Instead, if you provide just

Starting point is 00:23:09 dot mem semantics to the to the converged controller, you can actually bypass pretty much software changes as well as you can improve the performance. So the workloads that really require capacity and use load store type of semantics can significantly benefit if you want to have the converged controller that can expose the NVM media through the load store semantics using the CXL protocol, right?

Starting point is 00:23:35 So that's really what will be the beneficial aspect for the set of workloads out there that can benefit from. So that's the converged device architecture. Now, there has been a lot of focus about computational storage type of outflow capability. SNIA, there has been a lot of work in SNIA on the standards as well.

Starting point is 00:24:08 So the current implementations on the computational storage use BlockIO. So essentially use the BlockIO in a new protocol to submit a command acceleration capability inside the NVMe SSD. We'll execute that command, process that data inside within the SSD itself, and return the response back through the NVMe protocol back to the host CPU software stack. And then we have the CXL acceleration capability where you are looking at, you know,.mem semantics and then be able to actually submit the commands to the accelerator using a cache-coherent interface. Now, what it does is, all of a sudden, you have a mechanism to interact with the host CPU cache hierarchy and be able to actually seamlessly interrupt between the acceleration capability within the CXL device, as well as the host, you know, the software stack running on the host,

Starting point is 00:25:08 you know, the CPU. So you essentially have two sets of protocols, two sets of, you know, computational offload implementations and the architecture around it. If you were to look at this as a converged architecture, so, you know, the essentially you don't have to change the software stack and the software stack continues to use the dot mem semantics.

Starting point is 00:25:36 And then you essentially provide the acceleration capability whether it is computational storage or computational memory, irrespective, you can actually run it through the CXL acceleration capability. And then tap into the NVMe media through NVMe controller. So you can actually unify this architecture for computational offloads. The benefit, again, here is that you don't have to worry about, you know, the protocol nuances. You don't have to worry about translation layer within

Starting point is 00:26:06 VME. You can actually bury the protocol nuances, avoid software changes, and deliver a seamless computational offload capability through the CXL accelerator functionality. That's really the benefit of what the converged architecture will give you. In summary, CXL enables storage and memory architecture innovations. This includes converged controller concept that provides both storage and memory semantics to enable, you know, workloads that require memory-centric, you know, load-store semantics. And then CXL acceleration capability to enable computational storage offload implementation as well, along with computational memory offload architecture to deliver seamless, you know, acceleration type of interface to the host platform.

Starting point is 00:27:08 And then having DDR persistence within the CXL memory buffer, as opposed to depending on the platform-based implementations to speed up the SDS metadata operations, specifically the right operations to improve the cluster-wide right data throughput. And then last but not least, bringing in the high availability architecture from storage appliance type of implementations out there into scale-out architecture using CXL constructs to reduce the cluster-wide rebuild and recovery times and enable the high availability uptime goals for the data center. Thank you for your time.

Starting point is 00:28:02 Thanks for listening. If you have questions about the material presented in this podcast, Thank you for your time. you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #166: Future of Storage Platform Architecture

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.