Storage Developer Conference - #155: Innovations in Load-Store I/O Causing Profound Changes in Memory, Storage, and Compute Landscape

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 155. Greetings. My name is Devendra Das Sharma. I'm excited to participate in the STC conference. I hope everyone is staying safe and healthy. I will talk about how LoadStore I.O. is driving exciting innovations and causing profound changes in the compute and storage infrastructure.

Starting point is 00:01:02 Here is the agenda for today. We will talk about interconnects, especially load-store interconnects, as an important pillar in the compute, memory, and storage landscape. Then we will talk about the evolution of load-store interconnects as an important pillar, and the new Compute Express Link, CXL standard, which is poised to be a game changer. We will discuss in depth how CXL as an open interconnect slot

Starting point is 00:01:30 will drive innovations in the memory and storage space. I will go on a limb and provide my projections of how these innovations will save the future of compute. The picture here shows the interconnects at different levels of the data center looking from outside in. Other segments, such as edge computing and on-premise computing, are using cloud infrastructure concepts.

Starting point is 00:01:58 A data center comprises of racks of servers. They're connected through network fabrics with optical connections for longer distances and copper for short distance. Each rack consists of multiple chassis or servers as shown in the pictures here on the left. Each drawer can be a compute drawer, a memory drawer, or a storage drawer. The compute drawer has memory and storage needed to act as a compute server. A storage or memory drawer may have compute elements, but is used primarily for storage or memory. Each server in the rack is a single domain or a node, and it's typically contained within a chassis or a drawer as shown here. This can be a single CPU socket system.

Starting point is 00:02:46 Typically, we'll have a two-socket connected through cache coherent link. And the interconnects at this level are CPU to CPU symmetric coherency link such as our ultra path interconnect or UPI. For if you have a symmetric multiprocessing system or SMT for short. Memory interconnects PCI Express, which is also known as PCIe, and Compute Express Link, known as CXL. These are known as load-store interconnects. They're very tightly coupled, and we'll discuss more about these and their relationship with Fabrics in a bit.

Starting point is 00:03:19 Each server in this node connects to the networking Fabric through a NIC. Increasingly, smart NICs are being deployed to help offload the compute complex with networking as well as with infrastructure tasks. The fundamental challenge that we have is to treat the data center as a computer, whether it is compute, memory, storage, or any other resource. we have to get to warehouse scale efficiency. If we develop the proper interconnect infrastructure, we can move resources or tasks around seamlessly and securely. This will result in better power efficiency, better performance, better total cost of ownership, abbreviated as TCO. We are making good strides towards that goal from an interconnect perspective,

Starting point is 00:04:08 but we still have a long way to go. The graph here is from IDC and shows the explosion of data. Some of the drivers for this explosion are cloud, 5G, a lot of data coming in from the sensors, automotive, IoT, you name it. Applications are demanding ever-aggressive time to gain insights in real time in these large data sets. As the data volume grows and the workload requirements shift

Starting point is 00:04:34 to address new business opportunities, it puts new levels of pressure on the underlying infrastructure. We have seen our customers struggle with these new challenges at scale. Ensuring a trusted, stable, and reliable experience can be challenging. Workload demands are changing dynamically, impacting performance, memory, and storage needs. This is resulting in stranded or overtaxed resources and the inefficient use of data center capacity. For example, a server may have unused memory, whereas another server in the data center

Starting point is 00:05:09 could have used more memory in order to service a virtual machine that has a higher memory footprint. We have to meet these scaling challenges. Latency, bandwidth, capacity, these are all important to meet the customer demand. Explosion of this data is leading to rapid innovation. We have to move the data faster, store more data, process all that data quickly, seamlessly, efficiently and securely, and determine where in the storage hierarchy to store that data. In each of these aspects, interconnects play a critical role in the

Starting point is 00:05:46 compute continuum. Let me switch gears a bit and talk about some of my taxonomy of different node level interconnects and identify some of the key characteristics and their differences. I would broadly categorize the wired interconnect world into two categories, latency tolerance and latency sensitive. The fundamental reason stems from the scale in which they are deployed, resulting in some of the key differences and the characteristics that are shown here. Networking interconnects, be it one of the flavors of Ethernet, InfiniBand or Omnipath, they need to scale in connectivity to hundreds of thousands of nodes at the data center level. Hence, they are narrow and they lead the speed transition,

Starting point is 00:06:32 often at the expense of latency and cost. In contrast, the broad class of load store I.O. are latency and cost sensitive. The examples are PCI Express, CXL, UPI, in our case, effectively SMP, multi-processing interconnect, cache-coherent interconnect, memory interconnect. Because of memory access semantics,

Starting point is 00:06:57 these can be uber latency sensitive. Even a nanosecond impact to the latency for memory or coherent accesses may have noticeable performance impact in some applications. I have provided the five latency numbers here for the comparison purposes. Moving beyond 32 gigabit per second data rate to 64 gigabit per second data rate with spam for signaling is a major inflection point for the load store interconnects. We have to keep the latency flat while staying within the fixed silicon area and cost constraint. We will now delve into the evolution of load store interconnect.

Starting point is 00:07:35 Let us look at the evolution of PCI Express, the underlying interconnect technology driving different types of load store IO interconnect physical layer. PCI Express debuted with the release of PCIe 1.0 as shown in the table here in 2003. Since then it has evolved through five generations currently moving to the sixth generation. In every generation with double the data rate while maintaining full backward compatibility you can see that in the table as well as in the graph over here in the picture over here. Since PCI Express has been evolving in a cost-effective, scalable, and power-efficient manner with full backward compatibility, it has become the ubiquitous I.O. across the compute continuum. PC, handheld, workstation, server, cloud, enterprise, HPC, embedded, IoT, automotive,

Starting point is 00:08:23 they all use PCI Express. You have one stack, one silicon, but capable of working seamlessly across multiple form factors across the entire compute continuum. We have different widths for different types of devices. We have got BI1 through BI16. Full backward compatibility ensures that a BI16 Gen 5 device will interoperate seamlessly with a BI1 Gen 1 device. We have multiple PCI Express devices that are leading the technology transition from generation to generation, things like networking, XPUs, those are some of the leading candidates. Because PCI Express supports multiple protocols through its five, something known as alternate

Starting point is 00:09:01 protocol, the ones with the coherency and memory semantics, such as the Excel, are the leading consumers for this higher data rate. Fundamentally, as the compute capability goes up within the given power envelope in keeping with Moore's law and Dennard scaling, the interconnect bandwidth also needs to scale in order to effectively feed that beast.

Starting point is 00:09:23 Storage is steadily migrating to PCI Express due to its predictable performance cadence with low latency. As shown in this IDC graph here on the left-hand side, you can see the traction of PCI Express there. NVM Express has gained wide traction. The picture here in the middle shows how PCIe results in less number of components and provides a much better alternative for storage to connect to the system. PCI Express I.O. virtualization, as shown here, is important for storage applications. Multiple VMs can physically share the storage resources, which results in better efficiency while providing strict isolation guarantee.

Starting point is 00:10:07 Reliability, availability, and serviceability. RAS for short, things like hot plugs, these are critical for storage. PCI Express has enhanced its RAS offering through things like downstream port containment, enhanced downstream port containment, where the failure of one SSD does not bring down all the other SSDs in spite of having load-store semantics. One of the compelling value propositions of PCI Express is that we have one specification,

Starting point is 00:10:37 one stack, one silicon with full backward compatibility. But we do support multiple form factors, which is expected because a handheld device will have a very different power profile and a form factor need than, let's say, a storage chassis. And that has been key to PCI Express's adoption across multiple market segments. Now let's switch and see why CXL as a new class of open standard interconnect. The picture on the top here shows the system view with PCI Express only.

Starting point is 00:11:12 Memory that is attached to the CPU such as DRAM is mapped as coherent memory. Data consistency is guaranteed by hardware. On the other hand, memory that is connected to a PCI device is also mapped to the system memory, but it is uncatchable memory or memory mapped IOS space. So it cannot be cached in the system. So the memory that is attached to the CPU today is different in its semantics than memory that is attached to a PCI device. So when a CPU wants to access DRAM memory, it can simply cache the data, do the reads. If it updates the write, it will later on do a write back. That works just fine. Hardware basically takes care of data consistency. On the other hand,

Starting point is 00:11:50 if it wants to access memory that is attached to an I.O. device, because it is uncacheable, it has to send reads and writes across the PCI Express link for every access and that has to be completed by the device. By the same token, whenever a PCIe device needs to access system memory, it basically issues the read and write across the link that is known as the DMA, direct memory access. And the root port is the one that takes care of merging that data, keeping track of the producer-consumer semantics of PC Express, and basically merging it with the cache-coherent semantics in the system so that you get data consistency between these two different ordering domains.

Starting point is 00:12:31 By the same token, whenever you want to do peer-to-peer, that's also a non-coherent kind of access. This kind of accesses, they really work. Producer-consumer ordering model really works well for a wide range of IO devices. Things like bulk transfers, like getting data in and out of storage device or moving things in wide range of IO devices, things like bulk transfers, like getting data in and out of storage device or moving things in and out of a network, it really works well. We definitely want to preserve that for those usages. However, we are seeing the emergence of a new class of applications, a new type of things

Starting point is 00:12:59 like accelerators that want to work on a much more finer-grained basis with the processors. We are seeing a lot of demand for heterogeneous type of processing. Different entities are good at processing different types of things that's heterogeneous processing. And there you want finer-grained sharing of the data. So the PCI Express mechanism needs to be augmented with enhancements as shown over here. So this way now the devices should be able to cache the memory in the system in the local cache and by the same token the memory that is attached to the devices should be able to be mapped to the coherent memory space. So this way the devices don't really need to DMA in and out of memory when the data needs to move between CPU and the device or between devices.

Starting point is 00:13:52 So you can basically work on the problem on the data set in a collaborative manner and cache coherency takes up the thing. You don't have to really literally move the entire data structure back and forth. With CXL, we enable heterogeneous computing and disaggregation as we talked about here and we'll see more about that. Memory resources anywhere in the system can be shared efficiently between these computing elements due to memory and coherency semantics. So this results in enhanced movement of operands and the results. Furthermore, we can use CXL for memory bandwidth and memory capacity expansion supporting multiple types of memory tiering with different memory types. CXL is an open industry standard with 150 plus memory companies. All CPU, GPU, memory, storage vendors are in the consortium. There is tremendous movement in the ecosystem with

Starting point is 00:14:37 several interop and product announcement. In short, CXL is poised to be a game changer in the industry. Compute Express Link runs on PCI infrastructure and leverages PCI Express as shown over here. It uses PCIe 5. You can see here PCIe stack and a CXL stack on the CPU. They both run on the PCIe 5 at 32 gigabit per second. We can downgrade to 8 and 16 if needed. The widths supported on CXL are by 4, by 8, and by 16.

Starting point is 00:15:08 It's full plug and play capable. So you can plug in either a PCIe card here or a CXL card here in this slot. Very early in the link training, PCI Express has this alternate protocol mode that figures out if this is a PCIe card or is it a CXL card. If it is a CXL card, then from very early in the training process, it hands over to CXL and you come up as a CXL device. If it is a PCIe, you basically come up as a PCIe device. It's all done in the hardware. No involvement of software for training that link.

Starting point is 00:15:41 So it automatically detects and comes up in the right mode. We completely leverage PCI Express in order to accomplish this goal. So basically enhance the PCIe offerings, given that it's a ubiquitous infrastructure, it makes sense for us to only invent the parts that are really needed. So the approach of CXL is that it's defined ground up

Starting point is 00:16:04 to address the challenges in the evolving compute landscape by making both heterogeneous computing as well as different types of memory efficient. As we saw, CXL is built on top of PCI Express infrastructure, leverages PCI Express. Fundamentally, it overlays caching and memory protocols on top of existing PCI protocols. There are three protocols, CXL.io, CXL.Cache, and CXL.Memory. CXL.IO is the I-O part of the stack. It's almost identical to PCI Express. We use it for things like device discovery, configuration, register access, interrupt, direct memory access, I-O

Starting point is 00:16:41 virtualization, all of those. It's mandatory. CXL.cache protocol is what enables the device to cache the data in the system. CXL.memory protocol is also optional for the device. It allows the processor as well as other CXL devices, which have CXL.cache semantics, to access the device memory in a coherent manner. Latency is a critical component to ensure system performance for CXL. We are expecting latencies for CXL.Cache and CXL.Memory type

Starting point is 00:17:16 of accesses to be in line with symmetric cache coherency links, such as our UPI. So for example, a CPU load to use latency is expected to be significantly less than 200 nanoseconds, which is, again, significantly less than half of a comparable PCI access. The other aspect of CXL which is extremely important is CXL is an asymmetric protocol.

Starting point is 00:17:41 So the protocol flows, message classes are different between the host processor versus the devices. That's a conscious decision to keep the protocol simple and the implementation easy for devices. Host processors have their proprietary home agent functionality that is shown here, which basically orchestrates cache coherency in the system. There are multiple CPU cores, you got your, you know, DMA, IO agent that basically merges the DMA with the cache coherency. So there is an inherent home agent that is present. It is very dependent on the underlying microarchitecture. CXL just

Starting point is 00:18:15 abstracts the caching functionality. It assumes that there is a home agent. It doesn't specify how that home agent behaves. It just does the caching agent functionality, which is the simpler part. And it basically does that by sending a set of commands that are basically conforming to the messy protocol, modified, exclusive, shared, and invalid. And that way, the device side of the implementation becomes easy. Home agent is the one that orchestrates coherence, resolves conflict, all of that. The device doesn't need to worry about any of that. By the same token, even if you are a memory device, you really don't need to understand the coherency flow. You just need to provide access to the data, which is what CXL defines. So this asymmetry has helped us

Starting point is 00:19:00 tremendously with the backward compatibility without stifling performance or innovation. Here are some of the typical usage models that are targeted for CXL 1.2 and 1.1. There are three types of devices, type 1, type 2, and type 3. An example of type 1 device would be a NIC or smart NIC or an accelerator, which uses.io and.cast semantics..io is used for bulk transfer, but caching is used for things like advanced atomic support, things like if you have a PGAS NIC, partisan global address-based NIC, you might want to do the PGAS ordering enforcement. So there, instead of trying to merge the ordering between PCIe and the PGAS, what we can do is we can prefetch all the cache lines that will be used

Starting point is 00:19:46 for the PGAS NIC and then complete them within the local cache in the PGAS order, which will result in a much better performance. And type 2 devices are GPGPU, dense computation, where all three protocols are used. Here, accelerators and host processors, they work closely. There is no need to hand off entire data structures and results. Only those that are used to will be cached and the cache coherency and enforces global visibility. Last but not the least is memory buffer. Given that Serdis is very bandwidth efficient for PIN,

Starting point is 00:20:19 we can use CXL for memory bandwidth and memory capacity expansion. This also enables heterogeneous memory, including persistent memory, multiple memory hierarchies, since the memory controller is within the device. Also of importance is that you can deploy a type 2 device in a traditional type of memory device and deliver even more compelling value propositions that we will see in a few slides.

Starting point is 00:20:49 Now let's see what CXL 2.0 does while maintaining full backward compatibility with CXL 1. CXL 1, whether it is 1.0 or 1.1, was based on a single node. What CXL 2.0 does is it enables resources to be pooled across multiple nodes. So essentially, CXL 2.0 enables one to move from a drawer level to a subrack or a rack level. This is shown in the diagram here. On the top, you have multiple hosts or nodes having access to multiple devices.

Starting point is 00:21:19 The assignment of devices to the host, they can change over time. And this will be done by using the hot plug source defined by the CXL 2.0 specification. CXL 2.0 specification defines a standardized fabric manager to orchestrate the resource allocation across multiple hosts. Each host can have its own dedicated resources that is not shown here. So it can be multi-socket with its own memory, with its own IO devices, all of that. Those are all abstracted as a host here. So it has its dedicated resources. But in addition to that, it has access to these resources in the pool. So for example, D1, it's an accelerator.

Starting point is 00:21:59 It is assigned to H2 right now. But later on, if H2 doesn't need D1, it is going to get released to the pool, and maybe H3 will be assigned it, in which case the color will be purple color. So in addition to that, CXL 2.0 further expands to define the memory type 3 devices to simultaneously belong to up to 16 hosts. In this case, part of D3's memory is assigned to H3,

Starting point is 00:22:28 and the other part of it is assigned to H2. And that itself can change over time. So this allows for partial allocation of devices to the node. CXL 2.0 Architect's persistence flows to work seamlessly with Fabrics. It also defines software API for devices to enable system software to

Starting point is 00:22:45 manage the devices. Security enhancements on top of the device TLB, things like device authentication, link encryption, and security through the switches are also comprehended in CXL 2.0. In short, CXL 2.0 is a game changer. It enables the construction of disaggregated system to improve resource utilization in order to deliver significant PCO and power efficient performance improvements in the data center. Now let's look at how innovations in load store IO will change the compute landscape. CXL provides a media independent memory interface with coherence extensions. It does so while preserving all PCIe functions and PCIe services.

Starting point is 00:23:31 For example, the device management of PCIe can be fully leveraged, including NVM Express. The picture on the top right here shows the memory hierarchy. We have load store semantics that covers DRAM, NRAM, MRAM, 3D cross point, effectively storage class memories using load store semantics. What CXL enables to do is move the storage class memory into the cacheable space.

Starting point is 00:23:55 NAND is accessed through different semantics using files, objects, blocks connected through PCI Express. These two semantics are different, as shown by the semantics wall in the diagram here. CXL enables the storage class memory to be mapped into the coherent space. As a result, CXL can span everything from the DRAM through the storage class memories as well as NAND kind of memory using all the semantics using CXL.Memory semantics for the top part and CXL.IO for the other bottom part and it can also use NVM express infrastructure with DMA type operations.

Starting point is 00:24:33 CXL enables new compute and memory architectures with memory and coherence extensions. One can extend the memory hierarchy to include the storage class memory and even NAND using CXL. As a result, CXL provides the additive bandwidth and capacity over traditional DIMMs across multiple types of memory and hierarchy without interference and that is possible because CXL is memory neutral. The other advantage of CXL is its leverage of PCIe form factor. We are no longer constrained by considerations such as consuming a precious DIMM slot and having the power constraint of 15 watts on a DIMM slot. We have lots of choices of form factors.

Starting point is 00:25:19 We can even enable higher power profiles, things like 25 watts, 40 watts if needed. And the benefit of leveraging PCI Express doesn't stop there. We get standard device discovery, configuration management. We get the software leverage, things like PCI driver, all the ACPI enhancements. For example, we can deploy the heterogeneous memory attribute table, HMAT to describe properties of the memory.

Starting point is 00:25:41 We can do DMA engine for data move. We can use IO virtualization services of PCI Express. Now let's look at some of the applications of CXL in the memory and storage segment. As discussed before, DRAM memory can be behind CXL as shown over here since memory bandwidth and capacity increases linearly with the core count, and different customers have different needs for core count,

Starting point is 00:26:09 today one has to build multiple platforms with different number of DIMM slots. However, with CXL, one can envision a common platform as shown over here. You've got the regular DIMM slots, and you have got the DRAM through CXL that you can plug in as many as you need. So that way you have a common platform, and you also have the flexibility of adding more memory bandwidth and more memory capacity.

Starting point is 00:26:36 CXL offers bandwidth scaling with the width and frequency. So unlike traditional memory, one can allocate different bandwidth with the same capacity by changing the link weight, for example. As we had stated earlier, PCIe data rates are doubling every two to three years while staying within the power, performance, and channel reach, as well as cost profile. So today you can get 32 gigabytes per second per direction if you have a x8 Gen 5 CXL device. Once we go to, let's say, Gen 6 data rate, 64 gig, we can get 128 gigabytes per second per direction. And that's a lot of bandwidth that we have at our disposal. The other benefit of CXL is spin efficiency. DDR requires a lot of spin. Spin growth in CPU package by increasing the number of memory channels to sustain core performance is not going

Starting point is 00:27:26 to be sustainable in the long run. This is where CXL can help deliver more bandwidth on a per pin basis. You do get higher latency with CXL than a local DRAM access that's inherent because you're crossing a chip and service has invariably more latency than a parallel bus-based infrastructure. However, CXL access is similar in latency to a DRAM access in a remote CPU socket in an SMP system, where NUMA domains have been very well established. The question is how much of memory stays in the traditional DIMM and how much of it really migrates to CXL. With time, my prediction is we will see more and more memory migrate to CXL. And I can envision systems like this on the right-hand side here where pretty much all the memory is CXL and you've got on-package memory.

Starting point is 00:28:17 That will happen over time. Here, one can even have a memory hierarchy where on-package memory access a memory side cache or a level 4 cache. Backward compatibility ensures interoperability, something that we don't get today in the DIMM world. So this offers a very compelling interconnect for memory to migrate to.

Starting point is 00:28:40 It makes perfect sense for me that NVidim moves to CXL, as shown in the picture over here. We can have storage class memory, such as NAND, that can back up the DRAM, and it will have the non-volatile characteristics. One can even envision deploying a DMA engine to back up memory from other DIMM slots. The advantage of this is it will be hot-pluggable, like a form factor like this, and hence, it will be serviceable, something that we just do not get today.

Starting point is 00:29:10 It can be also multi-headed and connect to multiple hosts for failover access. We do not have to consume a plesius DIMM slot, and we are not constrained by the form factor or power constraint. Persistent memory behind CXL is now cacheable. All of this while we maintain hot plug serviceability and multi-headed access for availability. One can even build very large capacity memory by having DRAM as the memory side cache for the entire or a large chunk of

Starting point is 00:29:41 the storage class memory like NAND for example. So this is like a 2LM kind of a model, multi-level memory model. One can build very large capacity that way, and you can use the HMAT table to interleave accordingly, and you can also report the latency characteristics accordingly. The CXL devices in both these types of memory can be augmented with TMA engines to move data efficiently and it can be deployed for accelerators for local processing of the data. One of the compelling new usage for CXL is computational storage. Imagine putting an accelerator like shown over here in front of memory for near memory compute. So functions such as compression, encryption, computation for reliability like RAID 5, RAID 6, compaction for key value store, search engine, vector processing for AIML applications,

Starting point is 00:30:42 all of these can be done through this accelerator engines here, there. DMA engines can be deployed to transfer data to where it is needed. So this can also use standard drivers and management framework that we have developed over the years for PCI Express. One can envision that the storage class memory, as we talked about, can be mapped into the coherent space

Starting point is 00:31:04 and have a local DRAM back it up. So that's also possible, as we have shown over here. So the picture here shows that how you can build a very efficient system with all of these capabilities combined with the memory capacity and memory bandwidth expansion that CXL offers. What's important is everything in CXL can be located behind even a low latency switch

Starting point is 00:31:28 and it can be pulled across multiple clusters in a rack. The bottom line is with a simple set of.cast and.mem enhancements, CXL enables one to build on top of existing PCI Express infrastructure and services and offers a very effective and compelling interconnect. It enables to scale with heterogeneous processing and memory with shared, cacheable memory space that is accessible to all using the same mechanism.

Starting point is 00:31:59 The slide here offers a view of how to treat the storage plus memory as a byte addressable data store with storage data style management and protection services while being accessible through the load store semantics of CXL. So byte addressable characteristics will be of course it is load store access it offers cacheability on top, scalable to petabytes plus still low latency because of dot memory access latencies. And you still can preserve the storage functions like replications, snapshots, cluster-wide data sharing. And with CXL 2.0 and beyond, we are moving from a node level to a rack level. So the cluster-wide thing also applies there.

Starting point is 00:32:49 CXL enables us to extend the load-store interconnect across multiple hosts. So it's a fundamental paradigm shift that we are seeing since load-store semantics so far have been constrained to captive resources, whether it is processing, IO, memory, or storage, all inside one host. Composable disaggregated infrastructure at rack level, as shown here, and potentially across racks, is essential to realize the data center of the future vision. That vision is represented pictorially here. We have pools

Starting point is 00:33:17 of compute nodes, multi-tiered memory, XPUs for acceleration, and storage pool. And you have an infrastructure processing unit that acts as the connection to the networking fabric. Each of these entities are connected through CXL, as shown here in the load store fabric. This can be at the rack level or even at a pod level. The goal is to deliver almost identical performance per watt as independent servers. Going forward, we need to enhance this usage model with shared memory, message passing, and atomics among these nodes through CXL

Starting point is 00:33:52 so that we can encompass new usage models and also new segments that require high performance computing. We need to nurture the synergy between networking fabrics and load store interconnect so that we can interoperate seamlessly, provide the same look and feel for our end users, but with different performance characteristics. Clearly, if you go through the fabric, your latencies will be a lot higher. If you go through the load-store interconnect, your latencies will be a lot lower, but your scale is

Starting point is 00:34:18 going to be also lower. We already have the necessary constructs like fabric manager, multi-head, multi-domain, atomic support, persistent flows, smart net, optimized flows to access system memory without involving the host, VM migration with CXL. We still have a lot of challenges. We need to ensure that we are taking the learnings from our initial CXL implementations into account as we move forward. One of the important challenges is performance, namely latency and bandwidth. These are two sides of the coin. As we have discussed, access latency of direct CXL attached device

Starting point is 00:34:55 for load to use should be less than, significantly less than 200 nanoseconds, similar to a UPI access. If we go through a single level of switching, we will probably add another 100 nanosecond latency to that path. Low latency switches are absolutely critical to realize this vision. These accesses through switches can leverage the existing NUMA optimizations that we have done with SMP multiprocessing. The other aspect is bandwidth. With giga transfers per second and no signs of slowing down. Load store interconnects have also hundreds of lanes per CPU socket.

Starting point is 00:35:31 Bandwidth is not going to be a bottleneck. We can aggregate bandwidth through multiple sets of switches. A rack worth of resources are fairly accessible through a single level of switch with 128 lanes if we chop them up into by four ports. So with two levels of switches, the connectivity goes up as a square of that. Latency does increase by another 100 nanoseconds, but now we can potentially access multiple racks in a pod with two levels of switching. So with increased nodes connected through CXL, we need to be careful about the blast radius. Unlike networking, we cannot have software-based recovery mechanism for pooled resources. So the system needs to have an acceptable fit or failure in time, which is the number of failures in a billion hours. We also need to design shared parts with the lowest possible fit and also ensure that containment is provided as well as quality of service across this multiple domains that are accessing resources through the same silicon.

Starting point is 00:36:36 We can deploy co-packaged optics to help with the reach. Last but not the least, software, software, software. Everything from management to orchestration, resource allocation, RAS, migration, we need to have seamless experience for our users. In summary, we have a lot of opportunities to set the data center of the future through LoadStore I-O interconnect by working collaboratively across disciplines and organizations.

Starting point is 00:37:03 Together, we can overcome the challenges as they may come and keep the virtuous cycle of innovation going and change the compute landscape of the future. Thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #155: Innovations in Load-Store I/O Causing Profound Changes in Memory, Storage, and Compute Landscape

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.