Storage Developer Conference - #155: Innovations in Load-Store I/O Causing Profound Changes in Memory, Storage, and Compute Landscape
Episode Date: October 26, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 155.
Greetings. My name is Devendra Das Sharma. I'm excited to participate in the STC conference.
I hope everyone is staying safe and healthy. I will talk about how LoadStore I.O. is driving
exciting innovations and causing profound changes in the compute and storage infrastructure.
Here is the agenda for today. We will talk about interconnects, especially
load-store interconnects, as an important pillar in the compute,
memory, and storage landscape.
Then we will talk about the evolution of load-store
interconnects as an important pillar,
and the new Compute Express Link, CXL standard,
which is poised to be a game changer.
We will discuss in depth how CXL as an open interconnect slot
will drive innovations in the memory and storage space.
I will go on a limb and provide my projections
of how these innovations will save the future of compute.
The picture here shows the interconnects
at different levels of the data center
looking from outside in.
Other segments, such as edge computing and on-premise
computing, are using cloud infrastructure concepts.
A data center comprises of racks of servers.
They're connected through network fabrics
with optical connections for longer distances and copper for short distance. Each rack consists of multiple chassis or servers
as shown in the pictures here on the left. Each drawer can be a compute drawer, a memory drawer,
or a storage drawer. The compute drawer has memory and storage needed to act as a compute server.
A storage or memory drawer may have compute elements, but is used primarily for storage or memory.
Each server in the rack is a single domain or a node, and it's typically contained within a chassis or a drawer as shown here.
This can be a single CPU socket system.
Typically, we'll have a two-socket connected through cache coherent link. And the interconnects at this level are CPU to CPU symmetric coherency
link such as our ultra path interconnect or UPI. For if you have a symmetric multiprocessing
system or SMT for short. Memory interconnects PCI Express, which is also known as PCIe, and Compute Express Link,
known as CXL.
These are known as load-store interconnects.
They're very tightly coupled, and we'll
discuss more about these and their relationship
with Fabrics in a bit.
Each server in this node connects to the networking
Fabric through a NIC. Increasingly, smart NICs are being
deployed to help offload the compute complex with networking as well as with infrastructure tasks.
The fundamental challenge that we have is to treat the data center as a computer, whether it is
compute, memory, storage, or any other resource. we have to get to warehouse scale efficiency.
If we develop the proper interconnect infrastructure, we can move resources or
tasks around seamlessly and securely. This will result in better power efficiency,
better performance, better total cost of ownership, abbreviated as TCO. We are making good strides towards that goal from an interconnect perspective,
but we still have a long way to go.
The graph here is from IDC and shows the explosion of data.
Some of the drivers for this explosion are cloud, 5G,
a lot of data coming in from the sensors, automotive, IoT, you name it.
Applications are demanding ever-aggressive time
to gain insights in real time in these large data sets.
As the data volume grows
and the workload requirements shift
to address new business opportunities,
it puts new levels of pressure
on the underlying infrastructure.
We have seen our customers struggle
with these new challenges at scale.
Ensuring a trusted, stable, and reliable experience can be challenging.
Workload demands are changing dynamically, impacting performance, memory, and storage needs.
This is resulting in stranded or overtaxed resources and the inefficient use of data center capacity. For example, a server may have unused memory, whereas another server in the data center
could have used more memory in order to service a virtual machine that has a higher memory
footprint.
We have to meet these scaling challenges.
Latency, bandwidth, capacity, these are all important to meet the customer demand.
Explosion of this data is leading to rapid innovation.
We have to move the data faster, store more data, process all that data quickly, seamlessly,
efficiently and securely, and determine where in the storage hierarchy to store that data.
In each of these aspects, interconnects play a critical role in the
compute continuum. Let me switch gears a bit and talk about some of my taxonomy of different node
level interconnects and identify some of the key characteristics and their differences. I would
broadly categorize the wired interconnect world into two categories, latency tolerance and latency sensitive.
The fundamental reason stems from the scale in which they are deployed,
resulting in some of the key differences and the characteristics that are shown here.
Networking interconnects, be it one of the flavors of Ethernet,
InfiniBand or Omnipath, they need to scale in connectivity to hundreds of thousands of nodes at the data center level.
Hence, they are narrow and they lead the speed transition,
often at the expense of latency and cost.
In contrast, the broad class of load store I.O.
are latency and cost sensitive.
The examples are PCI Express, CXL, UPI,
in our case, effectively SMP,
multi-processing interconnect,
cache-coherent interconnect, memory interconnect.
Because of memory access semantics,
these can be uber latency sensitive.
Even a nanosecond impact to the latency
for memory or coherent accesses may have noticeable
performance impact in some applications. I have provided the five latency numbers here for the
comparison purposes. Moving beyond 32 gigabit per second data rate to 64 gigabit per second data
rate with spam for signaling is a major inflection point for the load store interconnects. We have to
keep the latency flat while staying within the fixed silicon area and cost constraint.
We will now delve into the evolution of load store interconnect.
Let us look at the evolution of PCI Express,
the underlying interconnect technology driving different types of load store IO interconnect physical layer. PCI Express debuted
with the release of PCIe 1.0 as shown in the table here in 2003. Since then it has evolved through
five generations currently moving to the sixth generation. In every generation with double the
data rate while maintaining full backward compatibility you can see that in the table
as well as in the graph over here in the picture over here. Since PCI Express has been evolving in a cost-effective, scalable, and power-efficient
manner with full backward compatibility, it has become the ubiquitous I.O. across the compute
continuum. PC, handheld, workstation, server, cloud, enterprise, HPC, embedded, IoT, automotive,
they all use PCI Express. You have one stack,
one silicon, but capable of working seamlessly across multiple form factors across the entire
compute continuum. We have different widths for different types of devices. We have got
BI1 through BI16. Full backward compatibility ensures that a BI16 Gen 5 device will interoperate
seamlessly with a BI1 Gen 1 device.
We have multiple PCI Express devices that are leading the technology transition from
generation to generation, things like networking, XPUs, those are some of the leading candidates.
Because PCI Express supports multiple protocols through its five, something known as alternate
protocol, the ones with the coherency and memory semantics,
such as the Excel, are the leading consumers
for this higher data rate.
Fundamentally, as the compute capability goes up
within the given power envelope
in keeping with Moore's law and Dennard scaling,
the interconnect bandwidth also needs to scale
in order to effectively feed that beast.
Storage is steadily migrating to PCI Express due to its predictable
performance cadence with low latency. As shown in this IDC graph here on the left-hand side,
you can see the traction of PCI Express there. NVM Express has gained wide traction.
The picture here in the middle shows how PCIe results in less number of components and provides
a much better alternative for storage to connect to the system.
PCI Express I.O. virtualization, as shown here, is important for storage applications.
Multiple VMs can physically share the storage resources, which results in better efficiency
while providing strict isolation guarantee.
Reliability, availability, and serviceability.
RAS for short, things like hot plugs,
these are critical for storage.
PCI Express has enhanced its RAS offering
through things like downstream port containment,
enhanced downstream port containment,
where the failure of one SSD does not bring down all the other SSDs in spite of having load-store semantics.
One of the compelling value propositions of PCI Express is that we have one specification,
one stack, one silicon with full backward compatibility. But we do support multiple
form factors, which is expected because a handheld device will have a very different power
profile and a form factor need than, let's say,
a storage chassis.
And that has been key to PCI Express's adoption
across multiple market segments.
Now let's switch and see why CXL as a new class of open standard interconnect.
The picture on the top here shows the system view with PCI Express only.
Memory that is attached to the CPU such as DRAM is mapped as coherent memory.
Data consistency is guaranteed by hardware.
On the other hand, memory that is connected to a PCI device is also mapped to the system memory, but it is uncatchable memory or memory mapped IOS space.
So it cannot be cached in the system.
So the memory that is attached to the CPU today is different in its semantics than memory that is attached to a PCI device.
So when a CPU wants to access DRAM memory, it can simply cache the data, do the reads.
If it updates the write, it will later on do a write back.
That works just fine. Hardware basically takes care of data consistency. On the other hand,
if it wants to access memory that is attached to an I.O. device, because it is uncacheable,
it has to send reads and writes across the PCI Express link for every access and that has to
be completed by the device. By the same token, whenever a PCIe device needs to access system memory, it basically
issues the read and write across the link that is known as the DMA, direct memory access.
And the root port is the one that takes care of merging that data, keeping track of the
producer-consumer semantics of PC Express, and basically merging it with the cache-coherent
semantics in the system
so that you get data consistency between these two different ordering domains.
By the same token, whenever you want to do peer-to-peer, that's also a non-coherent kind of access.
This kind of accesses, they really work.
Producer-consumer ordering model really works well for a wide range of IO devices.
Things like bulk transfers, like getting data in and out of storage device or moving things in wide range of IO devices, things like bulk transfers, like
getting data in and out of storage device or moving things in and out of a network,
it really works well.
We definitely want to preserve that for those usages.
However, we are seeing the emergence of a new class of applications, a new type of things
like accelerators that want to work on a much more finer-grained basis with the processors. We are
seeing a lot of demand for heterogeneous type of processing. Different entities are good at
processing different types of things that's heterogeneous processing. And there you want
finer-grained sharing of the data. So the PCI Express mechanism needs to be augmented with
enhancements as shown over here. So this way now the devices should be able to
cache the memory in the system in the local cache and by the same token the memory that is attached
to the devices should be able to be mapped to the coherent memory space. So this way the devices
don't really need to DMA in and out of memory when the data needs to move between CPU and the device or between devices.
So you can basically work on the problem on the data set in a collaborative manner and cache coherency takes up the thing. You don't have to really literally move the entire data
structure back and forth. With CXL, we enable heterogeneous computing and disaggregation as we
talked about here and we'll see more about that. Memory resources anywhere in the system can be shared efficiently between these computing
elements due to memory and coherency semantics. So this results in enhanced movement of operands
and the results. Furthermore, we can use CXL for memory bandwidth and memory capacity expansion
supporting multiple types of memory tiering with different memory
types. CXL is an open industry standard with 150 plus memory companies. All CPU, GPU, memory,
storage vendors are in the consortium. There is tremendous movement in the ecosystem with
several interop and product announcement. In short, CXL is poised to be a game changer in
the industry. Compute Express Link runs on PCI infrastructure
and leverages PCI Express as shown over here.
It uses PCIe 5.
You can see here PCIe stack and a CXL stack on the CPU.
They both run on the PCIe 5 at 32 gigabit per second.
We can downgrade to 8 and 16 if needed.
The widths supported on CXL are by 4, by 8, and by 16.
It's full plug and play capable.
So you can plug in either a PCIe card here or a CXL card here in this slot.
Very early in the link training, PCI Express has this alternate protocol mode that figures
out if this is a PCIe card or is it a CXL card.
If it is a CXL card, then from very early in the training process, it hands over to CXL and you come up as a CXL device.
If it is a PCIe, you basically come up as a PCIe device.
It's all done in the hardware.
No involvement of software for training that link.
So it automatically detects and comes up in the right mode.
We completely leverage PCI Express
in order to accomplish this goal.
So basically enhance the PCIe offerings,
given that it's a ubiquitous infrastructure,
it makes sense for us to only invent the parts
that are really needed.
So the approach of CXL is that it's defined ground up
to address the challenges in the evolving compute landscape by making both heterogeneous computing as well as different types of memory efficient.
As we saw, CXL is built on top of PCI Express infrastructure, leverages PCI Express.
Fundamentally, it overlays caching and memory protocols on top of existing PCI protocols.
There are three protocols, CXL.io, CXL.Cache, and CXL.Memory.
CXL.IO is the I-O part of the stack.
It's almost identical to PCI Express.
We use it for things like device discovery, configuration,
register access, interrupt, direct memory access, I-O
virtualization, all of those.
It's mandatory.
CXL.cache protocol is what enables the device to cache the data in the system.
CXL.memory protocol is also optional for the device.
It allows the processor as well as other CXL devices, which have CXL.cache semantics, to access the device memory in a coherent manner.
Latency is a critical component to ensure system performance
for CXL.
We are expecting latencies for CXL.Cache and CXL.Memory type
of accesses to be in line with symmetric cache coherency
links, such as our UPI.
So for example, a CPU load to use latency
is expected to be significantly less than 200 nanoseconds,
which is, again, significantly less than half
of a comparable PCI access.
The other aspect of CXL which is extremely important
is CXL is an asymmetric protocol.
So the protocol flows, message classes
are different between the host processor versus
the devices. That's a conscious decision to keep the protocol simple and the implementation easy
for devices. Host processors have their proprietary home agent functionality that is shown here,
which basically orchestrates cache coherency in the system. There are multiple CPU cores,
you got your, you know, DMA, IO agent that basically merges the DMA
with the cache coherency. So there is an inherent home agent that is present.
It is very dependent on the underlying microarchitecture. CXL just
abstracts the caching functionality. It assumes that there is a home agent. It doesn't
specify how that home agent behaves. It just does the caching agent functionality,
which is the simpler part. And it basically does that by sending a set of commands that are basically
conforming to the messy protocol, modified, exclusive, shared, and invalid. And that way,
the device side of the implementation becomes easy. Home agent is the one that orchestrates
coherence, resolves conflict, all of that. The device doesn't need to worry about any of that. By the same token, even if
you are a memory device, you really don't need to understand the coherency flow. You just need to
provide access to the data, which is what CXL defines. So this asymmetry has helped us
tremendously with the backward compatibility without stifling performance or innovation.
Here are some of the typical usage models that are targeted for CXL 1.2 and 1.1.
There are three types of devices, type 1, type 2, and type 3. An example of type 1 device would be
a NIC or smart NIC or an accelerator, which uses.io and.cast semantics..io is used for bulk transfer, but caching is used for things like advanced atomic support,
things like if you have a PGAS NIC, partisan global address-based NIC,
you might want to do the PGAS ordering enforcement.
So there, instead of trying to merge the ordering between PCIe and the PGAS,
what we can do is we can prefetch all the cache lines that will be used
for the PGAS NIC and then complete them within the local cache in the PGAS order, which will
result in a much better performance. And type 2 devices are GPGPU, dense computation, where all
three protocols are used. Here, accelerators and host processors, they work closely. There is no
need to hand off entire data structures and results.
Only those that are used to will be cached and the cache coherency
and enforces global visibility.
Last but not the least is memory buffer.
Given that Serdis is very bandwidth efficient for PIN,
we can use CXL for memory bandwidth and memory capacity expansion.
This also enables heterogeneous memory,
including persistent memory, multiple memory hierarchies,
since the memory controller is within the device.
Also of importance is that you can deploy a type 2
device in a traditional type of memory device
and deliver even more compelling value propositions
that we will see in a few slides.
Now let's see what CXL 2.0 does while maintaining full backward compatibility with CXL 1.
CXL 1, whether it is 1.0 or 1.1,
was based on a single node.
What CXL 2.0 does is it enables resources
to be pooled across multiple nodes.
So essentially, CXL 2.0 enables one to move from a drawer level to a subrack or a rack level.
This is shown in the diagram here.
On the top, you have multiple hosts or nodes having access to multiple devices.
The assignment of devices to the host, they can change over time.
And this will be done by using the hot plug source defined by the CXL 2.0 specification. CXL 2.0 specification defines a standardized fabric manager to orchestrate the resource
allocation across multiple hosts. Each host can have its own dedicated resources that is not
shown here. So it can be multi-socket with its own memory, with its own IO devices, all of that.
Those are all abstracted as a host here.
So it has its dedicated resources.
But in addition to that, it has access to these resources in the pool.
So for example, D1, it's an accelerator.
It is assigned to H2 right now.
But later on, if H2 doesn't need D1, it is going to get released to the pool,
and maybe H3 will be assigned it, in which case
the color will be purple color.
So in addition to that, CXL 2.0 further
expands to define the memory type 3 devices
to simultaneously belong to up to 16 hosts.
In this case, part of D3's memory is assigned to H3,
and the other part of it is assigned to H2.
And that itself can change over time.
So this allows for partial allocation of devices
to the node.
CXL 2.0 Architect's persistence flows
to work seamlessly with Fabrics.
It also defines software API for devices
to enable system software to
manage the devices. Security enhancements on top of the device TLB, things like device
authentication, link encryption, and security through the switches are also comprehended in
CXL 2.0. In short, CXL 2.0 is a game changer. It enables the construction of disaggregated system to improve resource
utilization in order to deliver significant PCO and power efficient performance improvements
in the data center. Now let's look at how innovations in load store IO will change the
compute landscape. CXL provides a media independent memory interface with coherence extensions.
It does so while preserving all PCIe functions and PCIe
services.
For example, the device management of PCIe
can be fully leveraged, including NVM Express.
The picture on the top right here shows the memory hierarchy.
We have load store semantics that covers DRAM, NRAM, MRAM,
3D cross point, effectively storage class memories
using load store semantics.
What CXL enables to do is move the storage class memory
into the cacheable space.
NAND is accessed through different semantics
using files, objects, blocks connected through PCI Express.
These two semantics are different,
as shown by the semantics wall in
the diagram here. CXL enables the storage class memory to be mapped into the coherent space.
As a result, CXL can span everything from the DRAM through the storage class memories as well as
NAND kind of memory using all the semantics using CXL.Memory semantics for the top part and CXL.IO
for the other bottom part and it can also use NVM express infrastructure with DMA type operations.
CXL enables new compute and memory architectures with memory and coherence extensions. One can
extend the memory hierarchy to include the storage class memory and even NAND
using CXL. As a result, CXL provides the additive bandwidth and capacity over traditional DIMMs
across multiple types of memory and hierarchy without interference and that is possible because
CXL is memory neutral. The other advantage of CXL is its leverage of PCIe form factor.
We are no longer constrained by considerations such as consuming a precious DIMM slot and
having the power constraint of 15 watts on a DIMM slot.
We have lots of choices of form factors.
We can even enable higher power profiles, things like 25 watts, 40 watts if needed.
And the benefit of leveraging PCI Express doesn't stop there.
We get standard device discovery, configuration
management.
We get the software leverage, things like PCI driver,
all the ACPI enhancements.
For example, we can deploy the heterogeneous memory attribute
table, HMAT to describe properties of the memory.
We can do DMA engine for data move.
We can use IO virtualization services of PCI Express.
Now let's look at some of the applications of CXL
in the memory and storage segment.
As discussed before, DRAM memory can be behind CXL
as shown over here since memory bandwidth and capacity
increases linearly with the core count,
and different customers have different needs for core count,
today one has to build multiple platforms with different number
of DIMM slots.
However, with CXL, one can envision a common platform
as shown over here.
You've got the regular DIMM slots,
and you have got the DRAM through CXL
that you can plug in as many as you need.
So that way you have a common platform, and you also have the flexibility of adding more memory bandwidth and more memory capacity.
CXL offers bandwidth scaling with the width and frequency.
So unlike traditional memory, one can allocate different bandwidth with the same capacity by changing the link weight, for example. As we had stated earlier, PCIe data rates are doubling every
two to three years while staying within the power, performance, and channel reach, as
well as cost profile. So today you can get 32 gigabytes per second per direction if you
have a x8 Gen 5 CXL device. Once we go to, let's say, Gen 6 data rate, 64 gig, we can get 128 gigabytes
per second per direction. And that's a lot of bandwidth that we have at our disposal.
The other benefit of CXL is spin efficiency. DDR requires a lot of spin. Spin growth in CPU
package by increasing the number of memory channels to sustain core performance is not going
to be sustainable in the long run. This is where CXL can help deliver more bandwidth on a per pin
basis. You do get higher latency with CXL than a local DRAM access that's inherent because you're
crossing a chip and service has invariably more latency than a parallel bus-based infrastructure. However, CXL access
is similar in latency to a DRAM access in a remote CPU socket in an SMP system, where
NUMA domains have been very well established. The question is how much of memory stays in
the traditional DIMM and how much of it really migrates to CXL. With time, my prediction is we will see more and more memory migrate to CXL.
And I can envision systems like this on the right-hand side here
where pretty much all the memory is CXL and you've got on-package memory.
That will happen over time.
Here, one can even have a memory hierarchy
where on-package memory access a memory side
cache or a level 4 cache.
Backward compatibility ensures interoperability,
something that we don't get today in the DIMM world.
So this offers a very compelling interconnect for memory
to migrate to.
It makes perfect sense for me that NVidim moves to CXL,
as shown in the picture over here.
We can have storage class memory, such as NAND, that can back up the DRAM, and it will have the non-volatile characteristics.
One can even envision deploying a DMA engine to back up memory from other DIMM slots.
The advantage of this is it will be hot-pluggable,
like a form factor like this, and hence,
it will be serviceable, something that we just
do not get today.
It can be also multi-headed and connect to multiple hosts
for failover access.
We do not have to consume a plesius DIMM slot,
and we are not constrained by the form factor or power
constraint.
Persistent memory behind CXL is now cacheable. All of this while we maintain
hot plug serviceability and multi-headed access for availability. One can even build very large
capacity memory by having DRAM as the memory side cache for the entire or a large chunk of
the storage class memory like NAND for example. So this is like a 2LM kind of a model, multi-level memory model.
One can build very large capacity that way, and you can use the HMAT table to interleave
accordingly, and you can also report the latency characteristics accordingly.
The CXL devices in both these types of memory can be augmented with TMA engines to move
data efficiently and it can be deployed for accelerators for local processing of the data.
One of the compelling new usage for CXL is computational storage. Imagine putting an accelerator like shown over here in front of memory for near memory compute.
So functions such as compression, encryption, computation for reliability like RAID 5, RAID 6,
compaction for key value store, search engine, vector processing for AIML applications,
all of these can be done through this accelerator engines here, there.
DMA engines can be deployed to transfer data
to where it is needed.
So this can also use standard drivers
and management framework that we have developed
over the years for PCI Express.
One can envision that the storage class memory,
as we talked about, can be mapped into the coherent space
and have a local DRAM back it up.
So that's also possible, as we have shown over here.
So the picture here shows that how you can build
a very efficient system with all of these capabilities
combined with the memory capacity
and memory bandwidth expansion that CXL offers.
What's important is everything in CXL can be located
behind even a low latency switch
and it can be pulled across multiple clusters in a rack.
The bottom line is with a simple set of.cast
and.mem enhancements,
CXL enables one to build
on top of existing PCI Express infrastructure
and services and offers a very effective and compelling interconnect.
It enables to scale with heterogeneous processing and memory with shared,
cacheable memory space that is accessible to all using the same mechanism.
The slide here offers a view of how to treat the storage plus memory as a byte addressable data store
with storage data style management and protection services while being accessible through the load
store semantics of CXL. So byte addressable characteristics will be of course it is load
store access it offers cacheability on top, scalable to petabytes plus still low latency because of dot memory access latencies.
And you still can preserve the storage functions like replications, snapshots, cluster-wide
data sharing.
And with CXL 2.0 and beyond, we are moving from a node level to a rack level.
So the cluster-wide thing also applies there.
CXL enables us to extend the load-store interconnect across multiple hosts.
So it's a fundamental paradigm shift that we are seeing
since load-store semantics so far have been constrained
to captive resources, whether it is processing, IO, memory,
or storage, all inside one host.
Composable disaggregated infrastructure at
rack level, as shown here, and potentially across racks, is essential to realize the
data center of the future vision. That vision is represented pictorially here. We have pools
of compute nodes, multi-tiered memory, XPUs for acceleration, and storage pool.
And you have an infrastructure processing unit that acts as the connection to the networking fabric.
Each of these entities are connected through CXL, as shown here in the load store fabric.
This can be at the rack level or even at a pod level.
The goal is to deliver almost identical performance per watt as independent servers.
Going forward, we need to enhance this usage model
with shared memory, message passing,
and atomics among these nodes through CXL
so that we can encompass new usage models
and also new segments
that require high performance computing.
We need to nurture the synergy between networking fabrics
and load store interconnect
so that we can interoperate seamlessly, provide the same look and feel for our end users, but with different performance
characteristics. Clearly, if you go through the fabric, your latencies will be a lot higher. If
you go through the load-store interconnect, your latencies will be a lot lower, but your scale is
going to be also lower. We already have the necessary constructs like fabric manager,
multi-head, multi-domain, atomic support, persistent flows, smart net, optimized flows to access system memory without involving the host, VM migration with CXL.
We still have a lot of challenges. We need to ensure that we are taking the learnings from our initial CXL implementations into account as we move forward.
One of the important challenges is performance,
namely latency and bandwidth.
These are two sides of the coin.
As we have discussed,
access latency of direct CXL attached device
for load to use should be less than,
significantly less than 200 nanoseconds,
similar to a UPI access.
If we go through a single level of switching,
we will probably add another 100 nanosecond latency to that path. Low latency switches are absolutely critical to realize
this vision. These accesses through switches can leverage the existing NUMA optimizations that we
have done with SMP multiprocessing. The other aspect is bandwidth. With giga transfers per second and no signs of slowing down.
Load store interconnects have also hundreds of lanes per CPU socket.
Bandwidth is not going to be a bottleneck.
We can aggregate bandwidth through multiple sets of switches.
A rack worth of resources are fairly accessible through a single level of switch with 128 lanes if we chop them up into by four ports. So with two levels of switches,
the connectivity goes up as a square of that. Latency does increase by another 100 nanoseconds,
but now we can potentially access multiple racks in a pod with two levels of switching.
So with increased nodes connected through CXL, we need to be careful about the blast radius.
Unlike networking, we cannot have software-based recovery mechanism for pooled resources.
So the system needs to have an acceptable fit or failure in time, which is the number of failures in a billion hours. We also need to design shared parts with the lowest possible fit and also ensure that containment is provided as well as quality of service across this multiple domains that are accessing resources through the same silicon.
We can deploy co-packaged optics to help with the reach.
Last but not the least, software, software, software. Everything from management to orchestration,
resource allocation, RAS, migration,
we need to have seamless experience for our users.
In summary, we have a lot of opportunities to
set the data center of the future through
LoadStore I-O interconnect by working
collaboratively across disciplines and organizations.
Together, we can overcome the
challenges as they may come and keep the virtuous cycle of innovation going and change the compute
landscape of the future. Thank you very much. Thanks for listening. If you have questions
about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.