Storage Developer Conference - #146: Understanding Compute Express Link
Episode Date: May 25, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 146.
My name is Devendra Das Sharma. I lead the IO Technologies and Standards Group at Intel.
I'm a co-chair of the CXL Technical Task Force and have been leading CXL since its inception.
I will be delving into Compute Express Link, which is also abbreviated as CXL.
So when we look at the industry landscape today, we see some very clear mega trends emerge.
Cloud computing has become ubiquitous. Networking and edge computing is using the cloud infrastructure
and it has also become ubiquitous. AI and analytics to process the data is driving a lot of innovations across the board. All of these are driving the demand for faster data processing.
We see increasing demand for heterogeneous processing
as people want to deploy different types of compute for different applications.
So whether it is general purpose CPUs, GPGPUs, custom ASIC, or FPGA.
Each is important and best suited to solve some class of problems.
There is enough volume to drive customized solutions in each of these segments.
And increasingly, we see people deploying a combination of these different types of
compute in their platform depending on their needs. In addition to the demand for
heterogeneous computing, we also see the need for increased memory capacity and memory bandwidth in
our platforms. Significant technical innovations in storage class memories have also resulted in
those memories approaching DRAM-like latency and bandwidth characteristics while maintaining
non-volatile characteristics and also they have larger capacity. So we have a class of memory
between DRAM and SSD that needs to be thought of as a separate memory tier. This tier offers
compelling value proposition due to its performance capacity and persistence so for example now we
could store an entire database in this new memory and with that makes search faster and we could do
a lot more of ai and analytics type of applications so these mega trends that we are discussing here
it takes advantage of these types of memory in addition to the evolution that we are going to see in the traditional DRAM memory and storage
plus the heterogeneous computing. Compute Express Link is defined ground up to address these
challenges in this evolving landscape by making the heterogeneous computing as well as these different types of
memory efficient and it has been designed to sustain the needs of the different compute
platforms for many years to come. So the question is why do we need a new class of interconnect?
So if you look at the picture on the top, this represents a typical
system today. So you got CPU, you got memory that is attached to the CPU, which is typically DRAM,
and those get mapped as coherent memory. Now, for coherent memory, data consistency is guaranteed by
hardware. The memory that is attached to a PCI Express device, on the other hand. It is also mapped into the system memory,
but it is mapped as uncast memory or memory mapped IO space.
So the memory attached to the CPU is different in its semantics
than the memory that is attached to an IO device.
So when the CPU wants to access its DRAM,
it simply caches the data,
accesses the data from its local cache,
later on does a write back if it has updated the data.
On the other hand, the memory that is attached to the IO device,
it cannot be cached. It's an uncast memory.
So when a CPU has to access it, it does so using load store semantics,
but that load store semantics always has to traverse to the entire hierarchy
and access the device for every access.
So similarly, when an IO device,
it wants to do a read or a write from the system memory.
It does that through what is known
as the PCI Express DMA mechanism.
So if you are an IO device, typically you are connected into the CPU.
The CPU would have a write cache. And whenever it gets a write, DMA write from the IO device,
it basically goes ahead and does the fetching of that cache line using the caching semantics.
And it does the protocol translation between the producer-consumer ordering model and then the ordering model that exists in the system memory.
So it does the merge of the data, and then it can do the writeback.
The IO device is not allowed to cache any of the system memory.
It always has to issue explicit reads and writes to the host
in order to get access to that memory.
On a read, the CPU simply does,
the root port simply goes ahead and asks for a snapshot,
coherent snapshot of the data and provides it to the device.
Now, this is the producer-consumer ordering model.
It really works well for a wide range of IO devices,
has worked really well over the last two plus decades.
And it's really good when performing bulk transfer.
So such as those involved with traditional storage
or moving data in and out of a networking interface.
It's very efficient, works really well.
And we definitely want to preserve
that kind of DMA model going forward for those usages.
However, things like accelerators for example they want to
do fine grain sharing of data with the processor and then that so in that case the pci express
mechanism it needs to be augmented with some extra semantics so that that will allow these devices to
be able to cast the data from the CPU.
And also, by the same token, if they have memory attached to them,
be able to map all or some of that memory into the system's coherency space.
And that's basically what CXL allows.
So if we fix those effectively in all of these things that we had in the red here, those, of course, will be prevalent.
They will continue to exist.
But in addition to that, we enable write back memory on the device side.
We enable direct memory load store on behalf of the device where the device can cast the data.
And similarly, your PCIe DMA, if you're a pure pci device right if you do pci dma
whatever that looks like is also going to look very similar if you are doing dma directly into
the memory that is attached to this new cxl and enabled environment so in in summary this allows
us for doing efficient resource sharing we can share memory pools across different devices,
you know, devices across the system,
different types of compute elements can work on problems collaboratively
without having to effectively transfer the entire data
and then a flag to tell whoever they're partnering with
that they're done with the processing.
They can just work through all of these accesses
in a very seamless manner.
But for this to happen,
what we really need is an open standards so that the industry can innovate on
this, on these common set of standards.
And together we should be able to then get the best performance,
power efficient performance in, in the compute systems that we
have so that's basically the proposition of compute express link so this is an overview of
the cxl consortium if we look into you know back back in march of 2019 when, when we went public with the CXL consortium to now, we have more than,
right now, this is 100 plus members. I believe it is more than 123 members now. And it's growing
rapidly. And this entire membership, it reflects the width of the industry and then the depth of
the industry. And it's very essential for us to create this vibrant ecosystem. And
Compute Express Link is not a one-time thing. It is going to evolve. And there are work groups,
technical work groups, five technical work groups that are developing the next generation of CXL
2.0 specification in a backwards compatible manner. And this journey continues as more and
more problems come through. We are going to go ahead and address them. And of you know, this journey continues as more and more problems come through, we are going
to go ahead and address them. And of course, we are going to go through the speeds and feeds
adjustment in order to meet the demand. So that's, in a nutshell, what is CXL. And, you know,
these are the board of directors, you can go to the ComputeX Express Link website to get more details.
So now let's get into an overview of what Compute Express Link is all about.
So Compute Express Link is, this is looking at the system outside in.
So you can see that this is a data center, right?
You access, there are a bunch of networking connections.
You have a whole lot of racks in a data center.
Within each rack, you've got a lot of chassis.
And in each chassis is like,
there is a system which is one or more CPUs connected through a symmetric cache coherency protocol.
They have their own memory.
There will be IO devices.
This is where CXl fits in it is defining
a new protocol semantics to work at this level and you know tightly coupled with the cpus
between accelerators and memory and as i said this is uh and we'll see that this leverages PCI Express and targeted for AI, machine learning, HPC,
comms, and a variety of emerging applications.
So what is CXL?
CXL is basically based on PCI Express infrastructure.
So if you think of this as a processor,
you've got IO link, PCI Express link.
This is PCI Express 5 going through a by 16 PCI Express
connector and you can plug in either a PCI Express by 16 card or you can plug in a CXL card
it will work either way PCI Express is the ubiquitous interconnect technology across the
entire computer's continuum it spans everything from your handheld, laptop, desktop, server,
comms, you name it, right?
So that's the reason that, you know, it's present everywhere.
PCI Express 5.0 defined the alternate protocol mechanism.
So Compute Express Link sits on top of that.
So this way what happens is you get a flexible port.
Let's say in my system, I have five by 16 slots.
I could choose to put in all five as by 16 PCIe cards,
or I could choose to put in all five by 16 CXL cards,
or I could choose any combination thereof
depending on the user's needs.
So by keeping it in this particular way where it is
completely interoperable, you are offering users the choice. Otherwise, you had to give them
dedicated slots and that causes more real estate, more power, more area, and more pins on the CPU,
not the best way to make progress, right? So, you know, we truly believe
in this plug and play. And PCI Express, as I said, has the alternate protocol mechanism. So
fundamentally, what happens is when it comes up, when the link comes up, we run at 8B, 10B encoding
with PCI Express 2.5 gig. Very early on, the CPU is going to query to the device saying i support cxl do you support cxl
if the device says yes i support cxl we are going to talk cxl if the device says i don't know what
it is in in which case it's not going to respond uh we will simply go ahead and move with pc express
so you know by the time the time even the link comes up
through just the Gen 1 rate,
very early during the training,
you are decided whether it is PCI Express
or whether it is CXL and it is done dynamically.
So you could have a slot,
the same slot can have a PCI Express today
or in somebody else's system,
the same slot can have CXL.
It will work plug and play.
And CXL usages will work plug and play. And CXL
usages are expected to be
the key driver
for more data rate
in PCI Express
and we believe this will be one of the
lead
usage models for PCIe
6.0 data rate
transition.
So we basically reuse everything on the PCI Express side.
We reuse the PCI Express link training.
We use the circuits, the channels, everything.
And we'll see a more detailed picture of this coming up.
So this one talks about how do different CXL protocols coexist.
So if you look into this picture here, we have three types of protocols that run with CXL and they all run on top of the PCI Express infrastructure. So CXL consists of these
three protocols, CXL.IO, CXL.CAS and CXL.Memory. CXL.IO is the IO part of the stack. This is almost
identical to PCI Express. We use it for discovery, configuration, register access, interrupts,
virtualization, and most importantly, the bulk DMA with producer-consumer semantics.
More or less identical to PCI and it is mandatory in CXL. So that part is there, right?
As we said, PCI-5, PCI or CXxl log phi there are some modifications that are needed to
do the alternate protocol negotiation that i talked about same thing on the other side now
let's look into the other two protocols so you got cxl.cache this is optional for a device
it allows a device to be able to cache the system memory. So memory that is attached to the system, host memory, this is the coherent memory.
So CXL.cache will allow the device to access that memory, store it in the local cache,
and effectively have the same kind of caching agent functionality that a core might have, CPU core might have.
CXL.memory protocol is also optional for a device.
And what it does is if the device has memory attached to it,
the device can choose to allocate a portion or all of it or none of it
to be mapped into the system memory space as coherent memory.
So that way, you know, the CPU can access that part of the memory
using the same types of semantics that it would if it were accessing host memory.
So, you know, CXL.CAS, CXL.MEM, and then CXL.IO,
they all go through their own independent stacks, but they get multiplexed at the file level.
And CXL specification defines flits as the basic unit of transfer.
It dynamically multiplexes these three protocols.
Our size, basic unit of size for CXL transfer is 528 bits. So what this allows us to do is if a device is, let device. So we don't want that snoop to be stuck
behind a 512 byte payload
because our flit sizes are small.
They are, as I said, 528 bits,
which is about 66 bytes.
At that flit boundary,
we can pause the DMA transfer,
send the snoop to the device
and then resume back to the DMA bulk transfer that we were doing.
So all of these are provided in CXL. The flits are protected by a 16-bit CRC.
And on the IO side, of course, we have the same link level CRC on top of that. So now let's talk a little bit about CXL features and the benefits that it offers.
CXL has been designed ground up for low latency. And we'll talk through this in terms of the stack,
right? So each of these three critical usages,
they have their own latency critical elements into it.
So, you know, if it is the IO part of the stack,
it's identical to PCI Express.
So, you know, we made some enhancements into the link layer and the transaction layer
so that it can operate in a CXL.IO environment.
These are fairly small changes.
Things like, you know, you need to be able to pause things.
You need to be able to break things on a flit boundary.
All of those, right?
CXL.Cache and CXL.MEM, that's optimized for latency.
And that fundamental flit size is the flit size in the CXL protocol. And these transactions, as you can see, we
multiplex just before we hit the logical five. So, you know, we talked about how we can interrupt
a 512 byte or a large payload size for performance critical one and that way that helps us with the
latency now let's look into an alternate approach that we could have taken which we didn't right
we could have done the the multiplexing between these different protocols at the transaction
layer level go through the pci link layer, there are two issues with that from a latency point of view. First of all,
for the 512 byte example that I gave,
if I have a snoop,
I have to really wait
for that particular transaction to end
before I can send my transaction.
Whereas here, I don't have to do that.
I just have to wait till the flit boundary
and the flit boundary is a much more smaller,
64 byte or 66te granularity.
Two bytes is for the CRC.
The other thing is that PCIe, since it deals with variable packet sizes,
the link layer, I mean, you can have up to four TLPs in a given clock cycle,
assuming it's one gigahertz with the gen 5 data rate by 16 and you can have a TLP that can
go across multiple cycles so the CRC logic for example has to be able to deal with all of those
consequently it has a lot of pipeline stages and it's built in throughout the whole thing it works
really well for PCIe and it is really well suited for that. It's really optimized for that. But coherent traffic works on a cash line basis. It's more or less small packets. It's not very efficient from that point of view from a PCI sizes are fixed, CRC sizes are fixed.
So that causes a lot of efficiency
in terms of how quickly we can get
to the latency numbers that we want.
So the question is, you know,
how much of latency are we targeting?
What is good enough for us, right?
And the answer is very simple.
We want for CXLCAS and CXLMEM,
those latencies to be in the same ballpark as what we
would if we were doing a symmetric cas coherency protocol so we gave some guidance in the cxl
specification for example if you are a device and you get a snoop request on the pin, then we expect you to return
the, on a snoop miss, we expect you to return with the response for that snoop. Basically,
I mean, you know, I don't have the line invalid is the response, right? You'll come back with that
within 50 nanoseconds, five, zero, right? Pin-pin. So that's fairly aggressive. And we can do that because
we have this structure here. Similarly, if I'm giving a memory read request here,
and then the device has got either HBM memory or DDR memory, it is supposed to respond with the
data starting at 80 nanoseconds. So pin-to-pin latency is 80 nanoseconds is what is given in the spec.
Now, you can say that, hey, what about storage class memories?
Because those have higher latencies, and that's fine.
That's not an issue.
We have a reporting mechanism, HMAT kind of table,
where you will say what kind of memory you have
so that the system software is aware
what kind of what kind of a device you have what kind of latency you have
and map the memory to your space accordingly cxl it's an asymmetric protocol so the protocol flows and the message classes are different
between the host processor side and the device side this has been a conscious decision to keep
the protocol simple and the implementation easy and i'll go through that reasoning
we have experience with enabling the industry with
symmetric cash coherency protocols invariably you will find a lot of excitement initially and then
the vast majority of them okay 90 plus percent in my experience basically just do not make it
to the finish line because of the complexity because of the huge design effort, validation effort.
And most importantly, what happens is symmetric cache coherency protocols changes over time.
So doing something that is backwards compatible becomes a challenge.
Now let's look into each component.
Why is that the case?
Host processor, it has a mechanism to orchestrate cache coherency between its multiple
caching agents right it has got course it has got home it has got um you know i've not shown here
but pci express uh root ports all of that it might have peer cpus that it is connected through
in uh using a symmetric cache coherency protocol link.
So it has to deal with orchestrating.
And I want to emphasize on the word,
orchestrating cache coherency between multiple caching agents.
So typically this involves,
you'll take the request,
you'll resolve any conflict,
you're going to start tracking cache lines.
So this is what is known as home agent functionality functionality and that's what is represented here in this diagram
that's the complicated piece that's the one that is very tied to the individual micro architecture
that's the one that changes from generation to generation consumption side which is the
caching agent that's relatively straightforward and i've not seen very many
generations of cpu i've worked in two companies for more than two decades i have not seen very
many i actually have not seen any cpu work with its predecessor cpu using its cache coherency
link and there are very good technical reasons why it doesn't on the other hand from a device perspective it needs to if it needs to
cash right it has to cash something because it has a need it has a performance benefit that it
can gain because it has to cash something but it really doesn't is not in the business of
orchestrating cash coherency between different cores or between other caching agents or other accelerators that
might be there so there is no need for the device side to get bogged down in terms of orchestrating
cache coherency by having a home agent functionality which is complex and as i said changes across
generations so what the cxl specification does is it abstracts away it says that home agent
functionality use cpu or you host a processor you you are dealing with it in any way so use whatever
you do i'm just going to provide with you with a simple set of abstracted cache coherency commands
and those are similar to what you would find in a messy protocol.
So things like, you know, I want to read this cache line.
You know, I want to read it shared.
I want to read it private, all of those, right?
If I updated a cache line, I want to do a write back.
And, you know, occasionally I might get a snoop,
in which case I give a response as a snoop.
That's pretty much what we need.
We don't need any of these,
you know, how do I resolve conflicts
across multiple caching agents
and all of those things, right?
So that basically keeps it simple.
Now on the same token,
if I have memory,
I really am not in the business
or I should not be in the business
of orchestrating cache coherency.
That's on behalf of the CPU.
All I do is there is a read to that memory location.
I provide the data.
You want to write something to the memory location.
I write the data, right?
It's just I'm trying to do the best I can to manage the memory
and do the lowest latency possible that I will have.
So contrast that with a symmetric cache coherency protocol
where it is all,
you know, every side is a peer,
they have got caching agents.
So this is what basically,
as I said,
it's a very deliberate decision.
It's to keep things simple,
keep things simple for the developers,
keep things simple for the accelerators,
providers,
for the memory expansion providers.
And in any case,
the CPU needs to orchestrate cache coherency as we talked about.
So, you know, keep things where they really belong, right?
Now, I did mention that, you know,
there is the CXL.memory allows for a device side, right?
So host memory is mapped mostly into the coherent memory space.
But in the case of
a device, it can be mapped either into
its
or its local usage
and or part of it can be mapped
into the system.
So the two sides, so
the two different
views of how
things can be. And as I said, you can have like a
part of the memory that is in one type
and the other part would be on the other type.
It's really up to the device
how much it needs to map into the system memory space.
Now, if the device owns the memory,
it is not mapped into the coherent memory space
in the system,
then that is what is known as the device
bias so in the device bias what happens is the memory is still mapped into the system memory
space but it is mapped as memory mapped io or uncast so anytime the cpu wants to access that
memory it does that through using an uncast flow just like it does today in pci express
the picture on the right even though the memory physically resides with the device an uncast flow just like it does today in PCI Express.
The picture on the right,
even though the memory physically resides with the device,
it's really mapped into the system memory space.
So the home agent in the host processor is in charge of that memory.
So even if the device wants to access its memory,
it has to go through the home agent and then it can come back and take the data.
And that is because you want to make sure that, you know, when you are accessing a location, nobody else has it.
Right. You have the right coherency semantics built into it.
And you can flip the biases for a given location between these two and even if you didn't get it right by construction CXL protocol is such that
it will be it's going to work properly in the sense that data consistency will be still guaranteed.
Now let's talk a little bit about some of the use cases of CXL.
So what you see in this picture are three types of devices the leftmost one is a type one
device type one devices are used for they use the cxl.io and cxl.cast semantics and we have
provided some sample some example usages like a smart nick that can benefit from casting now if
the smart nick implements a partition global address space, PGAS,
it needs to ensure that the ordering model is preserved.
Now, note that PCIe has got
producer-consumer ordering model.
So that mandates that rights be able to bypass prior reads
to avoid deadlocks in the PCIe hierarchy.
So this is known as posted transactions bypassing non-posted transactions
that can cause a problem with the PGAS model
when two strongly ordered transactions cannot complete out of order.
So the way to work around this ordering issue is to serialize access
which can result in performance implications.
Now, if you have cxl.cash
the nick can simply cache these locations prefetch them because since you know the beauty of cache
coherence is that you can prefetch the locations and you know when you retire the transactions
or when you do the right as long as no it has not been snooped out you're guaranteed that that data
is still there with you.
And then in that case, it just needs to complete the transactions in order in its local cache.
Now, another type of usage for type 1 device is around the atomics.
Now, increasingly, applications are using advanced atomic semantics involving floating point operations.
Any standard IO protocol, such as PCI Express,
can be enhanced to support this natively.
In fact, PCI Express has got some amount of atomic semantics.
Now, to the usage of these atomic semantics,
it evolves so fast, right?
By the time you get to specify and get to an implementations on the CPU side,
on the device side, that can take years,
which slows down innovations.
On the other hand,
if you have CXL.CAS,
you can simply get ownership of that line,
perform any complex atomic semantics that you want
by keeping ownership of that data.
And then there you are done.
You don't really need to make sure
that somebody else has to implement atomics
the way you envision them to the middle type of device is what is known as a type 2 device
so this typical usages are your gpgpu and fpgs for dense computing now these devices may have
some amount of local memory attached to them and that's used for their computation but then
they can also be mapped into the system space and we expect these type 2 devices to implement all
three protocols now the caching and memory semantics would be used to populate and pass
operands and results back and forth between the computing entities with very low latency and high efficiency.
So this is where you would use the bias flipping, you'd crunch on some data, and then you would
just tell the CPU to come pick it up, right, without having to send an entire set of data
and flag and all of that.
It's just local.
Same way the CPU crunches something, you basically can get unfettered access without having to go through explicit synchronization that, hey, here is the data, here is the flag kind of a thing.
Third type on the right is a type 3 device.
And the usage would be memory bandwidth expansion, memory capacity expansion, and storage class memory.
Now, these only need to implement the CXL.IO
and CXL.memory semantics.
This memory will be mapped to system memory
as coherent memory.
But then the host processor orchestrates the cache coherency
as we mentioned earlier.
The devices doesn't have to know anything
about coherency flows.
All it does is it simply has the memory semantics which is a set of reads and writes it doesn't even need to
implement cxl.cas so what we are doing with cxl is that we are enabling a lot of capabilities with
some very simple extensions on top of PCI Express.
And the purpose of this is to make sure
that the ecosystem can really innovate
and can build some really good either accelerators
or memory expansion devices or whatever.
These are very powerful constructs,
very simple constructs.
And the barrier for entry is fairly
minimal because you know all the main things like pci express phi and all of those those are all
taken care of right you can get that you have that anyway and on top of that these are a handful of
um a handful set of semantics that you need to implement and get the benefit and you don't have to orchestrate
cache coherency and all of that so for example i can imagine that somebody will build either a
in memory or near memory processing by doing type 2 device you got you got a bunch of memory attached
to the device on the other side you want to do some kind of search
or whatever it is that you want to do on that memory and then you can work very collaboratively
with the processor that it you know you've got very high bandwidth access to that type of memory
and all of that doesn't need to really cross the wire and you can provide with a very valuable
solution to a particular problem that you're trying to solve,
right? I mean, you can do, you can speed up database accesses, you can do a bunch of things
with, because you've got caching semantics built into, built into the protocol along
with the memory semantics. So it's a very powerful construct that we are enabling the industry with.
So now let's look into, you know,
what we started off by saying, right?
I mean, why a new interconnect and see how well we did, right?
You know, effectively,
I hope I convinced you that, you know,
the memory that is attached to the device
is write back memory
because you have map use CXL dot mem semantics to map that.
Because of CXL dot CAS, you could do memory load and store just like the way a CPU core
would do.
And if the PCIe DMA, if just like it is happening to the memory attached to the CPU, if you
were to do peer to peer access, it looks very similar.
So we have, in addition to the existing load store semantics,
we have been able to bring these into picture.
And this will result in, you know,
if I'm operating between different sites,
I could do efficient, you know, population of operands and results.
And you could borrow memory resources
when you need to work on something,
and you could do a bunch of things
like user kernel level data access and data movement.
And all of these are very low latency,
as we talked about.
Extremely low latency.
Latency is similar to
CAS-coherent symmetric multiprocessing system,
which is, by the way,
much smaller than
a pci load store type of latency pci load store latency is itself small but these are much much
smaller in terms of the latency access characteristics and we talked about some of the
numbers so in summary cxl has the right features and the right architecture, the right level of abstraction, most importantly,
very low barrier to entry to adopt things
to enable a broad and open ecosystem, right?
Both are important, broad open ecosystem
and an open ecosystem.
And that will enable us to do heterogeneous computing,
allow for a bunch of different memories to be put in the system. And also we provide with the right level of abstraction so that you can classify different memories differently and optimize your performance accordingly and server disaggregation right so coherent interfaces we leverage pcie and again we only innovate where
it makes sense we want to really piggyback we wanted to piggyback on pc express because
allowed us for the plug and play and all of those things that i talked about
we don't have to do any of the heavy lifting in terms of the phi in terms of the channels in terms
of even discovery for io all of those right, are completely leveraged from PCI Express.
On top of that, we built low latency approach for.cas and.mem
targeted at near CPU cas coherent latency.
We expect it to be in the same ballpark, right?
So if you're looking into, you know know pick your favorite vendor i can talk about
our intel cpu so for example on our upi link whatever latency you see we expect cxl cast mem
to have similar latency characteristics we are talking about asymmetric complexity which
eases the burden of cast coherent interface designs because as an accelerator
or as a memory expansion developer,
you need to really worry about a handful of things.
You really are not in the business
of orchestrating cache coherency
across a wide plethora of designs.
That is the asymmetric nature of CXL
that is extremely valuable is the asymmetric nature of cxl that is
that is extremely valuable keeps latency low makes sure that backwards compatibility is going to work
really well last but not the least very important it's an open industry standard with growing broad
industry support right if you look at it all cpu vendors are in cxl committed
to it all the memory vendors are in cxl all the fpga vendors are in cxl all the gpg vendors are
in cxl you know cloud service providers are there in cxl oems com service provider you saw the list
right it's a very impressive uh list of um you know companies that are fully committed to cxl
we are not uh stopping on the laurels with cxl 1.1 that we published in march uh we um
q2 sorry q2 of 2019 uh march we published 1.0 uh 1.1 was with compliance added to it.
And we are doing the next generation CXL 2.0,
which we are, you know,
looks like we are on the final stretches.
I'm keeping my fingers crossed.
So this journey will continue.
And looking at the types of companies that are there,
the investments that we are getting
and the interest that is there, the investments that we are getting and the interest that is
there. It's an ongoing journey. You know, my take on it is if you are not a member,
please consider becoming a member. And thank you all for attending. I'm really glad to
have you all as co-travelers on this journey. Stay safe, my friends. Thank you.
Thanks for listening. Thank you. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.