Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Compute Express Link
Episode Date: June 12, 2024...
Transcript
Discussion (0)
Okay, then let's get started. Good morning everyone. Today's lecture is about Compute Express Link.
First, let's have a look at where we are right now. In yesterday's lecture, we left the CPU and memory domain and talked about storage.
We talked about PCIe and peripheral devices.
And CXL is an interconnect technology that fits very well to PCIe because it actually
is based on PCIe.
And so when we communicate over PCIe we're usually dealing with peripheral devices such
as GPUs, network interface cards, disks, FPGAs.
But with CXL, so we are not much dealing with DRAM anymore,
because this is usually directly attached to the CPU.
But with CXL, we're also talking about DRAM again,
because this is a technology that
allows moving the memory dims out of, yeah,
away from the main board and allows it to attach memory
to peripheral devices.
Regarding our timeline, after today's CXL lecture, next week, the first session on Tuesday,
we will discuss the task two solution and then also introduce you to the third task,
which is about buffer management.
And then we will have a networking a network session after
that we will continue with GPU sessions, RDMA and then also two FPGA sessions.
In today's lecture first I want to briefly discuss what are the current
limitations of interconnect specifically what are the limitations of pcie
and ddr because this is a motivation for cxl and what cxl tries to address then i will give you a
brief overview of cxl what is cxl in a nutshell what are the most important things that you should
remember i will go into detail about managing cache coherence. I will explain some topologies that are possible with CXL connected CPUs and devices.
I will very briefly talk about how you can actually program with CXL if you have the
chance to work with some CXL memory devices. I will give you a brief insight about performance
numbers that you can expect from such devices and also about
available hardware so what's already either announced by hardware vendors or even
already commercially available so with that so what is computer express link it's a standardized
interconnect technology between basically cpus and devices. Such devices can be FPGAs, GPUs, but also storage devices such as SSDs or also ASICs.
And CXL allows CPUs and devices to access and to cache data stored on each other's memory.
So a CPU can access memory that is potentially located on the device, cached coherently, and also the other's memory. So a CPU can access memory that is potentially located on
the device cache coherently and also the other way around. CXL is responsible
for maintaining cache coherence here and it is based on the PCIe physical layer.
Meaning if you already have a system that supports or has a physical layer of
PCIe 5 then you can yeah CXL basically uses it and specifies different protocols that are
alternative protocols that you can communicate or that the CPU and the devices can communicate over
the PCIe physical layer. And with the physical layer, PCIe physical layer as a basis,
this is also the limiting factor in terms of throughput. So with PCIe 5 layer as a basis, this is also the limiting factor in terms of throughput.
So with PCIe 5, for example, if you have a x16 connection, meaning 16 lanes, you can
achieve about 63 gigabytes per second per direction.
And with PCIe 6, you could achieve 121 gigabytes per second.
Okay. So I mentioned memory. you could achieve 121 gigabytes per second.
So I mentioned memory. I mentioned peripheral devices, but we already
have interconnects for that.
So why do we need CXL?
Peripheral devices are connected via PCIe.
And if you want to access your memory from your CPU core,
then you usually communicate over the DDR interface
on your main board.
There are certain limitations.
Let's have a look at the PCIe device first.
If you want to access the system's memory from a PCIe device, let's say from a GPU,
there is no cache-coherent read or write supported, and PCIe does not allow to
cache system memory to exploit temporal or spatial locality. And also on the
other side if we look at the CPU mHost cannot access a PCIe devices memory in a
cache coherent way so the device memory cannot be mapped to the to the cacheable
system address space. And also when you want to utilize accelerators,
data structures usually need to move from the host memory
to the accelerator for data processing
before then results are moved back to main memory.
And multiple devices cannot access parts of the same data
structure simultaneously, one of the limitations.
Second is memory scalability.
We can observe that there is a high demand for memory capacity
and memory bandwidth, so the demand is increasing
with more data-intensive applications.
And DDR memory lacks matching this demand.
So, and for this limitation, I invite you to look to have a look at pins of DDR memory DIMMs and PCIe connectors.
So on the left side here, you can see a DDR5 DIMM.
It has about or it has
288 pins and on the right side you can see a connector of PCIe 5 x 16
connector and this has 82 pins. And if we compare the throughput here with the
PCA connector again with 16 lanes here PCA 5
it supports PCA connector, again with 16 lanes here, PCA5. It supports about 63 GB per direction with only 82 pins, while this memory DIMM in this
case supports about 50 GB per second, but it has 288 pins, so the bandwidth per pin
is significantly lower. And also note that this is a quite beefy memory dim here with a transfer rate of 6.5 gigatransfers
per second.
So current CPUs, let's talk for example about the AMD Genoa or Intel Sapphire Rapids, they
support 4,800 megatransfers per second.
So this is even higher here.
And still, the efficiency is quite low
compared to PCIe pins.
And also, if you want to have more memory on your main board,
you also need to have more memory slots.
I mean, over CPU generations and main board generations
accordingly, the number of slots also increased.
But also temperature-wise, from a technical or electrical
engineering point of view, it's quite challenging
to add more DRAM memory dims.
Another limitation with memory located to the main board or directly attached by a DDR is
the distance here. So I already mentioned the electrical engineering challenges and everything is tightly together, which makes it also hard to design. And with PCIe, for example, you can have much larger distances
between your CPU and the actual device.
You can, for example, also add some kind of riser cards,
as you can see it here in this visualization.
So, for example, let's say this is your motherboard
and you want to attach a PCIe device here,
but you, for some reason, need a little bit more space, your PCIe devices are too big
so that it doesn't fit in there.
You can add some riser cards so that theoretically you could attach it here, but let's say for
some reason it's still not enough, so then you can also add retimers, which basically
are components which retransmit a fresh copy of the signal
and then you can extend the distance between your motherboard and your PCI
devices and then for example you install an SSD array here. So basically PCIe
allows increasing the distance and which also would help when it comes to providing
a volatile memory.
Another limitation or some factor that we see in today's data centers are stranded resources.
Stranded resource is when idle capacity remains while another resource is fully
used. A popular example here is CPU capacity and memory capacity. So let's say you have a very
memory-intensive application, for example, an in-memory database system, and you utilize
all the memory of your server,
but not all the CPU cores might be fully utilized,
then to provide more memory, you might scale out,
use an additional system here
and use the memory of the second system.
But on both servers now,
you do not fully utilize the CPU resources.
And also the other way around, you might have a very compute intensive application where
you utilize all the CPU cores, but you barely or you only use a small fraction of memory.
And to deal with this issue, usually servers are over provisioned, so they are configured in a way that the peak, ideally that the peak demand
of resources is provided, but this peak demand of an application is not the average demand.
So if the server is configured in that way, then it will actually tolerate the peak workloads and will not fail in that situation
or will not go out of memory if we focus on memory consumption
here.
But on average, there's now a lot
of memory provisioned, which is on average not used here.
And one way of dealing with this problem here is that we can separate individual hardware
resources.
Let's say you have a server with CPU resources, but just a small fraction of memory, but then
you have some kind of dis-regulated memory.
You have some memory that is located in a different server rack or in a different server chassis in your rack and now you can
from your server can access that memory and this is some use case where
CXL is getting more interesting. Another limitation or issue that we can see is data sharing.
In a distributed system where you have different nodes, small fractions of data is often shared and communicated over the network. key value store and they want to perform certain transactions here but they also
to to decide in which order transactions should be executed and and they need to
communicate with which with each other it's also common as distributed consensus,
so they need to follow consensus protocols.
And there are often small fractions of data
that are sent around over the network,
and often with such small data chunks
communicated over the network,
this communication is a typical delay in data centers,
and it often dominates the waiting time for certain
workloads. Just to summarize here and also point to one or the other
additional limitation, PCIe does not allow coherent memory access so we cannot
exploit a temporal and spatial locality
of data accesses.
Then with DDR we have a memory bandwidth limitation, as mentioned before.
Also one aspect that I haven't mentioned yet is we are very limited in the type of memory
that we can use if we only use DDR attached memory. On the server system, a single server
supports usually only one type of memory. It's for example DDR4, DDR5. So we cannot attach,
let's say, DDR3 DIMMs into our mainboard. So we are very restricted here. And also there are other alternative memory media types that are not common on mainboards.
But we definitely here limited if we only use DDR attached memory.
And also, as mentioned with the stranded resources, we have a type coupling of CPU and memory resources and also when it comes to
memory sharing PCIe does not support sharing memory across multiple systems while maintaining
cache coherent access. ComputeXpress tries to solve this it comes or at least to a certain degree. I will talk about the
details in a few moments but it basically is a specification that comes
with three sub protocols which are communicated and negotiated over the PCIe
physical layer as mentioned and the first one is PCIe.io. This is I would say the base protocol. It is used
for example for device enumeration, initialization, device registration. So
you need it for every device for certain setups so that you are able to
communicate via CXL. It is based on PCIe and it also supports non-coherent load and store semantics of PCIe.
The more interesting ones are CXL.cache and CXL.mem. CXL.cache, as I tried to illustrate
in the figure on the right, this is the brown part here, the brown error. This allows devices to cache data stored in system memory.
And CXL.MEM is used for the other direction,
so CPUs can access and cache data stored in CXL device memory.
Then, based on the different protocols that a device supports,
there are three different
device types.
Note that the CXL.IO protocol is mandatory, so every device needs to implement it or needs
to support it.
And then the device could, for example, only support the cache protocol.
And an example use case here is a smart NIC with coherent access to system memory.
And in this case, the device, the smart NIC in this case, can benefit from having coherent cache to perform, for example, complex atomic operations, which are not trivial to do with PCIe.
And this kind of device type is called, according to the specification, type 1 device.
So the names are quite easy here, type 1, type 2,
and type 3 devices.
The type 2 device supports all the three protocols.
Use case here, or an example, is an accelerator with memory.
This can be, for example, a GPU.
This can be an FPGA.
And a CPU can write data into the device memory
and also read results generated by the device
from the device's memory and attached memory.
So it has memory attached on the device.
And it can fully, it also is able to cache the data that is stored in systems memory.
And then the third type only supports CXL.io and CXL.mem,
and this type of device is a memory extension device.
So it also has memory attached, and the CPU can access the memory in a cache coherent way
and use case here is mainly memory bandwidth and memory capacity extension.
And as I mentioned before with the limitation of DDR you can also use different memory media types
here. So you can use also DDR memory on your PCIe device, but you don't have to.
So you can also use alternative memory media types here.
And there are also proposals, hardware proposals,
in which SSDs are used that supports the memory protocol meaning you could actually access the ssd via the cxl.mem protocol from your
cpu and then can actually perform persistent writes to the ssd even though from an application point of view, you're writing to virtual memory address space.
And some more details about this, the Type 3 device, the memory extension devices.
So this can also be seen as a cost power and pin efficient alternative to DDR, as mentioned before in the limitation part.
And it also offers flexibility in the limitation part.
And it also offers flexibility in system topologies
due to the longer trace length as shown in the PCIe extension
card example.
Then a special case is a global fabric attached memory.
So this is basically a Type 3 device,
but the scope is much higher here.
So it allows to be connected to up to 4,096 endpoints.
There are different CXL generations,
which I will also talk about in a minute.
And this is rather a feature of a later generation
or later revision.
This is only available with CXL 3.0.
And with that, I briefly want to introduce you
about the timeline.
When did it start?
Where are we right now?
What kind of revisions do exist?
Because it's very important to know about what revision
we're talking, because the feature sets are quite different.
CXL is an open specification, or interconnect specification, and the first generation 1.1
was published in 2019.
It basically specifies the protocols that I showed you
through the three main device types I introduced you to.
And also it allows devices to be directly attached
to the host system.
With 1.1, a few months later in 2019,
there were some compliance tests, mechanism additions, so no further feature.
With CXL2, this allows now larger topologies. So remember with CXL1, you can only directly attach
devices to your CPU, to your mainboard, but with CXL2 it allows
bigger topologies with one level of switching and it also allows
partitioning inside devices. So you can have one physical device with multiple
logical memory partitions inside which then in a let's say topology with the switch can be allocated to different
servers. CXL3 was published in August 2022. Note here that this is the first
version that requires PCIe 6. The previous versions require PCIe 5.
And this generation or this revision
allows even bigger topologies with multiple levels
of switching.
It adds an advanced routing approach
that also allows connecting 4,096 endpoints with a single device.
It allows, and this is one quite important feature here,
it allows memory sharing across multiple hosts and devices.
So you can imagine one device with memory attached
that is connected to multiple servers
and each server can access that memory
in a cache coherent way.
And it also supports direct communication between devices which the previous generations or revisions do not support.
And then in November last year there was another version published, CXL 3.1,
which has some other features that are not listed here, but it also supports non-coherent
memory access across different CXL domains.
So before, the access was rather limited inside a single CXL domain, and now communication
across such domains is possible in a non-cache coherent way. With that I want to dive a little bit deeper into how cache
coherence is managed with CXL. For this I briefly want to introduce some more
details about the CXL.cache protocol. As mentioned this protocol allows
devices to cache data stored in host memory and it uses the Macy protocol
with 64 byte cache line size. CXL is an asymmetric coherence protocol. So you
might remember a few lectures before when we talked about locking we also
introduced you to cache coherence to snooping protocols directive based protocols
and we also briefly mentioned the msu protocol here cxl is using that one on on 64 byte
cache line granularity and it's an asymmetric coherence protocol meaning that a cxl device
which wants to cache system memory itself is not responsible for managing
the coherence but the host is responsible for the management and also tracks the coherence
for peer caches.
And peer caches here can be caches of other devices, so assume a topology where you have one server and for example
two cxl devices with caches and both devices can cache system memory then the host tracks what
a device is caching which which memory and this keeps the control simple at the device, since the coherence management logic is located on the CPU
side.
The device never directly interacts with other caches.
Again, this is what the host is responsible for.
And the device only manages its own cache
and sends requests to the host if it needs certain data.
The CXL memory protocol again enables the device to expose device memory so the CPU
can access that memory.
It basically supports simple reads and writes from the host to the device. And it's only the protocol that is specified, not the memory media type.
So DRAM, PMEM, high bandwidth memory, flash memory could be supported.
And it also allows a device back-to-back invalidation of cache line copies.
And this is the part that is relevant for the scenario in which multiple
servers share the same memory that is located on one device, because
from a server's point of view, they're not in the same coherence domain, so therefore we need some some measures to
invalidate cache lines. Let's say
one server caches a certain
cache line of our shared memory device
and then another server also accesses this memory
and now writes to the data. We somehow need to
communicate to the first server that this
cache line should now be invalidated and for this reason
CXL3 introduced back invalidation meaning that the device itself can send back invalidation requests to the connected servers
And
Yeah, also when you have memory that
memory attached on the CXL device,
this is visible for the operating system.
This will be exposed in the unified virtual memory address
space.
So from an application point of view,
you're writing to virtual memory, which
makes the interaction or integration of CXL memory
quite easy from a programming point of view.
Okay, how is CXL integrated into the CPU cache hierarchy? So if we look at the modern CPU
hierarchy, we have multiple level of coherent caches. We have all level one caches that are
small, have low latencies and have
the highest bandwidth compared to the other levels. We have our level 2 caches with larger
capacities. They might be shared between multiple cores. We have our level 3 or last level cache
that is shared across many cores. And now CXL. So we have two sides here first at the bottom we talk about
or we see we leverage the CXL.MEM protocol to access memory but we can
also have the case that we use the CXL.cache protocol, which
allows devices that are indicated here
at the bottom in blue and greenish
to cache memory of the system.
And such a device cache can have a size of, for example,
one megabyte, and then via the CXL cache protocol, so there needs to be a controller implemented in the CPU,
we can now send requests that a certain memory line, a certain cache line should be cached.
And then above our last level cache, we have, if we have a multi-socket server,
we have our coherent CPU to CPU inter-socket interconnect. For example, for Intel this
would be an ultra-path interconnect or on AMD side this would be Infinity Fabric, also known as XGMI.
And on our CPU, we also have our home agent, which on the one hand resolves conflicts of last level cache trying to cache the same cache line, same address, and it also fetches data
from memory via the memory controllers. And this memory controller can now now if we do not use CXL just
be an integrated memory controller where memory dim is connected to the DDR interface or this
can also be a memory controller located on a peripheral device on a CXL device and in this
case the home agent needs to communicate
with this memory controller via the CXL.Memory protocol.
I want to show you some example protocol flows, how the CXL.Cache and also how the.Mem protocol works.
And for this we need to know some terminology here.
So we saw the home agent before,
but there's not only one home agent.
There can also be multiple home agents
depending on the address that's being addressed.
So if I, for example, want to access memory that is connected
or that one of these memory controllers is responsible for, then
I have my home agent on that socket.
If I want to access memory that is directly attached to the neighbor socket, then the
neighbor socket's home agent is responsible.
Then we have peer caches. A peer cache now, so we look at peer caches from the perspective of a CXL cache device.
So the one that is marked here a different CXL device and also my caches on my CPU are considered peer caches.
Then I have my memory controller. As briefly mentioned before, this can be an integrated one,
but this can also be a memory controller located on my CXL device.
And also note, this memory controller could also be located on a remote socket.
Okay, if we want to
write data, for example,
this is an example protocol flow here, note
that we will use some meta value in this example flow. So,
this is only mandatory for type 2 devices.
Remember type 2 devices support CXL and memory so they can cache system memory
but also allow the system to access its own memory and these metadata bits are
necessary for this kind of type of devices because for each memory access from the server
side the host now encodes the required cache line state and and with that it
exposes the state to the device so now the device has further information further knowledge and can store the information about what cache line is cached in one of the
host caches or in the peer caches because this is relevant when it comes to protocol paths or
communication paths because there's a differentiation between my server has copies of the
memory and my server or the host does not have any copies but I will also talk
about it in a moment right back to the to the right protocol flow here let's
assume the host CPU wants to write data so therefore it sends a request to the
memory controller we assume CXL type 3 device here so therefore it sends a request to the memory controller we assume a
cxl type 3 device here so therefore it does not have it does not support the cache protocol so
we send a memory write request with the meta data value here for type 3 devices this metadata value
is optional so it's up to the host how to use it.
It can be used, for example, for security attributes
or compression attributes.
And the memory controller then sends the write command
to the actual memory media type.
So this can be DDRDIMM in this case, again with the meta data here,
but the memory controller does not need to wait with sending a completion back to the
host after getting the completion from the memory media.
It can already send a completion response back to the host when it can ensure that the next read or that
the sent data will be visible when the next read request arrives.
So this can also be earlier than the completion from the memory media type.
But this is a quite simple flow how a CPU can use the memory protocol
for a write. Let's also have a look at read here. In this case, we assume that the host wants to
read a certain address and it also updates the meta value that is stored for the address.
Again, it sends a request first to the memory controller,
which then sends the read request to the memory media type.
The media itself sends the data back to the memory media sends also back the meta value that was used before.
And since we need to update the meta value here,
the memory controller again needs to send a write with the updated meta value.
This is also completed, but my memory controller can already send the data back to the host after the data was
sent back as a response from the memory media.
And in this case then the memory controller also has a feedback for the host response
with the meta value that was stored before.
In the second read case, we do not have any meta value update here, so therefore this
workflow or this protocol flow is even easier.
We just send the read to the memory controller and the read to the memory media, which then responds with the data and the meta value and both are then returned to the host.
Okay, this was the CXL.mem site. Now let's have a look at the CXL.cache protocol. And remember
that we have the home agent. This home agent, which home agent is used here depends on the memory address and where the
target memory is located on which socket. We have our different peer caches, meaning the caches of
other CXL devices or the caches of this case we have a CXL device which
wants to cache a certain cache line and the state first is in state invalid so it sends a read
request to my home agent and the home agent now is responsible for managing the
coherence so therefore it also needs to send snoop messages to the peer caches
so there could be an peer cache that has this cache line that is requested in
exclusive state and now since our CXL device will receive this cache line, we need to change the cache controller of this peer cache itself needs to change the state.
Also remember that the caches and the cache controllers maintain the finite state machine usually and also the peer cache will then send a response that the
the cache line state was changed accordingly in parallel already after the home agent received
the request from the cxl device it can contact the corresponding memory controller, sending memory read requests here,
which then also will respond with the data.
And the home agent then sends two messages back to the CXL device that requested the cache line first.
There's a global observed event, meaning that globally,
so this can also be multiple peer caches here,
it doesn't need to be only one,
but when the cache coherence,
or when the cache states for all the peer caches are managed,
the home agent will send the event
that the cache lines are managed coherently,
and also the data will be sent to the CXL device. In this case when the coherence management approval, the global observation arrives, the CXL device
can already switch or maintain its own cache line state and it will also receive the data then.
The write flow looks a little bit different, it's a little bit more complicated, but if
we compare with the read flow, the first half is very similar.
So we have first our read request, the home agent manages the coherence of the
other peer caches the peer caches respond accordingly the home agent fetches the data
from the memory controller it will send a global event in this case we request the cache line. So before we requested only the...
So we wanted only to read the cache lines and we had one peer cache that has the cache line in
exclusive state. Since we want to read it, there is a cache line copy of this cache line in multiple
caches. So therefore we need to change this exclusive state to shared and then
also when when the cxl device has the data it also has this cache line in shared state since multiple
caches have a valid copy of that cache line in this case now we want to read at the cxl device
side so therefore we need the cache line in exclusive state. So therefore the home
agent needs to invalidate other copies of the cache line and then also similar
to before it sends a global observed event back to the CXL device with the
data and then at some point there will be a silent write. Silent write here
means that there is no immediate write through.
So it can write the data on the device without any communication to the home agent or to other caches.
And at some point, the cache line will be evicted.
For example, because some other data needs to be fetched into the cache or depending on how the cache and memory hierarchy is implemented for the specific CPU also, from lower level cache to the last level cache but at some point an eviction of
this cache line can be triggered and in this point we need to make sure that the
data needs to be written back. So therefore the device sends this dirty evict event to the home agent.
The home agent sends a write pull message to the CXL device.
And with that, the data will be sent to the home agent.
The cache line state will be updated at the device side.
And memory write happens triggered by the
home agent so the home agent communicates with the correct memory
controller again depending on the address and the memory controller at
some point will send a completion event okay these were the easier examples so
there are different ways of how memory that is located on a device is handled.
Remember, as shown here on the right side, we have type 2 and type 3 devices, both have memory attached to the device.
And if this memory or parts of the memory that is exposed to the CPU, this is called host-managed device memory or short HDM.
And the CPU's caching agent interacts with this HDM via the CXL.MEM protocol, also as visualized in the architecture of the CPU before.
And with that, it is integrated into the host's coherence domain. There are different management options for this type of memory, for this exposed memory. The first one is the easiest one.
It's a host memory expansion with host-only coherent memory.
This means, or this is valid for a Type 3 device in this case,
the memory only needs to be coherent for the host so there's no cache in the
device that needs to be maintained so this is the easier case then there are two more interesting
ones an accelerator memory exposed to host as device coherent memory. This is applicable for type 2 devices.
And another type or another way of how cache coherence is managed is via device coherent memory using back-in validation.
I briefly touched back-in validation before, which allows memory sharing across multiple servers and with that also the
way of how memory accesses or cache coherence is managed differs from from the other approaches
yeah i briefly mentioned this one is the easiest i will not go into more details about it but
let's look at let's have a four minute break and then look let's
look into the more interesting ways of how to manage cache clearance I will
briefly go back to to that slide
to this one note here that we saw that when the CXL device
requests a certain cache line
it needs to communicate with the home agent
which then resolves
coherence issues here and manages the coherence
and communicates with the peer caches accordingly.
But we definitely see that there's this flow
from the device to the home agent.
But it could also be the case that we want,
or that the CXL device wants to access its own memory
since it has memory located in the case of type two device.
And this is something that we want to look into right now.
Right, so this is the case device coherent memory type 2 device can access
its own memory without violating coherence and also without communicating
with the host and for this I mentioned before that the host is responsible for
tracking cache states of other peer caches, but there's also, there can be
some tracking logic implemented in type 2 devices. In this case, the
device contains a device internal caching agent. This
again is then responsible for tracking which cache lines are cached at the host
side. So which other does the host have a certain of for each cache line that is stored or for each cache line of the device's
memory is the cache line cached in one of the caches of the host and for this it can implement
or yeah basically implement a bias table and this bias table then stores one bit per page. This bit indicates if the host may have a copy of the corresponding cache line.
And then depending on the state, there can be two states here.
There can be device bias, and in this case the host does not have a copy of a certain cache line,
or the cache line can be in state host bias. In this case the host has a copy of the related cache line.
In this case the read goes through the host as we saw in the example before.
But if the cache line is in device bias state then the device can access its memory directly without any communication to the host.
Again, the host tracks which peer caches have copies.
So type 2 device does not share further cache state
information about other peer caches.
And this protocol flow diagram shows how such a BIOS flow happens.
So in this case, so before we also looked at the type 3 devices and how memory access
happens.
Now we focus on type 2 devices.
This device has memory, therefore it also needs to have an internal memory controller. controller and CXL device here represents that basically that we or it rather represents
the cache of the device and let's assume that the device wants to cache its own memory but
based on its BIOS table it knows that the state for this cache line is in BIOS host state.
So therefore it cannot just simply read the memory from its own memory with its memory controller.
It needs to communicate with a home agent.
Because there could be some peer caches that have this cache line in certain valid states, shared, exclusive, modified.
So therefore, it needs to send a read-on request to the home agent.
Again, the home agent resolves coherence.
And this can already happen while the CXL device is reading its own memory. This only works because there is
something called a read-forward message but I will come to this in a second but
have in mind first we need to the CXL device needs to communicate with a home
agent then as shown before it needs to resolve cache coherence states by
communicating with the other peer caches and can then send a memory read forward
message and this allows this memory read at the upper part here, what would be the alternative?
So this read forward basically allows the device
to read its own memory instead of requesting the memory.
So if we would look at the cache protocol and the memory
protocol separately, we now, so we first, in terms of the CXI cache protocol we
request a
certain memory from the host with the cache protocol but the memory protocol
is there to get certain data from memory. So when we have it separately and no BIOS flow optimization here, the home agent now
needs to communicate via the memory protocol with the CXL device, gets the data and then
the CXL device can retrieve it. But this would be communication back and forth, some kind of
communication ping pong here, and this can be avoided by also supporting this read forward
response by the home agent, which allows the type 2 device, instead of communicating with the home agent again that
the home agent should fetch the data it can fetch the data by itself and
yeah the the internal memory controller then returns the data to the cache to the CXL device and also updates the
cache line state from invalid to exclusive and
in this moment the device also has enough information to know that all the other
cache line states of the same cache line
were now invalidated at the peer caches, so therefore now we have the bias device state for that specific cache line.
Are there any questions to that flow?
Because I think this was at the end a little bit more complicated with the data ping pong.
Yes?
What's the purpose of the first memory of the device to its memory?
We need to, we somehow need to read it.
So the idea here of this bias flow is that we do not need to communicate or to fetch the data from
the home agent but the cache agent of the CXL device can get the data directly
from its integrated memory controller and therefore we need to somehow
request it so otherwise we will not get the data response and if you see this
error here maybe this was not completely clear. So the message from the home agent at that point, after managing the cache coherence at this part,
the message from the home agent goes directly to the memory controller.
I mean, it basically also the CXL device,
instead of sending it back to the home agent.
So here's the question about, can we actually avoid this and have sufficient information
in the memory forward so that it then goes back to?
Yeah.
Good point.
Yeah, this is basically, I can also only tell you
what the protocol specifies.
But I don't know if there is a certain reason why. I assume there is a reasonifies but I don't know if there's a certain reason why
I assume there is a reason but I don't know the details right now why this is required here
but good catch
okay and now let's have a look at the other state or the other scenario in which our cache line that the CXL device wants to access is in
a biased device state.
So in this case, no other peer cache has a valid copy of that cache line and the device
can simply read the data or send a read request to the memory controller and then the memory
controller responds with the data.
Very straightforward. controller and then the memory controller responds with the data very straightforward
and now the last way of managing cache coherence is with back invalidation remember this is a feature of cxl3 and in this case this is valid for type 2 and type 3 devices, those devices then can send back invalidate snoop messages via the CXL.memory
protocol to the host.
The CPU, when let's say a device sends this back invalidation
message, the host then manages the coherence
by invalidating the copies of the
device's memories cache lines and this also allows the device to implement an inclusive
snoop filter and a snoop filter is doesn't show and now it does. Inclusive loop filter is basically logic on the device that tracks the host caching on
cache line granularity.
So the device always knows which of the cache lines of its own memory are cached at the
host side. And this also, to enable this communication and the back
invalidation, each device that supports this kind of back
invalidation needs to have a caching agent.
And note that type 3 devices do not support CXL.cache protocol.
But now, this spec invalidation, which is some way of
managing cache coherence, is implemented in the memory protocol. So even though memory devices do not
implement CXL.cache, this feature is at the memory protocol side, so therefore also memory extension
devices, type 3 devices, need to support it if they want to back-embedded cache.
And this is, again, if we go back to the example where
we have a device with memory attached
and multiple servers can access it in a cache-coherent way,
this definitely, I think, makes sense
that even though the device is not
able to cache any of the service memory,
that we need this feature of back-invalidating the caches so that no inconsistent memory access
or of write happens to a fraction of shared memory. Okay, with that I want to talk a little bit more about
topologies and about the topology scope. So I introduced you to the different
revisions 1, 1.1, 2, 3 and 3.1 and
the first version of the first major versions are in the single server scope so we do not have bigger topologies with
multiple levels of switches not even a single level of switch so it's really directly connected
to a server with cxl2 we are expanding the scope to a rack level
with the single level of switch with the cxl3 and.1, we are still in the rack level or one might also consider a multi-rack level in which we have multiple switches. and then at the neighboring racks there might be, for example, only memory devices and switching infrastructure
so that the different servers are connected to these memory-only devices.
Architecturally, it's theoretically possible to connect 4096 endpoints. There is a quite comprehensive introduction to CXL by the CXL
consortium. The CXL consortium is driving this technology and they also
mentioned that even though theoretically it's possible they would rather expect
the scale of hundreds endpoints rather than 4,000 endpoints.
Just to let you know.
Okay.
When we look at a single device,
there are also differences in the different generations.
So I want to talk about multiple logical devices that is introduced with CXL2.
With CXL11 we can only have one
logical device and one physical device. So each physical device represents one
logical device but with CXL2 memory devices, so type 3 devices, can be
partitioned. So there could be multiple logical devices that are one physical device can be partitioned into multiple logical devices.
You can think about it in a way that our memory is partitioned, for example, in this case into four partitions.
And each individual partition then can be assigned to a different server.
And with CXL2, such a device supports up to 16 domains.
So you can assume 16 partitions that can theoretically
be connected to 16 different servers.
Then I want to talk about resource pooling, also
a feature that CXL2 introduces.
You can assume a topology in which you have switch nodes
with CXL2.
It's only a single level of switching and there is a fabric manager that also then manages
the assignment or which device is assigned to which host and also when is it released and
maybe reassigned to another host. So these topologies are possible on the
left side only with single logical devices but you can also combine it with
multiple logical devices so that you have both in your topology. But also note that switches add latency.
And if excess latency is a high priority for your application
and you definitely, for some reason,
want to avoid the additional latency of switching
topologies, you could also use direct connections in which you
connect the servers to different devices but note now that one device has
different multiple connections and this is of course only possible when the
physical device also supports multiple connectors, multiple slots, in which you can connect your PCIe CXL cable here.
With CXL 3.0, the scope is even larger with something
called CXL Fabric.
So CXL Fabric extends the topology scale
from a node to reg or multi-reg level
as shown before in the visualization
with multiple levels of switches.
And before 3.0, CXL only supported tree topologies.
And now this is not a restriction anymore.
So non-tree topologies are supported
and if you might have wondered why the limitation of 4096
endpoints that can be supported because cxl 3.0 implemented something that's called port-based routing and it uses
12 bits as an identifier for each individual port. So therefore with a maximum of
4096 ports
Right and okay, so, and CXL Fabric also specifies the domain in which we can access different
devices.
So, single, so CXL domain scopes multiple hosts and ports and devices
within a physical address space.
So you can have a large topology,
but you can have different CXL domains on this large topology.
So let's say you have, as in the picture before, depending on which devices
and hosts are connected together,
they form a CXL domain.
Oh, sorry.
And Fabric basically connects one or more host ports
to devices within each domain.
So therefore, in a topology with multiple switches, with multiple servers, multiple devices,
we can have multiple CXL domains in the topology.
And in terms of CXL fabric, there are some additional terms that you might find if you want to read more about CXL.
One is fabric-attached memory. This is basically a memory resource connected to multiple hosts.
There are different ways of how this memory is now connected.
It can be pooled or shared in the scenario of pooled
memory or fabric attached memory here the memory is only assigned to a single host and
as you can imagine in the shared scenario memory region is assigned to multiple hosts and you can
also see this in this visualization here again we are talking about CXL3 in this
case pooled fabric attached memory is
yeah for example the the dark okay we only have one case here.
Okay, the orange one is only attached to the host 3 in that case, but we can see the bright
green one.
This is shared fabric attached memory, which is assigned to host 3, 4, and 5.
All of them can coherently access it,
as you have learned before with the back-end validation
technique of cache coherence management.
There's another term called logical device fabric
attached memory.
And such memory is basically type 3 devices
with up to 16 logical devices.
So here, we do not use CXL Fabric, so we do not use
port-based routing which is essential for CXL Fabric. Then we have global
fabric-attached memory and this is fabric the opposite to LDFAM where we have fabric attached memory exposed using
port-based routing links. And this as shown before this global fabric attached memory can be shared across multiple hosts or peer devices and this if we have now this kind of memory exposed to multiple servers and
devices potentially this global fabric attached memory has is located so our
memory resources somewhere located on a certain device. This is then a global fabric-attached memory device, a GFAM device.
And this has a specific physical address space.
And now, from our server, if we want to access the memory located on that device,
the device is responsible for translating the physical address to the device physical address.
And also in the case if we have now multiple servers accessing this global fabric-attached
memory, then this device needs to maintain multiple translation tables for the different physical address spaces of the servers and the device
address space.
Then also shown in multiple of the topology pictures before, we have this fabric manager
and this is a logic separate from the switch that performs tasks related to resource binding. So it allows allocating resources,
let's say memory resources to a server, it allows releasing it and reallocating it. And in all the
pictures before and also visualized in that one, it is located in the switch, but this is one
option. So this does not necessarily need to be implemented physically in the switch but this is one option so this is this does not necessarily need
to be implemented physically in the switch it can it is basically a program
or some logic that can be located in the switch but there could you could also
imagine that there is a separate device but it's important that the fabric
manager however it is implemented it can also be implemented for example in the server
side the host side this needs to have some connection to the switch for for
managing the resources and for binding and managing so for binding the
resources and for managing the ports so for binding the resources and
for managing the ports and devices it's yeah I think I briefly mentioned this it
allows dynamic resource assignment and reassignment and also it has different
ways how to communicate with it so the API is media agnostic. This can be for example Ethernet, this can be BMC
or USB. Right, so I think with that I wanted to show you the most, in my opinion, interesting
features of CXL 2 and 3 in the last slides. Do you have any question in terms of scal thing, or are there some similarities?
It is very similar to, so there are PCIe switches existing
without CXL support.
And they are similar to PCIe switches
with additional support for the CXL protocols.
And also, even though I mentioned the Fabric Manager does not necessarily need to be implemented in the switch, we can assume that it will be.
So we also talked to some hardware vendors that also mentioned that this will be expected in the switch.
And latency-wise, maybe I can briefly jump to it right now. There is an existing work from last year. In this paper, it's called Bond CXL-Based Memory Pooling Systems for Cloud Platforms. and one of the interesting parts is latency estimations here.
And they compare local DRAM here with, in addition to some latency that happens on the CPU side and the fabric side,
that happens at the memory controller and DRAM side, but then they also compare it with a switch for example in the
very bottom case here and just to give you some numbers here so if you have a directly attached
device it would memory access would take based on this estimation here about 85 nanoseconds and a level of switching would add additional 70
nanoseconds here in this case. I mean you can also see that there is additional
latency I will talk about it also later for short. Every CXL port adds additional
latency we have we might have retimers. Retimers I showed you in the beginning
at this PCIe slide where we
can expand the distance between a core and in this case memory. So yeah just to
give you a number that you can somehow compare to local memory access
latencies. Okay I very briefly want to talk about programming with CXL device memory.
I mentioned before already that device memory is exposed in the uniform virtual address
space and this also allows us to already use libraries that are already existing and have
been used for virtual memory, for local memory, also for remote socket memory.
So we can basically apply what we used in NUMA-aware data management, let's call it
like that, for CXL device memory.
So what we can see or what we see when we have a type 3 device attached to our server, is that we have an additional NUMA node.
And let's say we have a single-socket server without any CXL devices,
we might have one NUMA node with modern CPUs.
Even these memory regions could be subdivided,
that on a single CPU you still have multiple NUMA regions.
So in this case, then basically the memory controllers and the cores are grouped into NUMA regions.
But you can think about it as a NUMA node.
Type 3 device could be exposed to the operating system as a NUMA node without any CPU resources.
So a memory-only NUMA node without any CPU resources. So a memory only NUMA node and then you can use
if we assume a Linux operating system here, a system called such as mbind to set a memory
allocation policy for a specific memory region that you've allocated or set mem policy to change
the memory allocation for your thread without limiting it only for a given memory region,
or also move pages to move memory pages from one Numer
memory region to the other, also meaning from local memory
to CX or in the other direction.
I briefly want to talk about some performance numbers that we can also really find in CXL work
that is in the field of data management.
There was one paper published last year at Micro.
It's called Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices.
It's a long list of authors here and they have a CXL device memory benchmark and analysis. In their
work they have three CXL memory devices. They used two ASICs based one and one FPGA-based one. And they basically perform a set of microbenchmarks
and end-to-end benchmarks.
And at the top right figure here,
they also show random memory access latencies
for the different devices.
They used as benchmark tools first
the Intel Memory Latency Checker.
So if you ever want to do some memory access
latency measurements, this is a widely used tool
provided by Intel, but they also use Memo,
which is their benchmark tool that they provide
or that they developed and used for their work
and also provided as open source contribution.
In this figure here, you can see the x-axis the different workloads
let's focus on the memo workloads here first so on the very left this is the
memory latency checker and here with their memo tool they have a load a
non-temporal load the store instruction and non-temporal load, the store instruction and non-temporal store instruction.
Just to get you on the same page, non-temporal basically means that you bypass the cache,
so you can read data from memory without propagating it into your cache hierarchy
and also the other way around. and the different options that they have here
so they have at their y at the y-axis they have the axis latency normalized to to local memory
and the other options that they have here are in blue the different cxl devices and in yellow
this repair presents the axis latency for memory that is located on a remote socket.
And, okay, now we don't have absolute numbers here, but we can assume,
let's assume that the local case is about 100 nanoseconds.
It can be a little bit lower, it can also be a little bit higher depending on the system, but just for a rule of thumb here we could
calculate with 100 nanoseconds and then we can see that all of these
X's are within the range of 100 and 500 nanoseconds and interestingly the non-tempal stores perform quite good
compared to the other access operations here and to just to give you a
little bit more information about the different devices they have yeah they
call them CXLA, B and C here there's an additional table in the paper that you
can see in the right bottom here A A and B are the hard IP, meaning this is ASICs-based,
while the last one is soft IP, meaning FPGA-based.
And I mentioned before that we do not have memory media type restrictions.
But in this case, the authors used devices with DDR memory attached.
In two cases, it was DDR4.
In one case, DDR5 with a different speed here that you can see at the end of this description.
And also the resulting memory bandwidth shown here.
So with these devices, we are in a range between 16 and 38
gigabytes per second.
And I already briefly showed it.
There's another work called PONT, CXL-based memory pooling
systems for cloud platforms.
It was also published last year.
And they also have different analysis in their paper, for example, workload
sensitivity to memory latency, or also effectiveness and latency of different CXL memory pool sizes.
But what I find quite interesting is this figure with a prediction. They also, one of their contributions is a prediction model for latency and resource management
at data center scale.
They evaluated with simulated memory,
so they did not have real CXL hardware,
but I think this figure that they also provide
is quite insightful, showing that your memory access
really depends on your topology so you
cannot i definitely say memory access has a latency of 500 nanoseconds because it's
even if you have the same device it depends on the topology how the device is attached to your CPU, how long it takes to actually access it.
And as you can see here, from 85 nanoseconds in the local case to more than 270 nanoseconds
in the case where we have a large memory pool in which we have additional we have first we have the CXL port latency which
adds 25 nanoseconds we have retry retimer which adds 30 nanoseconds here
then we have a switch again with two ports each adding 25 nanoseconds then we
have we might have some network on chip latency or also some switch arbitration
latency so this is basically a process to manage and coordinate access to a shared communication
medium that happens in the switch and then again retimer latency so it really depends how does the
topology look like what what's the quality, what
are the characteristics of the device. And even, let's assume we have a CXL device directly
attached to our CPU and we have our given PCIe throughput or our PCIe bandwidth,
then theoretically there could also be another bottleneck on the device.
In commercially available products, I would assume,
maybe I'm not saying too much here,
but I would assume that we do not have a limitation on the device,
but that rather the PCIe connection
is the limiting factor.
But theoretically, as you also show here in this figure,
in these prototypes that the authors used,
they have DDR memory DIMMs attached on the CXL device.
And the limitations that you can see here
are rather based on the maximum bandwidth
that you can achieve with these DDR DIMMs attached.
Alright, I want to finish up with some available hardware.
I will be quick since we are running out of time.
There are already CPUs available that support CXL.
For example, the server CPUs from Intel,
Sapphire Rapids, or also the later version, Emerald Rapids,
they support both.
So both of them support CXL 1.1.
And there's also Intel FPGAs available,
until like seven FPGAs, which also support 1.1, CXL 1.1.
In terms of, yeah, regarding the Sapphire Rapids,
you can see a visualization here.
And I talked about NUMA nodes and also NUMA sub-regions.
This is an example of a CPU that can be exposed
as multiple NUMA nodes.
So this CPU physically consists of four dice or four CPU chips in light blue which are then
connected this is the purple part here with an inter die interconnect and
basically each of these quarters here can be exposed as one NUMA node with its
integrated memory controllers yeah the the red part here are
the DDR memory interfaces and if you want to use CXL now you basically have
to use the orange parts. The orange parts are the PCIe CXL ports. Each of these ports support 16 lanes, so therefore per die it would be 32.
And in total you have 128 PCIe CXL lanes that you can use for peripherals that are only PCIe devices or CXL 1.1 devices.
AMD also has CXL support, for example the AMD Genoa CPUs also support 1.1
and there's another CPU AMD 400G adaptive smart, sorry, no, not a CPU, a smart network card,
which supports CXL 2.0.
And also recently ARM announced that with NUVERSE V3,
there will be up to 128 cores and they support CXL 3.0
and high bandwidth memory three.
When we want to look at devices, this is just an example of Samsung devices.
So they announced recently memory modules that are connected via PCIe 5 with 8-lane
connections. they have 256 gigabytes of capacity a bandwidth up to 28 gigabytes per second
and 520 nanoseconds of average excess latency they call it Samsung CMM short
for CXL memory module D and if you have multiple of them then you have a CMMB
this is basically a memory module that can host up to eight of the above ones.
So then this...
Okay, yeah, then this would... I was just confused by the numbers.
But yeah, then it would end up to maximum of two terabytes
I haven't looked into the details about the bandwidth here so Samsung claims 60
gigabytes per second oh yeah I definitely want to look into that
because in each individual supports 28 gigabytes per second I wonder what are
the limitations here but that's at least what they claim for for the second device with also excess latencies lower than 600 nanoseconds and it should
be cxl 1.1 and 2 compliant and one feature of the cxl 2.0 specification is the support for pooling of single logical devices.
In summary, if your friends and family ask you tonight what is CXL, I hope you can tell
them that CXL tries to cover different limitations of PCIe DDR interconnects.
So the previous one do not provide non-coherent memory accesses.
They have quite memory bandwidth limitations.
They only support homogeneous memory media types.
They are tightly coupled to CPUs or tightly coupled CPUs and memory together.
And they do not allow memory sharing, which CXL is trying to address. We talked about the different protocols and device types
that this CXL specification defines.
We looked into managing cache coherence,
how is CXL integrated into a modern CPU,
and how does the CPU and the device
communicate via CXL.cache and mem.
We briefly talked about the different revisions, about feature sets of these revisions and
also how the revisions determine the hardware scope.
And I also showed you some example topologies and how topologies can be built and also what the consequences of different topologies are,
for example, in terms of memory access latencies.
With that, we are at the end of this session.
Thanks for your attention.
Are there any questions?
All right.
Everyone wants to go for lunch.
And then the next topic will be networking.
There will be a session about it next wednesday.
But the next session on next tuesday will be another task
Session in which, as i mentioned in the beginning,
Present the solution of task 2 and also introduce you to the
Task 3, which is about buffer management. Thank you.