Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Compute Express Link
Episode Date: July 25, 2023...
Transcript
Discussion (0)
All right, yeah, I think we can start. Good morning, everybody. Welcome to the last session
of this semester's Hardware Controversy Data Processing lecture. Today, I'm going to talk
about Compute Express Link, a new hardware interconnect. And in the last month, we talked
about different hardware aspects. We started with the CPU basics, CPU internals, different
instructions, SIMD instructions. We moved on with memory technologies.
We also talked about persistent memory and then moved over to peripherals connected via
PCIe, for example, network interface cards or accelerators such as GPUs, FPGAs, also
disks.
And today we want to talk about a different hardware interconnect which is not
that far away from pcie it's actually pcie based and it's called computer express link
today is the last lecture of the seminar of this um of the hardware conscious data processing
lecture series and tomorrow um we will have a data center tour. All the details can be found in the Moodle.
Many of you already voted if you attend or will not attend.
I personally will not be there, but Laurence will pick you up.
The location is also mentioned in Moodle.
So it's actually at the south entrance of this building.
Please be five minutes early, so at 1055.
So then you can start at 11 a.m
okay and in this lecture we first want to talk about limitations of today's hardware
interconnects after that i want to give you a compute express link overview one important
aspect of this interconnect is cache coherence which is why I want to talk about this in a little bit more detail then.
The interconnect specification defines different protocols.
Among them is the cache and the memory protocol
that I will talk about.
Also, CXL has different generations.
And there are certain important enhancements
with the last generation, which is the third generation.
And then at the end, I want to talk about some performance
estimations and already some experimental measurements
that were already done.
The lecture is mainly based on different documents
from the CXL consortium.
If you are interested in the topic,
so feel free to check
them out. There are a lot of more details that I cannot cover in this lecture. Okay, let's start
with the limitations of today's interconnects. So first we want to look at PCIe. With PCIe we do not
have cache coherent accesses. So if you want to access your's memory from a PCIe device, you have to do it with non-coherent reads and writes.
And PCIe device cannot cache the system's memory to exploit, for example, a temporal and spatial locality.
Also, the other way around, if you want to access from a host CPU the PCIe device's memory, this has to happen non-cache coherently and we can
also not map the devices or PCIe devices memory into our CPU's global memory
space. If we look at accelerators for example FPGAs we also have the issue
that data structures have to be moved from the host's main memory
to actually the accelerator's memory.
Then at the accelerator we can process the data before we then have to move the data
back to main memory.
And therefore multiple devices cannot access parts of the same data structure simultaneously
with the CPU together without moving the data structure back and forth.
Another limitation of today's interconnects is memory scaling. Usually memory is attached via
the DDR interface, so you are probably familiar with the DDR DIMMs, and the demand for memory capacity and also bandwidth increases with the growth of
compute resources and DDR memory actually lacks matching this demand. So we are here limited in
both memory capacity and also bandwidth per CPU. And if we would compare the DDR interconnect with the PCIe, for example, we can also observe that for each pin, DDR is not as efficient as PCIe, for example.
Here on the right side, you can see PCIe generation 5 connector with 16 lanes. This actually allows us a transfer rate of 32 giga
transfers per second which is translated about 63 gigabytes per second and this
can be achieved with 82 pins and if we would look at a modern DDR5 DIMM we
would achieve if we have a device with 6,400 megatrends
per second, which is already quite a lot,
we would achieve about 50 gigabytes per second
with 280 ADP pins.
So per pin, we can actually achieve a higher bandwidth
with PCIe pins.
And another limitation related to DR is that we only have a limited number of DDR slots.
So if you want to scale up with more DDR DIMMs, then of course our mainboard also needs to provide more channels,
which also would introduce higher costs and also would make it more challenging in regard to signal integrity challenges.
Another aspect regarding to memory scalability is that with PCIe we can
move the memory further away from the CPUs. So there exists something like
retimers, so you can actually see the retimers here in the middle,
which is a physical layer, protocol aware,
software transparent extension device.
And it actually retransmits a fresh copy of the signal.
So it does not only amplify the signal, but it resends it.
And such extension devices are necessary
when the electrical path between a root complex and an endpoint
is longer than the specification allows.
So in this example on the left, you have a main board.
And you can also attach something like a riser card.
This allows you to move the PCI slot further away.
And then at this extension, you can theoretically
plug this retimer in and attach, for example,
different PCIe devices, for example, in SSD array.
And therefore, you have a much larger physical space
that you can use to attach different devices.
Also, if we look at DDR D-RAM DIMMs, you already saw this figure in the storage lecture.
Since 2013-2014 the growth in capacity per dollar has stagnated, so it stays flat.
And therefore, this is also a limiting factor in this aspect.
Another issue with today's interconnects is stranded resources.
So a stranded resource is when you have some capacity of a certain resource type, let's say memory, for example.
And when this remains unused or idle while another type of resource is fully used.
So the cause here is the tight coupling of different resource
types on one main board.
And this actually results in servers over-provisioning
different resource types to then handle workloads
with peak capacity demands.
And an example here could be a very memory-intensive
application that uses all available memory, but it's not very compute intensive, so we do not fully utilize our CPUs.
And vice versa, you can also have servers with very compute intensive application when the CPUs are fully utilized, but the memory might not be fully used.
And these stranding resources actually have a negative power and cost impact.
And this could also be observed in data centers of, for example, Microsoft, Google, Alibaba,
but also AWS.
Another aspect is data sharing.
So if you imagine a distributed system,
then the different components or the different nodes
often rely on fine-grained synchronization.
We often have small updates that are latency-sensitive
because we might have some data flows
where we need to wait until we received a certain answer
from a peer node.
And an example here could be distributed databases in which
We have to communicate different kilobyte size pages, for example.
Or distributed consensus. Consensus here means that we
Need to communicate some data to agree on what transaction
Should be committed and also in which order. And compared to page sizes, such distributed consensus
can be even smaller.
And therefore, with this small data chunks,
communication delay in typical data center networks
dominates the wait time for updates,
and which can actually slow down these use cases.
So we can summarize these four challenges as follows.
We have the missing coherent access challenge.
We have scalability limitations, as mentioned.
We can observe in today's data centers
that resources are not fully utilized,
and we have resource stranding.
And another issue is that
there is the need for actually fast cache coherent data sharing.
And these challenges is actually what CXL tries to solve.
And CXL is a PCIe-based open standard interconnect between processors and devices.
Such devices can be accelerators, memory buffers, network interface cards, or ASICs.
And CXL offers, on the one hand, coherency and memory semantics with a bandwidth that
scales with the PCIe bandwidth.
And just as a reference with PCIe 4, for example,
with 16 lanes, we can achieve up to 32 gigabytes per second.
And with PCIe 5, we can achieve up to approximately 64
gigabytes per second per direction
if we have a 16-lane connection.
CXL defines new protocols using the PCI physical layer.
So we use already existing hardware here and with PCIe generation 5,
it allows alternative protocols.
So you can still use the physical layer of PCIe,
but communicate with a different protocol.
And CXL specifies specifically a cache and memory protocol
here to optimize such data flows.
We have three major generations, 1, 2, and 3.
There was a minor change from 1.0 to 1.1,
but this only included additional compliance testing
mechanisms.
Generation 1 and 2 are both PCIe 5 based,
and generation 3 requires PCIe 6.
Therefore, it can still take some time until we
actually see CXL 3 in the industry. The three different specified protocols are
CXL.IO,.cache and.memory which I will talk about in a minute and the
development of the Compute Express link is driven by a CXL
consortium which has grown to about 250 companies since 2019. Talking about the
different protocols CXL.io is based on the PCIe protocol and it is used for
for example this device discovery, status reporting, virtual to physical address translation, and direct memory access.
And it uses non-coherent load and store semantics of PCIe.
The more interesting protocols are cache and the memory protocol.
The cache protocol is used by a device to cache the host memory.
So if you have a look at the right figure here, let's assume a host
and a host can be a single or multi-socket system and a certain CXL device. For simplification we
just assume that both have a host and a memory and with the CXL.cache protocol it is now possible that the CXL device can cache the host's memory and
access this memory in a cache-curing way and actually the memory protocol
does the same in the other direction. It enables CPUs and also other
CXL devices to access device memory as cacheable memory.
And with that, it also enables a uniform view
for the CPUs across device memory and also the host
memory.
So devices memory is also in the CPU's unified global memory
address space.
CXL.io is mandatory for all devices,
and the cache and memory protocols are optional, which
actually brings us to the different CXL device types.
Depending on what protocol the device supports, it can be classified as a Type 1 device, Type
2 device or Type 3 device.
Starting with Type 1 device, this implements, besides the mandatory IAO protocol,
the cache protocol.
And an example use case here are smart network interface cards
that use coherency semantics along with direct memory access
transfers.
The second device type implements all three protocols.
And use cases here can be accelerators,
example, GPUs, FPGAs, with local memory that is partially
used by the host as coherent device memory.
Or another use case is that devices cache the host memory
for processing, because for some data processing
on the accelerator, it wants to access data that
is in the host memory that is located on the DRAM
DIMMs of the host.
And the last type here is the type 3 device.
This only implements.io and the memory protocol.
And the use case here is simply memory bandwidth and capacity expansion
if we look at the third device type then such a memory extension device can be used as a
as a cost power and pin efficient alternative to adding more ddr channels to service cpus
and it also as we could see with the retimer example
in the beginning, it offers flexibility in system topologies
since we can have longer trace lengths.
With the third generation of CXL,
there is one specialized Type 3 device.
It's called a Global Fabric Attached Memory. And this is very similar to a Type 3 device. It's called a Global Fabric Attached Memory.
And this is very similar to a Type 3 device,
but it actually allows to connect up to 4,905 nodes
using an advanced routing protocol that I will not
cover in this lecture.
OK. When we talk about, so one of the features,
as I mentioned, is that we, from a host CPU,
can access the device's memory.
And this kind of, or this memory that
is now accessed and cacheable by the CPU
is called host managed device memory.
And there are three different use cases for that,
also with different requirements for the memory protocol.
The first use case is host memory expansion.
In this case, we have a Type 3 device, as shown before, and this only requires host-only coherency.
The second use case here is an accelerator with memory exposed to the host.
And in this case, it has to be device coherent since the accelerator can also access the host's memory.
And then there is another special use case that is introduced with the third generation,
which is device memory exposed with also device coherent memory, so similar to the second use case,
but now it uses back-in validation.
This is an extension of the memory protocol, which actually allows type 2 and type 3 device to back-in validate caches,
but I will talk about this later on the the key aspect here is that this
memory exposed to um to the um to the host in the cache queue in a way is called host managed device
memory and we have different use cases for that and if we have such type of memory within a type
2 or type 3 device and it can be accessible to multiple hosts then
this is also called fabric attached memory which we all will also talk about later on
okay so now we know the different three base device types and how we can call the memory
that is exposed to other CPUs or devices.
And what we also have to take into account is we have to differentiate between logical
and physical devices.
For all the type 1, 2, and 3 devices, we can have single logical devices, meaning that
all the resources of that device can only be assigned to one potential host.
In this example, we see a Type 3 device only with memory,
but this can also be an accelerator, a smart NIC, or any other Type 1 or 2 device.
The other type of device here is called multiple logical devices but it's still one
physical device but it partitions the resources and each individual partition
can be assigned to a different host but this is only valid for type 3 devices
meaning memory only devices and therefore you can have up to 16 logical devices that you can each assign to different
hosts so maximum three or maximum three partitions assigned to three different hosts
so this is already like a first way of um of realizing some kind of memory pooling. So you remember the inefficiency because of resource coupling.
And with this approach, you can already have a device
with different memory regions assigned to different hosts.
But this can be even improved with such topologies that you can see here.
So this is something that is introduced with the second generation of this interconnect,
which actually allows single level switching.
So if you see on the very left, you can use single logical devices,
connect them to a switch and also multiple hosts.
And then you have a central standardized fabric manager,
which can be used to assign different of these single logical devices to different hosts.
And in the middle, you can see a slightly different topology in which you can also use multiple logical devices.
So you can also mix them so it doesn't have to be either MLDs or SLDs. And also in this case, you have a standardized fabric manager, which is responsible for assigning
different logical devices to hosts.
If we use a switch, of course, from a latency perspective, we also have additional overheads
and you can reduce the overhead if you do not use that switch for example in the very right topology you
can also use multiple logical devices and use direct connections between your
hosts and your devices but then you do not have such a dynamic flexibility as
with the switch in which the fabric manager can dynamically
reassign the different devices to different hosts.
If we continue or increase the generation number again and we look at CXL 3.0, then
we also could realize resource sharing,
or more specifically, memory sharing.
In this topology, we can still assign
individual logical devices to certain hosts,
but it's also possible to assign memory regions
to multiple hosts, which can then cache coherently
access this memory range.
For example, in this case here, the blue memory region of the logical device can be assigned to hosts 1 and 2,
while the light green, for example, can be assigned to 3, 4 and 5.
And hosts 3, 4 and 5 can cache coherently read and write to the data that is on that memory region.
From hardware scope we also have different granularities. With CXL1 and 1.1 we are more
in the scope of a single server. This is because switches are not supported. So you don't have single or multilevel switches.
And you have to directly attach your PCIe CXL device
to the server.
With CXL 2.0, we have single level switches.
This generation actually requires tree topologies,
therefore also not allowing multilevel switches.
But here, we are more in a rack-level scope.
And with CXL 3.0, as we mentioned before,
we can have up to 4,096 endpoints connected.
It supports multilevel switches.
So therefore, we are here in the scope of multiple racks
in a data center.
All right. With that, I want to move on with cache coherence,
which is one of the key or one of the important features
of CXL.
Cache coherence protocols usually
tracks the state of any copy of a certain data
block of my physical memory.
And snooping is one approach to implement that tracking.
And every cache with a copy of the data
from this block or cache line of physical memory
can then track the sharing status of the block that it has.
And caches are typically all accessible via some broadcast
medium, which can be, for example,
a bus that connects the different level caches,
so the per-core level 1 caches with a shared cache or memory.
And all the caches have cache controllers
that monitor or also called snoop the medium, the bus, to determine whether they have a copy of a certain block that is requested on the bus.
Write invalidation or write invalid protocol is the most common approach to actually maintain the coherence. And the approach with this kind of coherence protocols
is that they ensure that the processor has exclusive access
to a data copy before writing to that copy.
And all other copies are invalidated.
So no other readable or writable copy of cache line,
for example, exists when the writer cores.
And yeah, usually caches or such kind of snooping coherence
protocol usually introduces a finite state controller
per core.
And this is actually responsible for responding
to requests from the cores processor and also from the bus and then changes the state of the
individual cash line copies also uses the bus to access or invalidates the
data an example of such a protocol is the messy cash coherence protocol which
we also briefly touched in one of the
lectures before. It is a write-back cache coherence protocol used for snooping on the bus or on the
broadcast medium. And it is actually named after the initials of the different states. So we have
the state modified, exclusive, shared, and invalid for each cache line. The state invalid here means that the line does not contain valid data.
Shared means that multiple caches have copies of this cache line and the data is also valid,
meaning it's up to date.
It can be an exclusive shared state, meaning no other caches has a copy of especially this
cache line.
The data is still valid.
And a cache line can be in modified state, meaning the data in this line is valid.
But the copy and main memory is invalid.
And no other cache has a copy of that cache line.
So the cache that has this cache line in modified state would be the owner of this cache line
at this time.
After we boot system, all the caches have their cache lines marked as invalid.
And here we briefly go through an example with three CPUs
and memory.
And the CPUs and memory are connected via a bus.
And first, the CPU1 reads block A, and it fetches the data.
And since no other cache has this cache line currently
in its cache, the cache line is in exclusive state.
If now CPU2 reads the block,
cache, CPU1 gets notified or observes
that there is now another cache that holds this cache line.
Therefore, it changes the state from exclusive to shared.
So both caches of CPU1 and 2 have this cache line now
in shared state.
If CPU 2 wants to write this cache line, then
also other peer caches, or in this case,
the cache of CPU 1 gets notified that it
has to invalidate the cache line.
And then CPU 2 has this cache line in exclusive state and
can then modify its copy if now cpu3 here on the top on the right side wants to read that block
this so before it wants to read it the cpu2 has the cache line in modified state. So therefore, when CPU 3 notifies that it
wants to read the data, CPU 2 notifies that CPU 3 first
has to wait, because CPU 2 needs to write back
the data to the memory.
And after the data is written back,
CPU 3 also gets notified about the data being written back and can then fetch the data
from memory and CPU 2 and CPU 3 then have the same cache line in shared state.
And then similar to the write before if CPU 2 wants to write on this cache line
again then the cache line has to be invalidated at CPU3.
Therefore, CPU2 has this cache line in an exclusive state
and can then write to the cache line so that it's then modified.
Talking about the CXL
does this in hardware.
And as a cache line granularity similar to existing cache coherency hierarchies.
It uses 64-byte cache lines,
and it actually uses existing services from the PCIe
for address translation.
So to translate virtual addresses,
it uses address translation service.
This is specified in the PCIe specification.
CXL cache coherency or the use cache coherent protocol is ASOMATRIC, meaning data flowsency, which significantly simplifies implementing coherency in devices.
A host scales to multiple CPU sockets or caches ass in a multi-circuit system, you have ultra-path interconnects, or the previous generation was quick-path interconnect, or with AMD CPUs, you have Infinity Fabric here.
And these specific protocols define an internal home agent to resolve coherency between the different host caches. And these host agents are also
Now responsible to incorporate cxl cache devices and enforce a
Simple messy coherence protocol.
And if a cpu wants to support cxl cache devices, then it is
Expected to size their tracking data structures
accordingly so that they can also track cache lines or the coherent state of cache lines
of connected CXL gets involved
into that.
So, usually we have multiple levels of coherent caches.
We have our small level one caches with small capacity with the lowest latency and also
with the highest bandwidth. Then we have level 2 caches, they have larger capacity, might
be shared between multiple cores but they also have a higher latency and lower
bandwidth compared to L1 caches and then in our level 3 caches
also called last level caches we have the highest capacity but also still a lower
bandwidth higher latency and they're shared between many CPU cores and CXL
now allows devices to directly engage in the cache hierarchy of CPU below the
last-level cache so as we can see here the CXL.cache block.
So if we assume two devices, two devices that support the CXL.cache protocol, then in the CPU's cache hierarchy, the caches of these CPUs are sitting as a peer to the cores within the CPU socket and the expected
cache size of these devices is one megabyte or smaller. Above our last level
cache then we have our home agent potentially connected with other CPU sockets via the vendor-specific CPU to CPU
link, so UPI or Infinity Fabric in the case of Intel or AMD.
And the home agent resolves the conflicts of last level caches
trying to cache the same address.
And on the top here, we can see then different kind of memory.
First, data that is accessible via DDR channels,
or we can also have CXL memory devices attached.
So therefore, we could access this memory
using the CXL PCIe interconnect
and then use the memory controller
that is located on the memory device.
Okay, now I want to talk a little bit about the more details of the CXL cache protocol.
This protocol, as I mentioned before, enables a device to cache host memory using the MESI protocol.
We use 64 byte cache lines.
And it specifies or uses 15 different request types
for cacheable reads and writes from the device to the host.
Here, I mentioned it's an asymmetric protocol.
So we keep the protocol simple here at the device side,
meaning that the host is responsible for tracking
the coherence of the peer caches.
And the device also never directly
interacts with any peer cache.
So here, also, the host is responsible if there is some interaction required and the
device only manages its own cache and sends requests to the host.
The CXL.cache protocol uses two communication directions.
One is from the device to the host, also called D to H,
or host to device, H to D. And per direction,
it uses three different communication channel,
a request, a response, and a data channel,
as you can see on the right.
In the direction from the device to the host,
the requests are requests to get cacheable access through reading or writing memory.
And in the other direction, if
a host has to perform some requests, then these are
mainly snoop messages for updating cache states, for example, invalidating certain
cache lines that are currently stored in the device's cache.
Very briefly, the 15 different requests from the device to host can be classified in four
different classes.
So we first have reads, meaning with such requests,
we request the coherent state and data for a cache line.
And the response then from the host
is, as expected, the coherent state with the data.
Then we have read zero requests. In this case, the device only requests a coherent state with the data. Then we have read zero requests.
In this case, the device only requests a coherent state,
but without the need to get the data.
And this is, for example, used to upgrade existing cache copies
and update the state, for example, from shared to exclusive,
or also bring in the exclusive shared if
if data needs to be written the write class is used to evict data from the device cache
this is for example used for dirty data but also for clean data and
yeah with with such a write request if the device requests, then the host will indicate a write pull. This means that the data or that the host is ready to receive the data from the cache
device. cache device and before it gets this response it also sends a globally observed message to the
device which indicates that the host successfully resolved any required cache coherence communication
so it might be the case that if if a device wants to get the cache line in exclusive state and other
caches have this copy of the cache line that the host needs to invalidate
the other peer caches first and if this has happened successfully then this globally observed message is sent to
the cache device. And then the last class is read0write also called streaming
writes. In this case the CXL cache device writes data to the host directly without
having any coherent state prior to issue and similar to the write we also get
write pull messages and globally observed signals from the host to the cache device.
Before looking at some example data flows, I want to introduce you to some of the
terminology that is relevant here. So on the right here the left
device with this marked with a red frame. So this is our device from which we
want to communicate to the host via the CXL.cache protocol and from this
perspective we have a certain home agent and the home agents location depends
actually on the address that is being read.
So it can be on our local socket, but it can also be the case that our home agent is located
on a remote peer socket.
Then our peer cache or peer caches in general are all the caches that we have on peer CXL devices with a cache,
or also CPU caches that are in our local or remote socket. And a memory controller
is either a native DDR memory controller, or it can also this can be on our local or also on the remote socket.
Our memory controller can also be located on a CXL memory device, of a CXL memory peer
device.
So also here, our target memory controller that might be involved in our cache protocol communication depends on the address that we want to access.
OK, knowing this terminology, let's
have a look at the read flow example of the cache protocol.
Here in this case, the CXL device
wants to read a certain memory address from the host's memory,
and the cache line is currently in valid state.
Now it has to send a read shared request to the home agent.
Note that one other peer cache has especially this copy of the cache line in exclusive state
and therefore because of snooping messages the cxl device is aware of the fact that there's already an exclusive copy and it also does not need the data in exclusive states, so therefore it just requests the data in shared state.
The home agent then, after receiving this request, requests from the peer caches to
update the state of that cache line, because the one cache that has this cache line is
not the only one anymore, so therefore the cache state has to cache line is not the only one anymore so therefore the
cache that has to be updated from exclusive to shared and the
corresponding cache will then also notify the home agent that it
successfully changed the state of that cache line copy and in parallel the home
agent also sends a memory read request to the memory controller, which responds with the data.
And when the home agent received the acknowledgement of the peer cache
that the current state was updated
and also received the data from the memory controller,
it can then send first a signal
that the shared state of this cache line was now globally observed.
And with that message, the CXL device can then update its own state of the cache line copy
and also the data is sent from the home agent to the CXL device.
WriteFlow looks a little bit different.
So in this case, we have again our CXL device with a cache line
in invalid state.
But now we actually need the cache line in exclusive state
since we want to modify it.
Therefore, we request this cache line.
Yeah, we want to get this cache line in exclusive state and know that a peer cache
or multiple peer caches can have this cache line
or copy of that cache line in shared state.
When the home agent receives this request,
it also sends snoop invalidate signals to the peer caches,
which then invalidate the cache line state,
also respond the change again to the home agent
in parallel again the home agent reads the memory and after receiving the snoop responses and the
data the home agent again sends a message to the CXL device notifying that the exclusive state of that cache line copy is globally observed and the data is sent to the CXL device,
then the CXL device with the copy in exclusive state can perform a silent write.
Silent here because no other cache has this copy and therefore we do not need to notify any cache to change
a coherent state.
And at some point, when the CXL device's cache
wants to evict this cache line, then
we need to send a dirty evict signal
to the home agent, which then might resolve some cache states,
but not in this case here, since it was in modified state but it sends a
globally observed write pull and with that signal received on the CXL device side the cache line of
the CXL device can then be updated to invalid and the data is sent to the home agent, which then writes the data
to the memory controller and acknowledge this with the complete message.
Yeah, sure.
What kind of time frame are we talking about for a home region, right?
Yes. You mean, so the entire round trip from the CXL device to actually getting the completion
response from the memory controller?
So most of the data that you can find are estimations, because there are not that many CXL devices
commercially available yet.
But in the estimations also from the CXL consortium, we are in the range have a Type 3 device directly attached to our server and we would
access the memory, then based on some CXL consortium documents, we are in a range of
less than 200 nanoseconds.
Okay, then let us continue with the memory protocol. As mentioned before, this is used
to enable simple reads and writes from the host to memory.
And the protocol is designed to be
memory media type independent.
So behind this protocol, there can actually
be high bandwidth memory, DDR memory attached on the CXL device or persistent memory.
And with CXL 2.0, CXL also supports mechanisms to manage persistence here, which I will not go into more details.
And also the protocol uses two additional bits for metadata per cache line.
These bits are optional for type 3 devices, meaning for memory-only devices.
And the host can here define the usage of these two bits, for example, for some security attributes or for compression attributes.
But these bits are
mandatory for type 2 devices so in these bits the host encodes the required cache state
so with with these bits
yeah the protocol exposes the host coherent state to the device. And this allows the device to know what state the host is caching for each address
in the host-managed device memory region.
The protocol uses or has two different communication
directions.
One direction is from the master to a subordinate.
Master can here be a host or a switch.
And subordinate to master is then in the other direction.
Each direction has two channels.
In the direction from M to S, master to subordinate,
we have requests and requests with data.
And in the other direction, we have requests and requests with data and in the other direction
we have a non-data responses or data responses note that this is valid for the first two
generations cxl 1 1.1 and 2 and with cxl 3.0 there is another channel per direction which is so from the subordinate to the master so from
our device to a host or switch this is a back invalidate and in the other direction this is
a back invalidate response and this is actually an important feature for cxl3 because this allows larger topologies and also cache coherent access between different
CXL devices, which I will talk about later on.
Also here we want to look at a certain memory protocol flow.
Let's look at write rights first. On the very left we have our host who
wants to write data. For this the host sends memory write request to our CXL
device. We assume a memory type 3 device here meaning a memory only device and
this has a memory controller and its memory media and after the memory
controller received the memory write request it writes the data. Note that
here the optional meta field is used so the host requests that the data is
written with this meta value 3. The meta value is also stored together with the data
at the memory media.
And the memory controller can already send a completion
when it can ensure that the next read request would
give the correct data.
So it can already send a completion when
data is visible to future reach, even though the memory
controller might not have fully committed
the data on the memory media.
And let's look at a read flow.
Here again, the host wants to interact with the memory,
and it sends a memory read request.
Also for read requests, the host can specify this meta value.
In this case, it is 0.
And the memory controller then reads the data
from the memory media.
As mentioned, the meta value is stored together with the data
and at the memory media it is set to 2.
And after reading it, the memory controller will send the data
together with the old meta value to the host.
But even though it's only a read request here,
the memory controller still needs to perform a write to the memory media to update the requested meta value.
And this gets simplified if the host does not have any meta value requirements.
So in this case, we can, or the memory controller can simply read the data and return it to the host.
So before we talked about the cache protocol and now about some simple workflows of the memory protocol,
but there are also the devices, type 2 devices,
that support both the cache and the memory protocols.
And such a device, usually an accelerator,
can access its own memory without violating the coherence and without communicating with the host.
For that, Type-2 device has to implement so-called BIOS table, which is a coherence directory. for each of its cache block copies and indicates if the host has a copy of that cache block.
And if the host has no copy, then we call it, or this state is then called device bias.
And in this state, the type 2 device can directly read its own memory.
But if the host has a copy of that cache block, then it has to read the data through the host.
And the host here is responsible for tracking potential cache block copies in other caches or CXL devices as well.
So the type two device itself,
it has to track if the host has a copy of that cache, but it doesn't have to track
if some other peer cache has a copy of this cache block.
And this protocol flow is visualized here for the case that is a host BIOS flow, meaning
that the cache line that we want to read is already or is present in the host cache.
So in this case, we have a type 2 device with its internal memory controller. And this device, using the cache protocol,
wants to read the cache block.
It's actually a cache block that is located on its own memory.
But it has to communicate with the host agent
first, because in CXL, especially 1 and 2, the accesses or the cache coherence is host managed.
And when the CXL device sends the read request to the home agent, it again, as we saw it before, has to invalidate the corresponding copies in other caches.
And therefore, sends snoop messages to the peer caches.
These again respond that they successfully changed the state of their cache line copies
and the home agent then can send a memory read forward. Why would it send a memory read forward in this case? So usually with a workflow that we saw before,
the home agent would read the data from the memory controller
and when it receives the data, it will then send the data to the CXL device,
which would be wasteful in this case since the CXL device itself has the memory and therefore the home agent can
send this message to the internal memory control of the CXL device letting the memory controller
know that the data needs to be sent to the CXL device's cache and which then happens and after the CXL device received the data,
so it's now in its cache, the cache line can be updated
or the copy of the cache line can be updated from invalid to exclusive.
And the state is still in device bias
since we did not propagate any cache line copies to other peer caches.
So to summarize here, the cache agent resolves the coherency while the device can read its own memory. And the workflow looks quite...
Oh, sorry, a little mistake here.
So before we were in host bias state.
And since the home agent at this point invalidated the cache line state of peer caches,
we can be sure at this point that no other peer cache has a copy of that cache line.
And therefore, we can change the state here in our BIOS table where we track which cache line
might be available in the host cache can change the state here from
the host bias to device bias but we started initially with host bias and
here we have a workflow when the state is in device bias meaning no other peer
cache has a copy valid copy of the data that we want to read. Therefore, we do not have
communicate with the home agent and the CXL device can directly read the memory from its
internal memory controller which then returns the data and we can change the state of that
cache line copy to exclusive. And since we did not propagate any copy of this cache line copy to exclusive and since we did not propagate any copy of this
cache line to other peer caches we are still in device bias cache which is
stored in the bias table okay as I mentioned or we have we have the
different protocol generations or interconnect generations
and there are some important enhancements with the CXL 3.0 generation.
We saw already before a multi-level switching which is introduced with that
generation. We have support for up to 4096 end devices.
We can build large fabric topologies with multiple paths between the source and the destination pair.
So before, in generation 1 and 2, the specification requires that a path from the source and the destination, that there's only one path. Therefore, from the host perspective,
we always would have tree topologies. And with the third generation, we can also break this
constraint and support non-tree topologies. Some limitation of generation one and 2 is also that only one CXL type 1 or type 2 device can be present in one topology from the host perspective.
This is because we need to track the different states of cache line copies of our CXL cache devices.
So remember, we have our cache that is host managed,
and therefore we need some tracking information or some tracking data structures in our CPUs.
And they also have to be updated depending on what state the cache line is in. And with the third generation,
we have these back invalidations
and back invalidation responses of the memory protocol
that we saw before.
And with that, we do not need,
or we do not have this limitation anymore.
We can have multiple type 1 and type 2 devices
up to 16 according to the specification.
We also double the bandwidth.
This is simply because it requires PCIe 6, which also compared to PCIe 5 doubles the
bandwidth.
So with a 16-lane connection, we can achieve up to approximately 128 gigabytes per second
while we are at 64 approximately with PCIe 5.
It also allows direct peer-to-peer access from a PCIe device or a CXL device
to the coherent exposed memory hosted by a type 2 or type 3
device.
And for this peer-to-peer communication,
a host does not have to be involved.
And one of the most important features
is this possibility to share memory in a coherent way
across multiple hosts.
This brings us again to the back-invalidation. I want to highlight it again.
So you saw before that in the memory protocol we have the new flows, back-invalidation and back-invalidate response.
You saw before that this also enables this new memory type, which is memory that
is exposed to the host, but it also supports caching. So it also caches. So the device
also or the cache coherency also needs to be present at the device but the device itself can
now invalidate other caches with the spec invalidation requests and there are three
different use cases one is one that i mentioned before direct peer-to-peer communication between CXL devices.
Another use case is that we can map a larger chunk of memory of the type 2 device to
be coherently accessible to the host. So
we talked before, I talked about the accelerators, the type 2 devices, but assume a FPGA with a certain bunch of memory, not all the capacity of this memory has to be exposed in a cache
coherent way or defined as HDM, because we have some limitations here, since the CPU
needs to track the cache lines of this exposed memory.
And this is actually a limiting factor here.
And with back-invalidation, we can map a significantly larger chunk of memory
to be cache-curiously exposed to the CPU or other CXL devices.
So before, we needed to track the coherence states also
for our BIOS flow. You remember that we are either in device or in host bias.
And if we want to access or if the type two device wants to access its own data, we need
to check if we need to communicate with the host or if we can directly read our data.
And for that, we have our bias table tracking the state of the cache line. And now we do not have to implement the full BIOS table anymore,
but we can instead invalidate a certain cache line,
send an invalidation message to the host,
and therefore do not need to store this BIOS table anymore.
And another use case is here for a coherent shared memory
access across multiple independent hosts.
Assume that you have two hosts or multiple hosts that
can access the same memory region.
And if one host accesses the data,
while second host has a copy of that cache line
without back invalidation, there would not be a solution.
Or in other words, the CXL consortium
implemented that solution to realize cache coherent memory
accesses across multiple independent hosts.
And here's one figure also from documentation
of the CXL consortium in which you
can see potential large fabric topology that we
can achieve with CXL 3.0.
So in this case, we could have, for example,
different end devices that you can see in the bottom area
here, which can be connected to
different switches these switches that are connected to end devices are called leaf switches
here and then inside of this large cxl topology we can have the leaf switches connected with
spine switches so spine switch switches only connected to other CXL switches. And I also talked about
their tree-based topologies and non-tree-based topologies. As you can see here, there are
definitely multiple paths from one end node to another. So marked in red is one potential path, but you could also go from this CPU over the top mid switch
and then back from the top mid switch to the bottom mid switch and then to this
global fabric attached memory, which is just as a reminder, a Type 3 device that can be accessed by multiple thousands of other devices.
Okay, then I would say you have some insights and an overview of the CXL protocol and different generations and some features
but what is already available what can we work with or what is maybe announced so Intel released the Intel Sapphire Rapid CPUs this year and they fully support CXL 1.1 Intel also provides the Intel Agile X7 FPGAs, which are CXL 1.1 compatible.
And they also announced Agile X7 M-series FPGAs with CXL 2.0 support.
On the right side, you can see the physical layout of the Intel Sapphire Rapids multi-die architecture.
So you have four dies here.
So this is one CPU subdivided into four dies that are connected.
And per die you can see here in orange the PCIe boot complex.
And each of these orange boxes is one 16-lane connection.
So you have actually 32 lanes per die and 128 in total, which is quite a lot.
So you could imagine attaching to all of these connections
a CXL device, potentially for memory or bandwidth extension.
If we are in the, yeah, we are limited here,
so we cannot use any switch topologies,
since Sapphire Rabbit only supports CXL 1.1.
But still, there's a lot of potential
if we could attach multiple CXL 1.1, but still there's a lot of potential if we could attach multiple CXL devices here.
And on the bottom you can, like, it's mainly a black box, but this is one, or this is the
memory extension board that was announced by Samsung.
They already have two generations, so this is now the latest one.
They announced the development of a 128 gigabyte DRAM
that supports CXL 2.0.
It uses PCIe 5.
It's a by eight device,
meaning the PCI width has eight lanes,
and they announced a bandwidth of up to 35 GB per second.
And AMD also released CPUs that are CXL 1.1 compliant, the new AMD Genua CPUs.
And they also announced a smart network interface card
that supports the XL 2.0.
In the last minutes I want to briefly talk about the performance. There
was already a question about how long it takes to access or to perform a write, for example. And here are some memory access or some latency estimations
that are provided in the introduction to Compute
Express Link by the CXL Consortium.
So if you are more interested, feel free to check the document.
But yeah, from the CPU to a Type 3 device, if we assume it's a
single logical device with DDR memory on top of the device, accessing the memory would, they
estimate that it would take 170 nanoseconds, and they estimate that it takes also about 170 nanoseconds when we access pooled or shared memory from the CPU on a direct attached multi-logical device, type 3 device.
And also if we would communicate from a device to the host's memory with CXL.cache.
And in a different topology with a type 3 device connected
via a switch, so we have one level of switching here,
the estimation is 250 nanoseconds.
And when we message a peer CPU or a peer device
through a CXL switch, it would take 220 nanoseconds.
Or if we send a message to a peer CPU or device
through two levels of switches, through two switches,
then the estimation would be 270 nanoseconds.
Okay, here is one example experimental evaluation that is already publicly available.
So there are not that many evaluations yet, and this was published this year. It's a work from Intel and UIUC and in their work
they use Intel AgileX FPGA as you can see here in the bottom. So here are some setup information. This is CXL 1.1 compatible and it is attached by our PCIe Gen 5 with a 16 LAN connection.
And they use a single DIMM with 16 GB of DDR4 memory.
And in their dual socket system they have Intel Xeon CPUs.
It's basically Intel Sapphire Rapids.
I'm not sure if it's pre-release or...
I have to double-check if it's a pre-release version
or already the release version.
But in their work, they performed some micro
benchmarks and as you can see in the middle for example they wanted to measure how long it takes
if you read or write a cache line. So the LD is the load instruction here.
And then we have another access which is a store with write back or a non-temporal
store meaning bypassing the caches and they performed all of these instructions as AVX
512 instructions. Without going into more detail here for the load instruction for example.
So in green you can see the time, the latency that it requires
to access the memory on the local node, on the local socket.
In purple it is the access to a remote NUMAS memory region. And in orange, you can see the latencies
that it took to access the memory of the CXL device
or this FPGA board in this case.
And for the load instruction, so what I wonder here
is actually that we are at the local memory already
at about 250
nanoseconds.
But compared to the remote NUMA node, the CXL memory latency does not require much more
time.
So it's comparable with the load from a remote NUMA node. And yeah, as you can also see, also for the other instructions,
the access to the CXL FPGA performs quite well
with lower latencies compared to the remote NUMA node.
And yeah, there were also some further measurements regarding pointer chasing.
What they also did here, they have certain...
So they first flushed the cache line and then have certain working sets, meaning they read
certain set of data and brought this into the cache, which was their warm-up.
And then they performed pointer chasing operations.
And what they measured was the latency
they achieved when they increased the working set size
here on the x-axis.
And you can observe these jumps.
And these are actually the jumps that you achieve when you exceed a certain capacity of a cache level.
So, for example, here at around 64 kilobytes, you see a slight increase of the latency, Which would mean that we exceed the level one cache capacity
Here and then at some point we exceed the level two cache
Capacity and then also the last level cache capacity.
And interesting to see is that so again the green line here is
The access to the local memory and line here is the access to the local memory, and the orange is the access to the CXL device.
And the performance of the...
Or accessing the CXL device with pointer chasing results in latencies very similar or...
I would almost say equal to the latencies that we can observe with local memory.
But as long as the data does not fit into our caches anymore, then the latency significantly
increases since we then have to fetch the data from our memory controller of the attached
CXL device or the FPGA in this case.
And at the bottom you can see bandwidth measurements with sequential axis.
On the left side they used local memory with eight DIMMs.
On the very right side, they used remote memory of a remote socket with only one channel.
And in the middle, they have the CXL memory device.
And an interesting observation here was for the green line, which are load instructions, if you increase the number of threads,
they achieved a maximum bandwidth
of about 20 gigabytes per second with eight threads.
And after that, the bandwidth decreased
if they used more threads.
And also an interesting observation was that already with two threads, they achieved the
maximum or they measured maximum bandwidth with non-temporal stores.
And already with four and six and more threads, the bandwidth dropped significantly compared to using only two threads.
To summarize, CXL tries to tackle the different interconnect
issues, as mentioned before.
They want to achieve cache-coherent accesses
from PCIe devices or PCIe 6L devices to host memory and vice versa. They want to tackle scalability
limitation allowing memory and capacity and bandwidth extension or also allowing accessing
multiple or different resource.
This also is relevant here for the resource training aspects.
They provide with CXL2 and 3 memory pooling and memory sharing.
And especially the memory sharing feature is also relevant for the fourth challenge here with the distributed data sharing.
And yeah, you can see this here in the table summarized
with the first generation, which is all the specifications
available since 2019.
We talked about the hardware scope.
So it's on a single machine level
where we have to attach the device directly to the server.
It has a speed of PCI 5, which is 32 gigatransfers per second.
And the bandwidth is calculated based on the number of lanes
that the device supports.
And this addresses the coherence challenge,
meaning we want to access the host's memory
from our accelerator and vice versa.
And it allows bandwidth and capacity
expansion cxl2.0 which is available since 2020 available means the specification is available
and here we are in a larger scope we could say we are on the rackEC level here, translated on data center scales. We still have PCIe 5 with 32 gigatransfers per second, and it mainly addresses the challenge
of resource pooling.
And with CXL 3.0, which was published last year, we are in a much larger scale, so theoretically we can support up to 4096 end devices.
But the consortium claims or mentions that it's more realistic that we have hundreds of machines
connected with multiple switches. Here we have we can double the speed since this requires PCIe 6 and this allows a larger scale
resource pooling so it increased the scalability of the resource pooling that
we achieved with the second generation and also allows sharing so therefore it
tackles the challenges 3 and 4. Here you can see a last overview of the different features with different generations.
We did not tackle all of them.
But we also didn't cover all of them in detail.
But we at least touched some of them.
So you know about the different device types, type 1, type 2, type 3 devices.
We talked about the differences between single and multiple logical devices.
We briefly talked about single-level switching and multi-level switching
and the topologies that are possible with such switches.
With generation 3,
peer-to-peer communication between CXL devices is possible.
We have an enhanced coherency, an important role.
The backend validation plays an important role here,
first for the enhanced coherence. So this actually is relevant for the type two device
memory capacity that I mentioned so that we do not need for example this bias table anymore but
we can back invalidate other caches we have memory sharing capabilities with CXL 3.0 we can also have, okay we didn't mention this, but with the third generation we can attach, oh actually we did, right.
With CXL 1 and 2 is increased to up to 16 devices.
And we have large fabric capabilities.
And we also saw an example figure of that.
All right.
With this, this semester's lecture series ends.
Tomorrow, as I mentioned, we will do the data center tour.
And if you like the lecture, if you like our research group,
and if you're also interested in student assistant positions,
just come drop by and talk to us,
or also attend in our next courses.
So next semester, especially relevant for hardware
conscious data processing, we will
also provide or we will do a seminar with hands-on work in groups related to HTTP topics.
And the topics are not finalized yet, but we will probably provide an FPGA topic and
also a topic related to NUMA-aware data processing.
Besides that, we also have other courses, for example, the Big Data Systems lecture.
This next winter term, we will for the first time have a Big Data Labs course in which
you can work hands-on on big data processing concepts and also learn how to deploy, for example, large clusters and work with them and process data with such approaches.
We offer a master project with the topic benchmarking real-time analytics systems.
And if you're interested in writing your master thesis with us, then look at our research topics. So we mainly work on database systems and modern hardware, stream processing, machine learning
systems, and benchmarking.
Thanks for your attention, for attending this course,
and hope to see you next semester in the HTP seminar.
Thanks.