Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Compute Express Link

Episode Date: July 25, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 All right, yeah, I think we can start. Good morning, everybody. Welcome to the last session of this semester's Hardware Controversy Data Processing lecture. Today, I'm going to talk about Compute Express Link, a new hardware interconnect. And in the last month, we talked about different hardware aspects. We started with the CPU basics, CPU internals, different instructions, SIMD instructions. We moved on with memory technologies. We also talked about persistent memory and then moved over to peripherals connected via PCIe, for example, network interface cards or accelerators such as GPUs, FPGAs, also disks.
Starting point is 00:00:41 And today we want to talk about a different hardware interconnect which is not that far away from pcie it's actually pcie based and it's called computer express link today is the last lecture of the seminar of this um of the hardware conscious data processing lecture series and tomorrow um we will have a data center tour. All the details can be found in the Moodle. Many of you already voted if you attend or will not attend. I personally will not be there, but Laurence will pick you up. The location is also mentioned in Moodle. So it's actually at the south entrance of this building.
Starting point is 00:01:20 Please be five minutes early, so at 1055. So then you can start at 11 a.m okay and in this lecture we first want to talk about limitations of today's hardware interconnects after that i want to give you a compute express link overview one important aspect of this interconnect is cache coherence which is why I want to talk about this in a little bit more detail then. The interconnect specification defines different protocols. Among them is the cache and the memory protocol that I will talk about.
Starting point is 00:01:56 Also, CXL has different generations. And there are certain important enhancements with the last generation, which is the third generation. And then at the end, I want to talk about some performance estimations and already some experimental measurements that were already done. The lecture is mainly based on different documents from the CXL consortium.
Starting point is 00:02:22 If you are interested in the topic, so feel free to check them out. There are a lot of more details that I cannot cover in this lecture. Okay, let's start with the limitations of today's interconnects. So first we want to look at PCIe. With PCIe we do not have cache coherent accesses. So if you want to access your's memory from a PCIe device, you have to do it with non-coherent reads and writes. And PCIe device cannot cache the system's memory to exploit, for example, a temporal and spatial locality. Also, the other way around, if you want to access from a host CPU the PCIe device's memory, this has to happen non-cache coherently and we can also not map the devices or PCIe devices memory into our CPU's global memory
Starting point is 00:03:15 space. If we look at accelerators for example FPGAs we also have the issue that data structures have to be moved from the host's main memory to actually the accelerator's memory. Then at the accelerator we can process the data before we then have to move the data back to main memory. And therefore multiple devices cannot access parts of the same data structure simultaneously with the CPU together without moving the data structure back and forth. Another limitation of today's interconnects is memory scaling. Usually memory is attached via
Starting point is 00:03:55 the DDR interface, so you are probably familiar with the DDR DIMMs, and the demand for memory capacity and also bandwidth increases with the growth of compute resources and DDR memory actually lacks matching this demand. So we are here limited in both memory capacity and also bandwidth per CPU. And if we would compare the DDR interconnect with the PCIe, for example, we can also observe that for each pin, DDR is not as efficient as PCIe, for example. Here on the right side, you can see PCIe generation 5 connector with 16 lanes. This actually allows us a transfer rate of 32 giga transfers per second which is translated about 63 gigabytes per second and this can be achieved with 82 pins and if we would look at a modern DDR5 DIMM we would achieve if we have a device with 6,400 megatrends per second, which is already quite a lot,
Starting point is 00:05:10 we would achieve about 50 gigabytes per second with 280 ADP pins. So per pin, we can actually achieve a higher bandwidth with PCIe pins. And another limitation related to DR is that we only have a limited number of DDR slots. So if you want to scale up with more DDR DIMMs, then of course our mainboard also needs to provide more channels, which also would introduce higher costs and also would make it more challenging in regard to signal integrity challenges. Another aspect regarding to memory scalability is that with PCIe we can
Starting point is 00:05:53 move the memory further away from the CPUs. So there exists something like retimers, so you can actually see the retimers here in the middle, which is a physical layer, protocol aware, software transparent extension device. And it actually retransmits a fresh copy of the signal. So it does not only amplify the signal, but it resends it. And such extension devices are necessary when the electrical path between a root complex and an endpoint
Starting point is 00:06:27 is longer than the specification allows. So in this example on the left, you have a main board. And you can also attach something like a riser card. This allows you to move the PCI slot further away. And then at this extension, you can theoretically plug this retimer in and attach, for example, different PCIe devices, for example, in SSD array. And therefore, you have a much larger physical space
Starting point is 00:06:57 that you can use to attach different devices. Also, if we look at DDR D-RAM DIMMs, you already saw this figure in the storage lecture. Since 2013-2014 the growth in capacity per dollar has stagnated, so it stays flat. And therefore, this is also a limiting factor in this aspect. Another issue with today's interconnects is stranded resources. So a stranded resource is when you have some capacity of a certain resource type, let's say memory, for example. And when this remains unused or idle while another type of resource is fully used. So the cause here is the tight coupling of different resource
Starting point is 00:07:48 types on one main board. And this actually results in servers over-provisioning different resource types to then handle workloads with peak capacity demands. And an example here could be a very memory-intensive application that uses all available memory, but it's not very compute intensive, so we do not fully utilize our CPUs. And vice versa, you can also have servers with very compute intensive application when the CPUs are fully utilized, but the memory might not be fully used. And these stranding resources actually have a negative power and cost impact.
Starting point is 00:08:28 And this could also be observed in data centers of, for example, Microsoft, Google, Alibaba, but also AWS. Another aspect is data sharing. So if you imagine a distributed system, then the different components or the different nodes often rely on fine-grained synchronization. We often have small updates that are latency-sensitive because we might have some data flows
Starting point is 00:08:58 where we need to wait until we received a certain answer from a peer node. And an example here could be distributed databases in which We have to communicate different kilobyte size pages, for example. Or distributed consensus. Consensus here means that we Need to communicate some data to agree on what transaction Should be committed and also in which order. And compared to page sizes, such distributed consensus can be even smaller.
Starting point is 00:09:31 And therefore, with this small data chunks, communication delay in typical data center networks dominates the wait time for updates, and which can actually slow down these use cases. So we can summarize these four challenges as follows. We have the missing coherent access challenge. We have scalability limitations, as mentioned. We can observe in today's data centers
Starting point is 00:10:00 that resources are not fully utilized, and we have resource stranding. And another issue is that there is the need for actually fast cache coherent data sharing. And these challenges is actually what CXL tries to solve. And CXL is a PCIe-based open standard interconnect between processors and devices. Such devices can be accelerators, memory buffers, network interface cards, or ASICs. And CXL offers, on the one hand, coherency and memory semantics with a bandwidth that
Starting point is 00:10:42 scales with the PCIe bandwidth. And just as a reference with PCIe 4, for example, with 16 lanes, we can achieve up to 32 gigabytes per second. And with PCIe 5, we can achieve up to approximately 64 gigabytes per second per direction if we have a 16-lane connection. CXL defines new protocols using the PCI physical layer. So we use already existing hardware here and with PCIe generation 5,
Starting point is 00:11:21 it allows alternative protocols. So you can still use the physical layer of PCIe, but communicate with a different protocol. And CXL specifies specifically a cache and memory protocol here to optimize such data flows. We have three major generations, 1, 2, and 3. There was a minor change from 1.0 to 1.1, but this only included additional compliance testing
Starting point is 00:11:55 mechanisms. Generation 1 and 2 are both PCIe 5 based, and generation 3 requires PCIe 6. Therefore, it can still take some time until we actually see CXL 3 in the industry. The three different specified protocols are CXL.IO,.cache and.memory which I will talk about in a minute and the development of the Compute Express link is driven by a CXL consortium which has grown to about 250 companies since 2019. Talking about the
Starting point is 00:12:33 different protocols CXL.io is based on the PCIe protocol and it is used for for example this device discovery, status reporting, virtual to physical address translation, and direct memory access. And it uses non-coherent load and store semantics of PCIe. The more interesting protocols are cache and the memory protocol. The cache protocol is used by a device to cache the host memory. So if you have a look at the right figure here, let's assume a host and a host can be a single or multi-socket system and a certain CXL device. For simplification we just assume that both have a host and a memory and with the CXL.cache protocol it is now possible that the CXL device can cache the host's memory and
Starting point is 00:13:30 access this memory in a cache-curing way and actually the memory protocol does the same in the other direction. It enables CPUs and also other CXL devices to access device memory as cacheable memory. And with that, it also enables a uniform view for the CPUs across device memory and also the host memory. So devices memory is also in the CPU's unified global memory address space.
Starting point is 00:14:02 CXL.io is mandatory for all devices, and the cache and memory protocols are optional, which actually brings us to the different CXL device types. Depending on what protocol the device supports, it can be classified as a Type 1 device, Type 2 device or Type 3 device. Starting with Type 1 device, this implements, besides the mandatory IAO protocol, the cache protocol. And an example use case here are smart network interface cards
Starting point is 00:14:33 that use coherency semantics along with direct memory access transfers. The second device type implements all three protocols. And use cases here can be accelerators, example, GPUs, FPGAs, with local memory that is partially used by the host as coherent device memory. Or another use case is that devices cache the host memory for processing, because for some data processing
Starting point is 00:15:03 on the accelerator, it wants to access data that is in the host memory that is located on the DRAM DIMMs of the host. And the last type here is the type 3 device. This only implements.io and the memory protocol. And the use case here is simply memory bandwidth and capacity expansion if we look at the third device type then such a memory extension device can be used as a as a cost power and pin efficient alternative to adding more ddr channels to service cpus
Starting point is 00:15:45 and it also as we could see with the retimer example in the beginning, it offers flexibility in system topologies since we can have longer trace lengths. With the third generation of CXL, there is one specialized Type 3 device. It's called a Global Fabric Attached Memory. And this is very similar to a Type 3 device. It's called a Global Fabric Attached Memory. And this is very similar to a Type 3 device, but it actually allows to connect up to 4,905 nodes
Starting point is 00:16:16 using an advanced routing protocol that I will not cover in this lecture. OK. When we talk about, so one of the features, as I mentioned, is that we, from a host CPU, can access the device's memory. And this kind of, or this memory that is now accessed and cacheable by the CPU is called host managed device memory.
Starting point is 00:16:40 And there are three different use cases for that, also with different requirements for the memory protocol. The first use case is host memory expansion. In this case, we have a Type 3 device, as shown before, and this only requires host-only coherency. The second use case here is an accelerator with memory exposed to the host. And in this case, it has to be device coherent since the accelerator can also access the host's memory. And then there is another special use case that is introduced with the third generation, which is device memory exposed with also device coherent memory, so similar to the second use case,
Starting point is 00:17:28 but now it uses back-in validation. This is an extension of the memory protocol, which actually allows type 2 and type 3 device to back-in validate caches, but I will talk about this later on the the key aspect here is that this memory exposed to um to the um to the host in the cache queue in a way is called host managed device memory and we have different use cases for that and if we have such type of memory within a type 2 or type 3 device and it can be accessible to multiple hosts then this is also called fabric attached memory which we all will also talk about later on okay so now we know the different three base device types and how we can call the memory
Starting point is 00:18:21 that is exposed to other CPUs or devices. And what we also have to take into account is we have to differentiate between logical and physical devices. For all the type 1, 2, and 3 devices, we can have single logical devices, meaning that all the resources of that device can only be assigned to one potential host. In this example, we see a Type 3 device only with memory, but this can also be an accelerator, a smart NIC, or any other Type 1 or 2 device. The other type of device here is called multiple logical devices but it's still one
Starting point is 00:19:07 physical device but it partitions the resources and each individual partition can be assigned to a different host but this is only valid for type 3 devices meaning memory only devices and therefore you can have up to 16 logical devices that you can each assign to different hosts so maximum three or maximum three partitions assigned to three different hosts so this is already like a first way of um of realizing some kind of memory pooling. So you remember the inefficiency because of resource coupling. And with this approach, you can already have a device with different memory regions assigned to different hosts. But this can be even improved with such topologies that you can see here.
Starting point is 00:20:03 So this is something that is introduced with the second generation of this interconnect, which actually allows single level switching. So if you see on the very left, you can use single logical devices, connect them to a switch and also multiple hosts. And then you have a central standardized fabric manager, which can be used to assign different of these single logical devices to different hosts. And in the middle, you can see a slightly different topology in which you can also use multiple logical devices. So you can also mix them so it doesn't have to be either MLDs or SLDs. And also in this case, you have a standardized fabric manager, which is responsible for assigning
Starting point is 00:20:51 different logical devices to hosts. If we use a switch, of course, from a latency perspective, we also have additional overheads and you can reduce the overhead if you do not use that switch for example in the very right topology you can also use multiple logical devices and use direct connections between your hosts and your devices but then you do not have such a dynamic flexibility as with the switch in which the fabric manager can dynamically reassign the different devices to different hosts. If we continue or increase the generation number again and we look at CXL 3.0, then
Starting point is 00:21:42 we also could realize resource sharing, or more specifically, memory sharing. In this topology, we can still assign individual logical devices to certain hosts, but it's also possible to assign memory regions to multiple hosts, which can then cache coherently access this memory range. For example, in this case here, the blue memory region of the logical device can be assigned to hosts 1 and 2,
Starting point is 00:22:16 while the light green, for example, can be assigned to 3, 4 and 5. And hosts 3, 4 and 5 can cache coherently read and write to the data that is on that memory region. From hardware scope we also have different granularities. With CXL1 and 1.1 we are more in the scope of a single server. This is because switches are not supported. So you don't have single or multilevel switches. And you have to directly attach your PCIe CXL device to the server. With CXL 2.0, we have single level switches. This generation actually requires tree topologies,
Starting point is 00:23:02 therefore also not allowing multilevel switches. But here, we are more in a rack-level scope. And with CXL 3.0, as we mentioned before, we can have up to 4,096 endpoints connected. It supports multilevel switches. So therefore, we are here in the scope of multiple racks in a data center. All right. With that, I want to move on with cache coherence,
Starting point is 00:23:28 which is one of the key or one of the important features of CXL. Cache coherence protocols usually tracks the state of any copy of a certain data block of my physical memory. And snooping is one approach to implement that tracking. And every cache with a copy of the data from this block or cache line of physical memory
Starting point is 00:23:58 can then track the sharing status of the block that it has. And caches are typically all accessible via some broadcast medium, which can be, for example, a bus that connects the different level caches, so the per-core level 1 caches with a shared cache or memory. And all the caches have cache controllers that monitor or also called snoop the medium, the bus, to determine whether they have a copy of a certain block that is requested on the bus. Write invalidation or write invalid protocol is the most common approach to actually maintain the coherence. And the approach with this kind of coherence protocols
Starting point is 00:24:47 is that they ensure that the processor has exclusive access to a data copy before writing to that copy. And all other copies are invalidated. So no other readable or writable copy of cache line, for example, exists when the writer cores. And yeah, usually caches or such kind of snooping coherence protocol usually introduces a finite state controller per core.
Starting point is 00:25:20 And this is actually responsible for responding to requests from the cores processor and also from the bus and then changes the state of the individual cash line copies also uses the bus to access or invalidates the data an example of such a protocol is the messy cash coherence protocol which we also briefly touched in one of the lectures before. It is a write-back cache coherence protocol used for snooping on the bus or on the broadcast medium. And it is actually named after the initials of the different states. So we have the state modified, exclusive, shared, and invalid for each cache line. The state invalid here means that the line does not contain valid data.
Starting point is 00:26:08 Shared means that multiple caches have copies of this cache line and the data is also valid, meaning it's up to date. It can be an exclusive shared state, meaning no other caches has a copy of especially this cache line. The data is still valid. And a cache line can be in modified state, meaning the data in this line is valid. But the copy and main memory is invalid. And no other cache has a copy of that cache line.
Starting point is 00:26:47 So the cache that has this cache line in modified state would be the owner of this cache line at this time. After we boot system, all the caches have their cache lines marked as invalid. And here we briefly go through an example with three CPUs and memory. And the CPUs and memory are connected via a bus. And first, the CPU1 reads block A, and it fetches the data. And since no other cache has this cache line currently
Starting point is 00:27:22 in its cache, the cache line is in exclusive state. If now CPU2 reads the block, cache, CPU1 gets notified or observes that there is now another cache that holds this cache line. Therefore, it changes the state from exclusive to shared. So both caches of CPU1 and 2 have this cache line now in shared state. If CPU 2 wants to write this cache line, then
Starting point is 00:27:53 also other peer caches, or in this case, the cache of CPU 1 gets notified that it has to invalidate the cache line. And then CPU 2 has this cache line in exclusive state and can then modify its copy if now cpu3 here on the top on the right side wants to read that block this so before it wants to read it the cpu2 has the cache line in modified state. So therefore, when CPU 3 notifies that it wants to read the data, CPU 2 notifies that CPU 3 first has to wait, because CPU 2 needs to write back
Starting point is 00:28:34 the data to the memory. And after the data is written back, CPU 3 also gets notified about the data being written back and can then fetch the data from memory and CPU 2 and CPU 3 then have the same cache line in shared state. And then similar to the write before if CPU 2 wants to write on this cache line again then the cache line has to be invalidated at CPU3. Therefore, CPU2 has this cache line in an exclusive state and can then write to the cache line so that it's then modified.
Starting point is 00:29:18 Talking about the CXL does this in hardware. And as a cache line granularity similar to existing cache coherency hierarchies. It uses 64-byte cache lines, and it actually uses existing services from the PCIe for address translation. So to translate virtual addresses, it uses address translation service.
Starting point is 00:30:03 This is specified in the PCIe specification. CXL cache coherency or the use cache coherent protocol is ASOMATRIC, meaning data flowsency, which significantly simplifies implementing coherency in devices. A host scales to multiple CPU sockets or caches ass in a multi-circuit system, you have ultra-path interconnects, or the previous generation was quick-path interconnect, or with AMD CPUs, you have Infinity Fabric here. And these specific protocols define an internal home agent to resolve coherency between the different host caches. And these host agents are also Now responsible to incorporate cxl cache devices and enforce a Simple messy coherence protocol. And if a cpu wants to support cxl cache devices, then it is Expected to size their tracking data structures
Starting point is 00:31:28 accordingly so that they can also track cache lines or the coherent state of cache lines of connected CXL gets involved into that. So, usually we have multiple levels of coherent caches. We have our small level one caches with small capacity with the lowest latency and also with the highest bandwidth. Then we have level 2 caches, they have larger capacity, might be shared between multiple cores but they also have a higher latency and lower bandwidth compared to L1 caches and then in our level 3 caches
Starting point is 00:32:19 also called last level caches we have the highest capacity but also still a lower bandwidth higher latency and they're shared between many CPU cores and CXL now allows devices to directly engage in the cache hierarchy of CPU below the last-level cache so as we can see here the CXL.cache block. So if we assume two devices, two devices that support the CXL.cache protocol, then in the CPU's cache hierarchy, the caches of these CPUs are sitting as a peer to the cores within the CPU socket and the expected cache size of these devices is one megabyte or smaller. Above our last level cache then we have our home agent potentially connected with other CPU sockets via the vendor-specific CPU to CPU link, so UPI or Infinity Fabric in the case of Intel or AMD.
Starting point is 00:33:32 And the home agent resolves the conflicts of last level caches trying to cache the same address. And on the top here, we can see then different kind of memory. First, data that is accessible via DDR channels, or we can also have CXL memory devices attached. So therefore, we could access this memory using the CXL PCIe interconnect and then use the memory controller
Starting point is 00:34:03 that is located on the memory device. Okay, now I want to talk a little bit about the more details of the CXL cache protocol. This protocol, as I mentioned before, enables a device to cache host memory using the MESI protocol. We use 64 byte cache lines. And it specifies or uses 15 different request types for cacheable reads and writes from the device to the host. Here, I mentioned it's an asymmetric protocol. So we keep the protocol simple here at the device side,
Starting point is 00:34:50 meaning that the host is responsible for tracking the coherence of the peer caches. And the device also never directly interacts with any peer cache. So here, also, the host is responsible if there is some interaction required and the device only manages its own cache and sends requests to the host. The CXL.cache protocol uses two communication directions. One is from the device to the host, also called D to H,
Starting point is 00:35:25 or host to device, H to D. And per direction, it uses three different communication channel, a request, a response, and a data channel, as you can see on the right. In the direction from the device to the host, the requests are requests to get cacheable access through reading or writing memory. And in the other direction, if a host has to perform some requests, then these are
Starting point is 00:36:00 mainly snoop messages for updating cache states, for example, invalidating certain cache lines that are currently stored in the device's cache. Very briefly, the 15 different requests from the device to host can be classified in four different classes. So we first have reads, meaning with such requests, we request the coherent state and data for a cache line. And the response then from the host is, as expected, the coherent state with the data.
Starting point is 00:36:43 Then we have read zero requests. In this case, the device only requests a coherent state with the data. Then we have read zero requests. In this case, the device only requests a coherent state, but without the need to get the data. And this is, for example, used to upgrade existing cache copies and update the state, for example, from shared to exclusive, or also bring in the exclusive shared if if data needs to be written the write class is used to evict data from the device cache this is for example used for dirty data but also for clean data and
Starting point is 00:37:32 yeah with with such a write request if the device requests, then the host will indicate a write pull. This means that the data or that the host is ready to receive the data from the cache device. cache device and before it gets this response it also sends a globally observed message to the device which indicates that the host successfully resolved any required cache coherence communication so it might be the case that if if a device wants to get the cache line in exclusive state and other caches have this copy of the cache line that the host needs to invalidate the other peer caches first and if this has happened successfully then this globally observed message is sent to the cache device. And then the last class is read0write also called streaming writes. In this case the CXL cache device writes data to the host directly without
Starting point is 00:38:33 having any coherent state prior to issue and similar to the write we also get write pull messages and globally observed signals from the host to the cache device. Before looking at some example data flows, I want to introduce you to some of the terminology that is relevant here. So on the right here the left device with this marked with a red frame. So this is our device from which we want to communicate to the host via the CXL.cache protocol and from this perspective we have a certain home agent and the home agents location depends actually on the address that is being read.
Starting point is 00:39:25 So it can be on our local socket, but it can also be the case that our home agent is located on a remote peer socket. Then our peer cache or peer caches in general are all the caches that we have on peer CXL devices with a cache, or also CPU caches that are in our local or remote socket. And a memory controller is either a native DDR memory controller, or it can also this can be on our local or also on the remote socket. Our memory controller can also be located on a CXL memory device, of a CXL memory peer device. So also here, our target memory controller that might be involved in our cache protocol communication depends on the address that we want to access.
Starting point is 00:40:30 OK, knowing this terminology, let's have a look at the read flow example of the cache protocol. Here in this case, the CXL device wants to read a certain memory address from the host's memory, and the cache line is currently in valid state. Now it has to send a read shared request to the home agent. Note that one other peer cache has especially this copy of the cache line in exclusive state and therefore because of snooping messages the cxl device is aware of the fact that there's already an exclusive copy and it also does not need the data in exclusive states, so therefore it just requests the data in shared state.
Starting point is 00:41:25 The home agent then, after receiving this request, requests from the peer caches to update the state of that cache line, because the one cache that has this cache line is not the only one anymore, so therefore the cache state has to cache line is not the only one anymore so therefore the cache that has to be updated from exclusive to shared and the corresponding cache will then also notify the home agent that it successfully changed the state of that cache line copy and in parallel the home agent also sends a memory read request to the memory controller, which responds with the data. And when the home agent received the acknowledgement of the peer cache
Starting point is 00:42:11 that the current state was updated and also received the data from the memory controller, it can then send first a signal that the shared state of this cache line was now globally observed. And with that message, the CXL device can then update its own state of the cache line copy and also the data is sent from the home agent to the CXL device. WriteFlow looks a little bit different. So in this case, we have again our CXL device with a cache line
Starting point is 00:42:47 in invalid state. But now we actually need the cache line in exclusive state since we want to modify it. Therefore, we request this cache line. Yeah, we want to get this cache line in exclusive state and know that a peer cache or multiple peer caches can have this cache line or copy of that cache line in shared state. When the home agent receives this request,
Starting point is 00:43:15 it also sends snoop invalidate signals to the peer caches, which then invalidate the cache line state, also respond the change again to the home agent in parallel again the home agent reads the memory and after receiving the snoop responses and the data the home agent again sends a message to the CXL device notifying that the exclusive state of that cache line copy is globally observed and the data is sent to the CXL device, then the CXL device with the copy in exclusive state can perform a silent write. Silent here because no other cache has this copy and therefore we do not need to notify any cache to change a coherent state.
Starting point is 00:44:09 And at some point, when the CXL device's cache wants to evict this cache line, then we need to send a dirty evict signal to the home agent, which then might resolve some cache states, but not in this case here, since it was in modified state but it sends a globally observed write pull and with that signal received on the CXL device side the cache line of the CXL device can then be updated to invalid and the data is sent to the home agent, which then writes the data to the memory controller and acknowledge this with the complete message.
Starting point is 00:44:55 Yeah, sure. What kind of time frame are we talking about for a home region, right? Yes. You mean, so the entire round trip from the CXL device to actually getting the completion response from the memory controller? So most of the data that you can find are estimations, because there are not that many CXL devices commercially available yet. But in the estimations also from the CXL consortium, we are in the range have a Type 3 device directly attached to our server and we would access the memory, then based on some CXL consortium documents, we are in a range of
Starting point is 00:45:57 less than 200 nanoseconds. Okay, then let us continue with the memory protocol. As mentioned before, this is used to enable simple reads and writes from the host to memory. And the protocol is designed to be memory media type independent. So behind this protocol, there can actually be high bandwidth memory, DDR memory attached on the CXL device or persistent memory. And with CXL 2.0, CXL also supports mechanisms to manage persistence here, which I will not go into more details.
Starting point is 00:46:42 And also the protocol uses two additional bits for metadata per cache line. These bits are optional for type 3 devices, meaning for memory-only devices. And the host can here define the usage of these two bits, for example, for some security attributes or for compression attributes. But these bits are mandatory for type 2 devices so in these bits the host encodes the required cache state so with with these bits yeah the protocol exposes the host coherent state to the device. And this allows the device to know what state the host is caching for each address in the host-managed device memory region.
Starting point is 00:47:38 The protocol uses or has two different communication directions. One direction is from the master to a subordinate. Master can here be a host or a switch. And subordinate to master is then in the other direction. Each direction has two channels. In the direction from M to S, master to subordinate, we have requests and requests with data.
Starting point is 00:48:04 And in the other direction, we have requests and requests with data and in the other direction we have a non-data responses or data responses note that this is valid for the first two generations cxl 1 1.1 and 2 and with cxl 3.0 there is another channel per direction which is so from the subordinate to the master so from our device to a host or switch this is a back invalidate and in the other direction this is a back invalidate response and this is actually an important feature for cxl3 because this allows larger topologies and also cache coherent access between different CXL devices, which I will talk about later on. Also here we want to look at a certain memory protocol flow. Let's look at write rights first. On the very left we have our host who
Starting point is 00:49:07 wants to write data. For this the host sends memory write request to our CXL device. We assume a memory type 3 device here meaning a memory only device and this has a memory controller and its memory media and after the memory controller received the memory write request it writes the data. Note that here the optional meta field is used so the host requests that the data is written with this meta value 3. The meta value is also stored together with the data at the memory media. And the memory controller can already send a completion
Starting point is 00:49:57 when it can ensure that the next read request would give the correct data. So it can already send a completion when data is visible to future reach, even though the memory controller might not have fully committed the data on the memory media. And let's look at a read flow. Here again, the host wants to interact with the memory,
Starting point is 00:50:25 and it sends a memory read request. Also for read requests, the host can specify this meta value. In this case, it is 0. And the memory controller then reads the data from the memory media. As mentioned, the meta value is stored together with the data and at the memory media it is set to 2. And after reading it, the memory controller will send the data
Starting point is 00:50:54 together with the old meta value to the host. But even though it's only a read request here, the memory controller still needs to perform a write to the memory media to update the requested meta value. And this gets simplified if the host does not have any meta value requirements. So in this case, we can, or the memory controller can simply read the data and return it to the host. So before we talked about the cache protocol and now about some simple workflows of the memory protocol, but there are also the devices, type 2 devices, that support both the cache and the memory protocols.
Starting point is 00:51:40 And such a device, usually an accelerator, can access its own memory without violating the coherence and without communicating with the host. For that, Type-2 device has to implement so-called BIOS table, which is a coherence directory. for each of its cache block copies and indicates if the host has a copy of that cache block. And if the host has no copy, then we call it, or this state is then called device bias. And in this state, the type 2 device can directly read its own memory. But if the host has a copy of that cache block, then it has to read the data through the host. And the host here is responsible for tracking potential cache block copies in other caches or CXL devices as well. So the type two device itself,
Starting point is 00:52:50 it has to track if the host has a copy of that cache, but it doesn't have to track if some other peer cache has a copy of this cache block. And this protocol flow is visualized here for the case that is a host BIOS flow, meaning that the cache line that we want to read is already or is present in the host cache. So in this case, we have a type 2 device with its internal memory controller. And this device, using the cache protocol, wants to read the cache block. It's actually a cache block that is located on its own memory. But it has to communicate with the host agent
Starting point is 00:53:42 first, because in CXL, especially 1 and 2, the accesses or the cache coherence is host managed. And when the CXL device sends the read request to the home agent, it again, as we saw it before, has to invalidate the corresponding copies in other caches. And therefore, sends snoop messages to the peer caches. These again respond that they successfully changed the state of their cache line copies and the home agent then can send a memory read forward. Why would it send a memory read forward in this case? So usually with a workflow that we saw before, the home agent would read the data from the memory controller and when it receives the data, it will then send the data to the CXL device, which would be wasteful in this case since the CXL device itself has the memory and therefore the home agent can
Starting point is 00:54:48 send this message to the internal memory control of the CXL device letting the memory controller know that the data needs to be sent to the CXL device's cache and which then happens and after the CXL device received the data, so it's now in its cache, the cache line can be updated or the copy of the cache line can be updated from invalid to exclusive. And the state is still in device bias since we did not propagate any cache line copies to other peer caches. So to summarize here, the cache agent resolves the coherency while the device can read its own memory. And the workflow looks quite... Oh, sorry, a little mistake here.
Starting point is 00:55:52 So before we were in host bias state. And since the home agent at this point invalidated the cache line state of peer caches, we can be sure at this point that no other peer cache has a copy of that cache line. And therefore, we can change the state here in our BIOS table where we track which cache line might be available in the host cache can change the state here from the host bias to device bias but we started initially with host bias and here we have a workflow when the state is in device bias meaning no other peer cache has a copy valid copy of the data that we want to read. Therefore, we do not have
Starting point is 00:56:46 communicate with the home agent and the CXL device can directly read the memory from its internal memory controller which then returns the data and we can change the state of that cache line copy to exclusive. And since we did not propagate any copy of this cache line copy to exclusive and since we did not propagate any copy of this cache line to other peer caches we are still in device bias cache which is stored in the bias table okay as I mentioned or we have we have the different protocol generations or interconnect generations and there are some important enhancements with the CXL 3.0 generation. We saw already before a multi-level switching which is introduced with that
Starting point is 00:57:37 generation. We have support for up to 4096 end devices. We can build large fabric topologies with multiple paths between the source and the destination pair. So before, in generation 1 and 2, the specification requires that a path from the source and the destination, that there's only one path. Therefore, from the host perspective, we always would have tree topologies. And with the third generation, we can also break this constraint and support non-tree topologies. Some limitation of generation one and 2 is also that only one CXL type 1 or type 2 device can be present in one topology from the host perspective. This is because we need to track the different states of cache line copies of our CXL cache devices. So remember, we have our cache that is host managed, and therefore we need some tracking information or some tracking data structures in our CPUs.
Starting point is 00:58:57 And they also have to be updated depending on what state the cache line is in. And with the third generation, we have these back invalidations and back invalidation responses of the memory protocol that we saw before. And with that, we do not need, or we do not have this limitation anymore. We can have multiple type 1 and type 2 devices up to 16 according to the specification.
Starting point is 00:59:27 We also double the bandwidth. This is simply because it requires PCIe 6, which also compared to PCIe 5 doubles the bandwidth. So with a 16-lane connection, we can achieve up to approximately 128 gigabytes per second while we are at 64 approximately with PCIe 5. It also allows direct peer-to-peer access from a PCIe device or a CXL device to the coherent exposed memory hosted by a type 2 or type 3 device.
Starting point is 01:00:07 And for this peer-to-peer communication, a host does not have to be involved. And one of the most important features is this possibility to share memory in a coherent way across multiple hosts. This brings us again to the back-invalidation. I want to highlight it again. So you saw before that in the memory protocol we have the new flows, back-invalidation and back-invalidate response. You saw before that this also enables this new memory type, which is memory that
Starting point is 01:00:52 is exposed to the host, but it also supports caching. So it also caches. So the device also or the cache coherency also needs to be present at the device but the device itself can now invalidate other caches with the spec invalidation requests and there are three different use cases one is one that i mentioned before direct peer-to-peer communication between CXL devices. Another use case is that we can map a larger chunk of memory of the type 2 device to be coherently accessible to the host. So we talked before, I talked about the accelerators, the type 2 devices, but assume a FPGA with a certain bunch of memory, not all the capacity of this memory has to be exposed in a cache coherent way or defined as HDM, because we have some limitations here, since the CPU
Starting point is 01:02:03 needs to track the cache lines of this exposed memory. And this is actually a limiting factor here. And with back-invalidation, we can map a significantly larger chunk of memory to be cache-curiously exposed to the CPU or other CXL devices. So before, we needed to track the coherence states also for our BIOS flow. You remember that we are either in device or in host bias. And if we want to access or if the type two device wants to access its own data, we need to check if we need to communicate with the host or if we can directly read our data.
Starting point is 01:03:00 And for that, we have our bias table tracking the state of the cache line. And now we do not have to implement the full BIOS table anymore, but we can instead invalidate a certain cache line, send an invalidation message to the host, and therefore do not need to store this BIOS table anymore. And another use case is here for a coherent shared memory access across multiple independent hosts. Assume that you have two hosts or multiple hosts that can access the same memory region.
Starting point is 01:03:37 And if one host accesses the data, while second host has a copy of that cache line without back invalidation, there would not be a solution. Or in other words, the CXL consortium implemented that solution to realize cache coherent memory accesses across multiple independent hosts. And here's one figure also from documentation of the CXL consortium in which you
Starting point is 01:04:09 can see potential large fabric topology that we can achieve with CXL 3.0. So in this case, we could have, for example, different end devices that you can see in the bottom area here, which can be connected to different switches these switches that are connected to end devices are called leaf switches here and then inside of this large cxl topology we can have the leaf switches connected with spine switches so spine switch switches only connected to other CXL switches. And I also talked about
Starting point is 01:04:49 their tree-based topologies and non-tree-based topologies. As you can see here, there are definitely multiple paths from one end node to another. So marked in red is one potential path, but you could also go from this CPU over the top mid switch and then back from the top mid switch to the bottom mid switch and then to this global fabric attached memory, which is just as a reminder, a Type 3 device that can be accessed by multiple thousands of other devices. Okay, then I would say you have some insights and an overview of the CXL protocol and different generations and some features but what is already available what can we work with or what is maybe announced so Intel released the Intel Sapphire Rapid CPUs this year and they fully support CXL 1.1 Intel also provides the Intel Agile X7 FPGAs, which are CXL 1.1 compatible. And they also announced Agile X7 M-series FPGAs with CXL 2.0 support. On the right side, you can see the physical layout of the Intel Sapphire Rapids multi-die architecture.
Starting point is 01:06:27 So you have four dies here. So this is one CPU subdivided into four dies that are connected. And per die you can see here in orange the PCIe boot complex. And each of these orange boxes is one 16-lane connection. So you have actually 32 lanes per die and 128 in total, which is quite a lot. So you could imagine attaching to all of these connections a CXL device, potentially for memory or bandwidth extension. If we are in the, yeah, we are limited here,
Starting point is 01:07:13 so we cannot use any switch topologies, since Sapphire Rabbit only supports CXL 1.1. But still, there's a lot of potential if we could attach multiple CXL 1.1, but still there's a lot of potential if we could attach multiple CXL devices here. And on the bottom you can, like, it's mainly a black box, but this is one, or this is the memory extension board that was announced by Samsung. They already have two generations, so this is now the latest one. They announced the development of a 128 gigabyte DRAM
Starting point is 01:07:50 that supports CXL 2.0. It uses PCIe 5. It's a by eight device, meaning the PCI width has eight lanes, and they announced a bandwidth of up to 35 GB per second. And AMD also released CPUs that are CXL 1.1 compliant, the new AMD Genua CPUs. And they also announced a smart network interface card that supports the XL 2.0.
Starting point is 01:08:32 In the last minutes I want to briefly talk about the performance. There was already a question about how long it takes to access or to perform a write, for example. And here are some memory access or some latency estimations that are provided in the introduction to Compute Express Link by the CXL Consortium. So if you are more interested, feel free to check the document. But yeah, from the CPU to a Type 3 device, if we assume it's a single logical device with DDR memory on top of the device, accessing the memory would, they estimate that it would take 170 nanoseconds, and they estimate that it takes also about 170 nanoseconds when we access pooled or shared memory from the CPU on a direct attached multi-logical device, type 3 device.
Starting point is 01:09:37 And also if we would communicate from a device to the host's memory with CXL.cache. And in a different topology with a type 3 device connected via a switch, so we have one level of switching here, the estimation is 250 nanoseconds. And when we message a peer CPU or a peer device through a CXL switch, it would take 220 nanoseconds. Or if we send a message to a peer CPU or device through two levels of switches, through two switches,
Starting point is 01:10:23 then the estimation would be 270 nanoseconds. Okay, here is one example experimental evaluation that is already publicly available. So there are not that many evaluations yet, and this was published this year. It's a work from Intel and UIUC and in their work they use Intel AgileX FPGA as you can see here in the bottom. So here are some setup information. This is CXL 1.1 compatible and it is attached by our PCIe Gen 5 with a 16 LAN connection. And they use a single DIMM with 16 GB of DDR4 memory. And in their dual socket system they have Intel Xeon CPUs. It's basically Intel Sapphire Rapids. I'm not sure if it's pre-release or...
Starting point is 01:11:36 I have to double-check if it's a pre-release version or already the release version. But in their work, they performed some micro benchmarks and as you can see in the middle for example they wanted to measure how long it takes if you read or write a cache line. So the LD is the load instruction here. And then we have another access which is a store with write back or a non-temporal store meaning bypassing the caches and they performed all of these instructions as AVX 512 instructions. Without going into more detail here for the load instruction for example.
Starting point is 01:12:26 So in green you can see the time, the latency that it requires to access the memory on the local node, on the local socket. In purple it is the access to a remote NUMAS memory region. And in orange, you can see the latencies that it took to access the memory of the CXL device or this FPGA board in this case. And for the load instruction, so what I wonder here is actually that we are at the local memory already at about 250
Starting point is 01:13:05 nanoseconds. But compared to the remote NUMA node, the CXL memory latency does not require much more time. So it's comparable with the load from a remote NUMA node. And yeah, as you can also see, also for the other instructions, the access to the CXL FPGA performs quite well with lower latencies compared to the remote NUMA node. And yeah, there were also some further measurements regarding pointer chasing. What they also did here, they have certain...
Starting point is 01:13:51 So they first flushed the cache line and then have certain working sets, meaning they read certain set of data and brought this into the cache, which was their warm-up. And then they performed pointer chasing operations. And what they measured was the latency they achieved when they increased the working set size here on the x-axis. And you can observe these jumps. And these are actually the jumps that you achieve when you exceed a certain capacity of a cache level.
Starting point is 01:14:30 So, for example, here at around 64 kilobytes, you see a slight increase of the latency, Which would mean that we exceed the level one cache capacity Here and then at some point we exceed the level two cache Capacity and then also the last level cache capacity. And interesting to see is that so again the green line here is The access to the local memory and line here is the access to the local memory, and the orange is the access to the CXL device. And the performance of the... Or accessing the CXL device with pointer chasing results in latencies very similar or... I would almost say equal to the latencies that we can observe with local memory.
Starting point is 01:15:28 But as long as the data does not fit into our caches anymore, then the latency significantly increases since we then have to fetch the data from our memory controller of the attached CXL device or the FPGA in this case. And at the bottom you can see bandwidth measurements with sequential axis. On the left side they used local memory with eight DIMMs. On the very right side, they used remote memory of a remote socket with only one channel. And in the middle, they have the CXL memory device. And an interesting observation here was for the green line, which are load instructions, if you increase the number of threads,
Starting point is 01:16:28 they achieved a maximum bandwidth of about 20 gigabytes per second with eight threads. And after that, the bandwidth decreased if they used more threads. And also an interesting observation was that already with two threads, they achieved the maximum or they measured maximum bandwidth with non-temporal stores. And already with four and six and more threads, the bandwidth dropped significantly compared to using only two threads. To summarize, CXL tries to tackle the different interconnect
Starting point is 01:17:20 issues, as mentioned before. They want to achieve cache-coherent accesses from PCIe devices or PCIe 6L devices to host memory and vice versa. They want to tackle scalability limitation allowing memory and capacity and bandwidth extension or also allowing accessing multiple or different resource. This also is relevant here for the resource training aspects. They provide with CXL2 and 3 memory pooling and memory sharing. And especially the memory sharing feature is also relevant for the fourth challenge here with the distributed data sharing.
Starting point is 01:18:04 And yeah, you can see this here in the table summarized with the first generation, which is all the specifications available since 2019. We talked about the hardware scope. So it's on a single machine level where we have to attach the device directly to the server. It has a speed of PCI 5, which is 32 gigatransfers per second. And the bandwidth is calculated based on the number of lanes
Starting point is 01:18:32 that the device supports. And this addresses the coherence challenge, meaning we want to access the host's memory from our accelerator and vice versa. And it allows bandwidth and capacity expansion cxl2.0 which is available since 2020 available means the specification is available and here we are in a larger scope we could say we are on the rackEC level here, translated on data center scales. We still have PCIe 5 with 32 gigatransfers per second, and it mainly addresses the challenge of resource pooling.
Starting point is 01:19:15 And with CXL 3.0, which was published last year, we are in a much larger scale, so theoretically we can support up to 4096 end devices. But the consortium claims or mentions that it's more realistic that we have hundreds of machines connected with multiple switches. Here we have we can double the speed since this requires PCIe 6 and this allows a larger scale resource pooling so it increased the scalability of the resource pooling that we achieved with the second generation and also allows sharing so therefore it tackles the challenges 3 and 4. Here you can see a last overview of the different features with different generations. We did not tackle all of them. But we also didn't cover all of them in detail.
Starting point is 01:20:17 But we at least touched some of them. So you know about the different device types, type 1, type 2, type 3 devices. We talked about the differences between single and multiple logical devices. We briefly talked about single-level switching and multi-level switching and the topologies that are possible with such switches. With generation 3, peer-to-peer communication between CXL devices is possible. We have an enhanced coherency, an important role.
Starting point is 01:21:00 The backend validation plays an important role here, first for the enhanced coherence. So this actually is relevant for the type two device memory capacity that I mentioned so that we do not need for example this bias table anymore but we can back invalidate other caches we have memory sharing capabilities with CXL 3.0 we can also have, okay we didn't mention this, but with the third generation we can attach, oh actually we did, right. With CXL 1 and 2 is increased to up to 16 devices. And we have large fabric capabilities. And we also saw an example figure of that. All right.
Starting point is 01:21:56 With this, this semester's lecture series ends. Tomorrow, as I mentioned, we will do the data center tour. And if you like the lecture, if you like our research group, and if you're also interested in student assistant positions, just come drop by and talk to us, or also attend in our next courses. So next semester, especially relevant for hardware conscious data processing, we will
Starting point is 01:22:25 also provide or we will do a seminar with hands-on work in groups related to HTTP topics. And the topics are not finalized yet, but we will probably provide an FPGA topic and also a topic related to NUMA-aware data processing. Besides that, we also have other courses, for example, the Big Data Systems lecture. This next winter term, we will for the first time have a Big Data Labs course in which you can work hands-on on big data processing concepts and also learn how to deploy, for example, large clusters and work with them and process data with such approaches. We offer a master project with the topic benchmarking real-time analytics systems. And if you're interested in writing your master thesis with us, then look at our research topics. So we mainly work on database systems and modern hardware, stream processing, machine learning
Starting point is 01:23:27 systems, and benchmarking. Thanks for your attention, for attending this course, and hope to see you next semester in the HTP seminar. Thanks.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.